...

Full Bio

Google Go Language Future, Programming Language Programmer Will Get Best Paid Jobs

12 days ago

New Coder Tool Promises to Turbo-Charge Coding In Major Programming Language

13 days ago

Why Many Companies Plan To Double Artificial Intelligence Projects In The Next Year

13 days ago

Why 75% SMBs Are Not Worried About Artifical Intelligence Killing Their Jobs

13 days ago

Interview Study Guide for Data Science To Get Job Quickly

16 days ago

Highest Paying Programming Language, Skills: Here Are The Top Earners

632127 views

Top 10 Best Countries for Software Engineers to Work & High in-Demand Programming Languages

439179 views

Which Programming Languages in Demand & Earn The Highest Salaries?

439053 views

50+ Data Structure, Algorithms & Programming Languages Interview Questions for Programmers

258006 views

100+ Data Structure, Algorithms & Programming Language Interview Questions Answers for Programmers - Part 1

220938 views

### Recent Advances for a Better Understanding of Deep Learning - Part I

I would like to live in a world whose systems are build on rigorous, reliable, verifiable knowledge, and not on alchemy. Simple experiments and simple theorems are the building blocks that help understand complicated larger phenomena.

- Non Convex Optimization: How can we understand the highly non-convex loss function associated with deep neural networks? Why does stochastic gradient descent even converge?
- Overparametrization and Generalization: In classical statistical theory, generalization depends on the number of parameters but not in deep learning. Why? Can we find another good measure of generalization?
- Role of Depth: How does depth help a neural network to converge? What is the link between depth and generalization?
- Generative Models: Why do Generative Adversarial Networks (GANs) work so well? What theoretical properties could we use to stabilize them or avoid mode collapse?

I bet a lot of you have tried training a deep net of your own from scratch and walked away feeling bad about yourself because you couldn't get it to perform. I don't think it's your fault. I think it's gradient descent's fault.

- What does the loss function look like?
- Why does SGD converge?

If we perturb a single parameter, say by adding a small constant, but leave the others free to adapt to this change to still minimise the loss, it may be argued that by adjusting somewhat, the myriad other parameters can "make up" for the change imposed on only one of them

- The functional that is minimized by SGD can be rewritten as a sum of two terms (Eq. 11): the expectancy of a potential Ã?Â¦, and the entropy of the distribution. The temperature 1/Ã?Â² controls the trade-off between those two terms.
- The potential Ã?Â¦ depends only on the data and the architecture of the network (and not the optimization process). If it is equal to the loss function, SGD will converge to a global minimum. However, the paper shows that it's rarely the case, and knowing how far Ã?Â¦ is from the loss function will tell you how likely your SGD will converge.
- The entropy of the final distribution depends on the ratio learning_rate/batch_size (the temperature). Intuitively, the entropy is related to the size of a distribution and having a high temperature often comes down to having a distribution with high variance, which usually means a flat minimum. Since flat minima are often considered to generalize better, it's consistent with the empirical finding that high learning and low batch size often lead to better minima.