Opening the Black Box of Deep Neural Networks via Information

Shwartz-Ziv, Ravid; Tishby, Naftali

Opening the Black Box of Deep Neural Networks via Information

Mar 2, 2017

19 pages

e-Print:

1703.00810 [cs.LG]

View in:

ADS Abstract Service

pdflinks

reference search42 citations

Citations per year

Abstract: (arXiv)

Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the \textit{Information Plane}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on {\emph compression} of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.

Note:

19 pages, 8 figures

References(29)

Figures(10)

Information Dropout: Learning Optimal Representations Through Noisy

A. Achille
,
S. Soatto

Computation. ArXiv e-prints, November

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes

David Balduzzi, Marcus Frean, Lennox Lewis, Kurt Wan-Duo Ma, and Brian

J.P. Leary

McWilliams. The shattered gradients problem: If resnets are the answer, then what is the question? CoRR, abs/ URL

•
e-Print:
- 1702.08591

Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv preprint

•
e-Print:
- 1511.06348

Elements of Information Theory Series in Telecommunications and Signal Processing) -

Thomas M. Cover
,
Joy A. Thomas

Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images, neurocomputing: foundations of research

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp) international conference on, pages 6645-6649

Simon Haykin. Neural Networks: PTR, Upper Saddle

A. Comprehensive Foundation

River, NJ, USA, 2nd edition

Deep Residual Learning for Image Recognition

e-Print:
- 1512.03385
•
DOI:
- 10.1109/CVPR.2016.90

Improving neural networks by preventing co-adaptation of feature detectors

e-Print:
- 1207.0580

Jonathan Kadmon and Haim Sompolinsky. Optimal architectures in a solvable model of deep networks. In NIPS

Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. Eurographics Symposium on Geometry Processing

Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys

69:066138, Jun

Rev. E

•
DOI:
- 10.1103/PhysRevE.69.066138

Hugo Larochelle, Yoshua Bengio, Jérôme Louradour, and Pascal Lamblin. Exploring strategies for training deep neural networks Learn. Res., 10:1-40, June. ISSN 1532-4435

J. Mach

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature

Michal Moshkovich and Naftali Tishby. Mixing complexity and its applications to neural networks

[2017]

URL

e-Print:
- 1703.00729

Liam Paninski. Estimation of entropy and mutual information ISSN 0899-7667

- Neural Comput. 15 (2003) 1191-1253
•
DOI:
- 10.1162/089976603321780272

The Fokker-Planck Equation: Methods of Solution and Applications. Number isbn9780387504988, lccn=89004059 in Springer series in synergetics -Verlag

H. Risken

Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In

Information Theory Workshop (ITW) pages 1-5

1-25 of 29
1
2
25 / page