Published:

Chapter 3 Probability and Information Theory

3.1 Why Probability?

Two viewpoints of probability: * frequentist probability: relates the probability directly to the rates at which events occur. E.g.: the probability of tossing a coin and showing head. * Bayesian probability: relates to qualitative levels of certainty. E.g.: a patient has a 40% chance of having the flu represents doctor’s degree of belief.

3.9 Common Probability distributions

The Bernoulli and Multinoulli distribution are used for a single discrete variable with binary or finite different states.

Chapter 9 Convolutional Networks

9.1 The convolution operation

To illustrate what is convolution operation, the authors give an example from 1-dimensional signal processing. The start point is to do a weighted average for a noisy signal. And it we apply such a weighted average function $w(a)$ to every point of the signal, we obtain a smoothed estimation of the signal. This operation is called convolution, i.e. each point in the output feature map is the summation of the weighted element-wise multiplication of the input and the kernel.

In the 2D imaging case, the kernel can be seen as a template of a part of an image. The convolution operation is a template matching process: the output will be bigger when the patterns in input match with the kernel. In my understanding, this is because at each point in the feature map, the operation is doing a dot point to compare the distance between the two patches.

9.3 pooling

Pooling helps to make the representation approximately invariant to small translations of the input, i.e. if we translate the input by a small amount, the values of most of the pooled outputs do not change. This is especially helpful if we care more about whether some feature is present than exactly where it is.

Note: in some applications such as segmentation, it is more important to preserve the location of a feature. In 9.4, it also mentioned a similar idea:

If a task relies on preserving precise spatial information, then using pooling on all features can increase the training error. Some convolutional network architectures are designed to use pooling on some channels but not on other channels, in order to get both highly invariant features and features that will not underfit when the translation invariance prior is incorrect.

9.4 Convolution and pooling as an infinitely strong prior

Think CNN as a fully connected net with an infinitely strong prior. This prior says that the function the layer should learn contains only local interactions and is equivalent to translation and small translation (in the case of pooling layer).

Note: the structure of CNN gives prior constraints to the model before it has seen any data. The prior is based on the assumption for a specific task, such as in image classification, the detection of local features regardless of their variation is relatively crucial.

Chapter 11 Practical Methodology

The practical methodology of applying deep learning techniques successfully includes knowing how to choose an algorithm for a particular application, and how to monitor and respond to feedback obtained from experiments.

11.1 Performance Metrics

An error rate can be estimated beforehand by benchmark (in an academic setting) or by real-world application requirements. In the case of dataset imbalance, precision, recall, and their combined F-score are usually applied.

11.4.1 Manual Hyperparameter Tuning

Hyperparameters can be categorized into two groups:

1. Increase model capacity
• Number of hidden Communications
• Convolution kernel width
2. Decrease model capacity, act as regularizer
• Weight decay coefficient
• Dropout rate

11.5 Debugging Strategies

The deep network is notoriously difficult to debug since the error can only be reflected by examing the output of the model, and also the other parts of the model may be able to adapt to compensate for some bugs.

Some important debugging tests include:

1. Visualize the model in action
2. Visualize the worst mistakes
3. Reasoning about software using train and test errors
4. Fit a tiny datasets
5. Compare back-propagated derivatives to numerical derivatives
6. Monitor histograms of activations and gradient