[Reading] Deep learning

12 minute read


  1. S. Carter and M. Nielsen, “Using Artificial Intelligence to Augment Human Intelligence,” Distill, vol. 2, no. 12, p. e9, Dec. 2017.

    GAN, VAE, AI, Human-computer interaction

    This is a very enlightening article that states the relationship between AI and HCI. Since the invention of computers, there are two distinguishing viewpoints of “what are computers for”? The AI (artificial intelligence) community regards computers as number-crunching machines that can enable human to outsource intellectual tasks, and problems in AI are often framed in terms of matching or surpassing human performance. On the other hand, the vision of intelligence augmentation (IA) sees the computer as which human could work with to support and expand their own problem-solving process. This idea deeply influenced digital arts, computational creativity, interaction design, data visualization, and human-computer interaction.

    The two viewpoints are not mutually exclusive. With the fast development of AI, this commentary brings up a new field by fusing the two ideas, namely Artificially intelligence augmentation (AIA), the use of AI systems to help develop new methods for intelligence augmentation. The human thinking is a representation of the world and its relationship. Similarly, machine learning largely deals with representation. The authors give some examples of how using VAE and GAN and exploring in latent representation space facilitates human designers’ work by expanding their internal representation. This is achieved by the means of interface design or HCI. In general, the author refers to tools like ancient writing system, Photoshop, or GAN as “cognitive technology” in which “users can internalize the interface operations as new primitive elements in their thinking”. In another word, while users are using the interface, the interface itself shapes users’ mind. Nowadays, instead of being developed by human inventors, more and more cognitive technologies are being developed with AI equipped with the proper interface. Therefore, instead of outsourcing cognition, as regarded by traditional AI community, AIA states that intelligence augmentation is achieved by such cognitive transformation for humans.

    Tweet: Human cognition can be regarded as a representation of the world. Similarly, learning for machine largely deals with representation. In this AI era, the two representation can be shared and improve each other. How to use AI to augment human intelligence? This will be an emerging topic in human-computer interaction.

  2. B. Zoph and Q. V. Le, “Neural Architecture Search with Reinforcement Learning,” arXiv:1611.01578 [cs], Nov. 2016.

    Reinforcement learning, hyperparameter optimization

    Designing neural net architecture requires a lot of expert knowledge and ample time for trial-and-error. To improve this process of neural net design and hyperparameter searching, the researchers from Google applied reinforcement learning to this trial-and-error problem. They used a RNN as the agent that samples with probability p as policy to generate a string that specifies the network structure and connectivity. Thus, the generated CNN was trained on the real data and result in an accuracy as the reward signal. They then compute the policy gradient to update the policy. In this way, the controller leans to improve its search over time.

    In addition, the team applied the similar approach in the search of architecture with skip connections and other layer types such as pooling or batchnorm. It also used the reinforcement learning framework to generate recurrent cell architectures. The experiments showed that the model can compose novel network architecture that rivals the best human-invented architecture in terms of test set accuracy.

  3. B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” arXiv:1707.07012 [cs, stat], Jul. 2017.

    Reinforcement learning, hyperparameter optimization

    This work is from the same research group in Google as the above paper. It’s a continuation to the above work of Neural Architecture Search (NAS). The NAS is computationally expensive since it will train multiple child networks on a large dataset. Therefore, the authors propose to search for architecture on a proxy dataset, and then transfer the learned architecture to a larger dataset. The main contribution of this work is the design of a novel search space, such that the best architecture found on the small dataset would scale to a larger dataset. It decouples the complexity of an architecture from the depth of a network, defines two building blocks (normal and reduction cells) which form the network by stacking alternately, and transferrers the same building block structure found on the small dataset to the large one by stacking more of the same building blocks. The structures of the building blocks are expressed with a sequence string and the best structure was searched with NAS framework on the small dataset. The resulting architectures approach or exceed state-of-the-art performance in both small and big dataset with less computation demand than human-designed architecture.

  4. M. Zitnik and J. Leskovec, “Predicting multicellular function through multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–i198, Jul. 2017.

    graph network

    The aim of the paper is to model functions of proteins in specific human tissues in the protein-protein interaction network. Previous works ignore and could not differentiate the protein functions in different tissues. This paper present OhmNet, which learns features of protein functions in a protein-protein interaction network with an unsupervised learning fashion of node2vec embedding. It encourages sharing of similar features among proteins with similar network neighborhoods. To reserve the hierarchical relationship of proteins in different tissues, the model uses mathematical regularization to express the fact that the proteins “stay together functions similar”. Specifically, it incorporates a recursive structure into the regularization that enforces the proteins to have similar features of the same proteins in the parent layer. The results showed OhmNet provided more accurate predictions of cellular function than alternative approaches, and also generated more accurate hypotheses about tissue-specific protein actions.

  5. P. Domingos, “A few useful things to know about machine learning,” Communications of the ACM, vol. 55, no. 10, p. 78, Oct. 2012.

    Tips on ML

    This paper summarized the key lessons that machine learning researchers and practitioners have learned, including pitfalls to avoid, important issues to focus on, and answers to common questions. It is written at the pre-deep learning era, and most of the tips focus on traditional ML.

    1) Learning = Representation + Evaluation + Optimization. All three parts are equally important. 2) Data alone is not enough. Model structure is a priori. 3) Despite the “no free lunch” theorems, very general assumptions (like smoothness, similar examples having similar classes, limited dependences, or limited complexity) are often enough for ML to do very well. Learning is more like farming, which lets nature do most of the work. ML is all about letting data to do the heavy lifting. Farmers combine seeds with nutrients to grow crops. ML combine knowledge with data to grow programs. 4) It mentioned multiple testing is closely related to overfitting. ML can easily test millions of hypothesis while standard statistical tests assume only one hypothesis. As a result, what looks significant may in fact not be. This can be overcome by correcting the significance test to take the number of hypothesis into account, or better, to control the fraction of falsely accepted non-null hypothesis, known as the false discovery rate. 5) Correlation does not imply causation: machine learning usually applies to observational data, as opposed to experimental data. The goal of learning predictive models is to use them as guides to action. Sometimes, the correlation is a sign of a potential causal connection, and we can use it as a guide to further investigation.

  6. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Feb. 2015.

    Batch Normalization

    Training deep neural networks suffers from internal covariate shift and saturation problem. Internal covariate shift is defined as the change in the distribution of network activations due to the change in network parameters during training. It has been long known that the network training converges faster if its inputs are whitened, i.e.: linearly transformed to have zeros means and unit variances, and decorrelated. This work provides a solution that performs input normalization in each layer in a way that is differentiable, and does not require the analysis of the entire training set after every parameter update.

    The authors made two simplifications based on the full whitening method. 1). normalize each dimension of input data independently, by making it have the mean of 0 and variance of 1. 2). use mini-batch to provide estimates of the mean and variance of each activation, so that the statistics that used for normalization can fully participate in the gradient backpropagation.

    At inference time, the batch norm uses the mean and variance of the whole training set, using a moving average to store these values. For CNN, the batch norm is done over each location in a feature map and over the mini-batch.

    Note:this post gives great explanation on why BN normalize over a mini-batch plus a whole feature map. Because the values of one feature map are generated from one weight kernel, therefore it is reasonable to normalize over one entire feature map.

    The experiment showed the advantages of batch norm: 1) it accelerates training: with a stable distribution of activation values throughout training, it enables higher learning rates and learning rate decay; 2) it regularizes the model since the normalization shift the input data with a non-deterministic manner, thus the network could reduce the dropout rate and the L2 regularization; 3) makes training more resilient to the parameter scale and initialization.

  7. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance Normalization: The Missing Ingredient for Fast Stylization,” arXiv:1607.08022 [cs], Jul. 2016.

    Instance normalization, style transfer

    This work aims to improve the aesthetic results of style transfer by replacing the batch normalization with the proposed instance normalization. The author design the instance normalization based on the observations: 1) the result of stylization is independent on the contrast of the content image; 2) the style loss is designed such that the model transfer the contrast of the stylized image to the content image. Thus, to discard contrast information, i.e.: to normalize the contrast in the content image, the author replace batch normalization, which applies normalization over a whole batch of images and a whole feature map, with instance-specific normalization, which normalize only over one feature map within one batch size. The instance normalization is applied at test time as well, since it only applies to one training/test data instead of to a min-batch of data, thus independent of the mini-batch.

  8. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis,” arXiv:1701.02096 [cs], Jan. 2017.

    Instance Normalization, style transfer

    As a continuation of the above work, this paper conducted experiments to show the effect of instance normalization. When applying instance normalization to the content image, the style loss showed faster converge compared to the batch norm.

  9. X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization,” arXiv:1703.06868 [cs], Mar. 2017.

    Adaptive instance normalization, style transfer

    It is unclear why instance normalization(IN) is so effective in style transfer. The authors conducted experiments and compare IN to batch normalization (BN) on contrast normalized images and style normalized images. The improvement brought by IN remains significant even when all training images are normalized to the same contrast, but are much smaller when all images are normalized to the same style. Different from the authors of IN that attribute the success of IN to its invariance to the contrast of the content image, the authors in this paper argue that IN performs more like a style normalization by normalizing feature statistics (the channel-wise mean and variance). And such statistics of a generator network can also control the style of the generated image.

    The author proposed a novel adaptive instance normalization (AdaIN) layer to solve the style transfer problem that the feed-forward network usually tied to a fixed set of styles and cannot adapt to arbitrary new styles. The AdaIn aligns the mean and variance of the content features with those of the style features. It adaptively computes the affine parameters from the style input and thus has no learnable affine parameters.

    The proposed method is three orders of magnitude faster than the iterative optimization process. It also allows flexible user controls at runtime, such as content-style transition, color, and spatial controls.

Leave a Comment