Top Machine and Deep Learning Research Papers

Having had the privilege of compiling a wide range of articles exploring state-of-art machine and deep learning research in 2019 (you can find many of them here), I wanted to take a moment to highlight the ones that I found most interesting. I’ll also share links to their code implementations so that you can try your hands at them.

Contrastive Representation Distillation

This paper in the model distillation arena leverages the family of contrastive objectives to capture correlations and higher-order output dependencies. They’re adapted in this paper for distilling knowledge from one neural network to another.

The paper considers three distillation stages:

Model compression
Transferring knowledge from one modality (e.g RGB) to another (e.g., depth)
Distilling an ensemble of networks into a single network

The main idea in contrastive learning is learning a representation that’s close in some metric space for positive pairs while pushing away the representations between negative pairs.

Network Pruning via Transformable Architecture Search

This is a paper in the network pruning arena. It proposes applying neural architecture search directly for a network with flexible channel and layer sizes. Minimizing the loss of the pruned networks aids in learning the number of channels.

The feature map of the pruned network is made up of K feature map fragments that are sampled based on the probability distribution. The loss is backpropagated to the network weights and to the parameterized distribution.

The pruning approach proposed in this paper is divided into three stages:

Training an unpruned large network with a standard classification training procedure.
Searching for the depth and width of a small network via Transformable Architecture Search (TAS). TAS aims at searching for the best size of a network.
Transferring the information from the unpruned network to the searched small network with a simple knowledge distillation (KD) approach.

Learning Data Augmentation Strategies for Object Detection

While this isn’t itself a model architecture, this paper proposes the creation of transformations that can be applied to object detection datasets that can be transferred to other objection detection datasets. The transformations are usually applied at training time. The code for training with the learned policy is available below:

In this model, an augmentation policy is defined as a set of n policies that are selected at random during the training process. Some of the operations that have been applied in this model include distorting color channels, distorting the images geometrically, and distorting only the pixel content found in the bounding box annotations.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

XLNet was a very mind-blowing paper in the field of Transformers. XLNet is a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. It doesn’t use a fixed forward or backward factorization order.

Instead, it maximizes the expected log-likelihood of a sequence with regards to all possible permutations of the factorization order. As a result of these permutations, the context for each position can consist of tokens from both left and right. Bidirectional context is captured since each position learns to utilize contextual information from all positions.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (ACL 2019)

Transformer-XL (meaning extra long) allows for the learning of dependency beyond a fixed-length without disrupting temporal coherence. It incorporates a segment-level recurrence mechanism and a positional encoding scheme. TransformerXL learns dependency that’s 80% longer than RNNs and 450% longer than vanilla Transformers. It’s available on both TensorFlow and PyTorch.

The authors introduce recurrence to their deep self-attention network. Instead of calculating hidden states from scratch for each new segment, they reuse the hidden states obtained in the previous segments. The reused hidden states act as memory for the recurrent segment.

This builds up a recurrent connection between the segments. Modeling long term dependency becomes possible because information is passed through the recurrent connections. The authors also introduce a more effective relative positional encoding formulation that generalizes to attention lengths that are longer than the one observed during training.

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos (AAAI 2019)

This paper is concerned with the task of unsupervised learning of scene depth and robot ego-motion, where supervision is provided by monocular videos. This is done by introducing geometric structure into the learning process. It involves modeling the scene and the individual objects, camera ego-motion, and object motions learned from monocular video inputs. The authors also introduce an online refinement method.

The authors introduce an object motion model that shares the same architecture as the ego-motion network. It is, however, specialized for predicting motions of individual objects in 3D.

It takes an RGB image sequence as input. It’s complemented by pre-computed instance segmentation masks. The work of the motion model is to learn to predict the transformation vectors per object in 3D space. This creates the observed object appearance in the respective target frame.

Auto-Keras: An Efficient Neural Architecture Search System

This paper proposes a framework for enabling Bayesian optimization to guide network morphism for efficient NAS. Based on their method, the authors built an open-source AutoML system known as Auto-Keras.

The major building block of the proposed method is to explore the search space via morphing the neural architectures, guided by a Bayesian optimization (BO) algorithm. Since the NAS space is not a Euclidean space, the authors tackle this challenge by designing a neural network kernel function. The kernel function is the edit distance for morphing one neural architecture to another.

Depth-Aware Video Frame Interpolation (CVPR 2019)

This paper proposes a video frame interpolation method that detects occlusion by exploring depth information. The authors develop a depth-aware flow projection layer that synthesizes immediate flows that sample closer objects than ones that are far away.

Learning of hierarchical features is done by gathering contextual information from neighboring pixels. The output frame is then generated by warping the input frames, depth maps, and contextual features based on the optical flow and local interpolation kernels.

The authors propose a Depth-Aware video frame INterpolation (DAIN) model that effectively exploits the optical flow, local interpolation kernels, depth maps, and contextual features to generate high-quality video frames.

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

OpenPose is an open-source real-time system for multi-person 2D pose estimation, including body, foot, hand, and facial keypoints. This paper proposes a real-time approach for detecting 2D human poses in images and videos.

This proposed method uses nonparametric representations known as Part Affinity Fields (PAFs). Some of the authors of this paper are from IEEE. This method takes an image as input for a CNN and predicts confidence maps for detecting body parts and PAFs for part association. This paper also presents an annotated foot dataset with 15K human foot instances.

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

This paper proposes a joint upsampling module named Joint Pyramid Upsampling (JPU) to replace the dilated convolutions that consume a lot of time and memory. It works by formulating the function of extracting high-resolution maps as a joint upsampling problem.

The method implements a fully-connected network (FCN) as the backbone while applying JPU to upsample the low-resolution final feature maps, resulting in high-resolution feature maps. Replacing the dilated convolutions with JPU does not result in any loss in performance.

Conclusion

Hopefully, this gives you some insights into the machine and deep learning research space in 2019. I’ve tried to include both links to the original papers and their code where possible. Try your hands at them and let us know what you accomplish.