As an academic researcher in a previous life, I like to maintain ties to the research community while working in the data science field. I feel that a firm understanding of the origins for the technologies I use in my consulting work: AI, machine learning, and deep learning, helps me establish a foundational perspective for how things work behind the scenes. In this article, I’ve put together a list of influential data science research papers for 2018 that all data scientists should review. I’ve included a number of “survey” style papers because they allow you to see an entire landscape of a technology area, and also because they often have complete lists of references including seminal papers.
While watching a recent webinar sponsored by The ACM, “Break Into AI: A Q&A with Andrew Ng on Building a Career in Machine Learning,” I found out that Dr. Ng routinely carries around a folder of research papers that he can draw from when there’s a lull in his active schedule like when he’s riding in an Uber. I thought I was the only one who carries around a bunch of research papers; apparently, I’m in very good company! So load up your own folder with some of the following papers. Dr. Ng advised that if you read a couple of papers per week (not all in great detail), after a year you will have read 100+ papers this will lead to a very good command of the discipline.
[Related Article: The Most Exciting Natural Language Processing Research of 2019 So Far]
We’ve all been taught that the backpropagation algorithm, originally introduced in the 1970s, is the pillar of learning in neural networks. In turn, backpropagation makes use of the well-known first-order iterative optimization algorithm known as Gradient Descent, which is used for finding the minimum of a function. In this paper, Bangalore-based PES University researchers describe an alternative to backpropagation without the use of Gradient Descent. Instead, they devise a new algorithm to find the error in the weights and biases of an artificial neuron using Moore-Penrose Pseudo Inverse. The paper features numerical studies and experiments performed on various data sets designed to verify that the alternative algorithm functions as intended.
Sentiment analysis is a widely used process of computationally identifying and categorizing opinions expressed in a piece of text, in order to determine whether the writer’s attitude towards a particular topic, product, etc., is positive, negative, or neutral. Sentiment analysis is especially valuable when acting on social media data sources.
Deep learning is another technology that’s growing in popularity as a powerful machine learning technique that learns multiple layers of representations or features of the data and yields prediction results. Along with the success of deep learning in many other application domains, deep learning is also finding common use in sentiment analysis in recent years. This paper provides an informative overview of deep learning and then offers a comprehensive survey of its current application in the area of sentiment analysis.
As a mathematician myself, I like to see tutorials that represent data science topics in light of their connections to applied mathematics. This paper provides a good introduction to the basic ideas that underlie deep learning from an applied mathematics perspective. Multilayered artificial neural networks are becoming a pervasive tool in a host of application domains. At the heart of this deep learning revolution are familiar concepts from applied and computational mathematics; notably in calculus, partial differential equations, linear algebra, and approximation/optimization theory.
This paper is a comprehensive historical review of deep learning models. It covers the genesis of artificial neural networks all the way up to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks. The paper also focuses on the precedents of these classes of models, examining how the initial ideas are assembled to construct the early models and how these preliminary models are developed into their current forms.
Recurrent neural networks (RNNs) are capable of learning features and long term dependencies from sequential and time-series data. RNNs consist of a stack of non-linear units where at least one connection between units forms a directed cycle. A well-trained RNN can model any dynamical system; however, training RNNs is mostly plagued by issues in learning long-term dependencies. This paper presents a survey on RNNs and highlights several recent advances in the field.
Although deep learning has historical roots going back decades, neither the term “deep learning” nor the approach was popular just over five years ago, when the field was reignited by papers such as Krizhevsky, Sutskever and Hinton’s now classic 2012 paper “ImageNet Classification with Deep Convolutional Neural Networks.” What has the field discovered in the five subsequent years? Against a background of considerable progress in areas such as speech recognition, image recognition, and game playing, AI contrarian Gary Marcus of New York University presents ten concerns for deep learning, and suggests that deep learning must be supplemented by other techniques if we are to reach the long-term goal of Artificial General Intelligence.
This paper is a wonderful resource that explains all the linear algebra you need in order to understand the operation of deep neural networks (and to read most of the other papers on this list). It assumes little math knowledge beyond what you learned in freshman calculus, and provide links to help you refresh the necessary math where needed. Note that you do not need to understand this material before you start learning to train and use deep learning in practice; rather, this material is for those who are already familiar with the basics of neural networks, and wish to deepen their understanding of the underlying math.
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems — BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. This paper, by Facebook AI Researchers (FAIR), presents Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes.
Deep neural networks are typically trained by optimizing a loss function with a Stochastic Gradient Descent (SGD) variant, in conjunction with a decaying learning rate, until convergence. This paper shows that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better generalization than conventional training. The paper also shows that this Stochastic Weight Averaging (SWA) procedure finds much broader optima than SGD, and approximates the recent Fast Geometric Ensembling (FGE) approach with a single model.
Automatic text summarization, the automated process of shortening a group of text while preserving its main ideas, is a critical research area in natural language processing (NLP). This paper surveys recent work on neural-based models in automatic text summarization. The author examines in detail ten state-of-the-art neural-based summarizers: five abstractive models and five extractive models.
The seminal work of Gatys et al. in the 2015 paper “A Neural Algorithm of Artistic Style” demonstrated the power of Convolutional Neural Networks (CNN) in creating artistic imagery by separating and recombining image content and style. This process of using CNN to render a content image in different styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention and a variety of approaches are proposed to either improve or extend the original NST algorithm. This paper provides an overview of the current progress towards NST, as well as discussing its various applications and open problems for future research.
There is a growing interest in using Riemannian geometry in machine learning. This paper introduces geomstats, a python package that performs computations on manifolds such as hyperspheres, hyperbolic spaces, spaces of symmetric positive definite matrices and Lie groups of transformations. Also provided is efficient and extensively unit-tested implementations of these manifolds, together with useful Riemannian metrics and associated Exponential and Logarithm maps. The corresponding geodesic distances provide a range of intuitive choices of Machine Learning loss functions. The authors also give the corresponding Riemannian gradients. The operations implemented in geomstats are available with different computing backends such as numpy, tensorflow and keras. The authors have enabled GPU implementation and integrated geomstats manifold computations into the keras deep learning framework.
This paper presents a two-parameter loss function which can be viewed as a generalization of many popular loss functions used in robust statistics: the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, and generalized Charbonnier loss functions (and by transitivity the L2, L1, L1-L2, and pseudo-Huber/Charbonnier loss functions). The author describes and visualizes this loss and its corresponding distribution, and documents several useful properties.
This paper introduces backdrop, a flexible and simple-to-implement method, intuitively described as dropout acting only along the backpropagation pipeline. Backdrop is implemented via one or more masking layers which are inserted at specific points along the network. Each backdrop masking layer acts as the identity in the forward pass, but randomly masks parts of the backward gradient propagation. Intuitively, inserting a backdrop layer after any convolutional layer leads to stochastic gradients corresponding to features of that scale. Therefore, backdrop is well suited for problems in which the data have a multi-scale, hierarchical structure.
This paper introduces an approach for deep reinforcement learning (RL) that improves upon the efficiency, generalization capacity, and interpretability of conventional approaches through structured perception and relational reasoning. It uses self-attention to iteratively reason about the relations between entities in a scene and to guide a model-free policy. The results show that in a novel navigation and planning task called Box-World, the agent finds interpretable solutions that improve upon baselines in terms of sample complexity, ability to generalize to more complex scenes than experienced during training, and overall performance.
Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that CNNs may be appropriate. This paper shows a striking counterexample to this intuition via the seemingly trivial coordinate transform problem, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and one-hot pixel space. Although CNNs would seem appropriate for this task, the authors from Uber show that they fail spectacularly. The paper demonstrates and carefully analyzes the failure first on a toy problem, at which point a simple fix becomes obvious. This solution is called CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels.
The back-propagation algorithm is the cornerstone of deep learning. Despite its importance, few variations of the algorithm have been attempted. This work presents an approach to discover new variations of the back-propagation equation. The authors from Google use a domain specific language to describe update equations as a list of primitive functions. An evolution-based method is used to discover new propagation rules that maximize the generalization performance after several training epochs. The research finds several update equations that can train faster with short training times than standard back-propagation, and perform similar as standard back-propagation at convergence.
Object detection is the computer vision task dealing with detecting instances of objects of a certain class (e.g., ‘car’, ‘plane’, etc.) in images. It has attracted a lot of attention from the community during the last 5 years. This strong interest can be explained not only by the importance this task has for many applications but also by the phenomenal advances in this area since the arrival of deep convolutional neural networks (CNNs). This paper offers a comprehensive review of the recent literature on object detection with deep CNNs and provides an in-depth view of these recent advances.
This paper surveys neural approaches to conversational AI that have been developed in the last few years. Conversational systems are grouped into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, the paper presents a review of state-of-the-art neural approaches, draws connections between them and traditional approaches, and discusses the progress that has been made and challenges still being faced, using specific systems and models as case studies.
[Related Article: The Best Machine Learning Research of 2019 So Far]
Recurrent neural networks (RNNs) provide state-of-the-art performance in processing sequential data but are memory intensive to train, limiting the flexibility of RNN models which can be trained. Reversible RNNs—RNNs for which the hidden-to-hidden transition can be reversed—offer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation. This paper shows that perfectly reversible RNNs, which require no storage of the hidden activations, are fundamentally limited because they cannot forget information from their hidden state. The paper then provides a scheme for storing a small number of bits in order to allow perfect reversal with forgetting. The author’s method achieves comparable performance to traditional models while reducing the activation memory cost by a factor of 10–15.
Research into data science is a perpetual machine with new advancements coming frequently. Whether it’s machine learning, deep learning, neural networks, or something else, there’s always something new to learn.