The discipline of AI most often mentioned these days is deep learning (DL) along with its many incarnations implemented with deep neural networks. DL also is a rapidly accelerating area of research with papers being published at a fast clip by research teams from around the globe.
I enjoy keeping a pulse on deep learning research and so far in 2021 research innovations have propagated at a quick pace. Some of the top topical areas for deep learning research are: causality, explainability/interpretability, transformers, NLP, GPT, language models, GANs, deep learning for tabular data, and many others.
In this article, we’ll take a brief tour of my top picks for deep learning research (in no particular order) of papers that I found to be particularly compelling. I’m pretty attached to this leading-edge research. I’m known to carry a thick folder of recent research papers around in my backpack and consume all the great developments when I have a spare moment. Enjoy!
Check out my previous lists: Best Machine Learning Research of 2021 So Far, Best of Deep Reinforcement Learning Research of 2019, Most Influential NLP Research of 2019, and Most Influential Deep Learning Research of 2019.
In many scenarios, human decisions are explained based on some high-level concepts. This paper takes a step in the interpretability of neural networks by examining their internal representation or neuron’s activations against concepts. A concept is characterized by a set of samples that have specific features in common. A framework is proposed to check the existence of a causal relationship between a concept (or its negation) and task classes. While the previous methods focus on the importance of a concept to a task class, the paper goes further and introduces four measures to quantitatively determine the order of causality. Through experiments, the effectiveness of the proposed method is demonstrated in explaining the relationship between a concept and the predictive behavior of a neural network.
Text generation has become one of the most important yet challenging tasks in natural language processing (NLP). The resurgence of deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs). This paper presents an overview of the major advances achieved in the topic of PLMs for text generation. As the preliminaries, the paper presents the general task definition and briefly describes the mainstream architectures of PLMs for text generation. As the core content, the deep learning research paper discusses how to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
Building a dialogue system that can communicate naturally with humans is a challenging yet interesting problem of agent-based computing. The rapid growth in this area is usually hindered by the long-standing problem of data scarcity as these systems are expected to learn syntax, grammar, decision making, and reasoning from insufficient amounts of task-specific data sets. The recently introduced pre-trained language models have the potential to address the issue of data scarcity and bring considerable advantages by generating contextualized word embeddings. These models are considered counterparts of ImageNet in NLP and have demonstrated the ability to capture different facets of language such as hierarchical relations, long-term dependency, and sentiment. This short survey paper discusses the recent progress made in the field of pre-trained language models.
AI is becoming increasingly more popular and can be found in workplaces and homes around the world. However, how do we ensure trust in these systems? Regulation changes such as the GDPR mean that users have a right to understand how their data has been processed as well as saved. Therefore if, for example, you are denied a loan you have the right to ask why. This can be hard if the method for working this out uses “black box” machine learning techniques such as neural networks. TrustyAI is a new initiative which looks into explainable artificial intelligence (XAI) solutions to address trustworthiness in ML as well as decision services landscapes. This deep learning research paper looks at how TrustyAI can support trust in decision services and predictive models. The paper investigates techniques such as LIME, SHAP and counterfactuals, benchmarking both LIME and counterfactual techniques against existing implementations.
Ever since its debut, generative adversarial networks (GANs) have attracted tremendous amount of attention. Over the past years, different variations of GANs models have been developed and tailored to different applications in practice. Meanwhile, some issues regarding the performance and training of GANs have been noticed and investigated from various theoretical perspectives. This paper starts from an introduction of GANs from an analytical perspective, then moves onto the training of GANs via SDE approximations and finally discusses some applications of GANs in computing high dimensional MFGs as well as tackling mathematical finance problems.
In spite of showing unreasonable effectiveness in modalities like Text and Image, deep learning has always lagged gradient boosting in tabular data – both in popularity and performance. But recently there have been newer models created specifically for tabular data, which is pushing the performance bar. But popularity is still a challenge because there is no easy, ready-to-use library like scikit-learn for deep learning. PyTorch Tabular is a new deep learning library which makes working with deep learning and tabular data easy and fast. It is a library built on top of PyTorch and PyTorch Lightning and works on Pandas dataframes directly. Many SOTA models like NODE and TabNet are already integrated and implemented in the library with a unified API. PyTorch Tabular is designed to be easily extensible for researchers, simple for practitioners, and robust in industrial deployments.
As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. This paper surveys approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods.
Complex learning rate schedules have become an integral part of deep learning. This research finds empirically that common fine-tuned schedules decay the learning rate after the weight norm bounces. This leads to the proposal of ABEL: an automatic scheduler which decays the learning rate by keeping track of the weight norm. ABEL’s performance matches that of tuned schedules and is more robust with respect to its parameters. Through extensive experiments in vision, NLP, and RL, it is shown that if the weight norm does not bounce, it is possible to simplify schedules even further with no loss in performance. In such cases, a complex schedule has similar performance to a constant learning rate with a decay at the end of training.
While GPTs with traditional fine-tuning fail to achieve strong results on natural language understanding (NLU), this paper shows that GPTs can be better than or comparable to similar-sized BERTs on NLU tasks with a novel method P-tuning — which employs trainable continuous prompt embeddings. On the knowledge probing (LAMA) benchmark, the best GPT recovers 64% (P@1) of world knowledge without any additional text provided during test time, which substantially improves the previous best by 20+ percentage points. On the SuperGlue benchmark, GPTs achieve comparable and sometimes better performance to similar-sized BERTs in supervised learning. Importantly, it is found that P-tuning also improves BERTs’ performance in both few-shot and supervised settings while largely reducing the need for prompt engineering. Consequently, P-tuning outperforms the state-of-the-art approaches on the few-shot SuperGlue benchmark.
Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture — such as the use of non-overlapping patches — lead one to wonder whether these networks are as robust. This paper performs an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. Investigated is robustness to input perturbations as well as robustness to model perturbations. The paper finds that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. Also found is that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification.
The creation of altered and manipulated faces has become more common due to the improvement of DeepFake generation methods. Simultaneously, we have seen the development of detection models for differentiating between a manipulated and original face from image or video content. We have observed that most publicly available DeepFake detection datasets have limited variations, where a single face is used in many videos, resulting in an oversampled training dataset. Due to this, deep neural networks tend to overfit to the facial features instead of learning to detect manipulation features of DeepFake content. As a result, most detection architectures perform poorly when tested on unseen data. This paper provides a quantitative analysis to investigate this problem and present a solution to prevent model overfitting due to the high volume of samples generated from a small number of actors.
Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. This paper first discusses the major microarchitectural details of Edge TPUs. This is followed by an extensive evaluation of three classes of Edge TPUs, covering different computing ecosystems that are either currently deployed in Google products or are the product pipeline. Building upon this extensive study, the paper discusses critical and interpretable microarchitectural insights about the studied classes of Edge TPUs. Mainly discussed is how Edge TPU accelerators perform across CNNs with different structures. Finally, the paper presents ongoing efforts in developing high-accuracy learned machine learning models to estimate the major performance metrics of accelerators such as latency and energy consumption. These learned models enable significantly faster (in the order of milliseconds) evaluations of accelerators as an alternative to time-consuming cycle-accurate simulators and establish an exciting opportunity for rapid hard-ware/software co-design.
Recently, the advancement of 3D point clouds in deep learning has attracted intensive research in different application domains such as computer vision and robotic tasks. However, creating feature representation of robust, discriminative from unordered and irregular point clouds is challenging. The goal of this paper is to provide a comprehensive overview of the point clouds feature representation which uses attention models. More than 75+ key contributions in the recent three years are summarized in this survey, including the 3D objective detection, 3D semantic segmentation, 3D pose estimation, point clouds completion etc. Also provided are: a detailed characterization of (i) the role of attention mechanisms, (ii) the usability of attention models into different tasks, and (iii) the development trend of key technology.
Deep neural networks (DNNs) are notorious for making more mistakes for the classes that have substantially fewer samples than the others during training. Such class imbalance is ubiquitous in clinical applications and very crucial to handle because the classes with fewer samples most often correspond to critical cases (e.g., cancer) where misclassifications can have severe consequences. Not to miss such cases, binary classifiers need to be operated at high True Positive Rates (TPR) by setting a higher threshold but this comes at the cost of very high False Positive Rates (FPR) for problems with class imbalance. Existing methods for learning under class imbalance most often do not take this into account. This paper argues that prediction accuracy should be improved by emphasizing reducing FPRs at high TPRs for problems where misclassification of the positive samples are associated with higher cost. To this end, it’s posed the training of a DNN for binary classification as a constrained optimization problem and introduce a novel constraint that can be used with existing loss functions to enforce maximal area under the ROC curve (AUC). The resulting constrained optimization problem is solved using an Augmented Lagrangian method (ALM), where the constraint emphasizes reduction of FPR at high TPR. Results demonstrate that the proposed method almost always improves the loss functions it is used with by attaining lower FPR at high TPR and higher or equal AUC.
While normalizations aim to fix the exploding and vanishing gradient problem in deep neural networks, they have drawbacks in speed or accuracy because of their dependency on the data set statistics. This paper is a comprehensive study of a novel method based on unitary synaptic weights derived from Lie Group to construct intrinsically stable neural systems. Here it’s shown that unitary convolutional neural networks deliver up to 32% faster inference speeds while maintaining competitive prediction accuracy. Unlike prior arts restricted to square synaptic weights, the paper expands the unitary networks to weights of any size and dimension.
The recent explosive interest with transformers has suggested their potential to become powerful “universal” models for computer vision tasks, such as classification, detection, and segmentation. An important question is how much further transformers can go – are they ready to take some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs)? Driven by that curiosity, this paper conducts the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. The proposed vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution while decreasing embedding dimension, and a patch-level discriminator that is also transformer-based. TransGAN is seen to notably benefit from data augmentations (more than standard GANs), a multi-task co-training strategy for the generator, and a locally initialized self-attention that emphasizes the neighborhood smoothness of natural images. Equipped with those findings, TransGAN can effectively scale up with bigger models and high-resolution image datasets. Specifically, the architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones. The GitHub repo associated with this paper can be found HERE.
Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute a dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in scene classification using deep learning. More than 260 major publications are included in this survey covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In retrospect of what has been achieved so far, this paper is concluded with a list of promising research opportunities.
Explainable Artificial Intelligence (XAI) has in recent years become a well-suited framework to generate human-understandable explanations of black box models. This paper presents a novel XAI visual explanation algorithm denoted SIDU that can effectively localize entire object regions responsible for prediction. The paper analyzes its robustness and effectiveness through various computational and human subject experiments. In particular, the SIDU algorithm is assessed using three different types of evaluations (Application, Human and Functionally-Grounded) to demonstrate its superior performance. The robustness of SIDU is further studied in presence of adversarial attack on black box models to better understand its performance.
This paper proposes a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. The method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, the method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, two learned algorithms are highlighted which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.
VGG-style ConvNets, although now considered a classic architecture, were attractive due to their simplicity. In contrast, ResNets have become popular due to their high accuracy but are more difficult to customize and display undesired inference drawbacks. To address these issues, Ding et al. propose RepVGG – the return of the VGG!
RepVGG is an efficient and simple architecture using plain VGG-style ConvNets. It decouples the inference-time and training-time architecture through a structural re-parameterization technique. The researchers report favorable speed-accuracy tradeoff compared to state-of-the-art models, such as EfficientNet and RegNet. RepVGG achieves 80% top-1 accuracy on ImageNet and is benchmarked as being 83% faster than ResNet-50. This research is part of a broader effort to build more efficient models using simpler architectures and operations. The GitHub repo associated with this paper can be found HERE.
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model — with outrageous numbers of parameters — but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability — this paper addresses these with the Switch Transformer. The Google Brain researchers simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. The proposed training techniques help wrangle the instabilities and it is shown that large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. They design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings to measure gains over the mT5-Base version across all 101 languages. Finally, the paper advances the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus” and achieve a 4x speedup over the T5-XXL model. The GitHub repo associated with this paper can be found HERE.
How to Learn More about Deep Learning Research
At our upcoming event this November 16th-18th in San Francisco, ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on deep learning and deep learning research. You can register now for 60% off all ticket types before the discount drops to 40% in a few weeks. Some highlighted sessions on deep learning include:
Sessions on Deep Learning and Deep Learning Research:
- GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
- Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
- Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
- Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google
Sessions on Machine Learning:
- Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
- Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
- Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
- Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University
Sessions on MLOps:
- Tuning Hyperparameters with Reproducible Experiments: Milecia McGregor | Senior Software Engineer | Iterative
- MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
- Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
- Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investment