As we close in on the end of 2022, I’m energized by all the amazing work completed by many prominent research groups extending the state of AI, machine learning, deep learning, and NLP in a variety of important directions. In this article, I’ll keep you up to date with some of my top picks of papers thus far for 2022 that I found particularly compelling and useful. Through my effort to stay current with the field’s research advancement, I found the directions represented in these papers to be very promising. I hope you enjoy my selections of data science research as much as I have. I typically designate a weekend to consume an entire paper. What a great way to relax!
This post explains the GELU activation function, which has been recently used in Google AI’s BERT and OpenAI’s GPT models. Both of these models have achieved state-of-the-art results in various NLP tasks. For busy readers, this section covers the definition and implementation of the GELU activation. The rest of the post provides an introduction and discusses some intuition behind GELU.
Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs), such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish, and Mish. In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different types of data. The insights of AFs are presented to benefit the researchers for doing further data science research and practitioners to select among different choices. The code used for experimental comparison is released HERE.
The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. However, it is highly challenging to automate and operationalize ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices, sets of concepts, and development culture. However, MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. This paper addresses this gap by conducting mixed-method research, including a literature review, a tool review, and expert interviews. As a result of these investigations, what’s provided is an aggregated overview of the necessary principles, components, and roles, as well as the associated architecture and workflows.
Diffusion models are a class of deep generative models that have shown impressive results on various tasks with dense theoretical founding. Although diffusion models have achieved more impressive quality and diversity of sample synthesis than other state-of-the-art models, they still suffer from costly sampling procedures and sub-optimal likelihood estimation. Recent studies have shown great enthusiasm for improving the performance of the diffusion model. This paper presents the first comprehensive review of existing variants of diffusion models. Also provided is the first taxonomy of diffusion models which categorizes them into three types: sampling-acceleration enhancement, likelihood-maximization enhancement, and data-generalization enhancement. The paper also introduces the other five generative models (i.e., variational autoencoders, generative adversarial networks, normalizing flow, autoregressive models, and energy-based models) in detail and clarifies the connections between diffusion models and these generative models. Lastly, the paper investigates the applications of diffusion models, including computer vision, natural language processing, waveform signal processing, multi-modal modeling, molecular graph generation, time series modeling, and adversarial purification.
This paper presents a new method for supervised learning with multiple sets of features (”views”). Multiview analysis with “-omics” data such as genomics and proteomics measured on a common set of samples represents an increasingly important challenge in biology and medicine. Cooperative learning combines the usual squared error loss of predictions with an ”agreement” penalty to encourage the predictions from different data views to agree. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals.
Getting the most out of limited resources allows advances in natural language processing (NLP) data science research and practice while being conservative with resources. Those resources may be data, time, storage, or energy. Recent work in NLP has yielded interesting results from scaling; however, using only scale to improve results means that resource consumption also scales. That relationship motivates research into efficient methods that require fewer resources to achieve similar results. This survey relates and synthesizes methods and findings in those efficiencies in NLP, aiming to guide new researchers in the field and inspire the development of new methods.
This paper shows that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, it is a matter of simply treating all nodes and edges as independent tokens, augmenting them with token embeddings, and feeding them to a Transformer. With an appropriate choice of token embeddings, the paper proves that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), the suggested method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. The code associated with this paper can be found HERE.
While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. This paper contributes extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. The paper defines a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed. To understand this gap, it was important to conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges that should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions.
By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access to measurements of this information, precluding the development of actionable tactics. Cloud providers presenting information about software carbon intensity to users is a fundamental stepping stone towards minimizing emissions. This paper provides a framework for measuring software carbon intensity and proposes to measure operational carbon emissions by using location-based and time-specific marginal emissions data per energy unit. Provided are measurements of operational software carbon intensity for a set of modern models for natural language processing and computer vision, and a wide range of model sizes, including pretraining of a 6.1 billion parameter language model. The paper then evaluates a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform: using cloud instances in different geographic regions, using cloud instances at different times of day, and dynamically pausing cloud instances when the marginal carbon intensity is above a certain threshold.
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutional-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, YOLOv7 is trained only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. The code associated with this paper can be found HERE.
Generative Adversarial Network (GAN) is one of the state-of-the-art generative models for realistic image synthesis. While training and evaluating GAN becomes increasingly important, the current GAN research ecosystem does not provide reliable benchmarks for which the evaluation is conducted consistently and fairly. Furthermore, because there are few validated GAN implementations, researchers devote considerable time to reproducing baselines. This paper studies the taxonomy of GAN approaches and presents a new open-source library named StudioGAN. StudioGAN supports 7 GAN architectures, 9 conditioning methods, 4 adversarial losses, 13 regularization modules, 3 differentiable augmentations, 7 evaluation metrics, and 5 evaluation backbones. With the proposed training and evaluation protocol, the paper presents a large-scale benchmark using various datasets (CIFAR10, ImageNet, AFHQv2, FFHQ, and Baby/Papa/Granpa-ImageNet) and 3 different evaluation backbones (InceptionV3, SwAV, and Swin Transformer). Unlike other benchmarks used in the GAN community, the paper trains representative GANs, including BigGAN, StyleGAN2, and StyleGAN3, in a unified training pipeline and quantify generation performance with 7 evaluation metrics. The benchmark evaluates other cutting-edge generative models(e.g., StyleGAN-XL, ADM, MaskGIT, and RQ-Transformer). StudioGAN provides GAN implementations, training, and evaluation scripts with pre-trained weights. The code associated with this paper can be found HERE.
Detecting out-of-distribution inputs is critical for the safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. This ICML2022 paper shows that this issue can be mitigated through Logit Normalization (LogitNorm) — a simple fix to the cross-entropy loss — by enforcing a constant vector norm on the logits in training. The proposed method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. The key idea behind LogitNorm is thus to decouple the influence of output’s norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.
This is a collection of (mostly) pen-and-paper exercises in machine learning. The exercises are on the following topics: linear algebra, optimization, directed graphical models, undirected graphical models, expressive power of graphical models, factor graphs and message passing, inference for hidden Markov models, model-based learning (including ICA and unnormalized models), sampling and Monte-Carlo integration, and variational inference.
The recent success of Vision Transformers is shaking the long dominance of Convolutional Neural Networks (CNNs) in image recognition for a decade. Specifically, in terms of robustness on out-of-distribution samples, recent data science research finds that Transformers are inherently more robust than CNNs, regardless of different training setups. Moreover, it is believed that such superiority of Transformers should largely be credited to their self-attention-like architectures per se. In this paper, we question that belief by closely examining the design of Transformers. The findings in this paper lead to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code, namely a) patchifying input images, b) enlarging kernel size, and c) reducing activation layers and normalization layers. Bringing these components together, it’s possible to build pure CNN architectures without any attention-like operations that is as robust as, or even more robust than, Transformers. The code associated with this paper can be found HERE.
Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. This paper presents Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which aims to fully and responsibly share with interested researchers. It is shown that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. The code associated with this paper can be found HERE.
Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains challenging. To facilitate further progress in the field, this paper provides an overview of state-of-the-art deep learning methods for tabular data. The paper categorizes these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, the paper offers a comprehensive overview of the main approaches.
Learn more about data science research at ODSC West 2022
If all of this data science research into machine learning, deep learning, NLP, and more interests you, then learn more about the field at ODSC West 2022 this November 1st-3rd. At this event – with both in-person and virtual ticket options – you can learn from many of the leading research labs around the world, all about new tools, frameworks, applications, and developments in the field. Here are a few standout sessions as part of our data science research frontier track:
- Scalable, Real-Time Heart Rate Variability Biofeedback for Precision Health: A Novel Algorithmic Approach
- Causal/Prescriptive Analytics in Business Decisions
- Artificial Intelligence Can Learn from Data. But Can It Learn to Reason?
- StructureBoost: Gradient Boosting with Categorical Structure
- Machine Learning Models for Quantitative Finance and Trading
- An Intuition-Based Approach to Reinforcement Learning
- Robust and Equitable Uncertainty Estimation