2022 Data Science Research Round Up: Highlighting ML, AI/DL, & NLP 2022 Data Science Research Round Up: Highlighting ML, AI/DL, & NLP
As we say farewell to 2022, I’m encouraged to look back at all the leading-edge research that happened in just a... 2022 Data Science Research Round Up: Highlighting ML, AI/DL, & NLP

As we say farewell to 2022, I’m encouraged to look back at all the leading-edge research that happened in just a year’s time. So many prominent data science research groups have worked tirelessly to extend the state of machine learning, AI, deep learning, and NLP in a variety of important directions. In this article, I’ll provide a useful summary of what transpired with some of my favorite papers for 2022 that I found particularly compelling and useful. Through my efforts to stay current with the field’s research advancement, I found the directions represented in these papers to be very promising. I hope you enjoy my selections as much as I have. I typically designate the year-end break as a time to consume a number of data science research papers. What a great way to wrap up the year! Be sure to check out my last research round-up for even more fun!

Galactica: A Large Language Model for Science

Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it even harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. This is the paper that introduces Galactica: a large language model that can store, combine and reason about scientific knowledge. The model is trained on a large scientific corpus of papers, reference material, knowledge bases, and many other sources.

Beyond neural scaling laws: beating power law scaling via data pruning

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. This NeurIPS 2022 outstanding paper from Meta AI focuses on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size.


TSInterpret: A unified framework for time series interpretability

With the increasing application of deep learning algorithms to time series classification, especially in high-stake scenarios, the relevance of interpreting those algorithms becomes key. Although research in time series interpretability has grown, accessibility for practitioners is still an obstacle. Interpretability approaches and their visualizations are diverse in use without a unified api or framework. To close this gap, we introduce TSInterpret1 , an easily extensible open-source Python library for interpreting predictions of time series classifiers that combines existing interpretation approaches into one unified framework.

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

This paper proposes an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Code for this paper can be found HERE.

TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations

Machine Learning (ML) models are increasingly used to make critical decisions in real-world applications, yet they have become more complex, making them harder to understand. To this end, researchers have proposed several techniques to explain model predictions. However, practitioners struggle to use these explainability techniques because they often do not know which one to choose and how to interpret the results of the explanations. In this work, we address these challenges by introducing TalkToModel: an interactive dialogue system for explaining machine learning models through conversations. Code for this paper can be found HERE

ferret: a Framework for Benchmarking Explainers on Transformers

Many interpretability tools allow practitioners and researchers to explain Natural Language Processing systems. However, each tool requires different configurations and provides explanations in different forms, hindering the possibility of assessing and comparing them. A principled, unified evaluation benchmark will guide the users through the central question: which explanation method is more reliable for my use case? This paper introduces ferret, an easy-to-use, extensible Python library to explain Transformer-based models integrated with the Hugging Face Hub.

Large language models are not zero-shot communicators

Despite the widespread use of LLMs as conversational agents, evaluations of performance fail to capture a crucial aspect of communication: interpreting language in context. Humans interpret language using beliefs and prior knowledge about the world. For example, we intuitively understand the response “I wore gloves” to the question “Did you leave fingerprints?” as meaning “No”. To investigate whether LLMs have the ability to make this type of inference, known as an implicature, we design a simple task and evaluate widely used state-of-the-art models. 

Core ML Stable Diffusion

Apple released a Python package for converting Stable Diffusion models from PyTorch to Core ML, to run Stable Diffusion faster on hardware with M1/M2 chips. The repository comprises:

  • python_coreml_stable_diffusion, a Python package for converting PyTorch models to Core ML format and performing image generation with Hugging Face diffusers in Python
  • StableDiffusion, a Swift package that developers can add to their Xcode projects as a dependency to deploy image generation capabilities in their apps. The Swift package relies on the Core ML model files generated by python_coreml_stable_diffusion

Adam Can Converge Without Any Modification On Update Rules

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? This paper points out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam; while practical applications often fix the problem first and then tune it. 

Language Models are Realistic Tabular Data Generators

Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data’s characteristics still remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. 

Deep Classifiers trained with the Square Loss

This data science research represents one of the first theoretical analyses covering optimization, generalization and approximation in deep networks. The paper proves that sparse deep networks such as CNNs can generalize significantly better than dense networks. 

Gaussian-Bernoulli RBMs Without Tears

This paper revisits the challenging problem of training Gaussian-Bernoulli-restricted Boltzmann machines (GRBMs), introducing two innovations. Proposed is a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. Also proposed is a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. 

Data2vec 2.0: Highly efficient self-supervised learning for vision, speech and text

data2vec 2.0 is a new general self-supervised algorithm built by Meta AI for speech, vision & text that can train models 16x faster than the most popular existing algorithm for images while achieving the same accuracy. data2vec 2.0 is vastly more efficient and outperforms its predecessor’s strong performance. It achieves the same accuracy as the most popular existing self-supervised algorithm for computer vision but does so 16x faster.

A Path Towards Autonomous Machine Intelligence 

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent agents. It combines concepts such as configurable predictive world model, behavior-driven through intrinsic motivation, and hierarchical joint embedding architectures trained with self-supervised learning.

Linear algebra with transformers

Transformers can learn to perform numerical computations from examples only. This paper studies nine problems of linear algebra, from basic matrix operations to eigenvalue decomposition and inversion, and introduces and discusses four encoding schemes to represent real numbers. On all problems, transformers trained on sets of random matrices achieve high accuracies (over 90%). The models are robust to noise, and can generalize out of their training distribution. In particular, models trained to predict Laplace-distributed eigenvalues generalize to different classes of matrices: Wigner matrices or matrices with positive eigenvalues. The reverse is not true.

Guided Semi-Supervised Non-Negative Matrix Factorization

Classification and topic modeling are popular techniques in machine learning that extract information from large-scale datasets. By incorporating a priori information such as labels or important features, methods have been developed to perform classification and topic modeling tasks; however, most methods that can perform both do not allow for the guidance of the topics or features. This paper  proposes a novel method, namely Guided Semi-Supervised Non-negative Matrix Factorization (GSSNMF), that performs both classification and topic modeling by incorporating supervision from both pre-assigned document class labels and user-designed seed words.

Learn more about these trending data science research topics at ODSC East

The above list of data science research topics is quite broad, spanning new developments and future outlooks in machine/deep learning, NLP, and more. If you want to learn how to work with the above new tools, strategies for getting into research for yourself, and meet some of the innovators behind modern data science research, then be sure to check out ODSC East this May 9th-11. Act soon, as tickets are currently 70% off!

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.