Top Machine Learning Research for the Second Half of 2021 Top Machine Learning Research for the Second Half of 2021
As 2021 draws to a close, I’m energized by all the amazing work completed by many prominent research groups extending the... Top Machine Learning Research for the Second Half of 2021

As 2021 draws to a close, I’m energized by all the amazing work completed by many prominent research groups extending the state of machine learning in a variety of important directions. In this article, I’ll keep you up to date with my top picks of machine learning research papers for the last half of 2021 that I found particularly compelling and useful. Through my effort to stay current with the field’s research advancement, I found the directions represented in these machine learning research papers to be very promising. I hope you enjoy my selections as much as I have. (Check my recent lists from 2019, 2020, and 2021).

The Causal Loss: Driving Correlation to Imply Causation

Most algorithms in classical and contemporary machine learning focus on correlation-based dependence between features to drive performance. Although success has been observed in many relevant problems, these algorithms fail when the underlying causality is inconsistent with the assumed relations. This machine learning research paper proposes a novel model-agnostic loss function called Causal Loss that improves the interventional quality of the prediction using an intervened neural-causal regularizer. In support of the theoretical results, the experimental illustration shows how causal loss bestows a non-causal associative model (like a standard neural net or decision tree) with interventional capabilities.

Instance-Conditioned GAN

Generative Adversarial Networks (GANs) can generate near photo-realistic images in narrow domains such as human faces. Yet, modeling complex distributions of datasets such as ImageNet and COCO-Stuff remains challenging in unconditional settings. This machine learning research paper takes inspiration from kernel density estimation techniques and introduces a non-parametric approach to modeling distributions of complex datasets. The data manifold is partitioned into a mixture of overlapping neighborhoods described by a data point and its nearest neighbors, and a model is introduced called instance-conditioned GAN (IC-GAN), which learns the distribution around each data point. Experimental results on ImageNet and COCO-Stuff show that IC-GAN significantly improves over unconditional models and unsupervised data partitioning baselines. It’s also shown that IC-GAN can effortlessly transfer to datasets not seen during training by simply changing the conditioning instances, and still generate realistic images. Source code can be found HERE


Fairness without Imputation: A Decision Tree Approach for Fair Prediction with Missing Values

This paper investigates the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e.g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. This paper theoretically analyzes different sources of discrimination risks when training with an imputed dataset. An integrated approach is proposed based on decision trees that does not require a separate process of imputation and learning. Instead, a tree is trained with missing incorporated as attribute (MIA), which does not require explicit imputation, and a fairness-regularized objective function is optimized. The approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets.

Merlion: A Machine Learning Library for Time Series

This paper introduces Merlion, an open-source machine learning library for time series. It features a unified interface for many commonly used models and datasets for anomaly detection and forecasting on both univariate and multivariate time series, along with standard pre/post-processing layers. It has several modules to improve ease-of-use, including visualization, anomaly score calibration to improve interpetability, AutoML for hyperparameter tuning and model selection, and model ensembling. Merlion also provides a unique evaluation framework that simulates the live deployment and re-training of a model in production. This library aims to provide engineers and researchers a one-stop solution to rapidly develop models for their specific time series needs and benchmark them across multiple time series datasets. The paper highlights Merlion’s architecture and major functionalities, and a report of benchmark numbers across different baseline models and ensembles.

A Machine Learning Pipeline to Examine Political Bias with Congressional Speeches

Computational methods to model political bias in social media involve several challenges due to heterogeneity, high-dimensional, multiple modalities, and the scale of the data. Political bias in social media has been studied in multiple viewpoints like media bias, political ideology, echo chambers, and controversies using machine learning pipelines. Most of the current methods rely heavily on the manually-labeled ground-truth data for the underlying political bias prediction tasks. Limitations of such methods include human-intensive labeling, labels related to only a specific problem, and the inability to determine the near future bias state of a social media conversation. This machine learning research paper addresses such problems and gives machine learning approaches to study political bias in two ideologically diverse social media forums: Gab and Twitter without the availability of human-annotated data. 

Sketch Your Own GAN

Can a user create a deep generative model by sketching a single example? Traditionally, creating a GAN model has required the collection of a large-scale data set of exemplars and specialized knowledge in deep learning. In contrast, sketching is possibly the most universally accessible way to convey a visual concept. This machine learning research paper presents a method, GAN Sketching, for rewriting GANs with one or more sketches, to make GANs training easier for novice users. In particular, the weights of an original GAN model are changed according to user sketches. The model’s output is encouraged to match the user sketches through a cross-domain adversarial loss. Furthermore, different regularization methods are explored to preserve the original model’s diversity and image quality. Experiments have shown that this method can mold GANs to match shapes and poses specified by sketches while maintaining realism and diversity. Source code can be found HERE

Interpretable Propaganda Detection in News Articles

Online users today are exposed to misleading and propagandistic news articles and media posts on a daily basis. To counter this, a number of approaches have been designed aiming to achieve healthier and safer online news and media consumption. Automatic systems are able to support humans in detecting such content; yet, a major impediment to their broad adoption is that besides being accurate, the decisions of such systems need also to be interpretable in order to be trusted and widely adopted by users. Since misleading and propagandistic content influences readers through the use of a number of deception techniques, this machine learning research paper proposes to detect and to show the use of such techniques as a way to offer interpretability. 

Man versus Machine: AutoML and Human Experts’ Role in Phishing Detection

Machine learning (ML) has developed rapidly in the past few years and has successfully been utilized for a broad range of tasks, including phishing detection. However, building an effective ML-based detection system is not a trivial task, and requires data scientists with knowledge of the relevant domain. Automated Machine Learning (AutoML) frameworks have received a lot of attention in recent years, enabling non-ML experts in building a machine learning model. This brings to an intriguing question of whether AutoML can outperform the results achieved by human data scientists. This machine learning research paper compares the performances of six well-known, state-of-the-art AutoML frameworks on ten different phishing data sets to see whether AutoML-based models can outperform manually crafted machine learning models. The results indicate that AutoML-based models are able to outperform manually developed machine learning models in complex classification tasks, specifically in data sets where the features are not quite discriminative, and datasets with overlapping classes or relatively high degrees of non-linearity. 

Learning with Multiclass AUC: Theory and Algorithms

The Area under the ROC curve (AUC) is a well-known ranking metric for problems such as imbalanced learning and recommender systems. The vast majority of existing AUC-optimization-based machine learning methods only focus on binary-class cases, while leaving the multiclass cases unconsidered. This machine learning research paper starts an early trial to consider the problem of learning multiclass scoring functions via optimizing multiclass AUC metrics. Our foundation is based on the M metric, which is a well-known multiclass extension of AUC. The paper pays a revisit to this metric, showing that it could eliminate the imbalance issue from the minority class pairs. Motivated by this, it proposes an empirical surrogate risk minimization framework to approximately optimize the M metric. Theoretically, it is shown that: (i) optimizing most of the popular differentiable surrogate losses suffices to reach the Bayes optimal scoring function asymptotically; (ii) the training framework enjoys an imbalance-aware generalization error bound, which pays more attention to the bottleneck samples of minority classes compared with the traditional O(√(1/N)) result. Practically, to deal with the low scalability of the computational operations, acceleration methods are proposed for three popular surrogate loss functions, including the exponential loss, squared loss, and hinge loss, to speed up loss and gradient evaluations. Finally, experimental results on 11 real-world data sets demonstrate the effectiveness of our proposed framework.

ChainerRL: A Deep Reinforcement Learning Library 

This machine learning research paper introduces ChainerRL, an open-source deep reinforcement learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original machine learning research papers’ experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found HERE.

Subspace Clustering through Sub-Clusters 

The problem of dimension reduction is of increasing importance in modern data analysis. This machine learning research paper considers modeling the collection of points in a high dimensional space as a union of low dimensional subspaces. In particular, a highly scalable sampling based algorithm is proposed that clusters the entire data via first spectral clustering of a small random sample followed by classifying or labeling the remaining out-of-sample points. The key idea is that this random subset borrows information across the entire dataset and that the problem of clustering points can be replaced with the more efficient problem of “clustering sub-clusters”. Theoretical guarantees for the procedure are provided. The numerical results indicate that for large datasets the proposed algorithm outperforms other state-of-the-art subspace clustering algorithms with respect to accuracy and speed.

LassoNet: Neural Networks with Feature Sparsity

Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. The approach outlined in this machine learning research paper achieves feature sparsity by allowing a feature to participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, the method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. In experiments with real and simulated data, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

Interpretable Random Forests via Rule Extraction 

This machine learning research paper introduces SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm, which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as “black boxes” because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly restrictive for applications with critical decisions at stake. On the other hand, algorithms with a simple structure—typically decision trees, rule algorithms, or sparse linear models—are well known for their instability. This undesirable feature makes the conclusions of the data analysis unreliable and turns out to be a strong operational limitation. This motivates the design of SIRUS, based on random forests, which combines a simple structure, a remarkable stable behavior when data is perturbed, and an accuracy comparable to its competitors. The efficiency of the method both empirically (through experiments) and theoretically (with the proof of its asymptotic stability) is demonstrated. An R/C++ software implementation sirus is available from CRAN.

KML – Using Machine Learning to Improve Storage Systems

Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users – essentially burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O heavy applications, so even a small overall latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. This machine learning research paper proposes that ML solutions become the first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. The machine learning research paper describes a proposed ML architecture, called KML. The researchers developed a prototype KML architecture and applied it to two problems: optimal readahead and NFS read-size values. The experiments show that KML consumes little OS resources, adds negligible latency, and yet can learn patterns that can improve I/O throughput by as much as 2.3x or 15x for the two use cases respectively – even for complex, never-before-seen, concurrently running mixed workloads on different storage devices.

Feature selection or extraction decision process for clustering using PCA and FRSD

This machine learning research paper concerns the critical decision process of extracting or selecting the features before applying a clustering algorithm. It is not obvious to evaluate the importance of the features since the most popular methods to do it are usually made for a supervised learning technique process. A clustering algorithm is an unsupervised method. It means that there is no known output label to match the input data. This machine learning research paper proposes a new method to choose the best dimensionality reduction method (selection or extraction) according to the data scientist’s parameters, aiming to apply a clustering process at the end. It uses Feature Ranking Process Based on Silhouette Decomposition (FRSD) algorithm, a Principal Component Analysis (PCA) algorithm, and a K-Means algorithm along with its metric, the Silhouette Index (SI). This machine learning research paper presents 5 use cases based on a smart city dataset. This research also aims to discuss the impacts, advantages, and disadvantages of each choice that can be made in this unsupervised learning process.

Learn more about Machine Learning and Machine Learning Research at ODSC East 2022

At our upcoming event this April 19th-21st in Boston, MA, ODSC East 2022, you’ll be able to learn from the leaders in machine learning to hear more about all of the topics above. Register now to learn more about machine learning research, deep learning, NLP, ML for cybersecurity, and so on. Tickets are 70% off for a limited time, so register now before prices go up soon.

Current machine learning and machine learning research talks include:

  • Need of Adaptive Ethical ML models in the post-pandemic era: Sharmistha Chatterjee/Senior Manager Data Sciences & Juhi Pandey/Senior Data Scientist | Publicis Sapient
  • AI Observability: How To Fix Issues With Your ML Model: Danny D. Leybzon | MLOps Architect | WhyLabs
  • Data Science and Contextual Approaches to Palliative Care Need Prediction: Evie Fowler | Manager/Data Science Product Owner | Highmark Health
  • Demystify the gap between Data Scientist and Business Users: Amir Meimand, PhD | Data Science/ML Solution Engineer | Salesforce

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.