fbpx
Solving Problems in Machine Learning with scAlign Solving Problems in Machine Learning with scAlign
Editor’s Note: See Nelson’s tutorial on this subject of  solving problems in machine learning at his talk “Data Harmonization for Generalizable Deep Learning Models:... Solving Problems in Machine Learning with scAlign

Editor’s Note: See Nelson’s tutorial on this subject of  solving problems in machine learning at his talk “Data Harmonization for Generalizable Deep Learning Models: From Learning to Hands-On Tutorial” at ODSC West 2019.

One of many common problems in machine learning (ML) is to learn models that work well (generalize) beyond the training data used to fit them. For example, suppose a photographer takes photos of different cars, and trains a classifier to predict the car make and model. Ideally, this classifier would be able to classify pictures of cars taken by another photographer. However, many classifiers predict the make and model of a car by implicitly or explicitly finding cars that are similar and have known label in a training dataset. In practice, if we were to draw a low dimensional embedding of pictures taken by both photographers, photos of the same car (Fig. 1, Volkswagen bug) may be dissimilar (far apart in the embedding space) due to differences in photo style, which ultimately hinders classification performance: 

[Related Article: How to Leverage Pre-Trained Layers in Image Classification]

problems in machine learning

Figure 1: Unaligned images of cars from stl10, where red and blue indicate two different sets of images (red: labeled, blue: unlabeled). A single labeled and single unlabeled Volkswagen bug are indicated by the photos.

The effect of the photographer, in this case, is known as a confounding variable because it leads to wide separation of photos of the same car. Methods termed “domain adaptation” (DA) try to eliminate these confounding effects, and bring cars of the same make and model together, regardless of which photographer took the picture.

We have developed a novel deep learning-based domain adaptation approach, called scAlign, which reduces the effect of confounding variables for both unsupervised and supervised learning. We originally developed scAlign to work in biological (genomic) data, but it generalizes to many other kinds of non-biological data, such as images. You can read our paper on scAlign here:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1766-4

scAlign is available on Bioconductor and Github as an R package:

problems in machine learning

https://bioconductor.org/packages/release/bioc/html/scAlign.html

https://github.com/quon-titative-biology/scAlign

In our tutorial, we’ll show you how to use scAlign for both unsupervised and supervised domain adaptation.

Unsupervised domain adaptation: leveraging unlabeled data to improve model performance

In the above example, the goal of domain adaptation, in general, is to remove the photographer-specific effects such that cars of the same type group together in these low dimensional embeddings. In the unsupervised case, we assume that we don’t have labels on the photos; we just want to merge the photos of the two photographers into a joint embedding space, where photos group by car make and model, regardless of which photographer took the picture.  Unsupervised domain adaptation is useful in cases where large amounts of unlabeled data (e.g. photos) are available for training.

 scAlign performs unsupervised DA by learning a shared embedding space for both sets of photos that is enforced to obey two properties:

  1. Every photo from photographer A is close to at least one photo from photographer B;
  2. Photos of similar objects should also reside close in embedding space.

Here is an example code snippet to run scAlign on the above data: 

scAlignHSC = scAlign(scAlignHSC,

                    options=scAlignOptions(steps=5000, log.every=5000),

                    encoder.data=“scale.data”,

                    decoder.data=“logcounts”,

                    supervised=‘none’,

                    run.encoder=TRUE,

                    run.decoder=TRUE,

                    log.dir=file.path(./tmp),

                    device=“CPU”)

 

And here is a visualization of the embeddings of the photos after scAlign-based domain adaptation (right) as compared to the unaligned photos (left):

problems in machine learning

Now we can train a classifier (machine learning model) on these embeddings that predict car make and model, regardless of which photographer took the photo. Because the effect of confounding is removed, the generalization performance of this classifier will be higher. Similarly, we could train a classifier of a large database of images, then use it to classify images taken by a new photographer.

Supervised domain adaptation: improving classification performance with inclusion of unlabeled data

scAlign also implements supervised DA, where we assume at least some of the photos are labeled. Adding in labels to domain adaptation in scAlign further improves performance and is easy:

 scAlignHSC = scAlign(scAlignHSC,

                    options=scAlignOptions(steps=5000, log.every=5000),

                    encoder.data=“scale.data”,

                    decoder.data=“logcounts”,

                    supervised=‘source’,

                    run.encoder=TRUE,

                    run.decoder=TRUE,

                    log.dir=file.path(./tmp),

                    device=“CPU”)

 

To be continued at the ODSC West 2019 scAlign tutorial:

[Related Article: Smart Image Analysis for E-Commerce Applications]

Our tutorial at ODSC West will showcase how to use scAlign to perform supervised and unsupervised domain adaptation for different types of data, and will also include how to perform “differential feature analysis”, where we extract out important data features that distinguish data from different sources (in this case, features of photo style specific to each photographer, for example).

Nelson Johansen

Nelson Johansen

Nelson is a PhD student in the Department of Computer Science. His current research interests include learning the relationship between gene expression, physiological and morphological properties defining cell state in order to identify abnormalities underlying complex diseases. His previous research focused on the integration of scRNA data from multiple studies and differential expression at the resolution of individual cells.

1