The following Q&A is part of a series of interviews conducted with speakers at the 2017 ODSC East conference in Boston. This interview is with Aedin Culhane, Computational Biologist at the Dana-Farber Cancer Institue, whose talk was entitled “R and Bioconductor in Cancer Research – Big Data in Genomics”. The transcript has been edited condensed and edited for clarity.
What was your talk about?
My talk was about how we use open source tools and methods in the R/Bioconductor project in the analysis of genomic data. I work at the Dana Farber Cancer Institute, and we analyze genomics data to discover biomarkers and predictive models in cancer.
Cool. What were some of the questions people asked?
Immediately after the talk, one person asked about one of the datasets I presented, and how we’d corrected for batch effects, which is obviously, a big issue in any data analysis. The second question, somebody asked about semantic analysis and whether we had used text learning.
Can you expand more about the applications of data science within this field? Describe how you use machine learning.
We use DNA sequencing and other technologies to quantify the molecular profiles of a cell. These data give us an understanding of which biological pathways are essential in a normal cell, and which biological pathways are perturbed to allow a cancer cell survive, thrive and progress to malignancy.
The technology and tools have dramatically advanced in the past five years. DNA sequencing technologies are cheaper so we can now sequence many more tumors and we now have so much more data than we ever had before. I’m working with molecular profiles of 10,000 tumors as the part of The Cancer Genome Atlas, the TCGA project. In that study, we have multiple datasets that measure different molecules in the cell including DNA, RNA, microRNA, methylation and proteomics. We need to integrate these molecular data, linking these data to the clinical data and find out which pathways are perturbed in disease. For example, what’s the difference between the patients that respond and don’t respond to therapy?
Our data science tools range from software and methods for basic processing of the data to those that are specific to our field. We apply standard exploratory data analysis approaches; unsupervised analysis, such as principal components analysis and clustering, to supervised machine learning. We have a lot of tools that are specific to the field, for example, those which map gene/protein features to the genome, map genes to pathways or other biological databases. There’s a lot of biological information in the published literature, so we can validate results using other public datasets, perform aggregate or meta analysis among datasets. We also apply supervised classification algorithms, linear models, penalized regressions, different methods for sparse data processing. The latter is important when dealing with some types of data, for example, single-cell data can be sparse.
Like a sparse matrix?
Yes, sparse matrices. Some can be large, for example, 10x Genomics just released a dataset, which quantified gene expression in about 1.3 million cells. It measured 20,000- 30,000 genes, so it’s a 30,000 by 1.3 million matrix, but it’s really sparse.
That’s my next question about the big data aspect. Could you translate that into gigabytes or terabytes?
The TCGA data is about 2.5 petabytes. The raw 10x Genomics 1.3 million cell single cell gene expression dataset is about 3.6 terabytes.
The raw data we work with are files called FASTQ files. These text files contain short sequences of a genome, that is the biological sequence of molecules (nucleotides) that form little pieces of DNA. We call these short reads of DNA. We then align or match these short reads to the 6 billion base pairs of nucleotides in the genome. The aligns reads are stored in BAM files which are more compressed in size. We then count how many reads match at each position. We look at the differences between each read and the reference genome, because differences may indicate a mutation or change in the expected DNA sequence. With DNA sequencing of tumors, we also ask “how many” reads a map to each position? If there are more or less than expected in a normal cell (where we expect 2 copies, one from mom and one from dad) we count the number of amplifications or deletions. These count matrices provide a portrait of changes in the genome of a cancer cell. The three files provide data but FASTQ > BAM > count matrices in size.
However, we examine many types of molecules in the cells. Proteins in the cell are essential for cell function and are created using the instructions in DNA. The cells transcribe a copy from DNA in the form of mRNA to make a protein. Therefore changes in DNA can modify mRNA or proteins. Changes in a protein’s function can perturb the normal network of biological pathways, essentially throwing the cell “engine out of tune”. Therefore we look at the RNA sequencing data and proteomics in addition to the DNA sequencing and seek to find the pathways that are modified. We may be able to target these cancer-specific pathways to develop better medicines to directly target cancer. Each of these steps, DNA sequencing (which can be all of the genome, or just the transcribed portion, the exons), RNA sequencing and proteomics each generate data and the integration of all of these can be computationally complex.
Ok, got it.
The read data (FASTQ) files are large. The aligned data are about 500 Gb each. The count data is smaller. Depending on the analysis that we’re doing, we’re dealing with different size files.
Describe the history and the mission of the Bioconductor project
The Bioconductor project (http://www.bioconductor.org) started just over 10 years ago. It’s one of the largest projects within R and is entirely open source, open development. It has over 1,300 packages, and its mission is to provide software for the statistical and bioinformatic analysis of genomics data (DNA, RNA, sequencing, proteomics, etc).
Bioconductor has data importers that makes it very easy to bring genomics data into R. For example, if you wish to download a genomics dataset from the NCBI GEO repository, the command GEOquery:::getGEO will load it into R in Bioconductor format.
Bioconductor have developed R objects and classes that are useful and functional for bioinformatics and genomics data. In Bioconductor, data are typically stored in an object that is an S4 class. The sample level data or the phenotype data is stored in the data matrix. We also have many packages that support annotation of features (genes, proteins, probes etc). This triplet of information; the feature level data, the sample level informations and the data matrix can be retained together in an S4 class, such that if one subsets the data, one can retain the associated phenotype and feature data. Thus avoiding the errors sometimes seen in other software (such as Excel) when the annotation and data are disconnected.
Do you use Neural Networks frequently?
I do some supervised classification with Neural Nets. I’ve have also used support vector machines and other different classifiers algorithms. At the moment, much of my work is exploratory analysis, because we’re trying to find subtypes. We don’t know the “labels”. We don’t have that gold standard or a good training dataset to train a supervised classifier.
So it’s more unsupervised work?
A lot of what I’m doing at the moment is trying to identify subtypes that span across cancers, and currently, that is very much unsupervised analysis.
Okay. Is it more so of a dimensionality reduction? Or are you doing clustering?
In terms of clustering, how do you know when you think you’ll be able to make that move from unsupervised to supervised, when you say “Okay, we have this n-number of clusters from the data, and we think these are these clusters which are indicative of the structure of the data” When do you make that jump? Or how far away do you think you are from making that jump from unsupervised to supervised?
When we find clusters, that we can reproduce, we ensure these are not associated with a batch effect or a reproducible technical artifact (for example the version of an instrument). Clusters need to be robust and biologically meaningful. We test clusters to identify those associated with clinical outcome. These are most likely to make a meaningful impact on patients. When validating clusters, I like to first use the Prediction Strength algorithm from Tibshirani and Walther. However, on finding clusters, we need to generate more data in the lab, such that we now have an independent validation set, in addition, to cross-training within a dataset itself. We rarely rely on one dataset, no matter how large it is. We always seek validation on multiple independent datasets. Sometimes additional validation dataset is available from our research groups but I’m also working with clinicians and lab scientists to actually generate those data. We have published several articles and approaches for validating biomarkers and clusters.
Yeah, so it sounds like you have to be extra, like 120% sure that a cluster is meaningful, I’m guessing.
If it’s not going to translate to the clinic, if it’s not going to translate to patients, if it’s not going to advance our understanding of biology, then it’s no use to us. We need something that will actually inform where we’re going, as scientists, in the cancer biology field.
How can you be sure that something will inform or not inform your field?
In addition to employing statistical rigor in analysis, we use our a priori understanding of biology and medicine.
We are more confident in results if we detect a noticeable difference between biological pathways, that is between groups of genes, and if there are some associations supported by the published literature. However, we are less confident in our discovery if we identify a random assortment of genes that are not described in the literature or have few known biological connections.
In biology, genes and proteins function together in pathways in a network, they work together in processes and these coordinated processes differ between cell types. Such that cells in your mouth, are doing something different to cells in your liver, your immune cells, or your neurons. Therefore we can borrow information across the network. We typically use approaches that extend gene set enrichment analysis where we test the associations of groups of features (genes) between groups of cases (tumors). We recently described a method (moGSA) that integrates multiple datasets using matrix factorization and we test the groups of features in that space.
Other considerations are that only some biological discoveries can be translated to a therapeutic in the clinic. Some genes are not easily “druggable”, that means it is difficult to generate small molecules that are active against them. Additionally, some genes have the potential to generate adverse side effects if targeted.
Okay. What is the short-term outlook for the Bioconductor project? The main short-term goals you want to achieve and then switch to the long-term vision of it, as well.
An immediate short-term goal is an event. Our annual meeting is in Boston from the 26th to 28th of July in Dana-Farber Cancer Institute. Developer day is on the 26th and then the main meeting is the 26th and 27th of July. Some great speakers are coming. We will introduce new networking events for Bioconductor developers. We will be doing “birds of a feather” meetings, where people can gather together to discuss topics of interest, in addition to the workshops and the talks, posters and social events that we normally do.
On intermediate-term goals, including making it easier to access and work with Bioconductor when performing analyses in the cloud. It is already possible to work with R in the cloud or perform google big query, but we wish to make it simpler. This is particularly important given that our datasets are increasing in size. I mentioned that 10x Genomics just released a dataset of 1.3 million cells. These data are big and sparse. We need to develop infrastructure within bioconductor to more efficiently analyze single-cell sequencing data. We need to develop classes, algorithm, and tools to support analysis of these data. Biology and genomics are advancing at a rapid pace and Bioconductor are continually adapting to meet those needs.