Editor’s note: Alex is a presenter for ODSC East 2019 this April 30 – May 3! Be sure to check out his talk, “From the New York Times to NASA: How Text Analysis Saves Lives.”
Machine learning techniques are driving disruptive change across disparate fields in engineering. A parallel, and equally bold revolution is occurring in information science. As the Internet continues to intensify the density of information we are exposed to, advancements in information science are crucial for our ability to make informed decisions, approach new fields of knowledge, and find truth in misinformation.
First, what is “information science”? Put simply, it is the study of how to organize information. Can we better classify documents to improve search? Can we extract facts from articles to weed out fake-news? Can we pose questions in real-time to a student reading a textbook?
I’ve encountered many such problems in my work. I worked as a data scientist at the New York Times, where we explored novel ways of recommending news to readers and predicting emotion in articles. I moved to Microsoft Research, where we investigated fake news and Russian misinformation. I am now working with the National Aeronautics and Space Administration (NASA) to make it easier for scientists to find collaborators, relevant datasets and methodologies.
Information science, as a field, is both old and new: some of the earliest computer algorithms sought to resolve ambiguities in information. Soundex, for example, is an early algorithm to match spellings for phonetic pronunciations (ex. “Hanna” and “Hannah”). It was patented in 1918, and used in 1930 by the U.S. Census Bureau. Advances in record-linkage in the late 1960s built off of emerging theories in Bayesian modeling, and helped clean some of the world’s earliest databases. And yet, the field is booming, with the emergence of novel machine learning approaches offering exciting possibilities.
Problems in information science differ from typical machine learning problems in that data being analyzed is usually textual. Text is discrete (the raw input is typically words or characters), sparse (an input will typically use only a small portion of the vocabulary), and heavy-tailed (some words, like “science”, are used very frequently, while others, like “nucleation”, are used rarely). As such, many challenges emerge.
In my talk at ODSC East, I’ll focus on how NASA is using hierarchical topic modeling to break down barriers between fields of science. What are the knowledge gaps in certain subfields? How can the methodologies and instruments used in one subfield help scientists in another?
I will pose questions that I hope are broadly relevant across industries. How to choose the right methodology for your problem? How to navigate the landscape of open-source tools? Whether you are the manager at a startup, a student at a university, or a coder in a large company, we are all, at the core, students of information, and our work revolves around its effective organization.