Asking the Right Questions with Machine Learning Asking the Right Questions with Machine Learning
Chris Gropp, a PhD student at Clemson University, spoke at HPCC Systems Tech Talk 10, focusing on how to plan effectively at the start... Asking the Right Questions with Machine Learning

Chris Gropp, a PhD student at Clemson University, spoke at HPCC Systems Tech Talk 10, focusing on how to plan effectively at the start of a machine learning research project to achieve a successful outcome. This blog shares his experience on how to ask the right questions with machine learning, by taking a step back and carefully examining the requirements.

Chris Gropp’s PhD research project focused on using machine learning techniques to analyze text data in documents that may change over time. This project was successful, but as with most projects, it wasn’t without its challenges.

The lessons learned during this research project are ones with which many people will relate. Chris wanted to share with us and our community why it is important to ask the right questions when using machine learning, while also being very careful “what you ask for!”

His effective, simple and slightly exaggerated example, illustrates this point:

A Cautionary Tale

Imagine you have a text reading application. Your client wants to create something that reads text to the vision impaired. They have some still images that need to be converted into raw text data. The images may include, for example, legal documents, and so accuracy is very important. Guided by this, we choose accuracy as our key metric. Let’s consider what might be the perfect text reading method that satisfies the high level of accuracy required.

Suppose we take the image and email it to 100 grad students, offering Starbucks gift cards in return for transcriptions of the images. When enough responses that agree with each other have been received, they are used to create text.

If it takes longer than a specified amount of time to get a response from a student, just keep on sending it to more students until the required number of responses needed for the accuracy constraint have been received. Perfect solution, right? Of course not. This is not a good solution. It scores well on your metric, but does not solve the problem.

Let’s consider these additional questions:

  1. How long is someone using the system willing to wait for the transcript to come back? Probably not long at all.
  2. How much is the client prepared to pay for Starbucks gifts cards to get the verified transcripts? This could be a prohibitively expensive solution.

The Moral Of The Story – Make Sure You Know What You Actually Need!

In reality, there were two additional implied constraints that needed to be factored in. Instead of simply asking for the most accurate solution, what was needed was a solution that provided accurate results within a specified amount of time, without being prohibitively expensive.

Chris’s reflections on his own experience take us through his use of topic modeling, which he used to understand and summarize large collections of textual information. He takes us through the models he used and the problems he encountered along the way, which caused him to re-evaluate his methodology and identify a better solution.

Introducing Topic Models – Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a type of topic model used to correlate text in a document to a particular topic.

In Latent Dirichlet Allocation (LDA), documents are assumed to be created via a generative process. To implement this model, for each word, sample a document’s topic mixture to choose a topic. Then, sample the chosen topic to choose a word from the vocabulary. Repeat this until the document is complete. To create our topics, we reverse engineer this generative model; from the observed documents, infer latent topics and topic mixtures using variational inference or Gibbs sampling.  We continue to iterate over documents and modify the prior estimates until they are satisfactorily converged.

Dynamic Topic Models

Chris and his team were tasked with examining how language changes over time in a large text corpus. The initial model type used for this problem was Dynamic Topic Modeling (DTM). Dynamic Topic Models modify a key assumption of LDA, which is that each document is generated at the same time. In Dynamic Topic Modeling, the assumption is that documents are not all generated simultaneously. Documents are separated into discrete timesteps, and each timestep has a distinct version of each of the topics under consideration.

So, there is an evolution of topics over time, which allows for language to change with new related concepts. This makes it possible to determine which topics are most important at each timestep.

However, because of the time dependencies and the way that they are arranged, there is difficulty inferring this model. The intention was to apply this model to a large data set, and the obvious approach was to parallelize.  But, data dependency and interconnection between the parts of the model complicates parallelization. In addition, the original code was not designed for performance.

After proving to be more difficult than anticipated, the requirements were revisited. Do we really need this? What do we actually need from this?

What was needed was parallel code that was fast and could be run on a large amount of data. Something that could extract topics from different time steps so that information could be preserved, and express information about the way that the topics change, what words are used, and how important they are.

A parallel dynamic topical model code could accomplish this, but there was another solution. Rather than keeping topics linked while doing all this inference, it was decided to do it all with local information, and then link them afterwards.  This allowed for parallelism and simplified the computation. It also runs very quickly, even on large datasets. The new model is referred to as Clustered Latent Dirichlet Allocation.

Clustered Latent Dirichlet Allocation (CLDA)

Clustered Latent Dirichlet Allocation (CLDA) runs LDA independently on each timestep of data. Resulting topics are clustered, and each of these clusters works similarly to the dynamic topics in DTM.

The CLDA application was very successful for a number of reasons:

  • It is two orders of magnitude faster than DTM.
  • It also provides more detailed topic evolution information
  • It allows for topics to arise and die off. For more details, see the white paper

This solution solved the problem, rather than simply providing an approximation.

If you want to know more about CLDA implementations, use to the following link. The code utilizes Python and C. There is also an active project to construct CLDA using the ECL language.

What We Learned

The important thing to take away from this is that Chris tried something that was harder than it really needed to be and found a much simpler solution that solved the problem.

Also, when a project has been implemented and does not achieve the desired outcome, it is time to stop, ask questions and reflect. If you start with a tool you have decided you want to use, it might not be the right one for the job. Instead, first identify the requirements, and then choose an appropriate tool.

Questions to Ask Before You Start

When it comes to embarking on a machine learning project, Chris’s advice is to start by identifying the requirements, determining how to evaluate the success of the project, and analyzing the methods to be used to satisfy those requirements. To be able do this effectively, it is important to ask the right questions:

What is the problem you want to solve?

  • Start with the big picture.
  • What is the final application?

How should the solution look?

  • What types of input do you have?
  • What does your output need to contain?
  • What other constraints are there? Think about any implied constraints, too. (Speed, memory, security, etc.)

Evaluating Success

When evaluating the success of a project, remember that what you can measure easily and what you need to measure may be different. So, it is important to always have one eye on the problem.

This means, you should always be thinking about:

  • How is the application evaluated?
  • How is a good solution distinguished from a bad solution?
  • What can a good solution do that a bad solution cannot?
  • How can you measure the difference?

Choosing a Method

Once you have identified what you need, then you can finally pick (or create) a method. To choose a method, do the following:

Look at the requirements:

  • Which methods process the types of input you have, and produce the type of output you need?
  • Which methods are within the constraints of your application?

Look at the metrics:

  • Which candidate methods perform best where you need them to excel?
  • What trade-offs do the candidate methods have? Which trade-offs best satisfy your priorities?

The Key to Success is Proper Planning

Remember to ask the right questions before starting a project. Asking the right questions allows for better planning and has the happy side effect of cutting down on execution time during the implementation of a project.

About Chris Gropp

Chris is a PhD Candidate working alongside Dr. Amy Apon and her team at Clemson University in the USA. As a member of this team, Chris, works on refining topic modeling approaches to text analysis by improving the algorithms themselves, and also by developing new methods to analyze output.

Listen to the full recording of Chris Gropp speaking about “Asking the Right Questions with Machine Learning” at our HPCC Systems Tech Talk webcast.

Original post here.

HPCC Systems

HPCC Systems

Discover HPCC Systems – the big data platform that enables you to spend less time formatting data and more time analyzing it. This truly open source solution allows you to quickly process, analyze, and understand large data sets, even data stored in massive, mixed schema data lakes. Designed by data scientists, HPCC Systems is a complete, integrated solution from data ingestion and data processing to data delivery. Connectivity modules and third-party tools, a Machine Learning Library, and a robust developer community help you get up and running quickly. Point to www.hpccsystems.com