I am back from Open Data Science Conference (ODSC West) in California. What a blast! Not only was I able to present my talk on the democratization of AI, but I have learned a lot of very interesting stuff!
I honestly am impressed by the projects and technologies presented throughout the week. The state of data science is way more advanced in 2018 than it was a year ago.
Here’s a quick summary of what I’ve found the most interesting. I have decided to keep the summary succinct so I will only talk about some techniques and concepts presented I master the most. I’ve regrouped my learnings in two categories: business concepts and technologies and tools.
How to Retain Data Scientists in Your Company
We’ve learned that a data scientist doesn’t stay in the same company for more than 18 months on average! According to Harvard Business Review, data scientists need support from their manager and organization, which is not often easy in the early stages of a team creation. They also need ownership. It’s complex to keep track of since data science is often seen as a support for organizations.
Finally, data scientists are also seeking for a purpose. Unfortunately, companies with a lack of vision often tend to either give no direction to the data scientists or very low-level tasks to do. These behaviors often drive employee churn.
My take on this: It is critical to keep them motivated. Data science teams are often created organically by companies that want to benefit from the high number of opportunities that you can tackle with AI. However, it is critical to keep in mind that a data science team, like any team, needs a vision, a solid structure, and coaches in place to make sure that the data science professionals thrive.
Data Science Goes Beyond Statistical Models
Implementing tensorflow or scikit learn models is an important ability, but a data scientist must master business and mathematic concepts to have the biggest impact on the businesses they work for.
My take on this: Being a data scientist is about using the right tool to solve a problem. If the Data Scientist does not understand the business problem and does not understand the statistical implications of it, it’s not going to work. Data cleanliness and data science methodology are also important concepts to master in order to have well-performing and relevant statistical models.
The Hype is Real
The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic.
A lot of A.I. use cases are very successful, confirming that the hype is not going to disappear. For example, being able to identify fake news, predicting the reader’s feeling while reading a New York Times’ article, early detection of behavioral health disorders in children, and very advanced image recognition tools were among the great projects presented.
My take on this: The question is not even about proving the usefulness of A.I., but rather making the outside world’s expectations more realistic. A common trait between these projects is that current technologies can already deliver good performance. It is less about technology and more about resolving a real problem.
It is All About Data
Data is the real competitive advantage going forward, as statistical models are shared amongst the community. As the AI research community constantly grows and is pretty much open source, it means that months and years of research work quickly become available to everyone.
Any data scientist can then use these tools to develop best-in-class statistical models to solve their problems. This means that statistical models tend to be similar throughout the community. However, as machine learning is a set of statistical algorithms which identify and generalize patterns from already observed data points, a voluminous and clean dataset is the best way to better exploit a statistical model than your competitors.
My take on this: This couldn’t be more relevant. However, the truth is that most companies have a looooong way to go. Still, if you want to quickly and smartly invest in your data, there are techniques discussed below to help you do so. These techniques were demonstrated to augment or clean your dataset. Some companies present at the event also offer labelling services. This is a gold mine for companies who want to get started with data science.
Technologies & Tools
Monte Carlo simulation and active learning are increasingly used successfully to prepare data in an agile fashion, or in cases where data isn’t abundant enough.
Regarding Monte Carlo Marcos Chain, the real advantage is the fact that it provides a serious alternative to augment your dataset, even if you do not have an extensive dataset already (vs a generalized adversarial network which takes already a good amount of data).
Note: it is however crucial to have all of the clusters and use cases present within the population before thinking of synthetically augmenting the dataset. The Hamiltonian MCMC using PYMC3 is a great technique as it allows multiple features while being able to converge better than other similar techniques.
My take on this: While data quality is super important and it is always better to have a big dataset, it is not always possible. Monte Carlo allows companies with smaller datasets to augment it so that they can use more advanced models, when done with care that is. Also, some use cases like forecasting and logistics simulation are more efficient with this technique.
Transfer Learning Is the Way to Go
Transfer learning is the method of adapting existing and proven models to our needs. As these models were already trained by a big corporation with a lot of resources, it usually is the best method in the use cases presented so far. You simply have to retrain the last labels to your problematic et voila!
My take on this: Models that are reusable via transfer learning are especially available for image recognition and Natural Language Processing. If you use those types of models, please try transfer learning, you’ll be blown away!
(More on) Transfer Learning
Many people were talking about transfer learning during the conference but no simple framework are available online. So here is a super simplified series of steps to get you started.
- You label a number of observations
- You fit a shallow (simple) statistical model on the labeled data
- You predict all of your observations that was not labeled
- You review a number of a random predicted label with a high prediction error
- You re-label those observations
- You go back to step 2 until labels are correctly labeled once you have reviewed it.
Some benefits were discussed as well. For sure, it is quicker to do this once you have labeled manually 2000 observation then having to label millions of observations by hand. Also, you ensure that labels are standardized (vs. Hiring multiple people with different criteria). Finally, this is a super way to correct your model while it is live, when performance starts to be bad.
My take on this: One of the biggest hassle for the Data Science communities is to label data to use it for supervised machine learning. Too much time is spent manually labelling customer support tickets, user profiles, pictures and texts… this is insane, especially considering that manually labelling thousands of items is boring and it is too easy to produce unstandardized results. The smart approach is to label a small batch of data points, use the technique above, and go iteratively until all data points are labeled.
Model Creation Is Just the Beginning
Without real production experience, people can think that machine learning is mainly about building a model with past observations and validate it successfully. However, experienced data scientists have many scars once it is in production. In fact, the real and complex work happens once your model is in production!
A data scientist at Tesla demonstrated how edge cases are critical and part of their testing process. Overall accuracy or lost minimization are clearly not sufficient. Tesla treats its edge cases as regular software delivery test scenarios that have to pass in order to update the models in production.
Other data scientists talked about various sampling biases that were causing very performant models using training data to be terrible. It is vital to make sure that all production use cases an clusters are present within the training dataset. Also, it is important to compare the distribution of the 2 datasets to make sure that the values were matching quite closely.
One of Google’s engineer came to discuss the importance to present the output of a model. Even though you might automate the decision making, but it is always a good idea to understand why the model predicted a case the way it did for periodical validation.
My take on this: Even if you have a live model, it is always a good idea to review how your model performs using real production data, so that all your hard work to define a good statistical model is working well once your project is live. You can be surprised how different to your expectation it might be. Also, you will quickly realize that without ways to interpret your model, you will quickly be lost trying to make sense of your model. Techniques such as LIME and SHAP will clearly help you to translate your model.
It’s about People, Processes, Data and Technologies
Open Data Science Conference was a great reminder of the most important aspects of data science; people, processes, data, technologies.
It is important to surround well the data scientists and give them all the ingredients to thrive and really make a difference. If the vision is well defined, they are well surrounded and work on interesting and strategic problems to solve, Data Science professionals will enable the organization to be more data driven.
In statistics and in computer science, processes and methodologies are critical. Without a doubt, Data Science is no exception. So far, well known methodologies are the model definition and the model validation processes. These processes are key to get valid results.
Lately, as the data science community matures, more practices get discussed such as model DevOps, which consists to validating the statistical model’s accuracy and performance while it is in production. Active learning has also been discussed. Essentially, active learning is the process of adapting a model based on feedback.
The community now realizes that a performant statistical model is really dependant on its underlying data and, thus making it the most important element of data science. Simulation and active learning are interesting and creative approaches to have a bigger dataset even if you do not have access to a lot of labeled data.
Once again, new frameworks, tools and algorithms were presented. Research is important in this field. This year though, I was happy to hear about use cases that was successful using already existing technologies. Even using some of the newest algorithms, some projects had only small gains.
It is satisfying to know that current versions are good baselines in most cases. Imagine what you would be able to achieve once you master all four aspects of data science; people, processes, data, technologies while working on resolving strategic problematics.