

Building a State-of-the-Art Data Science Platform
Modelingposted by ODSC Community March 1, 2022 ODSC Community

In the real world, the Data Science process flows beyond the steps of processing some data and training a predictive model. In a production system, the Data Science process is interested in data and model delivery, auditing, and preparation. Data Science is highly elaborate and getting it right is a critical factor in determining whether a business will succeed in its Data Science projects or not.
In this article, we are going to discuss the different components required to build a highly effective and robust DS platform.
The Objectives
Why do we need a Data Science platform?
What do we want to get out of this platform?
The main reason why businesses fail in their Data Science services is grossly related to their inability to properly quantify their DS requirements. Most are unable to answer “why do you need DS?”.
Unfortunately, several businesses jump on the hype train of trendy and buzz words at the time and feel obliged to do Data Science. At the end of the day, it’s what sells, no? There are three main possible scenarios of this:
- get lucky and produce a successful project
- fail miserably
What often ends up happening is failing to deliver on that hype. Failing to deliver thus results in a tainted reputation, and of course, business harm!
Modern-day businesses always have a use case for a Data Science platform. Data is key in today’s world. It is the world’s most valuable asset after all. However, what most fail to understand is the differentiation between a Data Science platform and an Artificial Intelligence (AI) platform. Data Science flows beyond the confinements of AI.
Data Science is the beating heart of the data body. Data Science is the medium through which any organization can make sense of its raw data. Data Science is, in its simplest form, a toolset for enabling the delivery of analytical intelligence.
So, why do you need a Data Science Platform? What do we want to get out of this platform?
Quite simply, a Data Science platform renders you capable of truly understanding your data.
We all experience physical relationships, and I am sure that all of us, at one point in time, wished that they knew what the other party was thinking. Well, as an organization, you are in a relationship with your clients and stakeholders. And Data Science does just that! Enables you to know and understand what your partners are thinking. It equips you with the right knowledge to assess what is working and what is not. Helps you decide which buttons to push.
The Components
Before we dive deep into the winning formula of the Data Science platform, I want to point out 2 incredibly important considerations:
- Think FUTURE: As with anything tech, data technologies are rapidly evolving both in scope and in magnitude. So keeping our options open to future scalability opportunities is probably a wise thing to do.
- Think MODULARISATION: When building our Data Science platform, it is imperative that we think of it as another product. As an organization, you should not build a Data Science platform to simply aid your workflow. Be proud of your Data Science platform! Exhibit it. Market it. And most importantly, SELL IT!
Yes, you heard that right. Sell it. Build your platform in such a way that every single component is individualized and can be sold as a separate product. Every single component of the platform should work perfectly fine individually and also be easily integrated with other solutions.
So — let us start discussing the key components of a solid Data Science platform.

Data Flow Diagram of the Platform Components. Image by the Author
Component 1: The Data Messaging System
Part 1: Any data-related product needs to start with handling the transfer of data. How are we going to get our data inside our platform? We have two main options really:
- Real-time data streaming
- ETL or batch loading
I strongly believe that real-time data streaming is the way to go in almost every scenario; however, this is heavily dependent on the infrastructure and use case. If real-time streaming is your choice, then Apache Kafka is your friend. Kafka is simply a message-passing solution. There are various other solutions such as Amazon Kinesis, TIBCO Spotfire, and RabbitMQ. Again, just pick your poison.
Part 2: The data storage aspect. Selecting the database system to go for is heavily dependent on the following characteristics:
- Velocity — How fast is the data coming in?
- Volume — How large (in size) is the data?
- Variety — How many different data structures do you have?
- Latency — How fast do you want the database to return results?
This is a huge domain for discussion and expands way beyond the objectives of this article, so I won’t be going into it much more.
Part 3: Data Processing and Feature Engineering. Once we start getting in our data streams, we can start processing this data on the fly. The ultimate goal for this is to create Feature Stores (i.e. specific feature pre-calculated). The Feature Stores become more relevant with the integration of Machine learning solutions within the Data Science platform. In production, any machine learning models running with constantly read from the Feature Stores to retrieve their data.
There are various solutions to this. One might develop their own using something like Spark to process the data streams and store the results in database objects (depending on what database flavor one opted for). Conversely, one might also go for existing Feature Store implementations such as Feast, Tecton, or Hopsworks. The important note to keep in mind here is that we should always have a single source of truth, especially for the engineered features.
Component 2: The Intelligence
This component is all about building a framework that allows for automated and manual machine learning model training. All model training processes should flow through a standard model evaluation strategy (common across all models) which would perform A/B testing and statistical hypothesis testing to both evaluate and compare base models.
The final model should be further optimized through automated hyper-parameter tuning. Every single model should first pass through this framework before being deployed to your Production environment.
Proper model versioning is also critical here. We need to be able to answer questions such as:
- when was the current model trained?
- what are the current model’s performance metrics?
- which model was in production on that specific day?
An awesome tool for this task is MLFlow.
At this point, we should also start thinking about Data Version Control (DVC) — mainly for the same reasons that you would want to version your code-base.
For DVC, we can either:
- Version data as a snapshot at training time (artifacts in MLFlow for example), or
- Use specialized data versioning systems;
This methodology would allow us to version data and code separately (any change in data would affect the code, but not all code changes would affect the data — so they should be separate units). Similar to the Feature Store, one might opt for building the DVC in-house or going via third-party solutions. Some examples include Pachyderm, DVC, and Hangar.
Properly implementing DVC in our Data Science platform would prove to be incredibly powerful in the long term. Especially with something like MLFlow in place, we would be in a position to bind a trained data model which is versioned on MLFlow to the training data set used by that model (versioned in DVC). This also allows us to have an audit trail on our data. Nonetheless, both model versioning and DVC should communicate with a Metadata Store to establish a strong audit trail of what, when, and how. But more on this later.
Another aspect within this component is Explainable Artificial Intelligence (XAI). Building machine learning models is one task. But delivering interpretable predictions to the end-user is a task on a whole other level. Nowadays, there exist numerous XAI packages to help you interpret the results of any machine learning algorithm. As such, incorporating such techniques within our intelligence component really helps take our platform to the next level.
The final gear of this component is model monitoring. We need a way to evaluate the performance of our trained models and continuously monitor them for performance degradation. Model monitoring and evaluation should be split up into 3 ways:
- Model evaluation/benchmarking during training
- Model monitoring using XAI and What-If analysis
- Continuous production monitoring
Component 3: The Productionisation
We release our models to Production to be used by our clients. This means that we have our models making predictions on real-time data and delivering the results to the respective destination. This step should also communicate with the Metadata Store to keep a productionisation audit trail (for instance, when a new model was pushed to production and simply keeping a track record of every single prediction made).
Another interesting mechanic which can be integrated within this component is the Feedback Loop. The feedback loop is a mechanism in which the prediction that a machine learning model just made is fed back to the same said model for re-training. This aspect becomes increasingly more powerful if combined with actionable predictions. This feedback loop should enable us to evaluate the model in production as close to real-time as humanly possible (depending on the task at hand) and also facilitate the transition to online learning. This would also enable us to minimize the effects of model decay and concept drift in our platform.
For example, let us assume a data model which is predicting whether a customer will churn within the next 5 days. Once we make a prediction, we can monitor that prediction for 5 days to determine whether that respective customer did in fact churn or not. This feedback would then be fed back to the initial model to further refine and improve the performance.
Component 4: The Monitoring and Exposure
This is the part where we can show off our cool observations and insights. This is the most important part of the entire platform. This is the face of the operation; the component which the customers will interact with and experience directly. This is where we deliver all of our results and monitor closely the generated business value.
Here is our opportunity to put on our creative hats. We need to build dashboards to fully visualize the capabilities of our product, and also ensure that the customer is able to fully and quickly absorb our insights.
We would also be able to continuously monitor current model production performance, enabling us to take a proactive stance on model retraining and delivery. We can also adopt a multi-layered dashboard that uses different views for different types of users. For instance, the Developer view would allow us to deep dive into a particular model’s technical metrics while a User/Regulator view would load up an XAI dashboard and What-If analysis.
Components 5: The Metadata Store
It is simply a way for us to keep an auditable trace of what is happening in our workflow. Here we can store an audit trail of queries, predictions made, and model training. One can also enhance the Metadata store to include a prognostic trail of the different products.
TL;DR
Essentially, our Data Science platform architecture should be composed of 5 main components:
- Getting the Data
– Real-time stream processing
– Feature Engineering
– Feature Store - Intelligence
– Machine Learning - Productionisation
– Model Deployment - Monitoring
– Production Monitoring - Metadata Store
The above components should be the backbone of any Data Science framework. What might differ is how we handle the integration and allow the different components to interact with each other.
Originally posted here. Reposted with permission by David Farrugia, Data Scientist | AI Enthusiast and Researcher