What is the problem that is compelling you to solve using data science? The power in data and the mechanisms to harness this power is now available to us. Identifying the right problem or use case is the first step. There are multiple use cases across the industry being attacked using data science currently. What resonates with you? Being very clear about your objectives and a definition of what success of the project would look like is the stepping stone.
[Related Article: Making the Move from Data Science to Data Leadership]
Having identified your use case and a definition of the success of the project, then you can begin focussing on the project more precisely. You may wonder how a data science project is similar or dissimilar from any other Software Development project? Well, the principles of software engineering will be no different in this case that includes handling of Requirements, Design & Development, Testing & Verification and finally Implementation. This may look very generic, and it is. What will differ is albeit the type of requirement you are talking about, what is meant by design & development in data science, how testing is done and the deployment happens. Further post-deployment how you support and manage the products you created. In a nutshell, the entire lifecycle or what is called MLOps and the model governance framework.
To begin with any data science project, it needs data, of course. Collecting data, cleansing it, understanding its meaning, applying appropriate processing to make it ready for consumption and then finally feeding the models and testing it may seem a very straight forward and mechanical task. In reality, it is not. There is a lot of work involved in just collecting and preparing data for its analysis before its final treatment and feeding to the models. Often this involves the laborious job of collecting and cleansing the data unless you have created smart ways of streamlining this process, even then in the beginning before automating the process there would be efforts needed to create that automation. Having a dataset with hundreds of dimensions again requires the most crucial task from the data scientist who needs to develop a thorough understanding of this data now while gaining domain knowledge as well as leveraging the tools and technique for data investigation and visualization. At the disposal of a data scientist today, there are several feature engineering tools and techniques available that can be leveraged upon. This phase is one that can be seen as the requirements management phase in a typical software engineering process, as at this stage you will work with the domain experts to clearly elicit the information you need, to ensure your model has the predictive capabilities you are aiming it to achieve.
In model development, your data scientist will use this data to train the models and tune the hyperparameters. It will be an iterative process of refining the models over a number of experiments. Unlike development in a typical software project, in data science, you build by training the models using the dataset you create iteratively through feature engineering. At the sometime building, the models with varying sets of hyperparameter settings which again can be automated is the other part of the variability in model development. The important factor here is narrowing down to the perfect dataset and the hyperparameter settings that will generate the model you are looking for. Version control of both, dataset as well as the models is thus an essential part of the development process.
Testing will involve exposing the model to unseen data and let it predict. The accuracy of the predictions and the model achieving expected targets would determine the final suitability of the model created. Testing data plays a crucial role too. An accurate representation of the real dataset is warranted for the right assessments or conclusions drawn from the test results. In typical software project test cases are determined based on the requirements the software, that it is required to meet. To draw an analogy in data science, it is the right dataset that is chosen for testing the models instead of test cases. There are different mechanisms and factors to be considered that will ensure the testing dataset is the right one selected.
So once your models are ready, it is about their consumption and using them for the predictions they are to make and how you integrate those predictions into bigger application and system. This requires the deployment of the models and generating web services that create REST APIs. Data can be passed to these APIs and a prediction received from the models. Predictions can be generated as a batch process or in the realtime depending upon the use case.
This IT support needed for the development of the models, the entire pipeline is often called MLOps. It will not be restricted to requirements, design, development and deployment only but will involve further governance around the entire model lifecycle. This will include regulatory aspects, model maintenance over time to address issues of data drift, model explainability, the surrounding setup that is facilitating its consumption.
Models are the creation of the data they are fed with. Hence, any bias in the data if left, will result in a bias in the prediction made by the models. While selecting and preparing a dataset for model training, it is important to pay attention to the accurate representation of the real scenario through the training dataset. Bias could mean different things in different use cases and datasets. A true understanding of all such possible biases and protecting the dataset against all those should be planned and adhered with early on in the data preparation and analysis phase. It is something like detecting defects early on in a software development lifecycle.
Predictions from models may not be possible to accept blindly. The need for the explanations that led to those predictions will vary from use case to use case. It may generate from the business’s requirement, regulatory expectations and others. Based on the need and design of the end to end processes through which these predictions are consumed will drive the level of explainability, the model will be expected to fulfill. Neither all models are black box, nor all models can be complete white box. An appropriate level of explainability should be planned and executed based on the use case and all other expectations surrounding it.
So to appreciate the uniqueness of a data science project, it is often said data science is both art & science. The art lies in the way feature engineering is done. It is an art indeed. A painter knows how to paint but ask him to define it in words, following which someone else can paint, is not possible. You don’t know exactly how you do it, but you do it. Feature engineering is something like that. You gain a feel of the data that you will feed into your models to train them. It is the most important phase and you come back to this again & again to arrive at your ideal model. When you lead a project ensuring the team understands this and has the highest focus on feature engineering is key to developing high-quality models those will meet the goals of the project. As a painter gets a feel of colors, strokes, and how everything blends, a data scientist should get the feel of data, it’s meaning and why it is important. It will finally impact the effectiveness of the models and the quality of the predictions they will make.
Model development is an iterative process but not with a linear progression. To get it right, the first thing will be to be ready to do experiments. Your next iteration of the experiment will be with the new hypothesis upon having learnt things from the previous ones. You won’t get this right easily that meets your expectations. Identifying which factor is impacting and how to fix it will go in numerous cycles. When you deal with very large datasets, resource need and cost for each experiment will be significant. Thoughtful planning and systematic learning from each experiment will definitely take the results in the right direction. Collaboration with the right people throughout the process will yield much benefit than working in silos. Agile practices of regular standup meeting, where each team member shares what they did, what went well and where they are facing challenges will keep the team in synergy and right hurdles will be cracked in time. Through such practice, even a small gain or a big failure will be easily uncovered for the right level of attention and support without loss of time and resources.
The team using visuals to share their work in terms of achievements as well as hurdles at a very regular basis, in any creative way, will again be highly useful to have a sense of the collective picture and status. Any good or bad news is not a surprise, and an authentic feel of things will keep the team grounded and glued together. Often individuals presenting some knowledge sharing sessions will improve the teams understanding of each other’s work leading to one’s feeling of ownership of their work area. An empowered work culture with members engaged fully will lead to real productivity and results.
[Related Article: How to Lead a Great Code Sprint]
As a summary, it is easy to see a data science project in parallel with any software development project with certain uniquenesses. It is an exploratory journey and a close candidate for agile project practices.
Originally Posted Here
Founder & CEO, SEEM — a regTech startup, with a product “Compliance by Design” for Financial Regulatory & Risk Support for GSIBs, leader for the data science team