This is the first post of a series of three articles in which we will discuss tips and guidelines for successful data science implementations. This post goes over the things you should worry about before to write the first line of code.
A high level data science project will have three phases:
1) Gathering information needs from users,
2) Designing an analytical approach and,
3) Implementing the pipeline it and evaluate the results of the data product(s).
It’s usually more complex in practice than it is on paper. During the project you have to make several decisions about platforms, technologies, storage, algorithms, implementation, and people. Not having clear definitions at the beginning of the project can further complicate matters.
Clearly defining the problem is crucial and requires framing it as a question to answer. Make the scope of the project explicit to everyone involved. Define the aim(s) of the project, the users and stakeholders, and how you will validate the results.
Meet, early and often, with users and stakeholders and ask for input and feedback. Involve colleagues and explain to them what you are trying to achieve. Consider their feedback and experience. Do this often, it will help you to get new ideas.
Pick out the tools you are going to use in the project while being open to trying out new things. It could be a new library, IDE, or platform.
The Question of the Data Science Team
There is an important question that can give focus and balance to data science efforts. Data analysts, often, receive a task and reports the results back after a few days. This means they work in isolation with respect of the rest of the team. Then, an important question remains key within the data science team.
“What are you working on?” is one of the most important question for a data science team because:
- It demonstrates one of the most important traits for a data scientist: curiosity.
- Creates connections between peers and projects.
- Allows opportunity to provide and receive feedback from experts and/or peers.
- Explaining your work helps you understand it more.
Make sure you do this exercise at the beginning of the project. Keep doing it once per sprint at least. Use this instance to receive suggestions of analysis and potential approaches.
As a suggestion, for the duration of a show & tell meeting. Use 30 minutes to explain the problem and your approach, and add 15 extra minutes per assistant. i.e.: if you invite two colleagues, then add 30 extra minutes, the meeting should last for 60 minutes.
Find the right question
The question you want to answer should be of interest to your organization. Your question is the compass of your project because it determines its direction. It’s not about data, it’s not about technology, it’s about your question. A well defined question aligns your design with your goal.
Spend time designing
Spend at least 15% of the time designing, if not more. What types of analysis you would like to run? What datasets you need to answer the question. Think how you can deliver the results.
Delivering results is an important part of any analytical project, which necessitates solid visualizations to put home your points.
When designing consider the data collection process and data sources. Technologies, user accounts,
code repositories, infrastructure, networking, and storage. Expected results, and how to deploy and visualize them.
Involve the user early in the process
Communication is key. Maintain constant and clear communication with the users. You can understand the why of the project by having direct communication with final user.
Paraphrasing Nietzsche: “He who internalize the why, can bear almost any how to do it.” Identify the main user, and spend time with him/her. Meetings with the user don’t need to be a waste of time, use the meetings as a way to better shape the project. How much do you know now about the events in study? How are the insights from the project will be used? Who will be the main consumers of this information? What similar solutions exists? Are there any similar attempts in the past?
Build prototypes and show them to the users for feedback. Use that feedback to improve your designs.
A key success factor is to get executive support for your project. Having C-level support from up top is a good way to motivate the people you need to collaborate on what you are trying to
do. Executive support aligns people and resources to the project.
Set realistic expectations
Be specific and transparent on the deliveries of the project.
Do not commit deliveries to fixed dates, instead, document what you are doing and what you are trying to get. Hold regular meetings with users and stakeholders presenting advances
and next steps. The project is a living entity of value discovery for the company.
Data collection starts once you have a broad understanding of the goal of the project. Look for new data sources that might be useful to add to the project. You might want to consider Open Data as an option. As a rule of thumb, make sure that the data you use in the project is going to be available once you deploy it.
If you are thinking to buy data sets. Test it first, using a sample from the vendor, before you get a full dataset.
Estimate storage and define the pipelines of the project. You need to say if there is any special need about the project. Like a data lake, a staging area, a landing area, processing power, platform, language, etc.
Divide the work in subtasks
Split the work into pieces for teams of 1 or 2 people. If the task seems to be bigger than that, then divide it again until you have pieces of work of ~8 hours-effort.
Build a backlog of the project
Get a list of the tasks with an estimation of the effort to complete them. Rank the tasks by importance. This prioritized list is going to be your backlog. You can build a kanban board to transparent in the work each member of the team is doing.
As the backlog considers importance, the first task in the backlog is always the next one to work on.
Prepare for iterations. Prepare to present weekly or monthly results to the users. Use their feedback to move forward with the project.
Start coding after design
Start coding when you run out of questions about the project. Have discussions about every aspect of the implementation and document the agreements.
Think in a broader ecosystem
Data products aren’t isolated entities. Think of what the organization already have. Think of how your new system can integrates, feed, or use other data products that are in the organization.
You are looking for synergies between data products. Synergy optimizes costs
and improve efficiencies.
Check the strategic goals of the organization. Analytical projects aligned with strategic goals, adds more value to the company.
Many times, the final version is not what it was defined in the beginning of the project.
Developing data products is an evolving process. I hope the pieces of advice in this post will help with your projects.
In summary, get the question right. Design around the question and your process will get you to the solution. Involve users as your business allies who can provide feedback. Get executive support and you will open doors easier. Divide (and) to conquer. And finally, think of your system as a component of an ecosystem. So “data synergies” may happen in your organization.
Diego Arenas, ODSC
I've worked in BI, DWH, and Data Mining. MSc in Data Science. Experience in multiple BI and Data Science tools always thinking how to solve information needs and add value to organisations from the data available. Experience with Business Objects, Pentaho, Informatica Power Center, SSAS, SSIS, SSRS, MS SQL Server from 2000 to 2017, and other DBMS, Tableau, Hadoop, Python, R, SQL. Predicting modelling. My interest are in Information Systems, Data Modeling, Predictive and Descriptive Analysis, Machine Learning, Data Visualization, Open Data. Specialties: Data modeling, data warehousing, data mining, performance management, business intelligence.
- Learn Interpretability for Data Science 79 views | by Rajiv Shah | under Conferences, Featured Post
- The Importance of P-Values in Data Science 39 views | by Daniel Gutierrez, ODSC | under Modeling, Statistics
- The Promise of Retrofitting: Building Better Models for Natural Language Processing 36 views | by Catherine Havasi | under Conferences, Featured Post, Modeling, NLP/Text Analytics, ODSC Speaker