Editor’s Note: If you’re interested in data storage, and the intersection of data science and data engineering, make sure to see Stephanie at ODSC West 2019, and watch her talk, “Integrating Elasticsearch with Analytics Workflows“
Bringing data science and data engineering together in service of business goals.
Data scientists at companies across industries have had this experience: you join a new team, excited to be a contributor, sit down to start generating the models that will help your company leaders make better decisions and get ahead in the marketplace, and you hit a wall—accessing the data. Perhaps you don’t have experience with the data storage solution being used, and it just doesn’t seem like anyone has a clear idea how data extraction for analysis is supposed to work. Perhaps the data storage option is performant for ingestion, but not for extraction, and you only discover this when you try to actually get it. Maybe you are conceptualizing your problem as a relational data challenge, but the data storage solution is not architected relationally.
In a vast number of cases, data scientists are the users of data, but not the decision-makers or maintainers of the data storage. Those responsibilities more commonly fall on data engineers or devops professionals. These members of the company have many things to consider when choosing how to store data—they are not setting out to make end users’ lives more difficult. But when you must balance ingestion performance, reliability, data security, cost, scalability, and myriad other needs, end user friendliness can be low on the list of priorities.
This bottleneck is an interesting problem to solve. Without smooth communication and mutual understanding between those managing the data storage and those expected to produce business insights from the data, the whole enterprise can screech to a halt. Data scientists feel frustrated and unable to do their best work, and data engineers feel that the complexity and challenge of their responsibilities are not appreciated.
We all need to meet each other halfway in the business data space. Instead of looking only as far as the edges of our own desks, the most effective companies, who are able to really gain advantage from their data, will all understand a common goal and recognize the inability to achieve it alone.
For data science, come prepared with a clear explanation of what functionalities you require to produce insights. Many of us have had the experience of being told by business leadership to do a specific task, and having to ask “Wait, what is the goal we want to accomplish?” before discovering a different task or method would actually be much more effective. This could be the case here as well, except that now you are the one who needs to step back and think about your goal. Talk frankly with your data engineering staff about what you need to be able to do, and be as flexible as you can about the solution.
For data engineering, recognize that all your hard work on storing data well is meaningless unless the business can use the data to meet its goals. Building the most secure safe in the world to store a priceless piece of artwork is great, but if the patron can’t open the safe to see the artwork, then the whole enterprise was for nothing. Listen to your data scientists, and remember that the big goal, shared by the entire organization, is helping the business succeed. Making the data available for business needs is part of your job too.
At ODSC West 2019, I’ll be discussing the data storage solution Elasticsearch, and describing its pros and cons as a data storage choice, with a strong emphasis on the end user experience. Attendees will get a chance to use it hands-on, and see for themselves what the ETL is like. I’d encourage attendees to use this case study as a model for how to think about bridging the data science and data engineering gap.