Starting a Data Science Project Starting a Data Science Project
I spoke in a Webinar this past Saturday about how to get into Data Science. One of the questions asked was “What does a... Starting a Data Science Project

I spoke in a Webinar this past Saturday about how to get into Data Science. One of the questions asked was “What does a typical day look like?”  I think there is a big opportunity to explain what really happens before any machine learning takes place for a large project. 

I’ve previously written about thinking creatively for feature engineering,  but there is even more to getting ready for a data science project, you need to get buy in on the project from other areas of the business to ensure you’re delivery insights that the business wants and needs.

The road to getting to the machine learning algorithm looks something like:

  • Meeting
  • More meetings
  • Data gathering
  • Feature engineering
  • Then machine learning

In this article, we’re only going to cover the first 3 bullets. Researching the best solution to use might also be part of the process, but here I know I’m doing a segmentation.

There are a ton of meetings that take place before I ever write a line of SQL for a big project.  If you read enough comments/blogs about Data Science, you’ll see people say it’s 95% data aggregation and 5% modeling (or some other similar split), but that’s also not quite the whole picture. I’d love for you to fully understand what you’re signing up for when you become a data scientist. 

As I mentioned, the first step is really getting buy in on your project.  It’s important that as an Analytics department, we’re working to solve the needs of the business.  We want to help the rest of the business understand the value that a segmentation could deliver, through pitching the idea in meetings with these stakeholders.  Also, I’m also not a one woman show. My boss takes the opportunity to talk about what we could potentially learn and action on with this project whenever he gets the chance. We now have consensus across multiple groups that they would like us to deliver a behavioral customer segmentation.

But I’m still not just diving in to SQL.  There are people on my team and in the previously mentioned areas of the business that can help brainstorm what data we might have available that could help tell us something about our customers.  In our case, data exists that we haven’t previously had the opportunity to analyze.

The first step was meeting with my team to discuss every piece of data that we could think of that might be relevant.  Thinking of things like:

  • If something might be a proxy for customers who are more “tech savvy”.  Maybe this is having a business email address as opposed to a Google address, or maybe you’re utilizing our more advanced features
  • Census data could tell us if a customers zip code is in a rural or urban area? They might be different.
  • What is available in the BigData environment? In the Data Warehouse? Other data sources within the company.  When you really look to list everything, you find that this can be a large undertaking.

Next, I met with marketing and operations at different meetings to make sure we weren’t missing anything, and see if they had more potential thoughts on inputs. It’s also just good practice to keep ongoing communication throughout the course of a project.

After we have a list of potential data to find, then the meetings start to help track it all that data down.  You certainly don’t want to reinvent the wheel here.  No one gets brownie points for writing all of the SQL themselves when it would have taken you half the time if you leveraged previously written queries.

If I know of a project where someone had already created a few cool features, I email them and ask for their code, we’re a team.  

In the end, there were 6 different people outside of my team that I needed to connect with who knew these tables or data sources better than members of my team.  So it’s time to ask those other people about those tables, and that means scheduling more meetings.

I honestly enjoy this process, it’s an opportunity to learn about the data we have, work with others, and think of cool opportunities for feature engineering.  

Also worth noting, these meetings with people outside of your team are probably not their highest priority, scheduling might get tricky, don’t get discouraged. Just be persistent. The more clear and concise you can be about exactly what you need from others may help you get assistance more quickly.  And if they’re in the same office you can always try swinging by their desk.

The mental picture is often painted of data scientists sitting in a corner by themselves, for months, and then coming back with a model that no asked for.  But by getting buy in, collaborating with other teams, and your team members, this doesn’t need to be the case.  You can be a thought partner that is proactively delivering solutions to better target the customer base and personalize customer’s experience.


 

Original Source

Kristen Kehrer

Kristen Kehrer

Kristen is currently a Senior Data Scientist for Constant Contact. After completing a MS in Statistics and a BS in Mathematics, she started her career utilizing Econometric Time Series analysis to forecast electric and gas load in the utility industry, leveraging Neural Nets and ARIMA models. Since then, Kristen has spent her career in Analytics and Data Science in both healthcare and Ecommerce. Her most recent position was managing a team at Vistaprint working to optimize the Vistaprint website and increase conversion. Kristen writes about different topics in data science at www.kristenkehrer.com.