The 7 Steps of a Data Project
It’s hard to know where to start once you’ve decided that yes, you want to become more data-driven. Just looking at all the technologies you have to understand and all the languages you’re supposed to master is enough to make your dizzy.
Well, building your first data project is actually not that hard. And yes, Dataiku DSS helps, but what will really helps you is understanding the data science process. Becoming data driven is about this: knowing the basic steps and following them to go from raw data to building a machine learning model.
The steps to complete a data project have been conceptualized a while ago as the KDD process (for Knowledge Discovery in Databases), and made popular with lots of vintage looking graphs like this one.
This is our take on the steps of a data project in this awesome age of big data!
STEP 1: UNDERSTAND THE BUSINESS
Understanding the business is the key to assuring the success of your data project. To motivate the different actors necessary to getting your project from design to production, your project must be the answer to a clear business need. So before you even think about the data, go out and talk to the people who could need to make their processes or their business better with data. Then sit down and define a timeline and concrete indicators to measure. I know, processes and politics seem boring, but in the end, they turn out to be quite useful!
If you’re working on a personal project, playing around with a dataset or an API, this may seem irrelevant. It’s not. Just downloading a cool open data set is not enough. I can’t tell you how many cool datasets I downloaded and never did anything with… So settle on a question to answer, or a product to build!
STEP 2: GET YOUR DATA
Once you’ve gotten your goal figured out, it’s time to start looking for your data. Mixing and merging data from as many data sources as possible is what makes a data project great, so look as far as possible.
Here are a few ways to get yourself some data:
- Connect to a database: ask your data and IT teams for the data that’s available, or open your private database up, and start digging through it, and understanding what information your company has been collecting.
- Use APIs: think of the APIs to all the tools your company’s been using, and the data these guys have been collecting. You have to work on getting these all set up so you can use those email open/click stats, the information your sales team put in Pipedrive or Salesforce, the support ticket somebody submitted, etc. If you’re not an expert coder, plugins in DSS give you lots of possibilities to bring in external data!
- Look for open data: the Internet is full of datasets to enrich what you have with extra information; census data will help you add the average revenue for the district where your user lives, or open street maps can show you how many coffee shops are on his street. A lot of countries have open data platforms (like data gov in the US). If you’re working on a fun project outside of work, these open data sets are also an incredible resource! Check out kaggle, or this github with lots of datasets for example
- Use more APIs: another great way to start a personal project is to make it super personal by working on your own data! You can connect to your social media tools, like twitter, or facebook, to analyze your followers and friends. It’s extremely easy to set up these connections with tools like ifttt. For example, I have a bunch of recipes that collect the music I listen to, the places I visit, my steps and the kilometers I run, the contacts I add, etc. And this can be useful for businesses as well! You can analyze very interesting trends on twitter, or even monitor the competition.
STEP 3: EXPLORE AND CLEAN YOUR DATA
(AKA the dreaded preprocessing step that typically takes up 80% of the time dedicated to a data project)
Once you’ve gotten your data, it’s time to get to work on it! Start digging to see what you’ve got and how you can link everything together to answer your original goal. Start taking notes on your first analyses, and ask questions to business people, or the IT guys, to understand what all your variables mean! Because not everyone will get that c06xx is a product category referring to something awesome.
Once you understand your data, it’s time to clean it! You’ve probably noticed that even though you have a country feature for instance, you’ve got different spellings, or even missing data. It’s time to look at every one of your columns to make sure your data is homogeneous and clean.
Warning! This is probably the longest, most annoying step of your data project. Data scientists report data cleaning is about 80% of the time spent on a project. So it’s going to suck a little bit. Luckily, tools like Dataiku DSS can make this much faster!
STEP 4: ENRICH YOUR DATASET
Now that you’ve got clean data, it’s time to manipulate it to get the most value out of it. This is the time to join all your different sources, and group logs, to get your data down to the essential features.
You’ll then start manipulating the data to extract lots of valuable features. For example, getting a country and even a town out of a visitor’s IP address. Extracting time of day, or week of year from your dates to get something more meaningful.
The possibilities are pretty much endless, and you’ll get a pretty good idea by scrolling through Dataiku DSS’s processors in the Lab of the operations you can execute.
STEP 5: BUILD VISUALISATIONS
You now have a nice dataset (or maybe several), so this is a good time to start exploring it by building graphs. When you’re dealing with large volumes of data, they’re the best way to explore and communicate your findings.
You’ll find lots of tools available that make this step fun to prepare and to receive. The tricky part is always to be able to dig into your graphs to answer any question somebody would have about an insight. That’s when the data preparation comes in handy: you’re the guy who did the dirty work so you know the data like the palm of your hand!
If this is the final step of your project, it’s important to use APIs and plugins so you can push those insights to where your end users want to have them. So get integrated with their tools!
Your graphs don’t have to be the end of your project though. They’re a way to uncover more trends that you want to explain. They’re also a way to develop more interesting features. For example, by putting your data points on a map you could perhaps notice that specific geographic zones are more telling than specific countries or cities.
STEP 6: GET PREDICTIVE
By working with clustering algorithms (aka unsupervised), you can build models to uncover trends in the data that were not distinguishable in graphs and stats. These create groups of similar events (or clusters) and more or less explicitly express what feature is decisive in these results. Tools like Dataiku DSS help beginners run basic open source algorithms easily in clickable interfaces.
More advanced data scientists can then go even further and predict future trends with supervised algorithms. By analyzing past data, they find features that have impacted past trends, and use them to build predictions. More than just gaining knowledge, this final step can lead to building whole new products and processes. To get these in production though, you’ll need the intervention of data scientists and engineers, but it’s important to understand the process so all the parties involved (business users and analysts as well), will be able to understand what comes out in the end.
STEP 7: ITERATE
The main goal in any business project is to prove it’s effectiveness as fast as possible to justify, well, your job. Data projects are the same. By gaining time on data cleaning and enriching, you can go to the end of the project fast and get your first results. These first insights will be a great start to uncover more necessary cleaning, to develop more features in order to continuously improve results and model outputs.
Now that you’ve got the skills, get started right now by building projects in Dataiku DSS!
Originally posted at www.dataiku.com/blog