Data Visualization for Data Scientists – Choosing the Right Tool for the Job
Blogs from ODSC SpeakersConferencesData VisualizationModelingaltairbokehEast 2021matplotlibplotlyplotnineseabornposted by ODSC Community February 4, 2021 ODSC Community
Photo for Data Visualization for Data Scientists by David Pisnoy on Unsplash.
Data visualization sometimes gets categorized as a field separate from machine learning or data science. Skill in designing effective, attractive plots and graphs doesn’t show up in job descriptions in the same way as experience with Keras or XGBoost. I think this is a mistake. In my years of data science practice, I have never built a model without visualizing some of the data, performance metrics, or output – or more often, all three. While not all data scientists are visual learners like me, it’s still true that visualization gives us a way to experience and understand our data that is distinct from what tables or text can provide.
This is a big reason why it took me a bit longer than I’d like to develop my skills in machine learning in Python. The data visualization ecosystem in Python has just never stood up to the tools R makes available, and visualizing is a non-negotiable part of machine learning development for me. I’ve complained about this before, I’ll readily admit it. However, as my career path has grown into the Python sphere more and more, I have had to bite the bullet and use Python tools to do data visualization tasks too.
This progression has led me to think a lot about what it is that the Python libraries lack that the ggplot2 world in R provides, and what it is about a library for data visualization that creates a passionate, happy user base. This foundation is what I use to critique and assess the six Python libraries I’m going to be discussing at ODSC East 2021.
Ease of Use
The first make-or-break moment for any software, including a dataviz library, is when you first pick it up. Jan needs to generate a plot to see how this data is shaped, or needs to show something to a peer, and so googles or asks someone “what’s the best data visualization library in python?”. How many steps, lines of code, or different docs pages will Susan put up with to get this visualization done? How many new concepts or paradigms will Bob want to learn before getting the one plot he needs? This process needs to be as easy and short as possible if you want someone to become a user. There are really three possible results of the first interaction: “Hey, that was pretty good. I’ll use that again!”, “Ugh, that was tough but I got it done. Not looking forward to the next time though.”, or “This is a mess, I don’t have time for this, I’m going to find some other tool to use.”
Sensible, consistent grammar
If the library makes it past that first hurdle, then you have a user. They might have varying levels of enthusiasm about the tool, however. So, what’s the development of this user going to look like? Grammar, in particular its intuitiveness and consistency, is key here. Once you learn a few key elements or functions, can you reasonably guess how to adapt them to a new use case? If you know how to create a scatterplot, does this give you any help when you need to create a line graph next? Or are you starting over from essentially scratch? People like to feel as though they are making progress and developing sophistication of their understanding of the tool as they keep working with it. Discovering that every new use case requires a new skill set or memorization is a big bummer. Think of it like learning spoken or written language. Nobody likes learning English verbs because so many of them are irregular – you try to guess what the correct conjugation might be, and you’re wrong, AGAIN, and eventually, this becomes really frustrating. Nobody wants this experience when making data visualizations.
Once the user has developed a robust understanding of the library, and is getting comfortable with making a variety of plots, they’re going to run up against customization needs. Perhaps they need a label just exactly right here, to point out an outlier – or they want to remove just the y axis tick marks, or they want to fill just this little bit with color. Perhaps there is a brand theme they must use for their company or organization. This kind of thing is the difference between a library that’s all right for exploratory analysis and messing around, and the library that you can use to make plots that really illuminate important things, or plots you’d be willing to show to your peers, boss, or clients.
If someone does all their exploratory analysis and then has to switch proverbial horses midstream, rewriting plots into another toolkit that gives them the specific features they want, they’re going to be annoyed, and they’re going to be less efficient. The user’s time is valuable, and even if they CAN do this kind of thing, that doesn’t mean they’ll want to. Within reason, a good plotting library will be full featured and allow pretty detailed customizations.
Beautiful, readable results
In my experience, people either ONLY think about the aesthetics of plots, or they don’t think of it at all. The truth is, everyone deserves good looking data visualizations. It’s not just for beautifying our environments, but because if a plot is appealing and easy on the eyes, more people will look at it, and the message it’s trying to convey will get to a larger audience. This matters!
However, attractive design shouldn’t take all day, and it shouldn’t be its own source of massive frustration for a non-designer user. A good data visualization library ought to have reasonably attractive design elements out of the gate, and it needs to be customizable along with other aspects of the plot. Having built-in themes, support for schemes like a color brewer, and font versatility, for example, all add to the effectiveness by making the result more pleasant to look at.
In my opinion, these are the major considerations worth our attention when evaluating a data visualization library. I’d argue that R’s ggplot2 ecosystem does a remarkable job hitting these marks, which is why it has such a strong and enthusiastic following. To make a comparison, I tested six python libraries on these same criteria: matplotlib, seaborn, bokeh, altair, plotnine, and plotly. To find out more about my assessments, and to see samples of my code and plots from all these libraries, join me at ODSC East 2021 in my session, “Going Beyond Matplotlib and Seaborn: A survey of Python Data Visualization Tools“!
Stephanie Kirmer is a Senior Data Scientist at Saturn Cloud, a company making large scale Python easy and accessible to the data community using Dask. Throughout her career, she’s used varied tools to make effective data visualizations, including as a DS Tech Lead at a travel data startup, and as a Senior Data Scientist at Uptake, an industrial data science company. She holds Master’s degrees in sociology and education, and was formerly an adjunct faculty member at DePaul University in Chicago.