If you think you can get away with poor communication skills as a data scientist because the data will speak for itself, Dr. Lindsay Brin is here to tell you otherwise. She believes that learning how to communicate data through proper visualization is critical for data scientists and outlines why that ability is essential for data truth-telling.
The Data Science Workflow
The purpose of data visualization is to make concepts and truths more transparent than they would have been in pure number form. Brin is quick to point out that you have to look at your data closely before you begin applying analyses is vital whether simple, as her example slide suggests, or much more complex.
She gives examples of the way really looking at your data helps you apply statistics appropriately and to set the correct model parameters.
In communication, this is critical because we often feel like we understand something better within the data with the right kind of image or visualization. This goes for both presenting data to a broader audience, choosing whether to clarify or conceal, but also goes for you when you’re considering the data you have.
Choices in Data Visualization
Certain types of data lend themselves to visualization through images, but not all fit the criteria. In fact, some visualizations can be needlessly redundant or purposefully misleading. Knowing when to employ a visual is the first step in proper data communication:
- Visualization isn’t useful when you have only two data points.
- It’s not effective when it obscures a relationship between data (particularly if a table is more warranted)
- It’s highly effective to show overall trends and pattern progressions.
Data Science And Integrity
Data visualization enhances understanding, but the flipside is that it can show the wrong story, both accidentally and with more insidious intentions. Avoidance of such potential pitfalls requires knowledge about how people fundamentally interpret signals, both covert and overt. Awareness of these signals can help you better present data for clarity and avoid misinterpretations.
The choice of scale can affect how we see any data. In some cases, shortening scale can highlight relationships in data. Reducing the scale to what presents in the data can have a profound effect on how we see differences in data points. For example:
The seasonality of birth rates in Toronto, Canada jumps out in this visualization here. If we pull back, however, to a much broader visualization, the data story changes:
Here, we see that despite seasonal changes in birth rates, the overall birthrate is relatively stable. Which story you tell will depend on the question being asked and your own integrity.
Extremes in axis ranges can also alter the effectiveness of your visualization. It’s difficult to know if these values have true meaning or if they’re just at the edge of your range. Subsetting the axis, for example, or log transformations, could be ways to show you the data patterns in ways you may not have seen. While there are many ways to address the issue, Brin hopes that you’ll begin to see how visualization may be affecting your own interpretations.
Another example of misleading visualization is the use of points and lines for something like a categorical variable. If there’s no real relationship between the variables or there are gaps in the information, you may begin to see patterns that don’t actually exist because of the psychology of the lines. For example:
This is a nonsensical visualization because the mammals don’t actually relate to each other in a way that warrants the line progression.
Choice Of Colors
Different elements of color can affect how you interpret your data as well. You may not realize it, but things like hue, value, and intensity all play a part in your interpretation. The study of color theory and how our eyes move through a composition can help explain how visualizations can be evocative and useful, or fail to draw attention.
In art, artists often use color intentionally, but data scientists may not always be so intentional. For example, hue (or color name), value (dark or light), and intensity (or saturation) can cause your audience to move particular ways through a composition. This is very apparent in Degas’ painting in which the eye typically catches on the red sweater (hue, intensity) and moves through the painting following value (dark to light).
In the types of graphs, color can help draw out specific patterns while disguising others, as seen by these three graphs which vary in color scheme but present the exact same data. The polynomial relationship is more evident in some while misleading or disguised in others.
Consider how color is causing you to assign importance. Is that importance relevant or accurate? If it isn’t, Brin encourages you to reconsider how you use color in your visualization.
When colors are the same hue or close on the color wheel, they can appear to have relationships where there may be none. If groups are, in fact, related, color can help show that relationship. Otherwise, it can skew your intuition.
Color can also obscure patterns. You can do log transformations in which colors are more randomly assigned to show those relationships, or just add a lot more colors to your values for that visual contrast. It can also have the opposite effect by emphasizing relationships that aren’t quite as extreme.
It’s Not Just Numbers That Talk
Brin emphasizes that color can be a remarkable tool to help show the relationship with data, but it also acts as a distraction when the usage isn’t thoughtful. It’s possible to obscure relationships or emphasize them unnecessarily just by manipulating things like color and visual style, so considering those things in addition to your data makes telling the story of your data more transparent and more reliable.