Data scientists work with columns and rows. This is at the core of our training and we are very good at it! From SQL tables to Pandas dataframes and everything in between, we like our columnar data. We thrive best in the assumption that each of our rows is an independent data type, not correlated in any way with any of the other rows. We live in this paradigm so fully that the vast majority of our machine learning models assume that our rows are independent of each other. And why shouldn’t we think that? After all, our data is brought to use by a series of measurements, whether they be user profiles on our platforms or transactions across payment systems or maybe screen clicks. These things are all independent measurements of each other, right?
Well, except when they are not. Like most things, this is a generalization and there are always exceptions to everything. In fact, there are many exceptions to the above. For example, users on a social network platform might be related to other users through relationships like friendships. If we wanted to predict the churn of one user, it is correlated to whether one of their friends churns. The purchase of one item on a website, for example shampoo, might then naturally lead to the purchase of another like conditioner. Screen clicks can only occur when one webpage is linked to another. Relationships between our data points matter.
And so as data scientists it is sometimes necessary for us to break our long-held need that each data point be independent of each other. This is where graphs come in. But how do we know that we have a problem that would lend itself well to graphs?
We are used to working in relational databases like SQL. However, SQL can make certain tasks based on relationships difficult and inefficient. For example, a simple JOIN operation can be O(N * M) where N and M are the size of the tables being joined. And then consider what happens to “big O” when there are multiple JOINs in a single query!
It can sometimes be easy to see that you have a graph problem just by knowing that the relationships between your data points matter. However, it is not always that obvious. Multiple JOINs are actually one of the biggest hints that you have a graph problem. Once you know you have a graph problem, there are many tools available to you using a graph data structure or graph database that can open up many new possibilities for calculations that far exceed the capabilities of SQL! These include finding the most important data points (nodes) with the graph, clusters within the data, similarity among the nodes, and even machine learning enabled and enhanced by graphs.
In my talk at ODSC East 2022, I will show you how to identify “graph-y” problems and what to do with them once you have them. We will walk through some common SQL queries, exploring how “big O” varies with the query. Then we will transition to looking at how to solve graph problems in SQL. We will compare each of our SQL queries with the equivalent query in Cypher, a common graph query language. Finally, we will touch on some of those exciting calculations that can only be done within a graph.
I hope to see you at my ODSC East talk, “When SQL is Not the Best Answer: Identifying “Graph-y” Problems and When Graphs Can Help“!
About the author / ODSC East 2022 speaker:
Dr. Clair Sullivan is currently a graph data science advocate at Neo4j, working to expand the community of data scientists and machine learning engineers using graphs to solve challenging problems. She received her doctorate degree in nuclear engineering from the University of Michigan in 2002 and has worked in a variety of settings including national laboratories, the federal government, as a professor at the University of Illinois. She has authored 4 book chapters, over 20 peer-reviewed papers, and more than 30 conference papers.
Cover image by Savionasc, CC BY-SA 4.0, via Wikimedia Commons