When you visit a new place, probably you will rely on a map to guide you from place to place so you could get yourself oriented and help you find the most interesting places. The same happens with data. When you get a new dataset to work with, what should you do first? I would argue creating a data map is the best way to get you jump started on an analysis. But creating a data map is not easy. We’ll discuss a new method of data mapping, including dimensionality reduction and network theory.
[Related Article: Why We Need Graph Analytics for Real-World Predictions]
The increased use of data analytics across all industries is creating a trend of collecting information from multiple data sources, and increasing the dimensions of our datasets. Let’s say we are analyzing a customer, we will collect their billing, their contract information, their phone usage, the calls they make to customer support, their navigation on our webpage, their demographics… This can lead to having hundreds of attributes per individual.
In analytics, we try to understand the effects that some variables have on others, and to find in these effects the root causes of business questions. For example, why are my customers leaving or how can i sell more to these customers. Often we are more interested in latent factors rather than collected data.
Traditionally, we have been able to create dashboards that map the relation between two, at most three attributes per customer. Analysts manually decide which variables to plot against each other in a x vs. y plot, a time series, a histogram, a map, and then we can extract conclusions of their relationships manually.
Another approach to this is to try to represent the relationship between pairs of attributes all at once, like we would do with variable correlations. However, correlations imply a linear relationship between variables, which leaves categorical data, and data with non-linear distributions out of the picture. Plus there are potentially hundreds of combinations of pairs of attributes.
A third option is dimensionality reduction, where you combine all attributes, keeping the dimensions (not necessarily variables) where data varies the most. Then each point is represented as linear combination of these new dimensions. This can help visualize groups of data points that share a combination of attributes. However, per se they do not provide the answer to finding in the effects of variables in the root causes of business questions.
Our proposal to solve this complex problem is to use networks or graph that allows us to keep a high number of attributes per item, map their relationship or similarity with other items using links, and create a map of the topology of the data where visually we can identify patterns in data such as groups or relationships, and the distributions of the values of the different attributes on our network, allowing us to find combinations of attributes in the shape of the data.
A natural representation of knowledge is a network, but data is not naturally encoded as a network. How do we do it?
The process to build a network from a dataset consists of 4 steps: (1) gather the data, (2) define a similarity function and compute the similarity among the rows of the dataset (3) build the network linking each row to those that are most similar (4) obtain insights from the map of the relationships.
The found relationships in graphs can either be explicit, when a node is connected to another node because there is a explicit relationship, or implicit, when a node is connected to another node because it’s the closest using a similarity function. The similarity function will be explained more in depth below.
Data maps are a very intuitive because people are used to interpret visuals, and visual representations leave a much stronger impression on people than other forms of data representation.
Maps work for all types of users and are naturally intuitive for humans. The sense of space between items and the density or sizes all give context to the information. Heatmaps or colors for classification are a great way to understand distributions in a fast way, and also to understand relationships among the distributions of variables. Maps can also be extended using legends or cross-filtering charts.
Using graphs as maps of data, major data components (clusters, exceptions, outliers) and their relationships (similarity, difference) are made evident, so users can identify interesting areas for further examination.
The similarity function will vary depending on the nature of the data. If the dataset is made of text documents, the similarity function will use NLP embedding techniques. If the dataset consists of multiple attributes per datapoint, those attributes will be transformed depending on their type (categorical, numerical, text, images…) before using a graph-based dimensionality reduction algorithm.
The main objective of the similarity function is to preserve structure of distances between data points
- If data points are similar in original dimensions, they also should be close in lower-dimensional representation
- Conversely, if data points are very different, they should be far from each other in the lower-dimensional space
- Data points in original space should have the same or similar neighborhoods in lower-dimensional space
Data maps can be built from datasets of different types, as mentioned above, depending on the definition of the similarity function. Let’s see some examples:
- Using data from social media to map explicit connections, like who follows whom on twitter, or who talks to whom to understand communities: https://www.minsait.com/ideasfordemocracy/en/second-round-brazilian-general-elections
- Use text data to group documents or sentences by the topics they cover, finding narratives and sub-narratives and relevance of organizations or people within the narratives: https://medium.com/graphext/healthy-food-73dcd68dfd5e
- Understanding customer or employee behavior based on our CRM data, by grouping nodes based on their attributes and using the graph shape to understand communities and the heatmaps to understand variable distributions: https://towardsdatascience.com/behaviour-analysis-using-graphext-ddca25ebd660
Data exploration is one of the most overlooked fields of Data Science and Data Analytics even if it’s a crucial part of understanding the initial problem and variable distributions, finding relationships between the different variables for feature creation, and evaluating the results of your model in different parts of your data.
[Related Article: Crisis Intervention and Saving Lives with NLP & Predictive Analytics]
Using the concept of data maps will help you navigate these problems. If you have a problem that you think you could solve by exploring a data map, let me know and I would love to see how Graphext could help.
Almudena works in Graphext as a Data Scientist focused on Business Development, working with clients to maximize the value they get from data. Previously she worked in Mckinsey & Company as a Data Scientist in the Advanced Analytics Hub. She also collaborates as an Associate Professor at the IE Data Science Bootcamp.
She holds an Industrial Engineering degree from Universidad Carlos III, and started her analytics academic background in the University of California, Berkeley.