fbpx
An Introduction to Social Network Analysis with NetworkX: Two Factions of a Karate Club An Introduction to Social Network Analysis with NetworkX: Two Factions of a Karate Club
Networks (a.k.a graphs) are one of the most interesting areas of data science and have been subject to an explosion of... An Introduction to Social Network Analysis with NetworkX: Two Factions of a Karate Club
Networks (a.k.a graphs) are one of the most interesting areas of data science and have been subject to an explosion of interest in recent years. The ability to model the relationship between data points is powerful. This article introduces some basic concepts in network science and gives in python using networkx, the go-to python package for network-related analysis. Keep your eyes peeled for my next article where I will explore network visualizations in more detail!

Why even care about networks?

A network (or graph) is a collection of nodes and edges which allows us to model the relationships between nodes. In recent years, network science has seen a recent explosion in popularity thanks to its wide-ranging applications.

With the meteoric rise of social media, social network analsyis is a typical use case and is what we will be discussing here. Networks also have significant applications in the modeling of pandemics and in modeling biological systems. The ability to model the relationships between agents has applications far and wide.

As with most things python makes it really easy for us to work with and analyze graphs. For the rest of this article, I am going to take you through the example of Zachary’s Karate Club using python and networkx.

An Example – Zachary’s Karate Club (ZKC)

When it comes to social network analysis, Zachary’s Karate Club (ZKC) is perhaps the most well-known example. It was introduced by Wayne Zachary in this paper in 1977 and has been a popular example ever since.

The network models the relationships between 34 members of a karate club: each node represents to an individual, and the links/edges represent individuals who interact outside of the karate club setting (e.g. spending social time together, like meeting up for a coffee, separate to karate).

The network has two main players the ‘Officer’ – John A (node 33) and the instructor – Mr. Hi (node 0). And the story goes that a rift occurred between Mr. Hi and John A, causing the karate club to splinter into two new clubs (or factions). One club lead by John A and the other led by Mr. Hi.

One might expect that each member’s decision to join either faction would be driven by their relationships with the other members of the club. So if we have a model of the relationships between the individuals (i.e. the network) we should be able to predict with faction each person will join. As you will see, network analysis gives us the power to do just that!

“The Karate Kid’ theatrical release poster (https://en.wikipedia.org/w/index.php?curid=8872317)

(N.B. as far as I know Zachary’s research didn’t have any influence the 1984 hit The Karate Kid)

Environment Set Up

First lets make sure we have installed the right packages into deepnote and import packages that we need.

#Install networkx if we don't already have it
!pip install networkx

Requirement already satisfied: networkx in /opt/venv/lib/python3.7/site-packages (2.4)
Requirement already satisfied: decorator>=4.3.0 in /opt/venv/lib/python3.7/site-packages (from networkx) (4.4.2)

#Import the required packages

import sys
import networkx as nx 
import matplotlib.pyplot as plt

%matplotlib inline

print(f"Python version {sys.version}")
print(f"networkx version: {nx.__version__}")

Python version 3.7.3 (default, Jun 11 2019, 01:11:15)
[GCC 6.3.0 20170516] networkx version: 2.4

Great – since everything appears to be in order we can move on to some more interesting things!

Import the ZKC graph

Our first step is to import the ZKC graph. Thankfully, this is already included in networkx. We will create an instance of the graph and also retrieve the club labels for each node. They tell us which faction of each member ends up siding with: Mr. Hi or the Officer (John A).

#Let's import the ZKC graph:
ZKC_graph = nx.karate_club_graph()

#Let's keep track of which nodes represent John A and Mr Hi
Mr_Hi = 0
John_A = 33

#Let's display the labels of which club each member ended up joining
club_labels = nx.get_node_attributes(ZKC_graph,'club')

#just show a couple of the labels
print({key:club_labels[key] for key in range(10,16)})

{10: ‘Mr. Hi’, 11: ‘Mr. Hi’, 12: ‘Mr. Hi’, 13: ‘Mr. Hi’, 14: ‘Officer’, 15: ‘Officer’}

So we see, as expected, the members of the club are split between Mr. Hi’s faction and the Officer’s faction.

Mathematically, a network is simply represented as an adjacency matrix,

AA. Each element of the matrix

AijA_{ij} represents the strength of the relationship between nodes i and j. Displaying

AA is one way for us to have a look at what is going on in the graph.

As a note on some of the definitions in graph theory relevant here: the ZKC graph is both undirected and unweighted. This means that the edges in the graph have no associated direction (so

AA is symmetric) and that the edges take a binary value of either 1 or 0 (i.e. the members either have a relationship outside of the karate club or not).

A = nx.convert_matrix.to_numpy_matrix(ZKC_graph)

A

matrix([[0., 1., 1., …, 1., 0., 0.],
[1., 0., 1., …, 0., 0., 0.],
[1., 1., 0., …, 0., 1., 0.],
…,
[1., 0., 0., …, 0., 1., 1.],
[0., 0., 1., …, 1., 0., 1.],
[0., 0., 0., …, 1., 1., 0.]])

Visualize the graph we have just imported

It becomes evident fairly quickly that the adjacency matrix is not the most intuitive way of visualizing the karate club.

There are many methods we can use to visualize a network and here I will just focus on the functions built into networkx. They are based off of matplotlib and don’t provide us with the prettiest visualizations you have ever seen, but they’ll do the job here. Look out for my next article where I will explore graph visualizations in more detail!

#To plot using networkx we first need to get the positions we want for each node. 
#Here we will use a ciruclar layout but there are many other variations you could choose!
circ_pos = nx.circular_layout(ZKC_graph)

#Use the networkx draw function to easily visualise the graph
nx.draw(ZKC_graph,circ_pos)

#let's highlight Mr Hi (green) and John A (red)
nx.draw_networkx_nodes(ZKC_graph, circ_pos, nodelist=[Mr_Hi], node_color='g', alpha=1)
nx.draw_networkx_nodes(ZKC_graph, circ_pos, nodelist=[John_A], node_color='r', alpha=1)

<matplotlib.collections.PathCollection at 0x7f60ffe90f28>

This is a lot more clear than the adjacency matrix! We quickly can see which members of the club are connected to each other and we might even be able to draw a couple of quick initial conclusions. The first thing to note is that both Mr. Hi and John A have the most connections in the graph (in other words they are the most central nodes). This is to be expected for the instructor and officer of the group. Given that we can see some nodes are connected to only one of Mr. Hi or John A (but not the other) we could also start to make some educated guesses at which members will join each faction.

Network Statistics (Exploratory Analysis)

Before we dive into some community detection it is worth exploring the network we have in a bit more detail (this is essentially our exploratory analysis stage). For networks, we often want to retrieve some common network statistics.

The first statistic we will look at is the density. This is a measure of how complete the graph is (how many edges are present in the network as compared to the total possible number of edges).

density = nx.density(ZKC_graph)

print('The edge density is: ' + str(density))

The edge density is: 0.13903743315508021

This value of ~0.14 is roughly in line with what we might expect for a social network.

Next, let’s look at the degree (how many edges each node has). This is a common centrality measure, which gives an idea of how ‘important each node is in the network. The assumption is that nodes with the most edges are the most important/central as they are directly connected to lots of other nodes. Nodes with a high centrality might be expected to play important roles in network. This is not the only way that centrality is measured in a graph and the interested reader is directed here.

#the degree function in networkx returns a DegreeView object capable of iterating through (node, degree) pairs
degree = ZKC_graph.degree()

degree_list = []

for (n,d) in degree:
degree_list.append(d)

av_degree = sum(degree_list) / len(degree_list)

print('The average degree is ' + str(av_degree))

The average degree is 4.588235294117647

#we now plot the degree distribution to get a better insight
plt.hist(degree_list,label='Degree Distribution')
plt.axvline(av_degree,color='r',linestyle='dashed',label='Average Degree')
plt.legend()
plt.ylabel('Number of Nodes')
plt.title('Karate Club: Node Degree')

This distribution is similar to what we might expect given the visualization of the graph. The majority of members of the club do not have very many links (most have 2 or 3 links) but a few nodes (which in our case correspond to Mr. Hi and John A) have a lot of links.

Another interesting statistic is the local clustering coefficient. You can find the full mathematical definition here. The local clustering coefficient can be thought of as the average probability that a pair of node i’s friends are also friends with each other. In other words, it measures the extent to which any given node is located within a tight ‘cluster’ of neighboring nodes.

#Now we can compute the local clustering coefficient
local_clustering_coefficient = nx.algorithms.cluster.clustering(ZKC_graph)

#lets find the average clustering coefficient
av_local_clustering_coefficient = sum(local_clustering_coefficient.values())/len(local_clustering_coefficient)

#similarly to the degree lets plot the local clustering coefficient distribution
plt.hist(local_clustering_coefficient.values(),label='Local Clustering Coefficient Distribution')
plt.axvline(av_local_clustering_coefficient,color='r',linestyle='dashed',label='Average Local Clustering Coefficient')
plt.legend()
plt.ylabel('Number of Nodes')
plt.title('Local Clustering Coefficient')
plt.show()

The clustering coefficient has been the subject of a lot of interesting research. A low value of the clustering coefficient indicates the presence of a ‘strucutral hole’, i.e. a location in the network where there are links missing that would have otherwise been expected. Which typically indicates that a node is located on the outside of a cluster. It has been argued that individuals at strucutral holes in the network are more likely to come up with good ideas because, being at the edge of a tight cluster of nodes, they are exposed to a greater variety of different ideas from outside of that cluster (Burt, 2004).

Community Detection

A key aspect of network analysis is community detection. In a social network, this is the idea that a large network can be broken down into smaller communities/cliques. For example, if the network represents the social relationships of all the students at a school, a community/clique would be a friendship group.

The ability to detect communities in networks has many applications. In the context of the karate club, it will allow us to predict which members with side with Mr. Hi and which will side with John A.

There are many ways to approach community detection in networks. I am not going to go into the maths in too much detail but we are going to opt for a modularity optimization method. Modularity is a measure of the extent to which like is connected to like in a network. The algorithm we will choose is already implemented for us in networkx, which makes its implementation very easy!

from networkx.algorithms.community.modularity_max import greedy_modularity_communities

#preform the community detection
c = list(greedy_modularity_communities(ZKC_graph))

#Let's find out how many communities we detected
print(len(c))

3

#Lets see these 3 clusters
community_0 = sorted(c[0])
community_1 = sorted(c[1])
community_2 = sorted(c[2])

print(community_0)
print(community_1)
print(community_2)
[8, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33] [1, 2, 3, 7, 9, 12, 13, 17, 21] [0, 4, 5, 6, 10, 11, 16, 19]

We immediately see a couple of interesting things in these communities. For instance, Mr. Hi (node 0) and John A (node 33) are in different communities. Given the split in the club this is what we would have expected – buts it’s good to have it confirmed!

We can move on to visualise these different communities in the network. This will help to tell us how useful the communities we have detected really are. We are going to plot each community in a different colour, and also include the label of which club each member ended up joining.

#draw each set of nodes in a seperate colour
nx.draw_networkx_nodes(ZKC_graph,circ_pos, nodelist=community_0, node_color='g', alpha=0.5)
nx.draw_networkx_nodes(ZKC_graph,circ_pos, nodelist=community_1, node_color='r', alpha=0.5)
nx.draw_networkx_nodes(ZKC_graph,circ_pos, nodelist=community_2, node_color='b', alpha=0.5)

#now we can add edges to the drawing 
nx.draw_networkx_edges(ZKC_graph,circ_pos,stlye='dashed',width = 0.2)

#finally we can add labels to each node corresponding to the final club each member joined 
nx.draw_networkx_labels(ZKC_graph,circ_pos,club_labels,font_size=9)

plt.show()

From this we notice that the communitites detected actually match our split between the ‘Officer’ (John A) and Mr. Hi quite closely. Let’s combine communities 1 (red) and 2 (blue) and take another look.

combined_community = community_1 + community_2

#draw the network as before
nx.draw_networkx_nodes(ZKC_graph, circ_pos, nodelist=community_0, node_color=’g’, alpha=0.5)
nx.draw_networkx_nodes(ZKC_graph, circ_pos, nodelist=combined_community, node_color=’m’, alpha=0.5)

nx.draw_networkx_edges(ZKC_graph, circ_pos,stlye=’dashed’,width = 0.5)

nx.draw_networkx_labels(ZKC_graph, circ_pos, club_labels, font_size=9)

plt.show()

So, following the combination of those two communities we have split our network into two groups – the green group we predict to join John A’s faction and the purple group Mr. Hi’s. Comparing to the club labels we see only one incorrect prediction in each group, meaning we have an accuracy of ~94%. Considering that the analysis itself was straightforward to carry out, I’d say that is impressive!

This demonstrates the exciting power of network analysis. The idea that we can develop a mathematical framework that can predict an individual’s choices based on their relationships with others is immensely powerful. We live in an interconnected world and the study of networks allows us to explore those connections.

Wrap-Up

Hopefully, this has opened your eyes to the exciting world of networks! The uses of this type of analysis stretch far and wide, and are rapidly increasing in popularity. There are many different forms that network analysis can take, this article just scratches the surface. A whole host of different algorithms are implemented in networkx. You can open up this notebook in deepnote and have a play-around building on the analysis already performed here!

Originally posted here. Reposted with permission.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1