In the 1840’s Dr. Ignaz Semmelweis had a big problem on his hands. He was an obstetrician in the Vienna General...

In the 1840’s Dr. Ignaz Semmelweis had a big problem on his hands. He was an obstetrician in the Vienna General Hospital and healthy young women in the maternity ward were dying at an alarming rate. The problem was well known throughout Vienna. There were two wards, we’ll call them ward A and ward B. The mortality rate a ward A was significantly higher than that of ward B.

Patients were admitted to ward A on even days and ward B on odd days. It is said that women used to show up at the hospital on odd days begging to be admitted to ward B and not risk going into labor on an even day. Ward A and ward B had the same equipment, did the same procedures, and saw the same patient population. The only difference between the two was that ward A was staffed by physicians whereas ward B was staffed by midwives.


We could model this as a graph problem to illustrate our point:


Had Ignaz Semmelweis used a graph analysis, he would have noticed that the blue path, (Physician Trainee)-[Trains_in]->(Cadaver Lab), is predictive of high mortality. He may have tested this hypothesis to see if this held true across other units in the hospital. (Luckily Dr. Semmelweis noticed this relationship on his own and instituted a handwashing policy in the hospital. After the new policy went into effect the mortality rates were equal between the two wards.


Modern Applications

I work at New York Presbyterian Hospital, one of the largest hospitals in the country, and a leader in academic medicine. When patients come to our hospital the last thing they expect is to contract a new infection from the hospital. Unfortunately, like in Dr. Simmelweis’ hospital, hospital acquired infections remain a big problem. Many hospital acquired infections (HAI) can be prevented with good housekeeping (such as disinfecting rooms between patients), and careful attention to infection control procedures like hand washing and use of gloves, gowns, and other personal protective equipment.

Careful monitoring of active infections will help prevent outbreaks.

We will look at the tools of graph databases and social network analysis to monitor and prevent outbreaks or “dangerous patterns” like the path in the Vienna example.

In this post I’ll discuss some of the benefits of a graph database and explain how we went about modeling and creating our graph, then I’ll review some of the key social network analyses and how they work on our dataset.


Why graph?

Many of the leaders in the technology industry are based on graphs: Google search uses PageRank which is a graph algorithm, Linkedin, Twitter, and Facebook are social networks. Amazon has just released its version of a graph database. Financial firms and insurance companies commonly use graph databases for fraud detection. There are a number of graph databases available, I used Neo4j which is the leader in the industry. For me the major benefit of Neo4j is its query language—Cypher. Cypher is to a graph database what SQL is to the relational database, and like SQL, once you get the hang of Cypher it’s incredibly powerful and intuitive.


Graph overview

Graphs are made of up nodes and relationships (or vertices and edges). Usually nodes represent a person, place, thing, or event. Relationships show some sort interaction between them.

If you come from a relational database background, nodes are like data that stored in a table and the relationships are joins between tables.

For example, here is how data would be represented in a traditional RDBS:



 This is how you could represent the same data in a graph:


The advantage is that when you have highly connected data you don’t have to spend time creating joins between tables—it’s very easy to traverse the graph, explore data, and discover relationships.


Modeling the data


For the hospital data model was wanted to capture spatial data (like the unit to which the patient was admitted), time series data such as the movement of patients through the hospital over time and when patients where on the same unit, the patients providers and when and where the provider cared for the patient, other data such as consults the patient had during their stay, diagnoses, and procedures the patient had.

The data model had to be flexible enough to answer a number of questions. For example: how many patients stayed in the ICU on Jan 1st? Where do patients generally move after staying in the ICU? Which provider sees the most patients that eventually contract a HAI?


Capturing Time

Capturing time in a Graph database can be tricky. In order to look at patients who were collocated on a unit at a given time we wanted to graph both the location and the location at a given time. We created a timeline:


We can then create a relationship between a visit and an hour on the timeline for to map the hours that the patient was on a given unit.

In the example below the patient is on AD01 at 10:00 and 11:00. In the 11:00 hour they moved to G10S.


This way we can very easily query the database for patients who were on the same unit at the same time.


Social Network Analysis

We can apply the tools of social network analysis to the graph database.

We used Louvain community detection algorithm to find similar units. Different floor or units in the hospital may be designated for different types of patients. For example, there is a pediatric floor, an oncology floor, and a neurology floor. The algorithm was able to classify similar units together because patients of a certain type tend to move between those floors.


The above command writes the community ID back on to the nodes.

Here is community 49. These are all oncology units. We see that the community detection has worked as expected—it grouped floors with similar patient populations.


This is community 2, most of these are cardiac floors at one campus. One node however is the cardiac ICU at another campus—patients are commonly transferred from the first campus to the larger hospital. This was an interesting and unexpected finding.



One of the most fundamental applications of social network analysis is centrality. Centrality is a measure of how import a node is in a network. Centrality may be based on how many paths pass through a node or the sum of the weights of its edges.

In analyzing the hospital infection data we look at which nodes have high centrality between infections. A provider with a high centrality could be spreading disease in the hospital or they may be an infection disease specialist who sees all the patients with a HAI.

It is important to contextualize the results and to maintain a close relationship with subject matter experts who can verify that a pattern is actually concerning.



Graph databases are becoming increasingly important in a number of markets. Products such as Neo4J make it easy to get started and can scale to massive graphs. The future of data and data science is about contextualizing and making connections between the massive amounts of data we are collecting. Graph databases allow you to model and query the data more organically.

Graphs make the tools of social network analysis like centrality, neighborhood detection, and subgraph pattern recognition available to datasets that aren’t commonly considered to be social network problems.


Michael Zelenetz is a Data Insights Programmer at New York Presbyterian Hospital. He is interested  in how technology and data can help drive meaningful change. He likes solving complex problems. In his free time, he build small robots, climb, and read non-fiction.

For more information on Michael, visit his LinkedIn.