Call for Collaboration: Data Science and COVID-19 – Modeling and Future Assumptions Call for Collaboration: Data Science and COVID-19 – Modeling and Future Assumptions
Update 3/20/2020: An initial conclusion has been made and is added at the bottom of the original article. Update 3/17/2020: This... Call for Collaboration: Data Science and COVID-19 – Modeling and Future Assumptions

Update 3/20/2020: An initial conclusion has been made and is added at the bottom of the original article.

Update 3/17/2020: This analysis & repository is ongoing. If you’d like to contribute to this open-source project, please email Alex (alex.l@odsc.com) and Ben (ben.vigoda@gamalon.com) to request access to the GitHub repository and contribute your thoughts on data science and COVID-19.

Disclaimer: The models and graphs below are not decisive and should not be used in any serious decision-making capability. The purpose of this article is to provide a framework for collaboration for future analysis. We want your help in improving this project and highly encourage interested parties to reach out to us.

Data Science and COVID-19: The Need and the Answer

The goal of this simulation is to answer the questions, “How long will the COVID-19 pandemic last?” and “How many people are infected right now in my local area but do not know it?”

We are making the code available because we are not sure that we believe this simulation. We want data scientists to critique the assumptions and how the code works to iterate towards the most useful and realistic model we can get.

This is a very emotional and human issue.  But the assumptions we put into the model are numbers, and the code is code. So while doing this data science work, it is important to view the code, inputs, and results with a kind of clinical emotional distance.

For example, to make mathematical modeling easier, the model assumes that if one person in a household gets sick, so does everyone else in that household. On a personal basis, I hope that is not true for your household.  As a simplified mathematical model, we treat a household where a group of people may be sheltering in place or self-quarantining as if it were a single person.

Before we get into the details of the code and the assumptions, the high-level conclusion is this: There are two regimes.  In one regime, people are very careful about staying at home and not spreading the disease  The curve spreads out over a 100 day period.  The other regime is where people are a bit less careful.  In this regime, the disease doubles every 4 days in the population.  Since it takes an average of 6.5-7 days to express symptoms, the virus can spread to 4x more people before the people who transmitted it feel any symptoms.

The Process

Let’s talk through the code now.  I have been basing my numerical assumptions on this Medium article, but the goal of this to create a code artifact that anyone can use to try out their assumptions.

The model assumes that there are two kinds of households, “home” and “moving.” A “moving” household is a household where one or more people in the household have a job where they move around in the community.  This includes people who are delivering food, bagging or boxing food in distribution centers, police, firemen, doctors, nurses, grocery store workers, and so forth.

The other kind of household, “home,” stays in their house, receives deliveries of food or other necessities, and practices social distancing (6+ feet) if they go for a walk outside.  They make decisions like whether to order take-out, whether to treat Amazon or Instacart type deliveries with dilute bleach or let non-perishables with hard surfaces sit for 2 days, etc.  They also decide whether to go see their “best friend” once every 10 days.  These are critical decisions.

Data Science and COVID-19
This split into two kinds of households is what I see when I look out of my window in Cambridge, MA.  Then I found the paper from the Centre for Mathematical Modelling of Infectious Diseases, London School of Hygiene & Tropical Medicine cited below (1).  They independently did the same exact thing in their model.

You can put whatever assumptions into the model that you want, but here is how I thought about putting in assumptions, such as considering the range of pessimistic assumptions versus optimistic assumptions:

1 in every 10 days a home person who is incubating the virus visits a friend because they have cabin fever and in 1/1 of those visits transmits the disease.  That means on average, 1 in every 10 days a given home household spreads the disease to some other home household.  A more optimistic assumption would put this number at 1 in every 100 days.

Data Science and COVID-191 in every 5 days, an Instacart delivery person, a plumber, or another outside individual provides the virus to a home via a surface or a brief failure of social distancing, and 50% of those incidents transmit the disease.  This likely means that the probability that each day a mobile household transmits COVID-19 to a home household would be 0.1.  This would mean that one in ten days a transmission occurs from a given mobile household to some home household that they deliver to.  A more optimistic assumption would put this number at 1 in every 100 days that a given moving delivery person or caregiver transmits the disease to one of the homes they help support.

I assume that “home” households that have symptoms will take very strong precautions not to get mobile workers sick, so there is zero transfer from home_sick to moving.

Another assumption is that because the US government has been vague about financial relief for moving workers who don’t or cannot go to work, and because many doctors, nurses, food delivery people, and so forth are very dedicated to helping others, 1 out of every 10 mobile workers unfortunately decide to go to work even if they have some symptoms. With a dry cough, they might have a 50% chance of infecting a home each day and a 50% chance of infecting another mobile worker each day.  Who knows what those numbers really are.  I just put some in.  BUT, what the model shows is that it doesn’t matter what these numbers are unless you are prepared to make them at least 10x more optimistic than this.  There is a critical threshold of mobile workers ignoring their symptoms, where the virus spreads like wildfire.

Data Science and COVID-19The results are hard to believe so I am looking for people to comb over the code.  What they say right now, is that unless we use the most optimistic assumptions, AND which seem to me to enforce unrealistically draconian constraints on people, the virus doubles in the population every 4 days, no one knows it is happening until an average of 7 days later when they all start to exhibit symptoms, and there is a tremendous spike in hospitalization which completely overwhelm the medical facilities, and the fatality rate is the fatalility rate of a country that does not have any respirators which right now looks to be in the 2-4% range.  4.5M people die in the US in the next 65 days.  The death experience is one of slow asphyxiation, like slowly drowning over a period of 4 days.

Conclusion and Initial Analyses

The simulations show that given how people are behaving now, there will be a day at the peak in late April when 100,000 households in Cambridge and Arlington MA will yield 4,500 new symptomatic cases in one day. If 5% of those are serious cases, the hospitals will be admitting 225 serious cases per day at the peak, and will average on the order of 100 new serious cases per day for a twenty-day period. Given over a million households in the Boston area, that is on the order of 2,000 new serious cases per day for twenty days. That is 40,000 serious cases in the Boston area in a very short period. We do not have that many ventilators. The hospitals will still be over-run.

To flatten the curve, we need to be even more careful not to transmit the disease than most people are being right now.

Each staying home household needs to only transmit the virus to another staying home household once every 20 days. So no going for walks with your best friend once every day or even every week. It may be the case that one person from your household can only do that at most once per month, if you are going to transmit the virus to them while you are with them.

A given moving/mobile worker like a food delivery person must only communicate the virus to a stay-at-home household once every 20 days. With all of the bags being carried and all of the individual fruits being picked up and put into bags by workers who might be incubating the virus, and with a delivery person delivering to 10-20 households per day, deliveries from a given delivery person can only result in a household getting infected once every 20 days.

The mobile workers need to be very careful as well. If they are together at a grocery store or packing warehouse, a given worker can only transmit the virus to another worker once every 10 days.

Call for Collaboration

We deeply welcome improvements to the model, and experiments to simulate different input assumptions. My goal here was to create a seed for the open-source data science community to build from. Even though a big first wave of the virus will be already “crashing over the population” before this model can probably be put into public policy use, I anticipate that it could become extremely useful for helping local governments manage through how we lift the social distancing directives, and how we deal with the second wave of the virus. Examples of things that are not yet in this model:

  1. After people recover from the illness, they become a kind of buffer in the population. It acts as if people become spatially more spread out. If you think of people like particles diffusing through space, on average they will start to have immune people in between them. That is not in the model.
  2. Age groups. I wanted the model to be simple so that a lot of people could quickly read and understand it. So I didn’t add age bands like the models reported by the WHO modeling team. Adding age bands would be a great way for a graduate student mathematical modeling to contribute.
  3. I didn’t set this model up to learn the parameters from data. This is a generative model. It would be a great project to put it into, for example, https://pyro.ai/ and infer/learn the parameters from data. It could become important to re-estimate the parameters on an ongoing basis, if for example, Spring-time weather starts to turn the tide.

How Computer Modeling Of COVID-19’s Spread Could Help Fight The Virus: https://www.npr.org/sections/health-shots/2020/03/04/811146915/how-computer-modeling-of-covid-19s-spread-could-help-fight-the-virus

“Scientists who use math and computers to simulate the course of epidemics are taking on the new coronavirus to try to predict how this global outbreak might evolve and how best to tackle it.

But some say more could be done to take advantage of these modeling tools and the researchers’ findings.

It is sort of an ad hoc, volunteer effort, and I think that’s something that we could improve upon,” says Caitlin Rivers, an infectious diseases modeler with the Johns Hopkins Center for Health Security.

In her view, “modeling plays a really important role in understanding how an outbreak is unfolding, where it might be going, and what we should be thinking through.

But only a small number of the modelers of epidemics work for the federal government, she says. Most are in academia, and they don’t have formal relationships with officials who have to make key public health decisions.

Putting that information out really quickly helped to bring a lot more attention of other modelers to say, ‘There are now things that we can do, so let’s do that,'” says Rosalind Eggo, an infectious disease modeler at the London School of Hygiene & Tropical Medicine.”

Useful resources:





(1) The effect of control strategies that reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China, Kiesha Prem, Yang Liu, Timothy W Russell1, Adam J Kucharski, Rosalind M Eggo, Nicholas Davies, Centre for the Mathematical Modelling of Infectious Diseases COVID-19 Working Group, Mark Jit, Petra Klepac.

See the full GitHub repository here.

Are you working on another open-source project related to COVID-19 or another important issue? We’d love to help you get more eyes on your project! Please email the blog manager (alex.l@odsc.com) and share some details about your project. We can help you get collaborators to make a difference on your project.

Ben Vigoda

Ben Vigoda

From 2013 to 2017 Gamalon received the largest single contract for next generation of machine learning from DARPA. With a foundational advance in machine learning developed in collaboration with leading groups at MIT, Berkeley, Stanford, and Columbia, and over 40 patent filings, Gamalon was named one of the 50 Smartest Companies by MIT Technology Review in 2017, and in 2018 became a World Economic Forum Technology Pioneer.