This open-source project features a community contribution cluster which can be made available to ODSC followers.
The severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2) has taken around 930,000 lives and infected more than 29 million people worldwide so far. The COVID-19 pandemic is still ongoing, and the crisis moves at a very quick pace. Even though the topic has attracted great attention from both academic and industrial communities, it’s a challenge to conduct research while attempting to keep up with the daunting task of managing, cleaning, and maintaining data collected from various sources.
Our team at HPCC Systems team decided to build an open-source, public COVID-19 Data Lake in order to support our academic partners, provide a comprehensive source of information for the general public, and demonstrate the power of our Data Lake technology. Our system is currently being used by many individuals, and as a data source for university researchers attempting to forecast the course of the pandemic. We invite you to check out our site at: https://covid19.hpccsystems.com. Our GitHub repository features information on data models and our academic partners in this project: https://github.com/hpcc-systems/covid19
Our map-based website provides a multi-level view of infection state from the global view down to the state and county levels. The system embeds a classical epidemiological model (SIR) and provides informative metrics that serve to illuminate the state and evolution of the pandemic within each location. Metrics include Effective Reproductive Rate (R), Contagion Risk, Social Distance Indicator, Medical Quality Indicator, Case Fatality Rate, Infection Fatality Rate, and more. These metrics are used to construct a comprehensive English language daily commentary on the current state of the infection for every country, state, county in the world for which data is available.
We also make data available through web-based APIs. The APIs provide access to any of our data from raw ingestion, to the highest-level analysis. We have found that the publicly available data is fraught with a number of problems that makes it unsuitable for analysis. We needed to correct these problems because of the deep level of analysis that we perform. The filtered and corrected data series can be very useful for researchers, who would otherwise face the same issues that we have addressed, as well as new ones we will encounter.
The COVID-19 Data Lake is orchestrated by the Tombolo Data Lake Curation system. The production workflow is initiated whenever new data is available. The workflow performs all the defined data enrichment activities, triggering each job when the data it depends on is ready. Tombolo automatically monitors the workflow and ensures that each job completes as scheduled. If errors occur, Tombolo will escalate the issue according to the defined policy, notifying the appropriate people. This flexible system allows production quality workflows to be easily created and maintained as new data enrichment paths are developed.
To learn more about this project and how others use HPCC Systems, Join us at our Virtual Community Day event on October 5th. In this year’s annual Community Day Event, HPCC Systems has opened their virtual event free to all technologists, including those new to HPCC Systems. Our worldwide virtual event will provide a high-quality training workshop, as well as presentations covering a wide variety of topics and technical posters from students working on HPCC Systems related projects.
About HPCC Systems:
HPCC Systems is a completely free, big data platform that handles data from ingestion and enrichment to content delivery. Our comprehensive, dedicated data lake platform makes combining different types of data easier and faster than competing platforms — even data stored in massive, mixed schema data lakes — and it scales very quickly as your data needs grow. It’s also open-source, free to use, and easy to learn.