Machines continue to show us how valuable they are to our everyday lives, and healthcare is no exception. However, finding quality healthcare data to train these machines can be a challenge. Luckily, researchers, governments, and even private companies recognize the value of providing (anonymized) data to advance healthcare initiatives and the public good. Here are 12 notable healthcare datasets for 2022.
You didn’t think we’d get out of this article without talking about Covid-19, did you? The Covid-19 X-Ray dataset offers more than 6000 annotated images of lungs with other characteristics removed. For example, traditional lung x-rays often show the shoulders or ribcage, which could help identify the age of the patient. Images include patients with and without Covid-19, and could help develop better tools for assessing the disease severity in individual patients.
Big Cities Health Coalition upgraded its platform to include comparisons of key public health indicators across 28 cities. This collection contains more than 17,000 data points, and researchers can navigate through desired focus points with the navigation menu. Users gain greater insight into what’s impacting the U.S.’s big cities and can train machines accordingly.
Governments are beginning to recognize the value of making datasets available to encourage innovation. This site offers high-value health data, including recent datasets for Covid-19, collected from the U.S. Department of Health and Human Services, as well as state partners. Researchers can explore
With information from 41 different countries, this dataset provides detailed mortality and population data. This type of data aids researchers and entrepreneurs in building solutions to improve life quality, address pressing chronic illness challenges, and manage or prevent environmental causes of shortened lifespans, among many other applications. The site now includes a new dataset—Short Term Mortality Fluctuations—for comparing responses to epidemics across 38 countries.
The latest release, OASIS-3, offers freely available datasets for researchers and citizen data scientists looking to explore advances in cognitive health, with images showcasing normal brain scans and those diagnosed with Alzheimer’s. It aims to improve clinical neuroscience initiatives and includes data across a broad demographic spectrum. Researchers can find thousands of images in the first, second, and now the third update. The datasets are free, but hopeful researchers must apply for use and sign the appropriate privacy agreements.
The nonprofit Public Health Institute offers data on factors in early life. Researchers can access environmental, behavioral, genetic, and other biological data for participants. In many cases, these datasets cover decades of monitoring. The datasets offer a connection from these factors in early life to health outcomes later in adulthood. The datasets are free, but researchers must apply and sign agreements to access the data.
The National Library of Medicine offers a variety of datasets from public health to drugs and supplements. These offer researchers data to explore in a variety of formats and over 130 different projects. Many of the datasets were updated last year. Researchers may use datasets for free but should follow the individual license agreements for each set.
The Surveillance, Epidemiology, and End Results Program offers population data by age, sex, race, year of diagnosis, and geographic areas. SEER releases new research data every spring and offers specialized datasets for researchers looking for something outside the available datasets. While these sets are free, researchers must apply for special access. There’s also an interactive toolbox to make the search for the right dataset easier.
For drug discovery training sets, this dataset located on Kaggle offers datasets simulating how molecule sets interact with each other. The set also includes starter code in R for reading the datasets, and the benchmark result for several tasks is available as an example set. It offers 15 molecular datasets originally part of a Kaggle competition, and each belongs to a biologically relevant target.
Located on the ELVIRA Biomedical Data Set Repository, this biomedical dataset collection focus on data published in journals such as Science and Nature. They offer high-dimensional sets, including gene expression, protein profiling data, and genomic sequence data. The list ranges from breast cancer sets to info in the central nervous system.
This free database contains more medical images, teaching scenarios, teaching cases, and clinical topics. These attach to nearly 59,000 images by disease location, pathology category, and patient profiles. These images are indexed and curated, coming from over 12,000 patients. They are continually accepting new data submissions, and the images could offer valuable training options for computer vision, diagnostics, or other tools.
This 100,000-plus strong image dataset lives on Kaggle and focuses specifically on chest x-rays. It includes over 30,000 unique patients and disease labels generated from NLP text-mining. These have an expected 90% accuracy rate. Researchers don’t have access to the original radiology reports, but interested parties can read the paper outlining the labeling process. Kaggle encourages other parties to offer other notations to update or correct erroneous labels.
Training New Healthcare Solutions
These data sets offer new choices for your healthcare solutions, whether you need data or images. As A.I. becomes a more significant part of healthcare solutions from beginning to end, expanding data set choices can provide the training. Be sure to check out the datasets from 2020 to find even more options for quality healthcare data.
Learn More about Healthcare AI and Healthcare Datasets at ODSC East 2022
At our upcoming event, ODSC East 2022 in Boston this April 19th-21st, you’ll be able to learn more about AI in healthcare, including healthcare datasets, case studies, ethical use of AI and so on. Here are a few sessions that may be of interest:
- Need of Adaptive Ethical ML Models in Post Pandemic Era
- Spark NLP for Healthcare: Modular Approach to Solve Problems at Scale in Healthcare NLP
- Data Science and Contextual Approaches to Palliative Care Need Prediction
- Unlocking the value of siloed data with multi-party ML
- Data Operations for Research Quality Health Data