12 Notable Healthcare Datasets for 2022 12 Notable Healthcare Datasets for 2022
Machines continue to show us how valuable they are to our everyday lives, and healthcare is no exception. However, finding quality... 12 Notable Healthcare Datasets for 2022

Machines continue to show us how valuable they are to our everyday lives, and healthcare is no exception. However, finding quality healthcare data to train these machines can be a challenge. Luckily, researchers, governments, and even private companies recognize the value of providing (anonymized) data to advance healthcare initiatives and the public good. Here are 12 notable healthcare datasets for 2022. 

V7 COVID-19 X-Ray Dataset

You didn’t think we’d get out of this article without talking about Covid-19, did you? The Covid-19 X-Ray dataset offers more than 6000 annotated images of lungs with other characteristics removed. For example, traditional lung x-rays often show the shoulders or ribcage, which could help identify the age of the patient. Images include patients with and without Covid-19, and could help develop better tools for assessing the disease severity in individual patients.

Big Cities Health Inventory Data Platform

Big Cities Health Coalition upgraded its platform to include comparisons of key public health indicators across 28 cities. This collection contains more than 17,000 data points, and researchers can navigate through desired focus points with the navigation menu. Users gain greater insight into what’s impacting the U.S.’s big cities and can train machines accordingly.

In-Person and Virtual Conference

September 5th to 6th, 2024 – London

Featuring 200 hours of content, 90 thought leaders and experts, and 40+ workshops and training sessions, Europe 2024 will keep you up-to-date with the latest topics and tools in everything from machine learning to generative AI and more.


Health Data

Governments are beginning to recognize the value of making datasets available to encourage innovation. This site offers high-value health data, including recent datasets for Covid-19, collected from the U.S. Department of Health and Human Services, as well as state partners. Researchers can explore 

Human Mortality Database

With information from 41 different countries, this dataset provides detailed mortality and population data. This type of data aids researchers and entrepreneurs in building solutions to improve life quality, address pressing chronic illness challenges, and manage or prevent environmental causes of shortened lifespans, among many other applications. The site now includes a new dataset—Short Term Mortality Fluctuations—for comparing responses to epidemics across 38 countries.

Open Access Series of Imaging Studies (OASIS) Brains Dataset

The latest release, OASIS-3, offers freely available datasets for researchers and citizen data scientists looking to explore advances in cognitive health, with images showcasing normal brain scans and those diagnosed with Alzheimer’s. It aims to improve clinical neuroscience initiatives and includes data across a broad demographic spectrum. Researchers can find thousands of images in the first, second, and now the third update. The datasets are free, but hopeful researchers must apply for use and sign the appropriate privacy agreements.

Child Health and Development Studies

The nonprofit Public Health Institute offers data on factors in early life. Researchers can access environmental, behavioral, genetic, and other biological data for participants. In many cases, these datasets cover decades of monitoring. The datasets offer a connection from these factors in early life to health outcomes later in adulthood. The datasets are free, but researchers must apply and sign agreements to access the data.

Data Discovery at the National Library of Medicine

The National Library of Medicine offers a variety of datasets from public health to drugs and supplements. These offer researchers data to explore in a variety of formats and over 130 different projects. Many of the datasets were updated last year. Researchers may use datasets for free but should follow the individual license agreements for each set.

National Cancer Institute SEER Data

The Surveillance, Epidemiology, and End Results Program offers population data by age, sex, race, year of diagnosis, and geographic areas. SEER releases new research data every spring and offers specialized datasets for researchers looking for something outside the available datasets. While these sets are free, researchers must apply for special access. There’s also an interactive toolbox to make the search for the right dataset easier.

Merck Molecular Health Activity Challenge

For drug discovery training sets, this dataset located on Kaggle offers datasets simulating how molecule sets interact with each other. The set also includes starter code in R for reading the datasets, and the benchmark result for several tasks is available as an example set. It offers 15 molecular datasets originally part of a Kaggle competition, and each belongs to a biologically relevant target.

Kent Ridge Biomedical Datasets

Located on the ELVIRA Biomedical Data Set Repository, this biomedical dataset collection focus on data published in journals such as Science and Nature. They offer high-dimensional sets, including gene expression, protein profiling data, and genomic sequence data. The list ranges from breast cancer sets to info in the central nervous system.

MedPix from the National Library of Medicine

This free database contains more medical images, teaching scenarios, teaching cases, and clinical topics. These attach to nearly 59,000 images by disease location, pathology category, and patient profiles. These images are indexed and curated, coming from over 12,000 patients. They are continually accepting new data submissions, and the images could offer valuable training options for computer vision, diagnostics, or other tools.

National Institute of Health X-Ray Dataset

This 100,000-plus strong image dataset lives on Kaggle and focuses specifically on chest x-rays. It includes over 30,000 unique patients and disease labels generated from NLP text-mining. These have an expected 90% accuracy rate. Researchers don’t have access to the original radiology reports, but interested parties can read the paper outlining the labeling process. Kaggle encourages other parties to offer other notations to update or correct erroneous labels.

Training New Healthcare Solutions

These data sets offer new choices for your healthcare solutions, whether you need data or images. As A.I. becomes a more significant part of healthcare solutions from beginning to end, expanding data set choices can provide the training. Be sure to check out the datasets from 2020 to find even more options for quality healthcare data.

Learn More about Healthcare AI and Healthcare Datasets at ODSC East 2024

So, I bet you’re ready to upskill your AI capabilities right? Well, if you want to get the most out of AI, you’ll want to attend ODSC East this April. At ODSC East, you’ll not only expand your AI knowledge and develop unique skills, but most importantly, you’ll build up the foundation you need to help future-proof your career through upskilling with AI. Register now for 50% off all ticket types! 

Elizabeth Wallace, ODSC

Elizabeth is a Nashville-based freelance writer with a soft spot for startups. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain - clearly - what it is they do. Connect with her on LinkedIn here: https://www.linkedin.com/in/elizabethawallace/