Machine Learning is exploding into the world of healthcare. When we talk about the ways ML will revolutionize certain fields, healthcare is always one of the top areas seeing huge strides, thanks to the processing and learning power of machines. There’s a good chance you either are or will soon be employed in the healthcare field. A while back, I wrote a list of 25 excellent open datasets for ML and included healthdata.gov and MIMIC Critical Care Database. Here are 15 more excellent datasets specifically for healthcare.
[Related Article: Major Applications of AI in Healthcare]
General and Public Health:
WHO: Provides datasets based on global health priorities. The organization includes easy search and provides insights for topics along with the datasets.
CDC: Use this for US specific public health. The CDC maintains WONDER (Wide-ranging Online Data for Epidemiological Research) and sets are searchable by topic, state, and other factors.
data.gov: US focused healthcare data searchable by several different factors. Datasets are intended to improve the lives of people living in the US, but the information could be valuable for other training sets in research or other public health areas.
Re3Data: Contains data from over 2000 research subjects defined across several broad categories. While not all datasets available are free, the structures are clearly marked and easily searchable based on fees, membership requirements, and copyright restrictions.
CHDS: Child Health and Development Studies datasets are intended to research how disease and health pass down through generation. It contains datasets for research into not just genomic expression but how social, environmental, and cultural factors play into disease and health.
Kent Ridge Biomedical Datasets: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).
Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.
SEER: Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.
1000 Genomes Project: Sequencing from 2500 individuals and 26 different populations. It’s one of the biggest genome repositories you can access and is an international collaboration. It’s accessed through AWS. (Note, there are grants available for genome projects)
Medicare: Provides datasets based on services provided by Medicare accepting institutions. Datasets are well scrubbed for the most part and offer exciting insights into the service side of hospital care.
HCUP: Datasets from US hospitals. It includes emergency room stays, in-patient stays, and ambulance stats. It’s clean and illuminating into the services section of US healthcare.
OASIS: Open Access Series of Imaging makes neuroimages of the brain freely, hoping to foster research and new advances in both basic health and clinical neuroscience
OpenfMRI: Other imaging data sets from MRI machines to foster research, better diagnostics, and training. It includes 95 datasets from 3372 subjects with new material being added as researchers make their own data open to the public.
CT Medical Images: This one is a small dataset, but it’s specifically cancer-related. It contains labeled images with age, modality, and contrast tags. Again, high-quality images associated with training data may help speed breakthroughs.
Deep Lesion: One of the largest image sets currently available. CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. It includes over 32,000 lesions from 4000 unique patients.
Bonus! Dataset Aggregators
Kaggle: As always, an excellent resource for finding datasets pertaining not only to healthcare but other areas. If your healthcare explorations expand to a different subject or need other datasets for training, this is always a great resource.
Subreddit: It may take some doing, but you can find some serious gems within the subreddit discussions on open datasets. If you have a burning question that other public datasets can’t answer, this could be the solution.
Healthcare.ai: Not necessarily an aggregator but a full, opensource software and community dedicated to training, activism, and furthering the machine learning integration into all things healthcare.
[Related Article: Machine Learning and Compression Systems in Communications and Healthcare]
ML in Healthcare
The world is living longer and needs new answers more than ever. If you’re a data scientist working with health organizations or conducting your own research into some of humanity’s most persistent questions, having free access to data is a critical part of that research. Get started with some of these datasets, and they could be a jumping off point for the answers you need.