

22 Machine Learning Open Datasets for 2021
Machine LearningModelingDatasetsposted by Elizabeth Wallace, ODSC October 4, 2021 Elizabeth Wallace, ODSC

It’s that time again. We know you’re diligently working on your machine learning skills, and it’s time to find datasets worthy of the challenge. Whether you’re new to the field or looking for some inspiration, here are some great machine learning open datasets for training models. Even better—they’re open access.
But first: How to Find Machine Learning Open Datasets
Searching for machine learning open datasets is a skill in itself and one you should get really good at if you’re in the data science community. Luckily, there are few sources for finding these datasets. Some common ones include:
- Kaggle
- Google Dataset Search
- Github
- data.gov
- UCI Machine Learning Laboratory
And many others. In fact, joining communities like Github or Kaggle (and Open Data Science) combined with some well-placed keyword searches can help you find machine learning open datasets specific to your project. Once you find a dataset, ask yourself:
- Can I trust the source?
- Can I find/fix inaccuracies?
- Is it complete?
- Is the data objective?
Take a little bit to explore the dataset and answer these questions. They’ll help you filter “data” from “high-quality data.”
22 Best Machine Learning Open Datasets
We’ll divide these machine learning open datasets based on some general categories, but you can also mix and match based on the data available in each set. Just because something is labeled for sentiment analysis doesn’t mean it wouldn’t also work with general natural language processing, for example.
Image Processing
LabelMe: A computer vision data set published by MIT that allows users to contribute through the annotation tool. You can download the images via the MatLab toolbox or work with them online.
Google Open Images: A massive dataset (befitting all Google contributions) with links to millions of categorized public images in thousands of categories. Images fall under a creative commons license for further open source security.
UMDFaces Dataset: For images specific to facial analysis, this dataset includes hundreds of thousands of both still and video images of over 8,000 subjects, all annotated.
VisualGenome: An ongoing project connecting “structured image concepts to language,” this knowledge base includes over 100,000 images and millions of labeled attributes, relationships, and visual question answers.
Natural Language Processing
Dirty Words: This fun dataset from Github itself looks at what you definitely do not want showing up in your chatbot, unless it’s that type of chatbot. A fascinating and ongoing collection of not socially acceptable words and phrases in a multitude of languages.
Amazon Reviews: Spanning the course of two decades, this dataset features about 35 million Amazon reviews with the associated product for reference. It also provides the ranking, text, and basic user information.
Microsoft MAchine Reading COmprehension Dataset (MS Marco): A Microsoft resource focused on deep learning in search. It includes a question data set, natural language generation dataset, passage ranking and keyword extraction datasets, and conversational search.
Jeopardy! Questions Dataset: With over 200,000 Jeopardy questions, answers, and relevant data, this dataset is an excellent multi-use option. It also contains info about the value of the question and its category.
Sentiment Analysis
Dynasent: This English dataset includes over 121,000 sentences in positive, negative, and neutral utterances created on its own open platform. Each utterance has been verified by five crowd workers.
ReDial: An annotated dataset featuring conversations of people recommending movies to each other. There are around10,000 conversations, and the site offers examples from the conversations for validation.
Youtubean: Using closed captions from videos focused on reviews, this dataset supports a range of sentiment analysis tasks and goals.
iSarcasm: Twitter is a goldmine for sentiment analysis, and this dataset focuses solely on sarcastic tweets (sarcastic or non-sarcastic) and a further subgroup labeled as ironic (irony, satire, understatement, overstatement, and rhetorical questions).
Speech
Vox Celeb: A large-scale speaker identification set with over 100,000 utterances compiled from YouTube videos. It offers a range of accents, balanced gender, and dispersed ages. It offers users around 2000 hours of speech.
Flicker Audio Captions: A collection of over 40,000 captions describing 8,000 images, this dataset originated to investigate multi-modal learning schemes for unsupervised speech pattern discovery.
VoxForge: Unlike some other collections, this one specifically targets collections of accents in English utterances. It’s suitable for robust training in diverse speech patterns.
CHIME: This challenge dataset provides real recordings, i.e., recordings of speakers in real-world settings, not just a studio. Specifically, it offers real audio and “synthetic” audio made from layering environments over the recordings, as well as clean audio with no noise.
Government Data
Data USA: A well-organized place to find all sorts of data from the US government and its various departments. It includes info on congressional districts, public workers, population studies, and so much more.
UN Data (the United Nations): For datasets on a variety of state powers and regional profiles, the site from the United Nations delivers.
EuroStat: This European-based database categorizes datasets by area or theme and includes sections on policy.
data.gov.au: Australia’s available public data in a searchable format. Users can find thousands of datasets on a variety of topics, including population, environment, and regional data.
For beginners
NYC Taxi Trip Data: A collection of trip data starting in 2009, this data set explores things like rates, trip lengths, and payment types. In addition, it offers other tools such as user guides and a user-friendly format.
Wheat Seeds Dataset: A simple dataset that’s useful for classification, it offers information about three wheat varieties analyzed with a soft X-ray technique.
Leveraging open datasets for your data science practice
There are so many great open datasets you can use to practice your craft, build your dream projects, and expand your portfolio. Whether you’re building for your current employer or dreaming up new projects, these datasets offer great machine learning training without the cost of buying expensive private data collections.
Do the ODSC community a huge favor and comment with your favorite machine learning open datasets below? Are there ones on the list that you’ve worked with? Ones we didn’t mention? Let us know.
Read the 2019 datasets here to refresh your memory and add to your collections of machine learning open datasets!
How to Learn More about ML and How to Use These Machine Learning Open Datasets
At our upcoming event this November 16th-18th in San Francisco, ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on machine learning and machine learning open datasets. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on machine learning include:
- Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
- Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
- Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
- Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability and Analytics Researcher | Northwestern University
Sessions on MLOps:
- Tuning Hyperparameters with Reproducible Experiments: Milecia McGregor | Senior Software Engineer | Iterative
- MLOps… From Model to Production: Filipa Peleja, PhD | Lead Data Scientist | Levi Strauss & Co
- Operationalization of Models Developed and Deployed in Heterogeneous Platforms: Sourav Mazumder | Data Scientist, Thought Leader, AI & ML Operationalization Leader | IBM
- Develop and Deploy a Machine Learning Pipeline in 45 Minutes with Ploomber: Eduardo Blancas | Data Scientist | Fidelity Investments
Sessions on Deep Learning:
- GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
- Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
- Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
- Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google