

Learn How to Organize, Cleanup and Process Medical Image Datasets for Computer Vision Training
Blogs from ODSC SpeakersGuest contributorModelingComputer Visionposted by Jesse Freeman April 23, 2019 Jesse Freeman

We are entering a whole new world where the possibility for AI, and more specifically computer vision, can help us with medical decision making that we’ve relied on doctors to perform. Moreover, while the hope of having doctor-less diagnoses is still a work of science fiction, every day we get closer to finding the solutions required to make this a reality. During the “Putting the ‘Data’ in Data Scientist” workshop, we will examine how to handle large amounts of data at scale with the help of deep learning ops, or DeepOps as we call it. DeepOps is a set of methodologies, tools, and culture where data engineers and scientists collaborate to build a faster and more reliable deep learning pipeline.
While we are still at the early stages of being able to detect cancer and other issues from medical images alone, new advancements in deep learning are creating challenges that budding data scientists and engineers need to understand better. The good news is that the datasets do this kind of early-stage detection are available to the general public allowing anyone to access it and use it for their training.
The ChestXray14 Dataset
In 2017, the NIH Clinical Center provided one of the most extensive publicly available chest x-ray datasets called ChestXray14 to the scientific community. It includes data from over 30,000 patients, including many with advanced lung disease. This kind of data had never been made available to the general public.
The goal of releasing it was to help others create the solutions that would help data scientists one day bridge the gap between computers and doctors being able to confirm the results of radiologists better Finding a solution that can correctly analyze these images may one day help to reduce misdiagnosis and save lives. There have already been several papers about this topic, such as the ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases.
For example, the ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the fourteen distinct, text-mined disease image labels (where each image can have multi-labels), from the associated radiological reports using natural language processing. It also includes fourteen common thoracic pathologies.
The circular diagram shows the proportions of images with multi-labels in each of 8 pathology classes and the labels’ co-occurrence statistics from ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases paper.
Unfortunately, managing this large dataset on your own could be a challenge, especially when you are new to the field. How do you query and version control 112,120 images which add up to 45 GB of data? Do you want to store another 45GB of data every time you add or change an image? What if we could take all of this data, version it and pull out only what we need with a few lines of code in a matter of seconds to speed up testing?
What You’ll Learn
Together, we’ll look at a publicly available dataset, ChestXray14, for reference and learn how to:
- Organize one of the largest publicly available chest x-ray datasets.
- Correctly correlate and tag medical images to corresponding metadata.
- Discuss strategies for storing this data.
- Prepare and stream data for deep learning training.
- View and version data with MissingLink.ai’s query tool.
Afterward, you’ll walk away with the knowledge of how to automate data management, exploration, and versioning in your deep learning projects.
Don’t forget to sign up to reserve your space on the ODSC workshop page!
– Jesse Freeman (@jessefreeman), Chief Evangelist at MissingLink.ai