Benchmarking a Computer Vision Deep Learning Pipeline with Distributed Computing Benchmarking a Computer Vision Deep Learning Pipeline with Distributed Computing
Editor’s Note: Jennifer is a speaker for ODSC East 2022. Be sure to check out her talk, “Creating a Benchmark for... Benchmarking a Computer Vision Deep Learning Pipeline with Distributed Computing

Editor’s Note: Jennifer is a speaker for ODSC East 2022. Be sure to check out her talk, “Creating a Benchmark for a Large-Scale Image Captioning Pipeline,” to learn more about computer vision deep learning there!

Computer vision has an essential role in solving some of the most challenging problems across industries. These industries include automotive (self-driving cars), pharmaceuticals (drugs, DNA binding, 3D Modeling), merchandise (warehouse management), energy (safety monitoring), among others. Our team decided to try our hand out at a large-scale computer vision Kaggle competition. We weren’t so interested in winning the competition as working with an extensive computer vision dataset and seeing how efficiently we could do deep learning. Deep learning pipelines are notoriously tricky and time-consuming. In our talk, we will touch upon:

  1. How to optimize a deep learning pipeline.
  2. That some deep networks can run well on CPUs.
  3. General best practices for creating the ever-complex computer vision pipelines.


The goal of the Kaggle competition was to create captions for organic chemistry images using a standardized chemical identification called the International Chemical Identifiers unit (InChI). The data for the competition contained images and their respective labels. Unfortunately, the quality of the images was poor (this was purposeful since older works have poorer quality images). However, our goal was to optimize the time for deep learning. For that reason, we created a derivative dataset that was going forward, what we used for all of our work.

Example Image from the derived dataset:

The competition included many winners who noted notoriously long times to train one epoch.  One person noted it took over 24 hours (12th place winner) on a single GPU for one epoch, another noted between 3 to 12 hours for one epoch using a single GPU.. Using a distributed system and CPUs, we reduced training times for the entire dataset of 2.4 million images to a little over 8 hours for one epoch. We have plans to test this on GPUs in a distributed fashion.

Our image captioning pipeline included several classical techniques, including transfer learning, natural language processing, and generation. Here’s a picture of the general pipeline. The encoder was a transfer learning model, and the decoder was an LSTM model. We chose this generic image captioning pipeline because it is simple to apply and troubleshoot.

Our Image Captioning Pipeline used with our derived dataset:

We did all of our experiments on the Domino Data Lab platform using one of the available distributed computing frameworks. We chose Ray because of its specialized libraries for deep learning and wrappers for classical algorithms such as distributed data-parallel.

Ray is a deep learning library that distributes Pytorch or Tensorflow training over multiple worker nodes (CPUs or GPUs). We did this with much success. First, our team worked on the classical pipeline, which runs on a single GPU, using a small dataset. We examined wall-clock times and found for a small sample of images (10,000), it took about 13 minutes on a single GPU for a single epoch, whereas with a set of distributed CPUs, the time was only 3 minutes. We are in the process of testing whether a distributed set of GPUs would speed this up even more. But given GPUs shortage is present at most cloud providers, we decided to focus instead on distributed CPU usage. Its also well established that both time and hardware contribute to cost. We attempted to optimize both. Our benchmarking experiments included single high-memory CPUs, a cluster of CPUs, single high-memory GPUs. 

Our talk at ODSC East 2022, “Creating a Benchmark for a Large-Scale Image Captioning Pipeline,” will describe in more detail how we optimized our pipeline, some preliminary results from our benchmarking efforts, and best practices for approaching this type of pipeline.

About the author/ODSC East 2022 Speaker on Computer Vision Deep Learning:

Jennifer Davis, Ph.D. is a Staff Field Data Scientist at Domino Data Lab, where she empowers clients on complex data science projects. She has completed two postdocs in computational and systems biology, trained at a supercomputing center at the University of Texas, Austin, and worked on hundreds of consulting projects with companies ranging from start-ups to the Fortune 100. Jennifer has previously presented topics at conferences for Association for Computing Machinery on LSTMs and Natural Language Generation and at conferences across the US and in Italy. Jennifer was part of a panel discussion for an IEEE conference on artificial intelligence in biology and medicine. She has practical experience teaching both corporate classes and at the college level. Jennifer enjoys working with clients and helping them achieve their goals.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.