Data Storage Keeping Pace for AI and Deep Learning
Deep LearningDeep Learningposted by Daniel Gutierrez, ODSC October 2, 2018 Daniel Gutierrez, ODSC
Data is the new currency driving accelerated levels of innovation powered by AI. Enterprises require modern data storage architectures purpose-built for deep learning and designed to shorten the time to insights while simplifying complex big data pipelines. Continuing to use legacy storage systems, however, will introduce serious complications in this march toward innovation. In this article, we’ll review why traditional data storage is unable to meet the demands of deep learning, and also take a snapshot of what an AI-enabled storage architecture should look like.
Traditional Data Storage Unable to Meet Deep Learning Demands
Rapid AI evolution today is powered by leading-edge technologies like deep learning algorithms and GPU hardware that are equipped for performing massively parallel computation. Legacy storage systems, however, were largely built on aging architectures, designed in the serial era. These systems may only work well with normalized data sources and are not optimized to analyze the kind of unstructured and semi-structured data often used in AI and deep learning. Further, these systems have become the primary bottleneck for AI applications and as a result, the performance separation between compute and storage continues to expand.
Deep learning applications require a significant level of computational power coupled with massive amounts of data. The primary workloads consist of neural network training processes where GPUs are consuming vast amount of data gathered from many sources, such as specialized sensors for autonomous vehicles collecting images, radar, and LIDAR data. Deep learning frameworks such as TensorFlow and PyTorch process the data for training purposes.
Once a robust GPU-based compute platform has been chosen, attention turns to storage. Early tests with GPU-based architectures often exposes the storage system as not being able to feed the GPUs with enough data to keep them busy; in other words, they’re unable to deliver enough ingest throughput to maximize training performance on GPU powered servers. There is a need for true linear scaling of capacity and performance; solutions well-suited to modern analytics workloads for AI and deep learning.
Source: Pure Storage
Data sets for each training run can reach into terabyte ranges and can take weeks to complete if the architecture is not properly optimized for such demands. Combining modern compute with modern storage enables progress in reducing training run intervals to the extent that data scientists expect to be able to iterate their models. For competitive reasons, enterprises experience mounting business pressure to iterate faster and develop new neural network models.
There are good compute options for deep learning such as GPU-based hardware that allow data scientists to reduce training run intervals. GPU utilization directly translates to data scientist productivity, and as such, it’s important to test the architecture in a hands-on environment. As the basis for calculating storage costs, you need to evaluate how many GPUs the storage solution of a given size can deliver data to without throttling.
Modern Data Storage for Deep Learning
The demands presented by contemporary AI and deep learning applications require fast and efficient data storage solutions which legacy storage systems are no longer able to provide. A new approach is needed consisting of innovative storage architectures that provide state-of-the-art performance in terms of concurrency, i.e. capacity, density, throughput, latency, and I/O operations per second.
There is a need for AI-enabled data centers populated with servers consisting of multi-core CPUs and GPUs using parallel processing and extremely fast networks. The storage solution should have a modern, massively parallel architecture, eliminating serial bottlenecks that impede legacy storage systems. It should be engineered to deliver the performance essential for AI and deep learning while offering simplicity, so data scientists can focus on training and inference, not details of infrastructure.
The seemingly simple process of data transfer – moving a block of data from a disk drive to a compute cluster – involves more than meets the eye. Moreover, the method of transfer used can conspire to slow down processing. Fortunately, a storage system can be tuned to meet the needs of deep learning, and it’s important to realize that those needs may be very different from other enterprise applications.
There are a number of factors that influence the ability of a storage system to provide speedy data access to a deep learning framework:
- The storage solution should use a highly parallelized architecture.
- The storage solution should work to eliminate I/O bottlenecks.
- The storage solution should keep the data near the deep learning cluster that will consume it.
- The storage solution should reduce latencies and accelerate write speeds.
- The storage solution should be able to optimize itself to respond to patterns associated with unstructured data being accessed randomly.
- The storage solution should remove inter-GPU traffic and move data quickly between multiple storage systems and the compute elements hosting the deep learning algorithms.
- The storage solution should be continuously optimized and then tested with AI and deep learning applications, network architectures and commercially available GPUs.
With the help of AI storage solutions fully optimized for AI and deep learning applications, data scientists, data engineers as well as academic researchers are able to focus their complete attention on what really matters most – transforming valuable data assets into even more valuable insights with unparalleled velocity and accuracy.