This time last year we brought you a detailed report of all the important updates for popular data science (machine learning and deep learning) frameworks throughout 2018. The developers of these frameworks continue to innovate at an accelerated rate. Data scientists demand more powerful tools in order to get work done quicker and more efficiently, given the deep shortage of data science skill-sets. With better tools, the idea is that the existing pool of data scientists can get more work done. In this article, I’ll to back and review the status of a number of popular data science frameworks to better understand what’s new and exciting. By all indications, 2020 should be a very productive year!
TensorFlow has always been one of the most popular data science frameworks. It’s an open source AI framework using data flow graphs to build models. It allows developers to create large-scale neural networks with many layers. TensorFlow is arguably the most popular deep learning framework, and 2019 saw a major leap forward in terms of the TensorFlow 2.0 release (originally announced in January 2019, and released in September). TensorFlow 2.0 represents a major milestone in the framework’s development. Over the years, the main complaint that came from data scientists about TensorFlow was its complicated, and somewhat black-box style. Version 2.0’s design is directly aimed at solving that problem, at making TensorFlow more user-friendly and easy to use. Here is a short-list of important changes included in TensorFlow 2.0:
- Refining the API—as 1.x development proceeded over time, the API became somewhat untidy as the framework’s functionality was expanded. Version 2.0 included much API cleanup to simplify and unify the TensorFlow API. The framework is now much simpler and easier to use.
- Eager execution is performed by default – eager execution provides an imperative interface to TensorFlow. With eager execution enabled, TensorFlow functions execute operations immediately as opposed to adding to a graph to be executed later. In addition, concrete values are returned as opposed to symbolic references to a node in a computational graph.
- Keras is the high-level API—Keras has become the official high-level API of TensorFlow in release 2.0. When you install TensorFlow 2.0, it comes with Keras. It integrates seamlessly with TensorFlow without the need for any sort of interface code. This is great news for many data scientists since Keras is the API of choice for building deep learning models.
- No more queue runners—in 2.0 queue runners have been deprecated and completely replaced with the tf.data module to construct input pipelines.
- Backward compatibility with TensorFlow 1.x code—this eases the transition to 2.0 if you already have a code base consisting of 1.x code.
Keras is an open-source framework for deep learning gaining much popularity among data scientists. Keras is a high-level API capable of running on top of TensorFlow, CNTK, Theano, or MXNet. Version 2.3.0 of the Keras framework was released in September 2019 with a point release (2.3.1) a month later. Keras 2.3.0 is the first release of multi-backend Keras that supports TensorFlow 2.0.
This release brings the API in sync with the tf.keras API (TensorFlow’s high-level API for building and training deep learning models) as of TensorFlow 2.0. However it does not support most TensorFlow 2.0 features, in particular “eager execution.” If you need these features, use tf.keras.
This is also the last major release of multi-backend Keras. Going forward, it is recommended that users consider switching their Keras code to tf.keras in TensorFlow 2.0. It implements the same Keras 2.3.0 API and switching should be as easy as changing the Keras import statements. But it has many advantages for TensorFlow users, such as support for eager execution, distribution, TPU training, and generally far better integration between low-level TensorFlow and high-level concepts like Layer and Model.
There were 7 releases of scikit-learn in 2019 and one small release already in 2020. Most of the point releases during 2019 were bug fixes, but just to give you an idea for how rapidly this data science framework is evolving, here is the Release Highlights document for the major year-end release 0.22.0.
- New plotting API
- Stacking classifier and regressor
- Permutation-based feature importance
- Native support for missing values for gradient boosting
- Precomputed sparse nearest neighbor graph
- KNN based imputation
- Tree pruning
- Retrieve dataframes from OpenML
- Checking scikit-learn compatibility of an estimator
- ROC AUC now supports multiclass classification
Another popular data science framework that saw a number of important additions and enhancements in 2019 was PyTorch, an open source machine learning framework that accelerates the path from research prototyping to production deployment. PyTorch provides two high-level features: tensor computation (like NumPy) with strong GPU acceleration, and deep neural networks built on a tape-based autograd system. You can reuse your favorite Python packages such as NumPy, SciPy and Cython to extend PyTorch when needed.
2019 included several releases including PyTorch 1.2 and PyTorch 1.3, culminating with PyTorch 1.4 that was released on January 15, 2020 with a number of important additions and enhancements including: tools for elastic training (PyTorch Elastic) and large scale computer vision, a new classification framework—a new end-to-end framework for large-scale training of state-of-the-art image and video classification models that allows researchers to quickly prototype and iterate on large distributed training jobs at the scale of billions of images.
Additional features include: PyTorch Mobile build level customization, distributed model parallel training, Java bindings, and new releases for all three domain libraries alongside the PyTorch 1.4 core release (torchvision 0.5, torchaudio 0.4, and torchtext 0.5).
MLlib is Apache Spark’s scalable machine learning library. There were 9 releases of Spark in 2019, culminating with Spark 2.4.3 in September as well as the Preview Release of Spark 3.0 on December 23, 2019. Each Spark release includes an updated version of MLlib. The latest MLlib Guide details improvements in the machine learning framework.
Chainer is a Python-based, standalone open source framework for deep learning models. Chainer provides a flexible, intuitive, and high performance means of implementing a full range of deep learning models, including state-of-the-art models such as recurrent neural networks and variational autoencoders.
On December 5, 2019, Preferred Networks (the Japanese company behind Chainer), released Chainer and CuPy v7.0.0. New features include: most features of Chainer, including ChainerMN, are now compatible with ChainerX ndarray; ONNX-Chainer is integrated into Chainer; TabularDataset is added as a rich abstraction of columnar datasets with pandas like manipulations; NHWC support added; and performance for convolutions and batch normalization is greatly improved on GPUs with Tensor Core. As for CuPy v7, new features include: support NVIDIA cuTENSOR and CUB for better performance; and experimental support of ROCm. CuPy now runs on AMD GPUs.
The company also announced it is changing its primary framework to PyTorch. They expect that Chainer v7 will be the last major release for Chainer, and further development will be limited to bug-fixes and maintenance. CuPy will continue its development as before. Although developed as a GPU backend for Chainer, it has been widely adopted by different communities and is relatively unique in accelerating computation with GPUs using NumPy syntax. After reviewing the available frameworks, the company felt that PyTorch is the closest in spirit to the Chainer style of code and is therefore the appropriate replacement.
Apache MXNet (incubating) is an open source deep learning framework suited for flexible research prototyping and production. The 1.5 current release was announced on June 29, 2019. The complete release notes for MXNet 1.5 are made available HERE. Feature highlights include:
- Automatic mixed precision—AMP automatically applies the guidelines of FP16 training (FP16 refers to half-precision floating points or 16-bit, as opposed to the standard 32-bit floating point, or FP32) using FP16 precision where it provides the most beneficial usage, while conventionally keeping in full FP32 precision operations unsafe to do in FP16.
- MKL-DNN reduced precision inference—two advanced features, fused computation and reduced-precision kernels, are introduced by Intel MKL-DNN (Math Kernal Library for Deep Neural Networks) back-end in the recent version. These features can significantly speed up the inference performance on CPU for a broad range of deep learning topologies and applications including image classification, object detection, and natural language processing.
- Dynamic shape – MXNet now supports Dynamic Shape in both imperative and symbolic mode. MXNet used to require that operators statically infer the output shapes from the input shapes. However, for some operators, the shape of the output depends on more than just the shape of the input.
- Large tensor support—Currently, MXNet supports maximal tensor size of 2^32. This is due to uint32_t being used as the default data type for tensor size, as well as variable indexing. This limitation has created many problems when larger tensors are used in the model. A systematic approach to enhance MXNet to support large tensors was needed. Now you can enable large tensor support by changing the following build flag to 1: USE_INT64_TENSOR_SIZE = 1 with the default set to 0.
- Dependency update—MXNet has added support for CUDA 10, CUDA 10.1, cudnn7.5, NCCL 2.4.2, and numpy 1.16.0.
- Gluon fit API – Training a model in Gluon requires users to write the training loop. This is useful because of its imperative nature, however repeating the same code across multiple models can become tedious and repetitive with boilerplate code. The training loop can also be overwhelming to some users new to deep learning. The current release has introduced an Estimator and Fit API to help facilitate training loop.
fastai is a free and open source library for deep learning that simplifies training fast and accurate neural nets. The library sits on top of PyTorch and provides a single consistent API to the most important deep learning applications and data types. Since its initial 1.0 release in October 2018, the folks at fastai have been hard at work throughout 2019 with a very healthy 20 releases for the year. The current release is 1.0.60. As a newcomer to the deep learning framework community, it is apparent that the company is serious about making an impact.
So, there are the major updates to the most popular data science frameworks from 2019, maybe I’ll see you at the start of next year for the next edition of this series.
Want to learn more about these frameworks in person? Attend ODSC East 2020 this April 13-17 and learn from AI experts who use them daily.