2018 Major Updates to the Most Popular Data Science Frameworks
Featured PostfastaiKerasmxnetPyTorchScikit-LearnTensorFlowposted by Alex Landa, ODSC January 23, 2019 Alex Landa, ODSC
New data science projects, tools, and frameworks are popping up at a hectic pace. However, most of us are already using many of the major and most popular ones. Read on and check out major releases of some of the most popular frameworks you may have missed in 2018.
PyTorch: The release of 1.0.0
The PyTorch open-source machine learning library is quickly becoming the go-to for machine learning and NLP pros, with big names like Facebook and Uber contributing to its resources. The biggest news from 2018 is the long-awaited release of version 1.0.0 – the first stable version. This release saw the introduction of JIT, a set of compiler tools for bridging the gap between research in PyTorch and production, which allows for the creation of models that can run without a dependency on the Python interpreter and which can be optimized more aggressively.
[Related Article: Training with PyTorch on Amazon SageMaker]
Other important features added to 1.0.0 include the C++ frontend – a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend, the release of Torch Hub – a pre-trained model repository designed to facilitate research reproducibility.
Keras: Significant improvements on Keras 2.0
When Keras 2.0 was released in 2017, it proved to have significant improvements over 1.0. Since then, numerous improvements have been made, leading up to 2.2.4 as of October 2018.
In June, the release of 2.2.0 allowed developers to perform deep learning with ease. This release included API changes, new input modes, bug fixes, and performance improvements to the high-level neural network API.
Some notable features added with 2.2.0 include:
- A new API called Model subclassing was added for model definition.
- A new input mode which provides the ability to call models on TensorFlow tensors directly (however this is applicable to TensorFlow backend only).
- An improved engine that follows a much more modular structure, thus improving code structure, code health, and reduced test time.
- Keras modules applications and preprocessing are now externalized to their own repositories such as Keras-applications and Keras-preprocessing respectively.
TensorFlow: Ease of use, Eager Execution, TF Lite, and an ML Crash Course
TensorFlow is an open-source software library for dataflow programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. TensorFlow was developed by the Google Brain team for internal Google use for both research and production.
The biggest news for TF is the impending release of version 2.0, a new version designed for ease of use and the ability to accommodate more platforms and languages, and in turn, compatibility between them.
Some of the biggest changes came earlier this year with version 1.5, such as the inclusion of Eager execution, an experimental interface to TensorFlow that provides an imperative programming style. When you enable eager execution, TensorFlow operations execute immediately; you do not execute a pre-constructed graph with Session.run(). This version also introduced the developer version of TensorFlow Lite, a lightweight solution for mobile and embedded devices which lets you take a trained TensorFlow model and convert it into a .tflite file which can then be executed on a mobile device with low-latency.
Originally an internal tool meant to help Google employees learn about applied AI and ML basics, their Machine Learning Crash Course (MLCC) was released for public use in March. There are videos, workshops, case studies, and more to learn from.
Scitkit-learn: Better support across the board and cleaning up what’s missing
Scikit-learn is a machine learning library for Python, known for its great algorithms for classification, regression, and clustering algorithms. Part of what makes scikit-learn special is its devoted support; while it was originally created in 2006 by David Cournapeau as part of a Google Summer of Code project, it has since been contributed to by numerous volunteers as part of a community of data science devotees.
Version 0.20.0 from summer 2018 stands out as the biggest one of the year. The release notes highlight the following as the most important updates: support for common data-science use-cases including missing values, categorical variables, heterogeneous data, and features/targets with unusual distributions. Missing values in features, represented by NaNs, are now accepted in column-wise preprocessing such as scalers. Each feature is fitted disregarding NaNs, and data containing NaNs can be transformed. The new impute module provides estimators for learning despite missing data.
FastAI: fastai v1 released and the training wheels are off
Fast.AI is still a newcomer to the data science field. Just a little over a year old, fast.AI is a research lab with the mission to make AI more accessible, largely through deep learning and machine learning courses, libraries, and passionate community.
In October 2018, fast.ai released the full 1.0 version of fastai, a free, open source deep learning library that runs on top of Facebook’s PyTorch framework.
“Fastai is the first deep learning library to provide a single consistent interface to all the most commonly used deep learning applications for vision, text, tabular data, time series, and collaborative filtering,”Fast.ai cofounder Jeremy Howard said in the announcement blog. The library provides a single consistent API to the most important deep learning applications and data types.
Apache MXNet: The Great Expansion
The release of MXNet 1.0 in late 2017 included some major updates including Gluon, a high-level API interface, Sparse Tensors, Raspberry Pi support, and more. Quickly following up was the January 2018 version 1.1.0 release that provided some performance improvements and major bug fixes, but also new features around operators and build optimization. In addition to some solid API additions were new options for building vocabulary and loading pre-trained word embeddings.
Release 1.2.0 in April expanded the project’s language support with a new Scala API, thus adding MXNet to what was only a handful of deep learning libraries (DL4J, Tensorflow, ScalNet) that support Scala. A host of additional updates included the ability to import ONNX (open neural network exchange format) models, support for model quantization with calibration, new integration with the MKL-DNN neural network accelerator library for Intel chips, and a slew of other minor features.
Version 1.3 provided updates to the pre-defined and pre-trained models that help bootstrap machine learning applications to the Gluon Model Zoo. A new important feature was the conversion of Gluon RNN layers to HybridBocks. HybridBocks combine declarative programming and imperative programming to provide the benefit of both which allow programmers to quickly develop and debug models with imperative programming, and later, easily switch to efficient declarative execution. This version also contained a host of experimental features including Clojure support, Spart Tensor support, a new memory pool type, and MXNet to ONNX model export support.
Wrapping up the year in late December, Version 1.4.0 Apache MXNet continued on its quest to add additional language support with a major release that included a Java API and a Julia API. Given the expanded support for JVM languages including Scala, Clojure, and Java, this version also added memory management, thus dispensing with the need to manually manage MXNet memory objects.
Chainer: Expanding the chain with various new features
Chainer is a Python-based, standalone open source framework for deep learning models that provides a flexible, intuitive, and high-performance means of implementing a full range of deep learning models. Chainer had two major updates in 2018 with versions 4.0.0 and 5.0.0.
Version 4.0.0 introduced new features for accelerating deep learning computations and making the installation process easier. Links that use supported functions can now be exported to Caffe format, which makes it easy to deploy your models. Also, most functions were rewritten in a new-style way that supports double backprop.
More recently, version 5.0.0 introduced its own series of updates including static subgraph optimization, Float16 support, ChainerMN integration, and the introduction of the chainer.distributions module which implements many parametric probability distributions with autograd capability.
[Related Article: ODSC East 2018 – Open Source Data Science Project Award Winner : the Chainer Framework]
Also pretty cool – Chainer is now listed as a choice for many of the Amazon Web Services applications as of mid-2018. The combination with AWS is notable as it leverages Chainer’s abilities in multi-GPU and multi-server scaling.
Of course, this list isn’t comprehensive of all frameworks or all updates. What would you add to this list? What updates are you excited for in 2019? Sound off!