The 2019 Data Science Dictionary – Key Terms You Need to Know The 2019 Data Science Dictionary – Key Terms You Need to Know
The data science field is teeming with terminology, a confluence of terms from computer science, statistics, mathematics, and software engineering. In addition, the language... The 2019 Data Science Dictionary – Key Terms You Need to Know

The data science field is teeming with terminology, a confluence of terms from computer science, statistics, mathematics, and software engineering. In addition, the language of data science evolves very quickly. As a journalist and also a data scientist, I probably see the newest terms before many others in the ecosystem. I encounter them from conferences, social media, LinkedIn, Stack Overflow, research papers, and conversations I have with colleagues.  

In this article, I’ll provide you with a brief dictionary of terms surrounding data science including AI, machine learning, and deep learning. The list below consists of the 20 terms, alphabetical order, that I feel have the strongest acceleration for 2019. Of course, data science has hundreds of terms that you need to be familiar with, but these terms are the hottest ones.

  1. Activation function: In neural networks, linear and non-linear activation functions produce output decision boundaries by combining the network’s weighted inputs. The ReLU (Rectified Linear Unit) activation function is the most commonly used activation function right now, although the Tanh or hyperbolic tangent, and Sigmoid or logistic activation functions are also used.

  2. Backpropagation: For this definition, I defer to a nice one I found by data scientist Mikio L. Braun on Quora: “Back prop is just gradient descent on individual errors. You compare the predictions of the neural network with the desired output and then compute the gradient of the errors with respect to the weights of the neural network. This gives you a direction in the parameter weight space in which the error would become smaller.”

  3. Blockchain: Blockchain is essentially a decentralized distributed database. Data scientists with access to blockchain data are able to build models and make predictions with cleaner, more reliable historical data. This is because the linked structure of blockchain makes it possible to trace the origin (as well as ownership changes) of any digital asset. This ability can provide key evidence in support of the authenticity of an object, asset, or record. This could lead to large amounts of highly structured, anonymized, and authenticated data assets with a transparent chronology of ownership.

  4. Convolutional Neural Network (CNN): A CNN is a common method used with deep learning, and is typically associated with computer vision and image recognition. CNNs employ the mathematical concept of convolution to simulate the neural connectivity lattice of the visual cortex in biological systems. Convolution can be viewed as a sliding window over top a matrix representation of an image. This allows for the simulation of the overlapping tiling of the visual field.

  5. Cost function: A cost function represents a value to be minimized, like the sum of squared errors over a training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function.

  6. Data Storytelling: The last step of the data science process involves communicating potentially complex machine learning results to project stakeholders who are non-experts with data science. Data storytelling is an important skillset for all data scientists.

    [Related article: Your Data is Garbage Unless it Tells a Story]

  7. DevOps/DataOps/MLOps/AIOps: DevOps is software development methodology that couples software development (Dev) along with IT operations (Ops) to hasten the application development life cycle while distributing features, updates, and bug fixes in an efficient manner that aligns with business goals. The application of continuous delivery and DevOps to data analytics is referred to as DataOps. Additionally, MLOps refers to the collaboration between data scientists and IT operations professionals to help manage the production machine learning lifecycle. Also, AIOps refers to software systems that couple big data and AI functionality to enhance and potentially replace a broad range of IT processes such as performance monitoring, availability monitoring, event analysis, IT services management, and automation.

  8. Docker containers: In a nutshell, a Docker container is a small, user-level virtualization that helps data scientists build, install, and run code. In other words, a container is a light-weight virtual machine (VM) that is built from a script that can be version controlled, resulting in the ability to version control a data science environment.

  9. Explainable AI: Explainable (interpretable) AI models strive to solve the recognized problem that as we generate newer and more innovative applications for neural networks, the question “How do they work?” becomes more and more important. Opening the black box to enable transparency is becoming more important as we realize that we don’t really know why AI models make the choices they do. As models become more complex, the task of producing an interpretable version of the model becomes more difficult. There are several approaches toward resolving the explainable AI issue: Reversed Time Attention Model (RETAIN), Local Interpretable Model-Agnostic Explanations (LIME), and Layer-wise Relevance Propagation (LRP).

  10. Gradient descent: Gradient Descent is an optimization algorithm, based on a convex function, that’s used while training a machine learning model. The algorithm adjusts its parameters iteratively to minimize a given function to its local minimum.

    [Related article: Understanding the 3 Primary Types of Gradient Descent]

  11. GPU acceleration: GPU-acceleration refers to the use of a graphics processing unit (GPU) along with a computer processing unit (CPU) in order to facilitate compute-intensive AI operations such as deep learning. A related term is GPU database which is a database, relational or non-relational, that uses a GPU to accelerate certain database operations.

  12. H2O is a leading open source data science and machine learning platform for R and Python. H2O’s Driverless AI tool is an AI platform that automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment. It aims to achieve the highest predictive accuracy in the shortest amount of time, while minimizing the amount of data scientist resources. H2O’s mission is to democratize AI for all.

  13. Jupyter Notebooks: The Jupyter (an acronym using the names of several popular languages used by data scientists: Julia, Python, and R) Notebook is the tool of choice for many data scientists. It is an open-source web application that allows you to create and share documents that contain code, equations, visualizations, and narrative text. Jupyter Notebooks help data scientists streamline their work and enable increased productivity and provide the means for collaboration.

  14. Long Short Term Memory (LSTM): A LSTM network is a special kind of recurrent neural network (RNN) which is optimized for learning from and acting upon time-related data which may have undefined or unknown lengths of time between relevant events. LSTMs work very well on a wide range of problems and are now widely used. They were introduced in 1997 by Hochreiter & Schmidhuber, and were refined and popularized by many subsequent researchers.

  15. MXNet: MXNet is a popular and scalable deep learning framework. As an open source library, MXNet helps data scientists build, train, and run deep learning models.

  16. Natural Language Processing (NLP): NLP is a branch of AI that provides a vehicle for computers to understand, interpret, and manipulate natural (human) language. NLP is composed of elements from a number of fields including computer science and computational linguistics in order to bridge the separation between human communication and computer understanding.

  17. Recurrent Neural Network (RNN): An RNN represents a type of neural network that works with sequences of data, and where the output from the previous step is fed to the current step as input. We see that in traditional neural networks, all the inputs and outputs are independent of one another. However, in some cases where it’s required to predict the next word of a sentence, for example, there is a need to remember the previous words. RNNs solve this need with the help of a hidden layer. The distinguishing characteristic of an RNN is the hidden state, which recalls information about a sequence. RNNs have in them a sense some memory about what happened earlier in the sequence of data.

  18. Reinforcement learning: Without specific goals, reinforcement learning algorithms deal with the problem of finding suitable actions to take in a given situation in order to maximize a reward where learning optimal goals by trial and error. When I first learned about reinforcement learning, I reflected back to the old Pac Man video game. With reinforcement learning, using trial and error, the algorithm would find that certain uses of the button and movements of the joystick would improve the player’s score; moreover, the process of trial and error would tend toward an optimal state of the game.

  19. Transfer Learning: Transfer learning is a deep learning technique where a model developed for one task is repurposed as the starting point for a model on another task. Transfer learning is a popular method where pre-trained deep learning models are used as the starting point for computer vision and natural language problems. This saves considerable computing resources required to develop deep neural networks for these problem domains.

  20. Tidyverse: The tidyverse is a very well thought-out collection of R packages for data manipulation, exploratory data analysis, and visualization that share a common design philosophy. The tidyverse was primarily developed by data science luminary Hadley Wickham, but is now being expanded by several other contributors. The goal for the tidyverse is to make data scientists more productive by providing a path through workflows that facilitate concise communication, and results in reproducible work products.

Editor’s note: Ready to learn about all of these terms and more? Attend ODSC East 2019 this April 30-May 3 in Boston and learn from industry-leading experts directly!

Daniel Gutierrez, ODSC

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.