Veysel is a speaker for ODSC East 2020 this April 13-17! Be sure to check out his talk, “Spark NLP for Healthcare: Lessons Learned Building Real-World Healthcare AI Systems,” there!
* This is the first article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows. During this series, we will do our best to produce high-quality content and clear instructions with accompanying codes both in Python and Scala regarding the most important features of Spark NLP. Through these articles, we aim to make the underlying concepts of Spark NLP library as clear as possible by touching all the practical and pain points with codes and instructions. The ultimate goal is to let the audience get started with this amazing library in a short time and smooth the learning curve. It’s expected that the reader has at least a basic understanding of Python and Spark.
1. Why would we need another NLP library?
Natural language processing (NLP) is a key component in many data science systems that must understand or reason about a text. Common use cases include question answering, paraphrasing or summarizing, sentiment analysis, natural language BI, language modeling, and disambiguation.
[Related Article: Best NLP Research of 2019]
NLP is essential in a growing number of AI applications. Extracting accurate information from free text is a must if you are building a chatbot, searching through a patent database, matching patients to clinical trials, grading customer service or sales calls, extracting facts from financial reports or solving for any of these 44 use cases across 17 industries.
Due to the popularity of NLP and hype in Data Science in recent years, there are many great NLP libraries developed and even the newbie data science enthusiasts started to play with various NLP techniques using these open source libraries. Here are the most popular NLP libraries that have been used heavily in the community and under various levels of development.
- Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
- TextBlob: Easy to use NLP tools API, built on top of NLTK and Pattern.
- SpaCy: Industrial strength NLP with Python and Cython.
- Gensim: Topic Modelling for Humans
- Stanford Core NLP: NLP services and packages by Stanford NLP Group.
- Fasttext: NLP library for the learning of word embeddings and sentence classification created by Facebook’s AI Research (FAIR) lab
Obviously, there are many more libraries in the general field of NLP—but we focus here on general-purpose libraries and not ones that cater to specific use cases. Given all these libraries, you can ask why we would need another NLP library.
We will try to answer this question under the following topics:
a. A single unified solution for all your NLP needs
When you want to deliver scalable, high-performance and high-accuracy NLP-powered software for real production use, none of those libraries provides a unified solution.
Keep in mind that any NLP pipeline is always just a part of a bigger data processing pipeline: For example, question answering involves loading training data, transforming it, applying NLP annotators, building features, training the value extraction models, evaluating the results (train/test split or cross-validation), and hyperparameter estimation. We need an all-in-one solution to ease the burden of text preprocessing and connecting the dots between various steps of solving a data science problem with NLP. So, we can say that a good NLP library should be able to correctly transform the free text into structured features and let you train your own NLP models that are easily fed into the downstream machine learning (ML) or deep learning (DL) pipeline with no hassle.
b. Take advantage of transfer learning and implementing the latest and greatest algorithms and models in NLP research
Transfer learning is a means to extract knowledge from a source setting and apply it to a different target setting, and it is a highly effective way to keep improving the accuracy of NLP models and to get reliable accuracies even with small data by leveraging the already existing labelled data of some related task or domain. As a result, there is no need to amass millions of data points in order to train a state-of-the-art model.
Big changes are underway in the world of NLP for the last few years and a modern industry scale NLP library should be able to implement the latest and greatest algorithms and models—not easy while NLP is having its ImageNet moment and state-of-the-art models are being outpaced twice a month.
The long reign of word vectors as NLP’s core representation technique has seen an exciting new line of challengers such as ELMo, BERT, RoBERTa, ALBERT, XLNet, Ernie, ULMFiT, OpenAI transformer, which are all open-source, including pre-trained models, and can be tuned or reused without a major computing effort. These works made headlines by demonstrating that pre-trained language models can be used to achieve state-of-the-art results on a wide range of NLP tasks, sometimes even surpassing the human level benchmarks.
c. Lack of any NLP library that’s fully supported by Spark
Being a general-purpose in-memory distributed data processing engine, Apache Spark gained a lot of attention from industry and has already its own ML library (SparkML) and a few other modules for certain NLP tasks but it doesn’t cover all the NLP tasks that are needed to have a full-fledged solution. When you try to use Spark into your pipeline, you usually need to use other NLP libraries to accomplish certain tasks and then try to feed your intermediary steps back into Spark. But, splitting your data processing framework from your NLP frameworks means that most of your processing time gets spent serializing and copying strings back and forth and it is highly inefficient.
d. Delivering a mission-critical, enterprise-grade NLP library
Many of the most popular NLP packages today have academic roots—which shows in design trade-offs that favor ease of prototyping over runtime performance, breadth of options over simple minimalist API’s, and downplaying of scalability, error handling, frugal memory consumption, and code reuse.
The library is already in use in enterprise projects—which means that the first level of bugs, refactoring, unexpected bottlenecks, and serialization issues have been resolved. Unit test coverage and reference documentation are at a level that made us comfortable to make the code open source.
In sum, there was an immediate need for having an NLP library that is simple-to-learn API, be available in your favourite programming language, support the human languages you need it for, be very fast, and scale to large datasets including streaming and distributed use cases.
Considering all these issues, limitations of the popular NLP libraries and recent trends in industry, John Snow Labs, a global AI company that helps healthcare and life science organizations put AI to work faster, decided to take the lead and developed Spark NLP library.
John Snow Labs is an award-winning data analytics company leading and sponsoring the development of the Spark NLP library. The company provides commercial support, indemnification and consulting for it. This provides the library with long-term financial backing, a funded active development team, and a growing stream of real-world projects that drives robustness and roadmap prioritization.
2. What is Spark NLP?
Spark NLP is an open-source natural language processing library, built on top of Apache Spark and Spark ML. It provides an easy API to integrate with ML Pipelines and it is commercially supported by John Snow Labs. Spark NLP’s annotators utilize rule-based algorithms, machine learning and some of them Tensorflow running under the hood to power specific deep learning implementations.
The library covers many common NLP tasks, including tokenization, stemming, lemmatization, part of speech tagging, sentiment analysis, spell checking, named entity recognition, and more. The full list of annotators, pipelines, and concepts is described in the online reference. All of them are included as open-source and can be used by training models with your data. It also provides pre-trained pipelines and models, although they serve as a way of getting a feeling on how the library works, and not for production use.
Spark NLP library is written in Scala and it includes Scala and Python APIs for use from Spark. It has no dependency on any other NLP or ML library. For each type of annotator, we do an academic literature review to find the state of the art (SOTA), have a team discussion and decide which algorithm(s) to implement. Implementations are evaluated on three criteria:
- Accuracy—there’s no point in a great framework if it has sub-par algorithms or models.
- Performance—runtime should be on par or better than any public benchmark. No one should have to give up accuracy because annotators don’t run fast enough to handle a streaming use case, or don’t scale well in a cluster setting.
- Trainability or Configurability—NLP is an inherently domain-specific problem. Different grammars and vocabularies are used in social media posts vs. academic papers vs. electronic medical records vs. newspaper articles.
Spark NLP is geared towards production use in software systems that outgrow older libraries such as spaCy, NLTK, and CoreNLP. As of February 2019, the library is in use by 16% of enterprise companies and the most widely used NLP library by such companies.
Built natively on Apache Spark and TensorFlow, the library provides simple, performant as well as accurate NLP notations for machine learning pipelines which can scale easily in a distributed environment. This library is reusing the Spark ML pipeline along with integrating NLP functionality.
In a recent annual survey by O’Reilly, it identified several trends among enterprise companies for adopting artificial intelligence. According to the survey results, Spark NLP library was listed as the seventh most popular across all AI frameworks and tools. It is also by far the most widely used NLP library — twice as common as spaCy. It was also found to be the most popular AI library after scikit-learn, TensorFlow, Keras, and PyTorch.
As a native extension of the Spark ML API, the library offers the capability to train, customize and save models so they can run on a cluster, other machines or saved for later. It is also easy to extend and customize models and pipelines, as we’ll get in detail during this article series. Spark NLP is open source with an Apache 2.0 license, so you are welcome to examine the full source code.
The rise of deep learning for natural language processing in the past few years meant that the algorithms implemented in popular libraries, like spaCy, Stanford CoreNLP, NLTK, and OpenNLP, are less accurate than what the latest scientific papers made possible.
Claiming to deliver state-of-the-art accuracy and speed has us constantly on the hunt to productize the latest scientific advances.
Optimizations are done to get Apache Spark’s performance closer to bare metal, on both a single machine and cluster, meant that common NLP pipelines could run orders of magnitude faster than what the inherent design limitations of legacy libraries allowed.
The most comprehensive benchmark to date, Comparing production-grade NLP libraries, was published a year ago on O’Reilly Radar. On the left is the comparison of runtimes for training a simple pipeline (sentence boundary detection, tokenization, and part of speech tagging) on a single Intel i5, 4-core, 16 GB memory machine
Being able to leverage GPU’s for training and inference has become table stakes. Using TensorFlow under the hood for a deep learning enables Spark NLP to make the most of modern computer platforms—from nVidia’s DGX-1 to Intel’s Cascade Lake processors. Older libraries, whether or not they use some deep learning techniques, will require a rewrite to take advantage of these new hardware innovations that can add improvements to the speed and scale of your NLP pipelines by another order of magnitude.
Being able to scale model training, inference, and full AI pipelines from a local machine to a cluster with little or no code changes has also become table stakes. Being natively built on Apache Spark ML enables Spark NLP to scale on any Spark cluster, on-premise or in any cloud provider. Speedups are optimized thanks to Spark’s distributed execution planning and caching, which has been tested on just about any current storage and compute platform.
This is how the functionality of the most popular NLP libraries compares:
Spark NLP also comes with an OCR package that can read both PDF files and scanned images (requires
Tesseract 4.x+). This is the first NLP library that includes OCR functionality out-of-package. (* since 2.2.2, OCR feature is moved to licensed version.)
3. Basic components and underlying technologies
Since Spark NLP is sitting on the shoulders of Apache Spark, it’s better to explain Spark NLP components with a reference to Spark itself.
Apache Spark, once a component of the Hadoop ecosystem, is now becoming the big-data platform of choice for enterprises mainly because of its ability to process streaming data. It is a powerful open-source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface.
In the industry, there is a big demand for a powerful engine that can do all of the above. Sooner or later, your company or your clients will be using Spark to develop sophisticated models that would enable you to discover new opportunities or avoid risk. Spark is not hard to learn, if you already know Python and SQL, it is very easy to get started. To get familiar with Spark and its Python wrapper Pyspark, you can find the additional resources at the bottom of this article.
Spark has a module called Spark ML which introduces several ML components. Estimators, which are trainable algorithms, and transformers which are either a result of training an estimator, or an algorithm that doesn’t require training at all. Both Estimators and Transformers can be part of a Pipeline, which is no more and no less than a sequence of steps that execute in order, and are probably depending on each other’s result.
Spark-NLP introduces NLP annotators that merge within this framework and its algorithms are meant to predict in parallel. Now, let’s start by explaining each component in detail.
In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions.
In Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel
AnnotatorApproach extends Estimators from Spark ML, which are meant to be trained through fit(), and AnnotatorModel extends Transformers which are meant to transform data frames through transform().
Some of Spark NLP annotators have a Model suffix and some do not. The model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers but do not contain the suffix Model since they are not trained, annotators. Model annotators have a pre-trained() on its static object, to retrieve the public pre-trained version of a model.
Long story short, if it trains on a DataFrame and produces a model, it’s an AnnotatorApproach; and if it transforms one DataFrame into another DataFrame through some models, it’s an AnnotatorModel (e.g. WordEmbeddingsModel) and it doesn’t take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer).
Here is the list of annotators offered by Spark NLP v2.2.2
By convention, there are three possible names:
nothing—Either a non-trainable annotator with pre-processing step or shorthand for a model
So for example, Stemmer doesn’t say Approach nor Model, however, it is a Model. On the other hand, Tokenizer doesn’t say Approach nor Model, but it has a TokenizerModel(). Because it is not “training” anything, but it is doing some preprocessing before converting into a Model.
Even though we will do many hands-on practices in the following articles, let us give you a glimpse to let you understand the difference between AnnotatorApproach and AnnotatorModel.
As stated above, Tokenizer is an AnnotatorModel. So we need to call fit() and then transform().
tokenizer = Tokenizer() \ .setInputCols([“document”]) \ .setOutputCol(“token”)tokenizer.fit(df).transform(df)
On the other hand, Stemmer is an AnnotatorApproach. So we just need to call transform().
stemmer = Stemmer() \ .setInputCols([“token”]) \ .setOutputCol(“stem”)stemmer.transform(df)
You will get to learn all these parameters and syntax later on. So, don’t bother trying to reproduce these code snippets before we get into that part.
Another important point is that each annotator accepts certain types of columns and outputs new columns in another type (we call this AnnotatorType). In Spark NLP, we have the following types: Document, token, chunk, pos, word_embeddings, date, entity, sentiment, named_entity, dependency, labeled_dependency. That is, the DataFrame you have needs to have a column from one of these types if that column will be fed into an annotator; otherwise, you’d need to use one of the Spark NLP transformers. We will talk about this concept in detail later on.
b. Pre-trained Models
We mentioned that trained annotators are called AnnotatorModel and the goal here is to transform one DataFrame into another through the specified model (trained annotator). Spark NLP offers the following pre-trained models in four languages (English, French, German, Italian) and all you need to do is to load the pre-trained model into your disk by specifying the model name and then configuring the model parameters as per your use case and dataset. Then you will not need to worry about training a new model from scratch and will be able to enjoy the pre-trained SOTA algorithms directly applied to your own data with transform(). In the official documentation, you can find detailed information regarding how these models are trained by using which algorithms and datasets.
Here is the list of pre-trained models offered by Spark NLP v2.2.2
# load NER model trained by deep learning approach and GloVe word embeddingsner_dl = NerDLModel.pretrained(‘ner_dl’)# load NER model trained by deep learning approach and BERT word embeddingsner_bert = NerDLModel.pretrained(‘ner_dl_bert’)ner_bert.transform(df)
Remember that we talked about certain types of columns that each Annotator accepts or outputs. So, what are we going to do if our DataFrame doesn’t have columns in those type? Here comes transformers. In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another. Here is the list of transformers:
DocumentAssembler: To get through the NLP process, we need to get raw data annotated. This is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.
TokenAssembler: This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, to use this document annotation in further annotators.
Doc2Chunk: Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.
Chunk2Doc : Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.
Finisher: Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into a string.
# get the dataframe with text column, and transform into another dataframe with a new document type column appendeddocumentAssembler = DocumentAssembler()\ .setInputCol(“text”)\ .setOutputCol(“document”)documentAssembler.transform(df)
We mentioned before that Spark NLP provides an easy API to integrate with Spark ML Pipelines and all the Spark NLP annotators and transformers can be used within Spark ML Pipelines. So, it’s better to explain Pipeline concept through Spark ML official documentation.
What is a Pipeline anyway? In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
- Split each document’s text into sentences and tokens (words).
- Normalize the tokens by applying some text preprocessing techniques (cleaning, lemmatizing, stemming etc.)
- Convert each token into a numerical feature vector (e.g. word embeddings, tfidf, etc.).
- Learn a prediction model using the feature vectors and labels.
This is how such a flow can be written as a pipeline with sklearn, a popular Python ML library.
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.linear_model import LogisticRegressiondef text_processing (): # your text preprocessing steps .. return processed_textmypipeline = Pipeline ([ (“preprocess”, text_processing()), (“vect”, CountVectorizer()), (“tfidf”, TfidfTransformer()), (“model”, LogisticRegression()), ])mypipeline.fit(X_train, y_train)
Apache Spark ML represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order.
In simple terms, a pipeline chains multiple Transformers and Estimators together to specify an ML workflow. We use Pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow.
The figure below is for the training time usage of a Pipeline.
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. That is, the data are passed through the fitted pipeline in order. Each stage’s transform() method updates the dataset and passes it to the next stage. With the help of Pipelines, we can ensure that training and test data go through identical feature processing steps.
Now let’s see how this can be done in Spark NLP using Annotators and Transformers. Assume that we have the following steps that need to be applied one by one on a data frame.
- Split text into sentences
- Get word embeddings
And here is how we code this pipeline up in Spark NLP.
from pyspark.ml import Pipelinedocument_assembler = DocumentAssembler()\ .setInputCol(“text”)\ .setOutputCol(“document”)sentenceDetector = SentenceDetector()\ .setInputCols([“document”])\ .setOutputCol(“sentences”)tokenizer = Tokenizer() \ .setInputCols([“sentences”]) \ .setOutputCol(“token”)normalizer = Normalizer()\ .setInputCols([“token”])\ .setOutputCol(“normal”)word_embeddings=WordEmbeddingsModel.pretrained()\ .setInputCols([“document”,”normal”])\ .setOutputCol(“embeddings”)nlpPipeline = Pipeline(stages=[ document_assembler, sentenceDetector, tokenizer, normalizer, word_embeddings, ])pipelineModel = nlpPipeline.fit(df)
Let’s see what’s going on here. As you can see from the flow diagram below, each generated (output) column is pointed to the next annotator as an input depending on the input column specifications. It’s like building-blocks and legos through which you can come up with amazing pipelines with a little bit of creativity.
What’s actually happening under the hood?
When we fit() on the pipeline with Spark data frame (df), its text column is fed into DocumentAssembler() transformer at first and then a new column “document” is created in Document type (AnnotatorType). As we mentioned before, this transformer is basically the initial entry point to Spark NLP for any Spark data frame. Then its document column is fed into SentenceDetector() (AnnotatorApproach) and the text is split into an array of sentences and a new column “sentences” in Document type is created. Then “sentences” column is fed into Tokenizer() (AnnotatorModel) and each sentence is tokenized and a new column “token” in Token type is created. And so on. You’ll learn all these rules and steps in detail in the following articles, so we’re not elaborating much here.
In addition to customized pipelines, Spark NLP also has a pre-trained pipelines that are already fitted using certain annotators and transformers according to various use cases.
Here is the list of pre-trained pipelines.
We will explain all these pipelines in the following articles but let’s give you an example using one of these pipelines.
Here are the NLP annotators we have in “explain_document_dl” pipeline:
- WordEmbeddings (GloVe 6B 100)
- NerConverter (chunking)
All these annotators are already trained and tuned with SOTA algorithms and ready to fire up at your service. So, when you call this pipeline, these annotators will be run under the hood and you will get a bunch of new columns generated through these annotators. To use pre-trained pipelines, all you need to do is to specify the pipeline name and then transform(). You can also design and train such kind of pipelines and then save to your disk to use later on.
print (df.columns)>> [‘text’]from sparknlp.pretrained import PretrainedPipelinepipeline = PretrainedPipeline(“explain_document_dl”, lang=”en”)transformed_df = pipeline.transform(df)print (transformed_df.columns)>> [‘text’, ‘document’, ‘sentence’, ‘token’, ‘checked’, ‘lemma’, ‘stem’, ‘pos’, ‘embeddings’, ‘ner’, ‘entities’]
While saying SOTA algorithms, we really mean it. For example, NERDLModel is trained by NerDLApproach annotator with Char CNNs—BiLSTM—CRF and GloVe Embeddings on the WikiNER corpus and supports the identification of PER, LOC, ORG and MISC entities. According to a recent survey paper, this DL architecture achieved the highest scores for NER. So, with just one single line of code, you get a SOTA result!
In this very first article, we tried to get you familiar with the basics of Spark NLP and its building blocks. Being used in enterprise projects, built natively on Apache Spark and TensorFlow and offering an all-in-one state of the art NLP solutions, Spark NLP library provides simple, performant as well as accurate NLP notations for machine learning pipelines which can scale easily in a distributed environment. Despite its steep learning curve and sophisticated framework, the virtual developer team behind this amazing library pushes the limits to implement and cover the recent breakthroughs in NLP studies and strives to make it easy to implement into your daily workflows.
[Related Article: 9 Organizations and People Leading the NLP Field]
In the following articles, we plan to cover all the details with clear code samples both in Python and Scala. Till then, feel free to visit Spark NLP workshop repository or take a look at the following resources. Welcome to the amazing world of Spark NLP and stay tuned!
Originally Posted Here
Veysel Kocaman: Veysel is a Senior Data Scientist at John Snow Labs, lecturer at Leiden University and a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence and big data with over ten years of experience. He is also working towards his PhD in Computer Science and is a Google Developer Expert in Machine Learning.