fbpx
Introduction to Spark NLP: Foundations and Basic Components Introduction to Spark NLP: Foundations and Basic Components
Veysel is a speaker for ODSC East 2020 this April 13-17! Be sure to check out his talk, “Spark NLP for Healthcare: Lessons Learned... Introduction to Spark NLP: Foundations and Basic Components

* This is the first article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows. During this series, we will do our best to produce high-quality content and clear instructions with accompanying codes both in Python and Scala regarding the most important features of Spark NLP. Through these articles, we aim to make the underlying concepts of Spark NLP library as clear as possible by touching all the practical and pain points with codes and instructions. The ultimate goal is to let the audience get started with this amazing library in a short time and smooth the learning curve. It’s expected that the reader has at least a basic understanding of Python and Spark.


1. Why would we need another NLP library?

[Related Article: Best NLP Research of 2019]

Spark NLP is already in use in enterprise projects for various use cases
John Snow Labs is a recipient of several awards in Data Analytics

2. What is Spark NLP?

Spark NLP provides licensed annotators and models that are already trained by SOTA algorithms for Healthcare Analytics
Spark NLP training performance on single machine vs cluster

3. Basic components and underlying technologies

An overview of Spark NLP components

a. Annotators

* These marked annotators do not take “Approach” suffix at the end while all the others take this suffix. All the AnnotatorModels take “Model” suffix at the end.
tokenizer = Tokenizer() \
 .setInputCols([“document”]) \
 .setOutputCol(“token”)tokenizer.fit(df).transform(df)
stemmer = Stemmer() \
 .setInputCols([“token”]) \
 .setOutputCol(“stem”)stemmer.transform(df)

b. Pre-trained Models

# load NER model trained by deep learning approach and GloVe word embeddingsner_dl = NerDLModel.pretrained(‘ner_dl’)# load NER model trained by deep learning approach and BERT word embeddingsner_bert = NerDLModel.pretrained(‘ner_dl_bert’)ner_bert.transform(df)

c. Transformers

# get the dataframe with text column, and transform into another dataframe with a new document type column appendeddocumentAssembler = DocumentAssembler()\
 .setInputCol(“text”)\
 .setOutputCol(“document”)documentAssembler.transform(df)

d. Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.linear_model import LogisticRegressiondef text_processing ():
 # your text preprocessing steps ..
 return processed_textmypipeline = Pipeline ([
 (“preprocess”, text_processing()),
 (“vect”, CountVectorizer()),
 (“tfidf”, TfidfTransformer()),
 (“model”, LogisticRegression()),
])mypipeline.fit(X_train, y_train)
from pyspark.ml import Pipelinedocument_assembler = DocumentAssembler()\
 .setInputCol(“text”)\
 .setOutputCol(“document”)sentenceDetector = SentenceDetector()\
 .setInputCols([“document”])\
 .setOutputCol(“sentences”)tokenizer = Tokenizer() \
 .setInputCols([“sentences”]) \
 .setOutputCol(“token”)normalizer = Normalizer()\
 .setInputCols([“token”])\
 .setOutputCol(“normal”)word_embeddings=WordEmbeddingsModel.pretrained()\
 .setInputCols([“document”,”normal”])\
 .setOutputCol(“embeddings”)nlpPipeline = Pipeline(stages=[
 document_assembler, 
 sentenceDetector,
 tokenizer,
 normalizer,
 word_embeddings,
 ])pipelineModel = nlpPipeline.fit(df)

print (df.columns)>> [‘text’]from sparknlp.pretrained import PretrainedPipelinepipeline = PretrainedPipeline(“explain_document_dl”, lang=”en”)transformed_df = pipeline.transform(df)print (transformed_df.columns)>> [‘text’,
 ‘document’,
 ‘sentence’,
 ‘token’,
 ‘checked’,
 ‘lemma’,
 ‘stem’,
 ‘pos’,
 ‘embeddings’,
 ‘ner’,
 ‘entities’]

4. Conclusion

[Related Article: 9 Organizations and People Leading the NLP Field]

5. Resources


Originally Posted Here

Author Bio:

Veysel Kocaman: Veysel is a Senior Data Scientist at John Snow Labs, lecturer at Leiden University and a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence and big data with over ten years of experience. He is also working towards his PhD in Computer Science and is a Google Developer Expert in Machine Learning.

ODSC Community

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1