Introduction to Spark NLP: Foundations and Basic Components Introduction to Spark NLP: Foundations and Basic Components
* This is the first article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library... Introduction to Spark NLP: Foundations and Basic Components

1. Why would we need another NLP library?

[Related Article: Best NLP Research of 2019]

Spark NLP is already in use in enterprise projects for various use cases
John Snow Labs is a recipient of several awards in Data Analytics

2. What is Spark NLP?

Spark NLP provides licensed annotators and models that are already trained by SOTA algorithms for Healthcare Analytics
Spark NLP training performance on single machine vs cluster

3. Basic components and underlying technologies

An overview of Spark NLP components

a. Annotators

* These marked annotators do not take “Approach” suffix at the end while all the others take this suffix. All the AnnotatorModels take “Model” suffix at the end.
tokenizer = Tokenizer() \
 .setInputCols([“document”]) \
stemmer = Stemmer() \
 .setInputCols([“token”]) \

b. Pre-trained Models

# load NER model trained by deep learning approach and GloVe word embeddingsner_dl = NerDLModel.pretrained(‘ner_dl’)# load NER model trained by deep learning approach and BERT word embeddingsner_bert = NerDLModel.pretrained(‘ner_dl_bert’)ner_bert.transform(df)

c. Transformers

# get the dataframe with text column, and transform into another dataframe with a new document type column appendeddocumentAssembler = DocumentAssembler()\

d. Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.linear_model import LogisticRegressiondef text_processing ():
 # your text preprocessing steps ..
 return processed_textmypipeline = Pipeline ([
 (“preprocess”, text_processing()),
 (“vect”, CountVectorizer()),
 (“tfidf”, TfidfTransformer()),
 (“model”, LogisticRegression()),
])mypipeline.fit(X_train, y_train)
from pyspark.ml import Pipelinedocument_assembler = DocumentAssembler()\
 .setOutputCol(“document”)sentenceDetector = SentenceDetector()\
 .setOutputCol(“sentences”)tokenizer = Tokenizer() \
 .setInputCols([“sentences”]) \
 .setOutputCol(“token”)normalizer = Normalizer()\
 .setOutputCol(“embeddings”)nlpPipeline = Pipeline(stages=[
 ])pipelineModel = nlpPipeline.fit(df)

print (df.columns)>> [‘text’]from sparknlp.pretrained import PretrainedPipelinepipeline = PretrainedPipeline(“explain_document_dl”, lang=”en”)transformed_df = pipeline.transform(df)print (transformed_df.columns)>> [‘text’,

4. Conclusion

[Related Article: 9 Organizations and People Leading the NLP Field]

5. Resources

Originally Posted Here

Author Bio:

Veysel Kocaman: Veysel is a Senior Data Scientist at John Snow Labs, lecturer at Leiden University and a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence and big data with over ten years of experience. He is also working towards his PhD in Computer Science and is a Google Developer Expert in Machine Learning.

ODSC Community

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.