For many of you in data science, natural language processing is a critical component of your projects. David Talby of Pacific.ai is here to introduce Apache Spark’s new NLP library and outline how it can facilitate your NLP pipeline for higher accuracy and faster results using the same amount of data, reaching real natural language understanding. Let’s take a look.
What is Natural Language Understanding?
Natural Language Processing began with simple keyword combinations that lead to specific results. Think about the way you searched Google in the early days: “Data Science Conferences San Francisco.”
[Related Article: Ben Vigoda on the New Era of NLP]
To achieve Natural Language Understanding, machines have to parse real, human language, not ideal keywords. Those keyword searches leave out a lot of possibilities. Human language presents a unique problem for algorithms because it is:
- medium-specific (i.e., texting)
- domain-specific (i.e., legal texts)
Apache Spark’s NLU library was designed for performance, accuracy, and enterprise-grade solutions. It’s built on top of Spark ML APIs and has an active development community. Even better, it’s open-source.
Performance and Scale
So why use NLP on Spark? The performance bottleneck in NLP pipelines at scale was a huge issue. Communication along the pipeline made the team at Apache write their own library. You have data frames within Spark, but there’s a communication issue between the Spark process and a Python process. For most of your time, you’re copying strings and reserializing it, spending so much time copying processes again and again.
Instead, the Spark NLP Library works directly on top of Spark. There’s zero copying and memory, and you now have everything that comes with Spark ML with Spark NLP’s addition of features for classic NLU actions.
This was 80 times faster than spaCy because it’s native. It’s built for scale because you’re training faster while using the same amount of data. Spark doesn’t use the Java memory model; it has its own memory model and does its own caching. What you would have done manually is now part of your training model.
The organization built two pipelines in 2018. Light pipelines provided ten times the speed of typical processing for “small data,” i.e., fewer than 50,000 documents. Those of you performing scale, the aptly named scalable pipelines are the only choice for cluster distributed NLP (more than five million documents).
It’s highly scalable because it allows frictionless reuse. You can combine the NLP and ML pipelines into one, and then Spark builds the execution plan.
So how is the accuracy? Deep learning has taken over NLP, too, not just machine learning. In the past, we could get to high performance if we didn’t care about accuracy, but we want to make sure we’re still getting accurate results. State-of-the-art designations require public benchmarks like peer-reviewed results. In practice, the community can determine if this is actually state-of-the-art programming.
Most of these state-of-the-art methods are deep learning-based, so Spark considered that with this new pipeline. These community tasks included common NLP actions like named entity recognition as well as de-identification and word embeddings.
If you look at how the architecture evolved, Spark is there, but TensorFlow is embedded, and the Python process isn’t needed. They’ve added sentiment analysis and spelling correction. Plus, it deals with images and other common NLP obstacles like page breaks.
So how’s the performance? You need to be able to train on GPU clusters out of the box and build with Intel-optimized deep learning libraries. Also, it must work efficiently with the Databricks cloud. Spark NLP is currently the only NLP library optimized for these and other platforms.
You can’t always use an off the shelf model because it just won’t work. Your context-specific needs require a different type of solution. If you have any domain-specific questions, you might run into issues with your model because you need something more robust in natural language understanding. For example, most sentiment analysis models operation in emotions rather than reputation and change.
[Related Article: Getting to Know Natural Language Understanding]
Using Spark NLP to train your own models is the only option. In general, state-of-the-art deep learning within natural language understanding can do slightly better or at least match human domain experts. Spark NLP allows you to access this type of accuracy without having to find domain experts to test against your model continually.