Corpus analysis is a technique widely used by data scientists because it provides understanding of a document collection and provides insights about the text. It’s an apt methodology to consider as we came upon Charles Dickens’ 210th birthday earlier this year because of how frequently passages from his works have made their way into popular culture. “It was the best of times, it was the worst of times, it was the age of wisdom …”and 119 words later the sentence finally ends in this excerpt on page 4 from A Tale of Two Cities. Were you eager to know how many words that was? Have you ever wondered why he used such long sentences? As we explore what corpus analysis is you will understand more about the technique widely used by data scientists and learn two keyways for how to use it.
Corpus (a collection of documents) analysis using Natural Language Processing (NLP) can help provide insights about Dickens’ work by providing insights about a document collection before engaging in additional analysis. Corpus analysis provides understanding for corpus structure through easily accessible output statistics to leverage NLP. It uses unstructured text data unlike structed information that fits neatly into rows and columns. Data scientists use NLP for tasks such as data cleansing, separating out noise, sampling effectively, preparing data as input for further models (rules-based and machine learning), and strategizing modeling approaches.
Two ways data scientists can leverage corpus analysis is:
- Generate statistics about the text to better understand the content and structure of your document collection.
- Examples of use cases where data scientists use NLP include viewing and understand insights about:
- Information complexity
- Vocabulary diversity
- Information density
- Comparison metrics against a predetermined reference corpus
- Further analyze or visualize these statistics (using the counts) in reports created in Visual Analytics
To begin corpus analysis using SAS Visual Text Analytics, you profile the data. An overview of the process starts by using a CAS action called Text profile, you can profile data for descriptive statistics that are relevant for understanding text data. This analysis informs model building, testing, and usage on specific data sets. Furthermore, this action can characterize a data set, identify differences between data sets, identify errors or noise and compare a data set to a reference data set.
A key element of corpus analysis are tokens which can be words, morphemes, or characters. The process of tokenization splits character sequences such as a sentence or document to make it into useful units.
Check out the analysis video or the process below from the first six paragraphs of Charles Dickens in A Tale of Two Cities to explore the unique literary style of Dickens’ writings.
_TOTAL_SENTENCES_ is the total number of sentences in the corpus. _AVG_SENTENCES_DOC_ is the average number of sentences per document. _MAX_SENTENCES_DOC_ is the number of sentences in the longest document by sentence count.
_AVG_TOKENS_SENTENCE_ is the average number of tokens per sentence in the corpus. _MAX_TOKENS_SENTENCE_ is the number of tokens in the longest sentence by token count. _TOTAL_TOKENS_ is the total number of tokens in the corpus.
_AVG_TOKEN_LEN is the average number of characters per token. _MAX_TOKEN_LEN_ is number of characters or bytes in the longest token. _TOTAL_FORMS_ is the number of unique tokens in the corpus.
_FORM_80_PERCENT_ is the number of unique tokens that account for 80% of the data. _PERCENT_CONTENT_TOKENS_ is the percentage of tokens that are content (not including numbers, stop words, or punctuation). _PERCENT_STOP_TOKENS is the percentage of tokens that are stop words.
_PERCENT_NUM_TOKENS is the percentage of tokens containing a number or digit. _PERCENT_PUNCT_TOKENS_ is the percentage of tokens that are punctuation marks.
As you can see, Dickens wrote some very long sentences. In these six paragraphs, there were only 19 sentences! From literary works to legal documents, corpus analysis provides the ability to compare information across documents and corpora (more than one corpus). After seeing this analysis, I hope you’re inspired to continue exploring NLP!
About the authors
Ali Dixon is an Associate Marketing Specialist for AI at SAS. She leads product marketing for SAS Visual Text Analytics and SAS Analytics Pro. Ali is a strategic marketer in the Marketing Associate Rotational Program with experience on Partner Marketing and the Global Customer Advisory Board & Product Strategy team. She has an MBA with a concentration in analytics from Baylor University. She also has a MA in Ministry Leadership from Southeastern Seminary and a BA in Journalism and Mass Communication with a concentration in Public Relations from UNC-Chapel Hill. Ali’s passion for innovation, leadership, and service has led her to participate in social impact initiatives around the world.
Mary Osborne is the SAS product manager for text analytics and all things natural language processing. She is an analytics expert with over 20 years of experience at SAS with expertise spanning a variety of technologies and subject matters. She has a special interest in the application of analytics to provide aid during humanitarian crises and enjoys her work in the #data4good and #analytics for good movements. Mary is known for her dynamic and fun presentations and enjoys using technology to solve complex problems.
Article originally posted here. Reposted with permission.