fbpx
Data Synthesis, Augmentation, and NLP Insights with LLMs Data Synthesis, Augmentation, and NLP Insights with LLMs
Editor’s note: Tamilla Triantoro, PhD is a speaker for ODSC East this April 23-25. Be sure to check out her talk,... Data Synthesis, Augmentation, and NLP Insights with LLMs

Editor’s note: Tamilla Triantoro, PhD is a speaker for ODSC East this April 23-25. Be sure to check out her talk, “Data Synthesis, Augmentation, and NLP Insights with LLMs,” there!

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. This technology bridges the gap between human communication and computer understanding, making it possible for machines to translate text, respond to voice commands, and summarize large volumes of information.

In-Person and Virtual Conference

April 23rd to 25th, 2024

Join us for a deep dive into the latest data science and AI trends, tools, and techniques, from LLMs to data analytics and from machine learning to responsible AI.

 

Humans use language as a complex tool to exchange information, and convey emotions and intentions. We rely on a combination of vocabulary, grammar, context, and cultural knowledge when we communicate. This presents significant challenges for NLP systems, which must interpret ambiguities, idiomatic expressions, and diverse linguistic structures to accurately process and generate text or speech.

In practical terms, NLP applications range from everyday tools that you can find in spell checkers and voice-activated assistants to more advanced systems that support customer service, content analysis, and more advanced applications such as personality and mood assessment. The development and refinement of NLP technologies continues to expand the ways in which we interact with machines, enhancing efficiency and opening new avenues for human-computer collaboration.

In NLP, several key techniques streamline and standardize text analysis. Let’s briefly discuss some of them: tokenization, lowercasing, removing punctuation and stop words, and lemmatization. 

Tokenization breaks text into smaller units known as tokens. It is the foundational step in NLP, allowing algorithms to analyze text at a granular level. For example, in the sentence “We like apples”, tokenization would split this into individual words like “we”, “like”, “apples”, making it easier for a computer to process and analyze each word’s role and meaning.

By converting all text to lowercase, this technique standardizes the dataset, ensuring consistency across different inputs. For instance, “Apple”, “APPLE”, and “apple” would all be treated as the same word, “apple”. This is important for search and classification, where the meaning is not affected by the case of letters, reducing the complexity for the machine to understand and compare text.

 

Another technique is removing punctuation and stop words. Stop words are common words such as “the”, “is”, and “at”, which are frequent in languages but often don’t contribute to the main sentiment or meaning. For example, in the sentence “The quick brown fox jumps over the lazy dog,” stop words like “the” and “over” might be removed to focus on the more meaningful words in this sentence. 

Lemmatization reduces words to their base or dictionary form called lemma. This addresses the challenge of understanding that different word forms represent the same underlying concept. Unlike stemming, which chops off word endings, lemmatization considers the context and uses morphological analysis to convert words to their lemma. For instance, “running”, “ran”, and “runs” would all be lemmatized to “run”. This helps in identifying the root meaning of words and enhances the machine’s ability to understand and process human language.

To carry out NLP tasks we generally need specialized libraries, for example, NLTK and spaCy. Using these libraries, however, requires a solid grasp of programming, library management, and NLP methodologies.  With large language models (LLM), some of these steps can be bypassed. Now we can execute text analysis through straightforward voice commands and text prompts, reducing the need for deep technical knowledge.

To see this in action, we will use ChatGPT to generate and analyze data. Here is what we will do:

Step 1 Data Generation and Processing: Generate consumer reviews for a fictional product.

Step 2 Data Analysis: Examine the reviews to extract insights and sentiment.

Step 3 Data Visualization: Create visual representations of results.

Step 1. Data Generation and Processing

To start, let’s generate some text data, such as a consumer review for a product. 

Prompt: Generate a consumer review for an electronic watch. Ensure the review is realistic, covering aspects such as product quality, customer service experience, delivery, and overall satisfaction. Avoid including any real names, locations, or sensitive personal information. The length of review is up to 50 words.

ChatGPT: Recently purchased an electronic watch and I’m thoroughly impressed. The build quality exceeds expectations, feeling both durable and stylish on the wrist. Customer service was prompt and helpful, addressing my inquiries with ease. Delivery was swift, arriving earlier than anticipated. Overall, a fantastic product that has met all my needs with remarkable efficiency.

We can apply data processing techniques using human language as follows: 

Prompt: Make all words lowercase, remove punctuation, stop words and perform lemmatization of this review.

ChatGPT: recently purchase electronic watch thoroughly impress build quality exceed expectation feel durable stylish wrist customer service prompt helpful address inquiry ease delivery swift arrive early anticipate overall fantastic product meet need remarkable efficiency

This example shows how effortless it is to perform some of the NLP techniques using simple human language prompts.

In-Person Data Engineering Conference

April 23rd to 24th, 2024 – Boston, MA

At our second annual Data Engineering Summit, Ai+ and ODSC are partnering to bring together the leading experts in data engineering and thousands of practitioners to explore different strategies for making data actionable.

 

Now let’s generate a set of reviews and display them as a table.

Prompt: Generate a set of ten reviews for an electronic watch with various customer perspectives. Please ensure the reviews are realistic. Avoid including any real names, locations, or sensitive personal information. The length of each review is less than 50 words. Display reviews as a table that can be easily copied to Excel.

ChatGPT: 

Review Number Review
1 The watch is stylish and the battery lasts for days. Highly recommend.
2 Love the fitness tracking features, but syncing can be hit or miss.
3 Great value for the price. However, the screen scratches easily.
4 Impressed by the water resistance, but the interface could be more intuitive.
5 Battery life didn’t meet expectations, but customer service was helpful.
6 Fantastic design, but I had to return it due to connectivity issues.
7 Very accurate health monitoring, but the app ecosystem is limited.
8 Lightweight and comfortable, yet the charging cable feels flimsy.
9 Excellent for daily use with seamless notifications, though a bit pricey.
10 Disappointed by the lack of third-party app support. Returning it.

Step 2. Data Analysis

Text analysis of consumer reviews and social media posts is a great way to understand consumer sentiment and preferences. Pinpointing frequently mentioned features, for example “battery life,” may inform a business where to direct their focus to enhance satisfaction and drive sales. Additionally, sentiment analysis sheds light on how consumers perceive products, informing critical decisions around product development, customer service improvements, and marketing tactics. Utilizing LLMs for these tasks simplifies the process, making it accessible without the need for extensive technical expertise, and enabling businesses to swiftly adapt and respond to market demands.

Let’s extract the most important keywords from the reviews to see what consumers are concerned or care about.

Prompt: Read the reviews. Extract 5 important keywords. Rank them by popularity, evaluate sentiment for each keyword, provide an explanation.

ChatGPT:

  • Battery Life: Sentiment – Mixed. Battery life is crucial for electronic devices. While one review highly recommends the watch for its long battery life, another points to unmet expectations regarding battery performance. This indicates that while the watch generally performs well, there are instances where it does not meet customer expectations.
  • Design: Sentiment – Positive. Design is mentioned positively, indicating that the watch’s style and aesthetics are well-received by customers. This suggests that the product’s appearance is a strong selling point that resonates with its user base.
  • Customer Service: Sentiment – Mixed. Customer service is highlighted as helpful in one review, suggesting that the company provides support when issues arise. However, the lack of comments directly associated with customer service in other reviews implies that it may not be a primary concern for all customers.
  • Functionality/Features: Sentiment – Mixed. The functionality or specific features like fitness tracking and health monitoring are mentioned with mixed sentiments. Customers appreciate the watch’s features but experience occasional issues with syncing and app limitations, indicating room for improvement in software integration and ecosystem development.
  • Durability: Sentiment – Negative. Durability, especially concerning the screen’s susceptibility to scratches, is mentioned negatively. This suggests that while the watch may excel in other areas, its build quality or material choice could be points of concern for potential buyers.

This keyword analysis reveals what aspects of the product customers are most engaged with and how they feel about them. While design is perceived positively, there are mixed sentiments on functionality and customer service, and concerns about durability. These are the areas with opportunities for improvement.

Next, let’s perform sentiment analysis to get insights into overall customer satisfaction. 

Prompt: Perform sentiment analysis for generated reviews.

ChatGPT:

Review Number Review Sentiment
1 The watch is stylish and the battery lasts for days. Highly recommend. Positive
2 Love the fitness tracking features, but syncing can be hit or miss. Mixed
3 Great value for the price. However, the screen scratches easily. Mixed
4 Impressed by the water resistance, but the interface could be more intuitive. Mixed
5 Battery life didn’t meet expectations, but customer service was helpful. Mixed
6 Fantastic design, but I had to return it due to connectivity issues. Negative
7 Very accurate health monitoring, but the app ecosystem is limited. Mixed
8 Lightweight and comfortable, yet the charging cable feels flimsy. Mixed
9 Excellent for daily use with seamless notifications, though a bit pricey. Positive
10 Disappointed by the lack of third-party app support. Returning it. Negative


The sentiment analysis of this electronic watch reveals a spectrum of customer experiences, predominantly mixed, with positive and negative points highlighting specific areas of interest and concern.

Step 3. Data Visualization

Visualization transforms complex data into an intuitive format, making it easier to spot trends and key themes. A word cloud is ideal for summarizing these reviews because it visually emphasizes the most mentioned aspects of the watch, such as design, battery life, and functionality. 

Prompt: Create a word cloud graph

ChatGPT: I can’t directly generate visualizations such as word clouds within this text-based interface…. 

When requested to create a word cloud, ChatGPT explained its limitation in directly generating visualizations within the text-based interface. However, it provided Python code as a workaround. I pasted the code in Python code editor to create a graph, and then asked ChatGPT to generate a title and a caption. Voilà, this is the result!

 

In this tutorial, we covered text generation and processing, data analysis, and visualization using prompts. I hope you enjoyed these simple steps. If you are a business owner, try these steps using actual reviews, and if you are just learning, generate various types of text and see how you can apply these techniques!

I will go into more detail on data synthesis, NLP techniques, and visuals in my upcoming talk “Data Synthesis, Augmentation, and NLP Insights with LLMs” at ODSC East 2024. Meanwhile, let me know if you have questions or comments.

Tamilla Triantoro, Ph.D. is an Associate Professor in the School of Business at Quinnipiac University. She has directed graduate and undergraduate programs in Business Analytics at Quinnipiac University and the University of Connecticut. Her expertise includes Artificial Intelligence, Human-AI Collaboration, and the Future of Work. She has spoken about these topics in various parts of the world. and presented her work on six continents. With a Ph.D. from the City University of New York, where she researched online user behavior, Dr. Triantoro brings a deep understanding of the human element to her work.

ODSC Community

The Open Data Science community is passionate and diverse, and we always welcome contributions from data science professionals! All of the articles under this profile are from our community, with individual authors mentioned in the text itself.

1