ODSC East 2024 Keynote: Carolyn Rosé on Setting Up Text Processing Models for Success ODSC East 2024 Keynote: Carolyn Rosé on Setting Up Text Processing Models for Success
With the challenges of immense volumes of textual data today, the field of  Natural Language Processing (NLP) is rapidly evolving.  At... ODSC East 2024 Keynote: Carolyn Rosé on Setting Up Text Processing Models for Success

With the challenges of immense volumes of textual data today, the field of  Natural Language Processing (NLP) is rapidly evolving.  At the core of this challenge lies the perennial tension between the need for intricate knowledge engineering and the drive to generate actionable insights with minimal human intervention.

In her keynote speech at ODSC East 2024, Carolyn Rosé, Professor of Language Technologies and Human-Computer Interaction at Carnegie Mellon University, discussed with attendees new techniques that allow for the leveraging of LLMs in their approaches to text mining and conversational data mining. You can watch the entire keynote here on Ai+ Training.

The Pursuit of Minimal Knowledge Engineering

For decades, NLP has pursued the goal of minimizing human knowledge engineering. The idea is to design models that can automatically learn from data, reducing the need for extensive human input. However, each new generation of machine learning research rekindles the old debate: should more effort be invested in knowledge engineering, or should the focus be on leveraging what can be directly gleaned from data?

Carolyn Rosé, PhD Keynote ODSC West 2024 3

The Role of Formal Representations in NLP

Formal representations involve meticulously crafting models based on well-defined rules and structured knowledge of languages. These representations are pivotal in ensuring that NLP models are not just effective but also interpretable and reliable, particularly in specialized or restricted domains where precision is paramount. Recent advancements in neural-symbolic approaches to NLP highlight the potential of integrating formal representations with neural learning techniques. These approaches have shown promising results, providing enhanced model robustness by incorporating structured linguistic and domain knowledge.

Nevertheless, formal methods are not without challenges according to Carolyn Rosé. Identifying high-utility abstractions and managing strategic exceptions often necessitate external data sources. The process of integrating these formal systems with the more fluid, generalizable insights derived from large datasets poses significant hurdles, requiring a delicate balance between structured knowledge and adaptive learning.

Carolyn Rosé, PhD Keynote ODSC East 2024 2

Large Language Models: A Paradigm Shift

Conversely, Large Language Models like GPT and BERT have shifted the paradigm by demonstrating that models trained on vast datasets can achieve remarkable understanding and generation capabilities. These models leverage massive amounts of text to learn a broad range of language patterns and nuances, which can then be applied to a variety of NLP tasks without domain-specific tuning.

Recent developments have seen LLMs being utilized to augment data representations in NLP, offering a more flexible approach compared to strictly formal methods. By dynamically integrating insights from large-scale data, LLMs can adapt to new contexts and domains more effectively, making them particularly valuable in applications requiring broad generalizability and up-to-date knowledge.

Carolyn Rosé, PhD Keynote ODSC West 2024 1

Bridging the Gap

Carolyn Rosé continued by exploring the ongoing research aimed at enhancing the availability and utility of both formal and informal representations of language. It examines the productive tension between these approaches and seeks ways to harness their respective strengths. For instance, blending the precision of formal representations with the scalability and adaptability of LLMs could lead to the development of hybrid models that excel in both specific and general applications.

Looking ahead, the challenge for NLP practitioners is to continue exploring this space of tensions to find optimal strategies for utilizing vast textual resources. By understanding the strengths and limitations of different approaches, researchers can better design systems that not only perform well across various tasks and domains but also push the boundaries of what automated text processing can achieve.

If you enjoyed this overview, then you won’t want to miss another keynote. Check out ODSC’s next conference, at ODSC Europe and enjoy 40 trainings/workshops, 130 hybrid sessions, and more! If you become one of the first attendees to ODSC Europe, you’ll save 75% by buying early. Not long after this is also ODSC West this October 29th-31st!



ODSC gathers the attendees, presenters, and companies that are shaping the present and future of data science and AI. ODSC hosts one of the largest gatherings of professional data scientists with major conferences in USA, Europe, and Asia.