Natural language processing (NLP) exists as a sort of intermediate space between computer science and linguistics, taking more from either discipline as fit for the task at hand. With this breadth in mind, NLP ends up serving as an umbrella that encompasses everything from named entity recognition to machine translation. Such a broad umbrella creates applications that perform tasks as varied as sentiment analysis, question answering, and coreference resolution — but this also fosters difficulty in precisely defining the term “natural language processing.”
The difficulty associated with pinning down a neat and tidy description for NLP extends to any attempt to outline a concise history of the field. Nonetheless, this two-part article series roughly aims to do that.
Drawing largely from a timeline laid out by Karen Sparck Jones in “Natural Language Processing: A Historical Review” (2001), this series showcases a total of five exemplary research papers to articulate a unifying foundation for NLP and depict how the earlier roots of the field extend into later and contemporary research.
Lexical and syntactic independence
The first forays into NLP came about during the late 1940s, ushered in by the Cold War. Machine translation had evolved into a massive defense initiative as governments sought automatic translation from Russian into English to monitor correspondence and activity of the enemy. The beginning of 1952 brought with it the first ever international conference on machine translation. Following this, the debut publication of the journal, Mechanical Translation, in 1954 further cemented widespread interest in the growing field.
Translation initially began as a lookup task, but slowly expanded to include context-based ambiguity resolution and ultimately autonomous sentence grammars and parsers. Syntax was the primary focus of the translation process while sentences were largely analyzed independently of one another with minimal integration of world knowledge.
Brown et al. builds on this early infrastructure in  “The Mathematics of Statistical Machine Translation: Parameter Estimation” (1993). The work done by Brown et al. concentrates on word-by-word alignment, echoing the way in which the earliest machine translation attempts forsook global text structure for lexical or syntactic phenomena.
The model described in the paper starts with a large dataset consisting of sentences in English and their respective French counterparts. Then, by examining sequences of words, the model algorithmically maximizes the probability of a given set of English words aligning with certain French words in meaning. While the algorithm naturally unpacks the simplest of translations such as milk to lait, it also succeeds in discovering word-to-phrase translations such as the single word starred being the equivalent of marquées d’un astérisque.
Brown et al.’s emphasis on statistical techniques over linguistic complexity reflects the idea that a basic concept of sentence composition and probabilistic likelihood can achieve noteworthy results — the core of machine translation and, in many ways, NLP itself.
Relations between entities
In the late 1960s, world knowledge came into play when further creating NLP models and algorithms. As such, researchers wanted to formalize its connection to meaning. The 1960s was a time of data, of knowledge bases, of ontologies. As the BASEBALL question-answering system and the LUNAR and SHRDLU natural-language databases made their mark over the next decade, NLP moved even closer to artificial intelligence (AI).
These newfound interfaces are most noteworthy for the fact that they dealt with highly constrained instances of natural language but still set the stage for applications of NLP that were grounded in real-world problem solving, a first in the field. Moving beyond syntactic constituents, leaders in the field wanted to think about “relationships between the elements of a whole universe of discourse” (Jones 2001).
One method of taking relationships between elements into account is through the use of conditional random fields. Consider an instance “where the variables y represent the attributes of the entities that we wish to predict, and the input variables x represent our observed knowledge about the entities.” Given this, a conditional random field (CRF) is defined as “a conditional distribution p(y|x) with an associated graphical structure” (Sutton and McCallum 2010). Its power lies in representing dependencies between entities, which can enhance classification tasks. In fact, a key application of NLP, information extraction, relies on constructing a database populated with relationships surmised from text.
A comprehensive paper detailing the ins and outs of CRFs is  “An Introduction to Conditional Random Fields for Relational Learning” (2010). Researchers Sutton and McCallum implement a skip-chain CRF for named entity recognition. The skip-chain CRF is known as such for its long-distance edges between similar words, which allows for the context of both endpoints to be incorporated. In general, CRFs combine the advantages of discriminative modeling and sequence modeling.
Sutton and McCallum found that skip-chain CRFs perform better than other models on named-entity recognition tasks that tend to result in inconsistently mislabeled tokens — that is, tokens whose labeling varies in accuracy throughout the document. As shown in Sutton and McCallum’s work, CRFs demonstrate the broader progression from looking at words as islands to instead seeking to capture their relationships with one another.
The next part of this series will track the history of NLP from the 70s onward, rounding out the discussion with three additional research papers in the field and considering what is to come in the future.