“There’s simply too much textual information available for human beings to find or analyze it by themselves. Machines must help,” said Chris Biow, the senior vice president of Global Public Sector at Basis Technology.
A 20-year text analytics veteran, Biow started his career in the space in 2000 working for the leader in Enterprise search technology at the time, Verity. Not long after he began, the tragic terrorist attacks of September 11, 2001 shocked the world—pushing the entire sector toward national security applications
[Related Article: Watch: Project Feels – Deep Text Models for Sentiment Analysis]
In all of his roles, and especially those involving security and defense, Biow tackled one fundamental problem: extracting meaning from huge, heterogeneous repositories of human-generated unstructured text.
These repositories are larger than any one person — or even an army of them — would have the time to read and analyze themselves. To provide some sense of scope, 2.5 quintillion bytes of data are created every minute, and much of it is textual data. A human typically reads about a thousand bytes in a minute.
This information problem has become universal, faced by public and private organizations all over the world. But, perhaps no space has a greater interest in solving it than national security, where finding the right information is a matter of life and death.
The Text Analytics Pipeline
Text analytics is part of a larger workflow. Data collection comes first.
Across government and civilian organizations, three huge sources account for most of the initial gathering of information: digital forensics tools, open sources such as the Web, and intercepted communications. Once this data is collected, machine processing can begin. Machines are best suited for the first level of processing—finding what’s interesting in what’s not. After this noise-reduction phase, humans shoulder the analytical burden.
“One way to look at this is that text analytics allow the computer to perform a triage function, narrowing this impossibly large stream of text to the relative trickle that humans actually will have the time to look at,” he said. The essential function of these machines, according to Biow, is helping humans focus.
Google Alerts are a simple example of this process. For those problems that can be expressed as a simple, highly selective text query, the alerts provide an automated workflow, sifting the entire Web to deliver useable quantities of information to a user.
Text Analytics in the Wild: Civilian Agency Applications
Civilian agencies use text analytics on different sources of information, for different applications, and with highly varied legal authorities among law enforcement, national security, and those who distribute benefits to citizens.
But across all of these, another common pattern emerges. Working with citizens, beneficiaries, and other parties gives a focus to people and, particularly, their names. In any given situation, Biow notes, it is a key goal of analysts to establish context and verify identities.
Matching names with lists is a common and crucial aspect of civilian agency text analytics. The 2013 Boston Marathon bombing is a tragic example of the importance of this process.
U.S. Customs and Border Protection (CBP) checks travelers’ info against national and international watchlists to make sure they’re not potentially dangerous. But the screening technology in 2013 had a key gap in handling name diversity.
“This is due to the complexity of human names as they travel across languages, cultures, and written alphabets,” Biow said. “For example, there is one correct way to spell the name ‘Muhammad’. That’s in Arabic! But in the Latin letters we use in written English, and for our national security databases, there are at least ten common variants.”
Tragically, similar spelling problems allowed one architect of the attack, Tamerlan Tsarnaev, to leave and enter the US without issue, despite being on watchlists both times. After the incident, CBP upgraded its screening system using text analytics technology built by Basis Technology. This machine learning-driven application could handle the various challenges posed by name variation.
Text Analytics in the Wild: Intelligence Applications
Intelligence organizations have a broader mission: They are tasked to gather all types of textual information looking for the national security value within that corpus. On the one hand, they have the legal authority to gather nearly any foreign data. On the other hand, the requirement for clearances and variety of languages drastically reduces the pool of people who could read it. This, according to Biow, means they may analyze far larger amounts of text through a much narrower lens.
The primary problem for intelligence organizations is actually not analysis, but triage: determining what part of the information humans will not see. Biow said it identifies “those few needles in the haystack which demand human attention.”
[Related Article: Deep-Text Analysis for Journalism]
The “needle in a haystack” analogy applies across text analysis in national security, regardless of whether analysts are matching names from lists or searching for useful information amid a massive corpus of text. “Humans don’t have time to consider every possible name match or to read every document, across dozens of languages,” Biow said.
Therefore the most crucial aspect of modern text analytics tools is their ability to power automated processes and perform triage functions. A saving grace is that these machine functions are relatively easy to scale, using now well-established big data technology. Biow said this means combined systems can use both machines and humans to their best advantage, resulting in nearly infinitely-scalable, accurate analysis.
“For the first time in history, the key problem for decision makers isn’t information availability. In fact, almost all the answers to the questions we care about when it comes to security exist somewhere in the grand expanse of Internet and enterprise data. Instead, the problem is information discovery, and that’s a problem that these hybrid, human/machine text analytics systems are already quite good at solving,” Biow explains.
“And they’re only going to get better”