Businesses increasingly delegate simple, boring, and repetitive tasks to artificial intelligence. In a case study, Alexandre Hubert — lead data scientist of software company Dataiku’s U.K. operations — worked on a team of three to automate mail processing with deep learning.
At ODSC Europe 2018, Hubert detailed how his team created a fairly successful mail processing software for a young insurance company. The deep learning system successfully processed two-thirds of all mail it received at a 1,000 letter-per-hour rate. This marked an improvement over the third-party sorting service the company used before.
Hubert and his team followed a four-part development procedure, detailed here.
Separate Handwritten vs. Typed Letters
Hubert’s team decided early on it would be best for their deep learning space to deal with handwritten and typed letters differently. So they had to label letters in a provided training dataset as handwritten, typed, or both. They also needed to separate anything that was not a letter (envelopes, forms, etc.). The team built a web platform and labeled about 2,000 documents manually into those four categories.
From the set of labeled data, Hubert’s team needed to label the entire collection of unlabeled training data. They used autoencoders, which take an input like an image and ask a network to reproduce it.
The network takes the input, puts it into a learning space, and recomposites it. The space is called latent space, and it contains the important features necessary to reconstruct an image quickly and efficiently.
Hubert fed the 2,000 images to autoencoders and presented their manually determined labels to the latent space. This made it easy for a traditional machine learning algorithm to label all images in the dataset. The latent space-informed model achieved a 97 percent AUC performance and very low errors, meaning it very effectively recognized handwritten vs. typed letters.
Deal with Typed Letters
Hubert said dealing with typed letters was the easiest part of creating the mail processing system. Using a tool called the Tesseract Open Source Optical Character Recognition Engine, the team simply inputted the images and specified their language. Tesseract outputted the fully digitized text.
The Tesseract tool isn’t perfect. For instance, when it tries to parse signatures it produces wild and inaccurate characters. Overall, though, the tool was very effective for Hubert’s team: Digitizing the words made sorting a trivial problem. To sort, the team simply used the letters’ frequency-inverse document frequency (or tf-idf) metrics. Running a logistic regression on these metrics achieves good sorting results.
Typed letters, then, could be forwarded to the proper departments with relative ease.
Detecting Words in Handwritten Letters
In contrast, detecting words in handwritten letters is fairly difficult. From letters that look like this:
the deep learning network had to extract all words and find a way to digitize them.
The team started by narrowing the scope of the words they told the deep learning mechanism to read. Body paragraphs, they decided, were the only section it really needed to read to identify a letter’s topic. So Hubert and his team used computer vision and decomposition techniques to find the body paragraphs of the letters.
First, they used dilation by convolution to recognize letters’ general layout. They achieved this using a cross-dilatation kernel. Then, they applied connected component techniques. The process revealed that most written letters have the same structure, meaning the body text of a letter is easy to identify (shown in red below):
After defining the desired area of the image, they needed to identify the lines of text, and then each individual word in each line. This process mimics human reading patterns.
They chose to find the white space between each line and word to separate individual words. They used the projection profile method, which takes all the cells on a vertical axis and sums them. When there’s white space, the sum should be close to zero. When there’s something written, the sum should be quite large.
Graphing these sums shows how the projection profile method identifies spaces separating lines and individual words:
Optimizing this computer vision technique required significant parameter tuning. The method did fail in some instances because spaces between some words weren’t very large, but it was generally very successful.
Extracting Words from Images
Deep learning can to turn images of words into a computerized format, which natural language processing techniques can eventually read and organize from.
Hubert’s team did a small bit of labeling themselves and augmented those images to inflate the dataset, but needed more training data to build a truly robust deep learning tool. So they combined their labeled data with the IAM database online, which contains more than 100,000 handwritten words correctly labeled. Then they added word images made from various fonts similar to human handwriting.
Between these three sources, their corpus grew to 300,000 labeled word images.
Using a combination of two popular deep learning spaces — CNN and LSTM — the team trained the network stack, which took an image of a word as input and produced the digitized word. CNN captures the information from the image, meaning it learns the visual features of that image, i.e. that there’s a sequence of letters. LSTM reads the features identified by the CNN and translates the sequence.
Based on the digitized words, the team was again able to use tf-idf to find the topic of the letters and identify what department each should be sent to.
Of all letters scanned into the system, the deep learning space decided to sort 78 percent of them, and of those 90 percent were sent to the proper department. In 22 percent of cases the system couldn’t identify a prevailing topic and the letter had to be sorted manually, along with those 10 percent of letters that were designated to the wrong department.
In the end, about a third of letters still had to be sorted manually at the insurance company. Overall, though, the process was much quicker than if the third party mail sorter had done it all. Since then, the team finally implemented curriculum learning.