Supervised machine learning is essentially classification: ball vs strike; dog vs cat vs horse vs cow; etc. For these types of problems, the most fundamental question is always: can I create an accurate and generalized model (classifier) from the data I have collected? Today, the only way to answer this question is to actually create and train a model and then validate it against held-out data. If you don’t get the validation accuracy you’re looking for, you have to repeat the process over and over again. This is fundamentally a game of whack-a-mole that costs time & money.
Why Don’t Data Scientists Measure?
Brainome believes there is a better approach. We believe that data science should be based on precise measurements, just like every other field of science & engineering. Rather than engaging in trial & error, we believe that data scientists should measure their information – specifically, the learnability of their training data. Doing so would allow one to very quickly answer some fundamental questions, before any model creation or training:
– do I have enough data?
– what are my most important data features?
– will my model generalize or overfit?
– what kind of ML model will produce the best results with my data?
– what is my accuracy vs generalization trade-off curve?
If the training data is learnable, one can use the learnability measurements to guide the automated creation of a bespoke model from scratch — similar to a tailor who sews bespoke clothing based on measurements of his client’s physical dimensions.
Learnability vs Memorization
So what makes one data set learnable and another data set not learnable? Let’s look at two examples.
Example A: 2, 4, 6, 8
If you’re given the set of numbers [2, 4, 6, 8] and asked to guess the next number in the sequence, what would your guess be? Most people would say 10. And the next number is 12 and the rule is “+2”. This is the epitome of a learnable data set – the explanatory rule (pattern) is immediately obvious.
Interestingly, adding more instances to this data set does not make it more learnable. If you were given [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, etc], you would have still figured out the rule after seeing just the first 4 numbers. At some point in every learnable data set, your learning must plateau and the explanatory rule has to settle.
Example B: 6, 5, 1, 3
If we play the same game with [6, 5, 1, 3], then it’s a little more challenging. There is no obvious rule or pattern. And that’s because this data consists of 4 numbers chosen at random. The reader may point out that there could be a complicated pattern like ‘6’, -1, -4, +2, etc. but such a pattern would be as long as the original sequence and, therefore, doesn’t help at all when it comes to predicting the next number. [6, 5, 1, 3] is no different than [8, 6, 7, 5, 3, 0, 9] – the string of digits made famous by Tommy Tutone’s classic 80’s pop hit.
Jenny’s number (like everyone’s phone number) is just a random sequence of digits with no rhyme or reason whatsoever … which makes it … only learnable via memorization … which is the best one can do with random data.
Human Learning vs Machine Learning
The lead character in the Amazon Prime series “Mozart in the Jungle” is a symphony conductor whose life revolves around classical music. In one episode, he asks his audience: “Why do you call it ‘classical music’?” His point is: when the symphonies were first written and performed for audiences, they were known as just “music”. Similarly, there is a modern discipline called “machine learning” which some people believe has magical predictive powers. But, the reality is that computers learn exactly the same way humans learn – which shouldn’t really be a surprise since humans invented computers … and machine learning.
So how exactly do humans learn? Well, we either (1) memorize or (2) recognize patterns & rules. As explained above, some data sets can only be memorized because they are inherently random. Non-random data sets, however, don’t require memorization because rules can be extracted and used for prediction (i.e., learning occurs).
Compressible = Learnable = Generalization
Consider two different strategies for teaching children how to multiply. One requires memorization. If we were to ask a child (or a computer) to “learn” multiplication by memorizing an ever growing multiplication table, that learner (human or digital) would eventually run out of memory. And short of having infinite memory (which is not possible), there would still be many multiplication problems that the memorization driven learner could not solve.
Alternatively, if we teach the learner a simple rule – multiplication is just adding the same number over and over again – they could solve an INFINITE number of multiplication problems while still preserving lots of memory to learn other topics. And the memory footprint of a recursive program that adds numbers together to implement multiplication is infinitesimally small compared to the memory footprint of a 1 trillion row X 1 trillion column multiplication table. Simply put, multiplication is pretty easy to learn … as long as your learning strategy isn’t memorization.
So there’s clearly a relationship between compression and learnability which is further explored in this YouTube lecture.
The Punch Line
Most adults know how to multiply. But, since we’re all unique, the multiplication model in each of our heads is slightly different and occupies a different amount of memory. Similarly, anyone who is 5 or older has a pretty general dog vs cat classifier somewhere in their brain. Like the multiplication model, the dog vs cat classifier is slightly different for everyone. What we cannot do today is measure the memory footprint of your dog vs cat classifier (i.e., the number of neurons it took for you to learn the difference between dogs and cats).
What Brainome can do today is measure the amount of memory a machine learning model needs in order to learn an arbitrary data set. The details for how we do this are explained in this academic paper. Knowing how much memory is required to learn a data set is the key to measuring (quantifying) learnability.
If we ask ourselves, “What is the hardest thing in the world to learn?”, it should be clear from Example B above that the answer is random data. Randomness can only be memorized – it cannot be learned. In fact, it is the opposite of learnable. If we can measure how much memory is needed to learn a random data set of arbitrary size and class balance, then we have our stake in the ground for determining learnability. If the same machine learning model requires less memory to learn your data than is required to learn random data, then it’s a clear indication that your data isn’t random and there are rules & patterns that can be identified, learned and used for prediction, as depicted in this diagram:
At ODSC East 2021, check out our talk, “A New Measurements-Based Approach to Machine Learning,” for a deeper discussion on measurements, memory, learning, overfitting, and generalization 🙂
Dr. Gerald Friedland is the CTO of Brainome, Inc and is also teaching as an adjunct professor in the Electrical Engineering and Computer Sciences Department of UC Berkeley. Before that, Dr. Friedland was at Lawrence Livermore National Lab and the International Computer Science Institute in Berkeley. Dr. Friedland’s work is primarily in the areas of signal processing and machine learning. Dr. Friedland has published more than 250 peer-reviewed articles in conferences, journals, and 3 books. Dr. Friedland received his doctorate (summa cum laude) in computer science from Freie Universitaet Berlin, Germany, in 2006.