Ensemble models give us excellent performance and work in a wide variety of problems. They’re easier to train than other types of techniques, requiring less data with better results. In machine learning, ensemble models are the norm. Even if you aren’t using them, your competitors are.
What Can We Do With Ensembles?
Ensemble models are comprised of several weak models that aren’t so great themselves, known as weak learners. When you combine them, they fill in each other’s weaknesses and render better performance than they would when deployed alone.
Two popular techniques for employing ensembles are bagging and boosting. The first, bagging, has roots in Bootstrapping, the act of taking a random sample of data, all of which can be replaced with another random sampling and helps us understanding bias within the dataset. Bagging is the act of applying that method to a group of decision trees. Each model runs independently, and results are aggregated at the end. Boosting, on the other hand, runs weighted averages in parallel.
So why do we bother? For example, if you have access to two doctors with different accuracy diagnosing disease and they both tell you that you have a particular disease, together they have much better accuracy at predicting that diagnosis.
Like those doctors, algorithms also can combine to take advantage of the same type of benefit. If your weak learners are still “more right than wrong” and their errors are diverse, each algorithm will likely compensate for the weaknesses of the other. Ensembling reduces variance and bias, two things that can cause big differences between predicted and actual results.
What Are the Building Blocks?
Weak learners are the first block for an ensemble. Using a decision tree, you can capture complex relationships in your data by taking advantage of the principles above, accuracy and diversity.
The second block is constraints. You must be able to control your algorithms using a maximum depth for information. You also must pick a minimum number of samples to justify your answers.
In practice it looks like this:
Find data. In Lemagnen’s example, he imports data from Facebook on characters of a specific type of post. He has information about the category, the number of comments and the target is predicting the number of comments he’ll have in the next hour.
Look at how much data you have. With 40,000 rows, you have plenty to work with. Extract the data and split into the training set and the test set to evaluate the model later.
Look at the decision tree. He’s working with regression, so the regressor decision tree is key. He’s constrained the tree with a maximum of three binary rules. Once he’s trained the tree, he’ll have only those three rules.
Evaluate the model. Lemagnen chose a model that would integrate well with the result he’s trying to predict. He’s elected substantial constraints but decides to increase depth. The complexity of the tree increases and becomes difficult to visualize. He chooses to use a new parameter to avoid the overfitting he sees in the current tree.
The result of changing the parameters is a tree that’s deep enough to convey a complicated relationship, but not so deep that it’s not representative of the overall data. The takeaway is to modify those two parameters (depth and XX) to find the right amount of data.
So How Do You Build Diverse Trees?
Instead of training all the trees, he’s going to build a new dataset with the same number of rules but a sample from the previous dataset. He can also provide new data in the new columns from random sets. You can force your trees to learn from new features for diversity.
Allow your tree to overfit to the data so that you can teach more complex relationships than just the basic data set. It’s vital to lose some constraints to overfit a little. Don’t overdo it, but you must learn how to gauge when enough data is enough.
What Are The Benefits? Downsides?
It’s excellent for implementation because everything runs in parallel. Building, training, and deploying can run in different CPUs, so it’s quite easy to model these types of functions. It’s also easy to run several trees at the same time to compare the best features of your trees.
Issues with the tree is that even though there are many sets, the models will always remain somewhat correlated. They’re harder to interpret than a single tree because you’re combining several, and the hierarchy may not be as apparent.
The presence of outliers could be a good or a bad thing depending on your data. Only some of your trees will see the outlier, so they’re implicitly hidden. In some cases, outliers could have too much of an impact on the model, so it could bring your information back into what really represents your data. On the other hand, sometimes outliers offer valuable insights into your data, and you may miss them altogether.
Bagging And Boosting
Lemagnen goes through the steps for both types of ensemble methods, bagging and boosting. For both processes, it shows insights through the data, bringing up training methods and compensating for errors in each previous decision tree. Watch his methods through Github during his talk to find out how each technique gets you closer to the truth and how these two processes enhance each weak learner to reveal those essential data insights.
Ensembling Isn’t Magic
If your ensemble uses bad models, you aren’t going to reveal insights in the data magically. However, if you’ve got a few different weak learners that don’t offer much in practice by themselves, ensembling could build on each strength.
For the full steps, be sure to watch Lemagnen as he moves through each method of ensembling. There are different reasons to bag or boost and a variety of programs that can help you move through these processes and take full advantage of what ensembling can do for your predictive data collection.