The past decade has witnessed a growing adoption of automation and artificial intelligence (AI) in various industries. This trend helped companies optimize their custom workflows in a cost-efficient manner, avoid the missed deadlines, and enhance the quality of life of its consumers. Naturally, these benefits have boosted the popularity and prevalence of the so-called “smart” systems that incorporate these technologies. For example, voice-controlled digital home assistants and self-driving cars have already become a part of our lives. Not surprisingly, at LinkedIn, we wanted to apply automation and AI principles to the management of our large-scale streams infrastructure. Hence, to achieve this goal, we built Cruise Control.
Cruise Control is a system that aims to provide effortless management of Kafka clusters. Several factors have been influential in building such a system. First, the growing scale (e.g. LinkedIn has over 2.6K brokers hosting ~5M partitions, which serve ~5T Messages/day) induced new challenges in manually identifying, tracking, and mitigating issues with unhealthy cluster components (e.g. racks, hosts, and brokers) and logical entities (e.g. replicas, leadership, and partitions). This increased the mean time to recovery causing a drop in availability, and disabled clients from intermittently producing to or consuming from partitions. In particular, it has become clear that reactive mitigation alone is not adequate for managing our infrastructure.
Next, the uneven distribution of resource utilization over brokers (e.g. due to differing popularity of partitions) has made Kafka less responsive to certain client requests, which exposed the clients to a higher variability in both throughput and latency. Finally, adding or removing brokers for expanding, shrinking, or upgrading clusters has also incurred a significant management overhead. These operational challenges have convinced us that building a system that can minimize the need for human-intervention was the way to go.
To achieve its goals, Cruise Control (1) supports admin operations for cluster maintenance, (2) enables anomaly detection with self-healing, and (3) provides real-time monitoring for Kafka clusters. Admin operations enable Cruise Control to handle on-demand requests without revealing the complexity of the underlying process to the users. Common requests include providing dynamic load balancing, adding, removing, or demoting brokers, as well as exposing and fixing offline replicas. To support such user demands, Cruise Control tackles a multi-objective optimization problem. It works on multiple conflicting goals while trying to minimize its operational side-effects.
This optimization process adopts a version of hill climbing algorithm to satisfy specified requirements while considering their priorities, strictness (e.g. best-effort), and mode. Hence, it manages to yield a near-optimal and timely solution in a dynamic environment. Anomaly detection identifies pre-defined goal violations (e.g. rack-awareness for fault-tolerance), broker failures, or metric anomalies (e.g. an abnormal increase in log flush rate due to bad disks). Self-healing indicates an automated action to fix the detected anomalies reactively or proactively. For example, broker failures are fixed by decommissioning the failed brokers. Real-time monitoring lets users examine the distribution of cluster entities, identify the unhealthy partitions, and check the health of cluster components. Overall, these key features made Cruise Control an essential system for us.
Cruise Control has established itself in a rapidly growing data streaming ecosystem and built a global open source community around it. We encourage the wider community to follow this quick tutorial to enjoy its benefits!
If you want to learn more about this topic, be sure to check out my talk at ODSC East 2019 this April 30-May 3 in Boston, “How LinkedIn Navigates Streams Infrastructure Using Cruise Control.”