Author’s note: This article on survival analysis was originally published on The Crosstab Kite.
Sometimes data scientists just don’t realize survival analysis is a good fit for their particular projects, so let’s talk about applications. Specifically:
- When and why you should consider survival analysis for a project
- What you get out of a survival analysis
- Examples of survival analysis applications other than clinical research.
When should I use survival analysis?
Kleinbaum and Klein’s introductory textbook defines survival analysis (page 4) as
“…a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.”
This definition misses the point. Here’s the main question to ask when considering survival analysis:
Do we need to act before we observe all of the data?
If yes, use survival analysis.
When we can’t afford to wait for all the data to roll in, some of our data will be censored. We have to close our observation window at some point to finalize our dataset and make decisions, but at that point, some subjects will not yet have experienced the target event. For these subjects, we know only that the time-to-event is at least some duration.
For example, suppose we make kitchen appliances and we need to decide if a new manufacturing process decreases the lifespan of our toasters. We need to make the decision in a few months at most but toasters wear out over the course of years.2 We can’t just ignore the toasters that haven’t broken yet; that would lead us to underestimate toaster lifespan. So how do we incorporate that data? Survival analysis!
Notice that I didn’t mention time-to-event in the survival analysis criterion above. Usually, we do care about durations, it’s true, but not always.
When to think about using survival analysis. The importance of durations vs. binary classification only comes into play if we don’t have censored data. Image by the author.
Let’s stick with the toaster example. Suppose we don’t care how long toasters survive, we just want to make sure they last at least three years. Even at this threshold, we still have a censoring problem, because we need to make decisions with only a few months of data. Survival models take this censored data into account.
The terminology of survival analysis—survival, hazard, failure, etc—implies the target event must be a bad thing to be delayed as long as possible. This is emphatically not the case; in many applications, the outcome event is a good thing and we want it to happen as soon as possible.
What do I get out of a survival analysis?
Survival models accomplish the same things as other supervised models: prediction and causal inference. Whereas mainstream regression methods tend to assume the regression function is the expectation of the target variable given a feature vector, survival models tend to output entire distributions and let the data scientist decide how to summarize them, Here are three uses of survival analysis:
- Quantify and visualize the distribution of durations. The shape of a survival curve can help to set expectations, baselines, and thresholds for qualitative program analysis or other quantitative studies.
Suppose we want to help our Customer Care team improve their performance, one aspect of which is the time it takes to resolve support tickets. We build intuition by computing and plotting the Kaplan-Meier survival curve for these durations.
If the plot looks like curve A below, the support team is doing a relatively good job, but I would ask why a small fraction of tickets stays open much longer than the rest. If the plot looks like curve C, the team isn’t doing a great job; most tickets are open for 5 weeks. However, most tickets do get resolved quickly after 5 weeks, so there could be a systematic cause for the delay that we can fix.
Three very different survival curves for a hypothetical Customer Care team’s support tickets. Image by the author.
- Predict the survival curve for individual subjects. Survival curve predictions can be used to prioritize interventions, even if we don’t understand the underlying causes.
There are two wrinkles with survival prediction to keep in mind. First, the predicted median survival time may be infinite, because the predicted survival probability doesn’t drop below 50% even at the longest observed duration.
Second, we have to specify whether the prediction subject is new to the system, i.e. their duration starts at 0, or they are an existing subject, censored in the training set. In the latter case, we want to predict time remaining until the target event, not total time.
- Estimate causal relationships to optimize duration. This bucket includes both field experiment analysis and observational causal inference. Don’t pay heed to claims that we just want to estimate associations; when the goal is to optimize time-to-event (in either direction), we need to know what levers we can pull and what their effects will be.3
Examples of survival analysis applications (that aren’t clinical research)
I think all data scientists should know at least a bit about survival analysis, but if your work touches any of the following applications areas, you really should consider adding survival analysis to your professional tool belt.
A back-of-the-envelope brainstorm of possible survival analysis applications. Image by author.
1. Hardware failure
This is a natural place to start because hardware failure is directly analogous to clinical research. In this case, we’re interested in predicting and preventing mechanical equipment failure instead of human medical problems.
2. Customer analytics
In this application, the subjects are individual customers and there are both good and bad events.
The good outcomes—which we want to accelerate—include marketing and sales conversions, especially for long transactions like making travel arrangements or applying for a mortgage. They also include up-selling to premium tiers and reaching critical levels of engagement.
The classic example of a bad event (from the company’s point of view) that we want to delay is customer churn.
3. Product analytics: time to adoption
Suppose we want to understand how long it takes for customers to upgrade to the latest version of some product, like a mobile phone. We might use the survival curve to make operational plans or detect when adoption is slower or faster than expected. Causal survival models might help us to accelerate product adoption.
From a modeling perspective, this is the same as the customer analytics use cases, but here the target event is about the product instead of revenue.
4. Unit economics: time to break-even revenue
This one falls under customer analytics, but it’s more general. The target event here is earning revenue equal to cost. For example, in marketing, how long does it take for a new customer to generate revenue equal to their acquisition cost? For capital assets—let’s say rental cars, to be concrete—how long until each unit generates revenue equal to its cost. More usefully, how to reduce break-even-time to the point where we feel comfortable scaling the business?
5. Human Resources
Opportunities abound for survival analysis in HR. Modeling employee tenure is a good place to start; the People team needs to decide which policies work to boost morale or retain talent, but employee resignations (typically) happen over durations of years so most subjects have censored durations.
Other interesting things to model in HR are the time it takes to fill open positions and the time it takes for promotions.
6. Engineering and support tickets
Engineering and Customer Care teams often track work with ticket systems. Managers sometimes track the time it takes to resolve tickets and pressure their teams to reduce that time (hopefully as one objective of several). Survival analysis can be a great fit for this task if some tickets stay open long enough to cause data censoring.
Ticket durations are a great use case for cure models where some fraction of the subjects never experience the target event. In engineering, tickets dumped in the P4 or wontfix buckets look open but everybody knows they’ll never be addressed. The Convoys Python package uses the example of Manhattan building code complaints to illustrate this use case.
7. Loan repayment
Lenders are obviously keenly interested in loan repayment, which is often seen as an all-or-nothing outcome. But the time to default (or full repayment) is a richer way to model the problem, because the longer a loan goes without default, the more the lender is paid. Loans outstanding at the time of analysis that are not (yet) in default are considered censored, as are loans repaid early. See this text for a worked example with code.
8. Inventory management
Forecasting when product supply is going to run out can have a lot of business value, but this is an application where we need to keep in mind the primary survival analysis question about durations vs. decision-making cadence. Survival analysis is only a good fit for products that sell slowly, over the course of months.
Houses are a good example; we might want to track time-to-sale on a monthly basis, but houses often take more than a month to sell.
Despite its narrowly-focused name, survival analysis is a general and powerful framework for modeling outcomes that occur more slowly than our decision-making cadence. If it’s not already in your data science tool belt, consider adding it.
To help you do that, please join my upcoming talk at ODSC West 2021 Applications of modern survival modeling with Python, where I’ll explain data censoring and survival data structures in more detail and show in code how to get started with survival modeling.
- The claim that survival analysis is rare outside of clinical research is a personal impression, not an empirical fact. Here’s one bit of evidence in support, though: the 2021 AAAI Symposium on Survival Prediction accepted five applications papers (of 22 total), all of which were about human health.
- Another way to make decisions quickly is to simulate the process in a lab. It helps, but it’s essential to measure real-world performance as well.
- For example, Kleinbaum and Klein (2012, page 16), say there are three goals of survival analysis:
“Goal 1: To estimate and interpret survivor and/or hazard functions…
“Goal 2: To compare survivor and/or hazard functions.
“Goal 3: To assess the relationship of explanatory variables to survival time.”
They seem to be going out of their way to avoid saying that our goal is to learn what causes durations to be longer or shorter, so we can improve the survival time in the direction we prefer.
About the author/ODSC West 2021 speaker: Brian Kent, PhD
Brian Kent is the founder of The Crosstab Kite, a publication for professional data scientists solving real-world challenges. He writes about survival analysis, data-driven decision-making, data science tools, and big picture trends in statistical modeling.
Prior to The Crosstab Kite, Brian worked in the FinTech space as Director of Data Science & Machine Learning at Credit Sesame. Before that, he was a machine learning engineer at Apple, where he worked on autonomous systems,
personalized health, and silicon engineering.
Editor’s note: Brian is a speaker for ODSC West 2021. Check out his talk, Applications of Modern Survival Modeling with Python, there!