Engineering

Why Auto-Discovering Event Sequences Is Harder Than It Looks

Priya Nakashima February 18, 2025 9 min read

Abstract machine learning event sequence graph visualization

When we started building the auto-discovery engine inside Prodlytix, the problem statement seemed clean: given a user's event stream, identify the ordered sequence of events that best predicts whether they'll retain at Day 30. Feed in the events. Surface the sequence. Done.

We were wrong about "done." Eighteen months and three significant architectural pivots later, we have something that actually works — and the path there taught us more about the limits of sequential pattern mining than most papers on the subject will tell you. This post is a technical account of where we started, where we went wrong, and what we ended up building.

What We Mean by "Auto-Discovery"

To be precise about the problem: we're not trying to discover what events correlate with retention in the aggregate. That's table-stakes analytics — you can get there with a cohort analysis and a join. We're trying to discover the ordered sequence of events, within a bounded time window, that is both statistically discriminative (retained users do it at much higher rates than churned users) and interpretable enough for a PM to act on.

That second constraint — interpretability — is where things get genuinely hard. A machine learning model can discover a 14-event sequence with 82% predictive accuracy that is completely meaningless to a product team ("users who fired events e47, e12, e83, e91, e47 again, e23... retain at higher rates"). The sequence might be real. It's also useless. The goal is to find sequences that are short enough to comprehend, ordered in a way that makes product sense, and discriminative enough to be worth acting on.

Dead End #1: Naive Frequent Sequence Mining

The first approach was GSP (Generalized Sequential Patterns), a well-established algorithm in sequence mining literature. GSP works by finding frequent subsequences — ordered sets of events that appear in a minimum percentage of sequences. Apply it to retained users, apply it to churned users, compare the frequent sequences, surface the ones that appear in retained users at significantly higher rates. Simple, auditable, interpretable.

It didn't work — not because GSP is wrong, but because our event space was too large. When you have 200+ distinct events in your tracking plan, the number of candidate subsequences grows combinatorially. GSP at reasonable minimum-support thresholds (≥5% of sequences) was either too permissive (finding thousands of sequences, many of which were just noise) or too restrictive (missing the actual discriminative sequences because they appeared in only 3–4% of retained users — which is still a statistically meaningful signal in a 50,000-user dataset, but GSP treats it as below the threshold floor).

We could tune the support threshold, but there was no principled way to know where to set it. A sequence appearing in 4% of retained users vs 0.8% of churned users is a 5x lift — very significant. A sequence appearing in 40% of retained users vs 32% of churned users is a 1.25x lift — much weaker, even though the absolute support is higher. Minimum support was the wrong filter. We needed lift-first filtering, not frequency-first.

Dead End #2: Feature Engineering Into a Classifier

The second approach: treat this as a supervised classification problem. Label users as retained (Day-30 active) or churned. Engineer features from the event stream — event presence booleans, event order indicators, time-gap features between events. Train a gradient-boosted classifier. Extract feature importances to understand which event sequence features mattered.

This worked much better in terms of predictive accuracy — we got to AUC ~0.78 on holdout fairly quickly. The problem was the feature importance readout. Feature importances from GBDT models are notoriously unreliable when features are correlated, and event sequences are highly correlated by construction. If event B always follows event A, then the importance attributed to "user did A before B" and "user did B after A" gets split arbitrarily between the two features depending on which split the tree happens to make.

We were generating "top 10 most important features" lists that changed substantially between training runs on similar data. A PM cannot build a product strategy around sequence insights that flip every time you retrain the model. Predictive accuracy wasn't the bottleneck. Stability was.

What We Actually Shipped: Lift-Filtered PrefixSpan with Stability Constraints

The third architecture, which is what powers the discovery engine today, combines three things:

PrefixSpan for Sequence Enumeration

PrefixSpan (a depth-first sequential pattern mining algorithm) is significantly more memory-efficient than GSP for large event catalogs, which matters when you're mining sequences across hundreds of thousands of event streams. We use PrefixSpan to enumerate candidate sequences up to length 6 (six events) — longer sequences are rarely interpretable and statistically thin.

Lift-First Filtering

Instead of filtering by minimum support, we filter by minimum lift: a sequence must appear at least 2.5x more frequently in retained users than in churned users (within the same time window) to be considered a candidate. This surfaces sequences regardless of their absolute frequency, so a sequence that appears in only 3.2% of retained users but only 0.6% of churned users (5.3x lift) gets surfaced. A sequence appearing in 45% of retained users and 38% of churned users (1.18x lift) gets filtered out, even though it's "frequent."

Bootstrap Stability Scoring

This is the piece that solved the stability problem. For each candidate sequence that passes the lift filter, we run 100 bootstrap resamplings of the data and compute the sequence's lift in each resample. If the 10th percentile bootstrap lift is below 2.0x (meaning the sequence's lift is fragile to data perturbation), it gets dropped even if the point estimate is high. We only surface sequences where the lift signal is stable — reproducible across realistic data variation.

The output is typically 3–8 sequences per product dataset, ranked by a composite score of lift × stability × interpretability-length-penalty (longer sequences get penalized). The interpretability-length-penalty is deliberate: a 4-event sequence that explains 68% of the lift of a 6-event sequence is almost always more useful to a PM than the longer version.

The Time-Window Problem

One nuance that isn't obvious from the algorithm description: the activation window over which you mine sequences matters enormously. Mine over day 0–3 and you capture early onboarding behavior. Mine over day 0–14 and you pick up habit formation signals. Mine over day 0–30 and you start to mix activation signals with engagement signals in ways that make the sequences less actionable.

We ended up building an automatic window optimizer that tests multiple windows (day 0–1, 0–3, 0–7, 0–14) and selects the window where the discovered sequences have the highest predictive validity — measured by how well sequences discovered in a training cohort predict retention in a held-out cohort from a different time period. For most B2B SaaS products, the optimal window lands between day 0–3 and day 0–7. Products with complex onboarding (data integrations, team setup) tend to have their most discriminative sequences in the day 0–7 window.

What This Is Not

We want to be precise about the limitations here, because the temptation to over-interpret sequence discovery output is real.

Auto-discovered sequences are correlational, not causal. We're not saying that completing event sequence A causes retention. We're saying that retained users are much more likely to have completed sequence A during their first week. Causality requires an intervention — an A/B test where you actively route some users through the sequence and not others. The discovery step tells you what to test. It doesn't tell you that the test will work.

We've seen products where the top-discovered sequence was something like: "user created project → invited teammate → received reply notification → commented in thread." That 4-event sequence had 4.8x lift over churned users. When the team redesigned onboarding to push users toward that sequence, Day-30 retention improved by 14 percentage points. That's a causal win. But the discovery step didn't guarantee it — it just made the test worth running.

We've also seen sequences that didn't pan out as interventions: the sequence was discovered because high-intent users who would have retained anyway happened to explore the product more thoroughly. Redirecting lower-intent users toward that sequence didn't improve their retention. Understanding the difference between selection effects (high-intent users naturally discover high-value features) and causal effects (routing users to features causes higher retention) is one of the places where product intuition still matters more than any algorithm.

The Infrastructure Reality

Running this pipeline on production event data requires more infrastructure than the algorithm description suggests. Event streams need to be ordered and sessionized before sequence mining — raw event tables from most CDPs or data warehouses come pre-sorted by timestamp, but they require additional preprocessing to handle events fired in the same millisecond (common in single-page app instrumentation) and to filter out bot traffic that creates spurious sequences.

The PrefixSpan mining step for a dataset of 50,000 users with an average of 200 events each takes around 4–8 minutes on a single compute node with 16GB RAM, assuming appropriate pruning. For larger datasets, we partition by user cohort and run in parallel. The bootstrap stability step adds another 20–40 minutes but runs entirely off the mined candidates, which are much smaller than the raw data.

The current pipeline runs on a weekly cadence for most products — enough freshness to catch changes in user behavior after product updates, not so frequent that the PM team drowns in new sequence reports every morning. For products going through rapid onboarding iterations, we can drop to a 3-day cadence with appropriate caveats about statistical noise in smaller cohorts.

Where the Algorithm Breaks

Two known failure modes worth calling out:

Sparse event catalogs. If your product has fewer than 20–30 distinct events in the tracking plan, the sequence space is too small to find discriminative sequences reliably. Everything is correlated with everything else and the lift signals are weak. This usually indicates an instrumentation problem, not an algorithm problem — the tracking plan isn't granular enough to capture meaningful behavioral differences. We surface this as a data quality warning before running discovery.

Small cohorts. Below ~1,500 retained users and ~1,500 churned users in the training window, bootstrap stability scores become unreliable. We can still surface candidate sequences, but the stability constraint gets loosened, and the output should be treated as exploratory rather than directional. Growing products often start in this range; the output gets more reliable as the user base grows.

The algorithm isn't magic. It's a principled way to automate a search process that most product teams currently do by hand — or don't do at all because writing the SQL for sequence analysis is a two-day data engineering task. The value is in making that search fast and reproducible, not in replacing the product judgment required to act on the results.

← Back to Blog