Flaky tests are architecture problems that compound quietly, and the way most teams respond to them makes things worse.
They emerge from a small set of recurring conditions, like shared state between parallel test runs, timing dependencies in async operations, environment instability, and test data that is not isolated between executions. When these pile up, your team starts making a quiet trade, such as retry instead of investigate, quarantine instead of fix.
Google, Spotify, Atlassian, and Daimler Truck AG have all documented this exact trajectory. So has almost every scaling SaaS team we have worked with that treated flaky tests as individual problems rather than architectural issues.
Centerbase, a legal practice management software provider, came to us after facing this pattern, false positives, constant manual maintenance, and a suite that the team had stopped trusting. We reduced their flaky test failures by 85% and cut execution time by 60%.

In this article, we’ll share how to figure out if you are in the same situation, how bad it actually is, and how to fix it step-by-step.
Three Questions to Find Out If You Actually Have a Problem
Before any solution, you need an honest read on your current exposure. Most teams are watching the wrong metric and drawing the wrong conclusion from it.
Answer these three questions using your actual CI data from the last 60 days.
You’ll See Results Like These
| Signal | Healthy | Warning | Critical |
| Unique flaky tests in 60 days | < 2% of the suite | 2–5% of the suite | > 5% of the suite |
| Quarantined tests with the named owner | All of them | Some of them | None or few |
| Failure timestamp correlation done | Yes, regularly | Once or never | Never attempted |
| Per-execution flaky rate | < 1% | 1–3% | > 3% |
| Team reaction when CI fails | Investigates root cause | Re-runs and moves on | Doesn't notice anymore |
If two or more rows in your table land in Warning or Critical, your CI signal is already degraded in ways your dashboard is not showing you. The rest of this article is about what architectural changes you need to make to resolve this.
Read Also: Top QA mistakes to avoid
Your monitoring dashboard probably shows a per-execution flaky rate, the percentage of individual runs that fail flakily on a given day.
A 2% rate feels manageable, especially if it has been stable for months.
Stability is the problem. It creates a false sense of control while the real damage accumulates elsewhere.
John Micco, Engineering Productivity Researcher at Google, put a precise number on this gap: a 1.5% per-execution flaky rate affects 16% of total test inventory over time. Against a 2,000-test suite, that is roughly 320 distinct tests behaving non-deterministically over a quarter, while your dashboard showed a tidy, manageable number every single day.

The metric you are watching tells you how many runs failed today. What you actually need to know is how much of your suite can no longer be trusted.
Open your quarantine folder right now. Find the oldest test in it. Check when it was quarantined, then check whether a single comment, commit, or ticket explains why it was moved there or when it is coming back.
For most teams, the honest answer is: it was flagged, it was moved, and that was the last anyone thought about it. The test is still counted in your inventory. The coverage it was supposed to provide is quietly gone.
Martin Fowler named this failure mode directly, quarantined tests stop helping with regression coverage, and without accountability, the quarantine folder becomes a one-way door. Atlassian built its Flakinator system precisely because it knew engineers would not follow up on quarantined tests voluntarily.
Their fix was automatic routing to a named owner, a ticket, and a deadline. The design assumption was explicit: good intentions are not enough.
Every test behind that door is a regression your suite will no longer catch. Your coverage percentage looks the same. Your actual coverage does not.
This is where your team is probably spending the most engineering time for the least return.
Understand it with an example: an engineer spends three hours chasing a flaky timeout in the payment confirmation test and fixes a hardcoded wait. The next week, a different engineer spends two hours on a race condition in the checkout flow. Different tests, different people, different days, but both failures trace to the same shared resource: a test environment database that is not properly isolated between parallel runs.

Two engineering sessions. One root cause. Fourteen more tests in quarantine with the same underlying condition, still waiting.
Owain Parry and colleagues at the University of Sheffield, Allegheny College, and Carnegie Mellon University documented this in a 2025 study: 75% of flaky tests fail in correlated clusters, with a mean cluster size of 13.5 tests. Failures are not random; they cluster because they share root causes. Fix the shared condition, and you resolve 13.5 tests at once.
The economy compounds fast. Leinen and colleagues at Daimler Truck AG measured the cost directly: manually investigating a flaky failure costs $5.67 in developer time. An automatic retry costs $0.02, but resolves nothing. It just makes the failure disappear until the next run. A team of 50 engineers seeing 50 flaky failures per day and choosing retry over investigation is spending over $100,000 a year to avoid a problem that gets worse every sprint.

Most flaky test clusters trace to one underlying design failure: tests that share state. Shared databases, shared test data, shared environment configuration, any of these creates conditions where the order of execution, or the speed of a parallel run, determines whether your tests pass or fail. Non-determinism is a design gap that no amount of individual test fixing will close.

For Centerbase, isolation was the core of the engagement. Here is what we implemented, and what we put in place across engagements at this stage of suite growth:

Spotify cut its flakiness from 6% to 4% in two months by making flaky tests visible to the people who owned them. Before a single test was repaired, the accountability structure changed, and the rate dropped.

Jason Palmer says that without confidence in your test suite, you are in no better position than a team with zero tests. What your team needs is not more intent around ownership; it is infrastructure that makes ownership unavoidable.
Naming someone in a comment and moving on does not work. Forcing functions do: automatic tickets, fix-by dates with pipeline consequences, and escalation when deadlines pass without action.
Leinen and colleagues at Daimler Truck AG found that flaky tests consume 2.5% of productive developer time, 1.1% investigating failures, 1.3% repairing tests. For a 100-person engineering team at $150K average loaded cost, that is $375,000 per year in direct productivity loss.
The Bitrise Mobile Insights Report, analyzing over 10 million builds between 2022 and 2025, found that the share of teams experiencing flakiness grew from 10% to 26% in that period, a 160% increase as CI pipelines became 23% more complex.
| Cost Category | Annual Impact (100-person team) |
| Developer time investigating failures | ~$165,000 |
| Developer time repairing flaky tests | ~$195,000 |
| Retry-only "resolution" (50 failures/day) | ~$100,000+ |
| Total direct productivity loss | ~$375,000 |
Every sprint you defer the architecture work, the suite falls further behind a product that keeps growing.
Here is the honest reality: the engineers who could fix your flaky test architecture are the same ones building your next release. That trade-off does not resolve on its own.
If any of this sounds familiar, it might be time to talk to someone who has done this before:
ThinkSys has worked through this with SaaS teams at exactly this stage. We know which patterns cause the most damage, which fixes hold up as suites scale, and how to run this work alongside your roadmap rather than instead of it.
If your team has stopped fully trusting its own CI results, that is worth a conversation. We will tell you honestly what we find and what it takes to fix it.
We helped Centerbase cut flaky test failures by 85%. We can do the same for your team.