I Replaced My Entire Playwright Test Maintenance Workflow With AI — And Saved 8 Hours a Week
Suneet Malhotra
Mar 24, 2026
I Replaced My Entire Playwright Test Maintenance Workflow With AI — And Saved 8 Hours a Week
Let me be honest about something: for years, Tuesdays were the worst day of my week. Not because of stand-ups, not because of sprint demos. Because of Playwright test failures.
Every sprint cycle, the UI team shipped changes. New class names, restructured components, updated aria labels. And every Monday night, our end-to-end suite would quietly break across a dozen tests. By Tuesday morning, my calendar was a graveyard of maintenance tickets. It was reactive, soul-crushing work that did nothing to actually improve quality — it just kept the suite from decaying.
Six months ago, I decided to fix it permanently. Here is the full story of how I rebuilt our test maintenance workflow around AI — and what I would do differently if I were starting from scratch today.
The Problem With Traditional Playwright Maintenance
Playwright is genuinely excellent. The auto-waiting, the locator API, the trace viewer — it is the best end-to-end testing tool I have used in twenty years of QA. But no framework protects you from the fundamental reality of web development: UIs change constantly, and tests written against a snapshot of the DOM become liabilities the moment a designer refactors a component.
The classic failure modes are familiar to every automation engineer:
- Brittle selectors — tests written against
data-testidattributes that the frontend team removed in a refactor - Timing assumptions — waits baked in for a loading state that no longer exists
- Assertion drift — expected text that changed with copy updates or localization
The standard fix is discipline: always use semantic locators, always use getByRole, write tests that survive refactors. Great advice. Also completely insufficient when you are inheriting a 600-test suite built over four years by six different engineers.
Step 1: AI-Powered Failure Triage
The first thing I changed was how we diagnose failures. Instead of an engineer manually reading Playwright error output and tracing through the DOM, I built a pipeline that feeds failure data directly into an LLM.
When a test fails in CI, a script captures three things: the Playwright error message, the aria snapshot of the page at the point of failure, and the current source of the test file. It pipes all of that to Claude via the API with a single prompt:
"This Playwright test failed. Here is the error, the current page state, and the test source. Identify the root cause and suggest a corrected selector or assertion."
The output is a structured JSON object with a diagnosis and a proposed fix. Nine times out of ten, it is correct. We review it, accept or reject the change, and the test is updated. What used to take fifteen minutes of manual tracing now takes ninety seconds.
Step 2: Self-Healing Locator Generation
Diagnosis is one thing. Preventing the failure from recurring is another.
The second piece of the system is proactive locator analysis. Every time a PR merges to main, a lightweight GitHub Action runs our Playwright locator inventory against the new build. For any locator that has changed or is now ambiguous, it flags the affected tests and automatically proposes updated selectors using the Playwright accessibility tree.
This is the self-healing pattern in practice. The system does not blindly update tests — that would be dangerous. Instead it opens a draft PR with proposed changes, clearly labeled with confidence scores. High-confidence fixes (exact aria-label match found, one candidate) get auto-merged after a brief delay. Low-confidence ones require human review.
Since deploying this, our Monday morning failure count dropped from an average of fourteen broken tests per sprint to under two. The two that still break are almost always genuine regressions — actual bugs — rather than test maintenance issues. That is exactly what you want.
Step 3: AI-Generated Test Cases From User Stories
The third layer is the one that surprised me most with its impact: using AI to write new tests directly from Jira tickets.
Our workflow now looks like this. When a user story is marked "QA Ready," a Jira automation triggers a webhook that sends the story description and acceptance criteria to an agent. The agent reads the criteria, looks up existing tests for the affected component, and generates a Playwright test file covering the happy path and two to three edge cases.
An engineer reviews the generated tests before they merge. In practice, the review takes five to ten minutes rather than forty-five. We are writing roughly 30% more test coverage per sprint with the same headcount.
What I Would Do Differently
If I were rebuilding this from scratch, I would start with the failure triage layer. It delivers value immediately, requires almost no infrastructure, and builds team trust in AI-assisted testing before you introduce more autonomous components.
I would also invest earlier in good prompt design. The quality of AI-generated fixes is directly proportional to how precisely you describe the context. Sending raw error messages gets mediocre results. Sending structured context — error, DOM state, test intent — gets excellent ones.
Finally: do not skip the human review gate. Self-healing tests that update themselves without any oversight will eventually introduce false positives into your suite. The goal is to make the right thing easy for engineers, not to remove engineers from the loop entirely.
The Bottom Line
Test maintenance is not going away. UIs will keep changing, and suites will keep drifting. But the amount of human time that work requires is now genuinely negotiable — and in my experience, the answer is about eight hours less per week than it used to be.
If you are running Playwright and spending meaningful engineering time on maintenance, I would strongly encourage you to start with failure triage automation. It is the highest-leverage change I have made to my QA workflow in years, and it took less than a day to build the first version.
Fight On — and may your locators always find their targets. ✌️
Suneet Malhotra is a Sr. Manager of Test Engineering at Motorola Solutions and an AI-driven QA automation specialist with 20+ years of experience. Follow him for weekly insights on QA engineering, AI testing, and building better software.
Share this post
You Might Also Like
I Replaced Half My QA Workflow with Playwright AI Agents — Here's What Actually Happened
After six months running AI-assisted testing with Playwright's MCP integration and self-healing tests in production, I have thoughts. Spoiler: it's not the apocalypse QA engineers feared.
QA EngineeringI Rewrote Our Entire Playwright Test Suite With AI in One Week — Here's What Actually Happened
I gave an AI agent access to our Playwright test suite and told it to refactor everything. Seven days later, our flakiness rate dropped 60% and coverage jumped 35%. But it wasn't a clean story.
AI & AutomationThe One Step I Never Hand to a Subagent
My content routine dispatches a fleet of subagents to gather, then hands none of them the draft. A fleet parallelizes retrieval. It cannot parallelize a voice.
Quantitative TradingThe Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
Latest Blog Posts
The One Step I Never Hand to a Subagent
My content routine dispatches a fleet of subagents to gather, then hands none of them the draft. A fleet parallelizes retrieval. It cannot parallelize a voice.
The Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
The Numbers I Used to Ask You to Trust
My April posts reported measured numbers you had to take on faith. My recent ones derive every figure from public config. The change was not discipline. It was topology.
Related Tools & Demos
Multi-Model LLM Harness
One interface to call any AI model — capability routing, fallback chains, budgets, circuit breakers, and a quality feedback loop. A practical architecture pattern write-up.
Automated Trading System
Multi-engine trading platform with real-time risk management, regime-based strategy selection, and automated order execution.
View Source Code →Personal Health Analytics
Multi-modal health data platform integrating wearables, lab results, and lifestyle tracking with predictive habit modeling.
View Source Code →
Stay in the Loop
Get weekly insights on AI-driven QA, engineering leadership, and automation strategies.
No spam, ever. Unsubscribe anytime.