I Replaced 500 Flaky Selectors With AI Self-Healing Tests — Here's What Happened
Suneet Malhotra
Mar 3, 2026
The Night I Almost Quit Automation
It was 11 PM on a Thursday, and I was staring at 47 failed tests in our CI pipeline. Not because anything was actually broken — a frontend dev had renamed a few CSS classes during a refactor. Sound familiar?
After 20 years in QA engineering, I've watched this cycle repeat endlessly: write precise selectors, UI changes, selectors break, spend hours fixing them, repeat. In 2026, with AI agents building entire workflows autonomously, why are we still hardcoding data-testid="submit-btn-v3" and praying nobody touches it?
I decided to fix this once and for all. Here's exactly how I did it — and the surprising results.
The Problem: Brittle Selectors Are a Tax on Velocity
Every QA team knows the pain. You maintain hundreds (sometimes thousands) of CSS selectors, XPaths, and test IDs across your automation suite. One innocent UI refactor, and suddenly your pipeline is red.
At my current role at Motorola Solutions, I tracked the data: roughly 30% of our test failures weren't real bugs — they were broken selectors. That's not a testing problem. That's a maintenance tax that slows down every sprint.
The Solution: AI-Powered Semantic Element Discovery
Instead of hardcoding selectors, I built a system where an AI agent describes what it's looking for in natural language, and a local LLM figures out how to find it on the current page.
Here's the stack:
- Playwright for browser automation
- Model Context Protocol (MCP) as the communication layer
- Ollama running a local LLM (gpt-oss:20b) for reasoning
- TypeScript gluing it all together
The flow works like this:
- A test says: "Find the primary submit button in the checkout form"
- The agent takes a page snapshot (accessibility tree + DOM)
- The LLM analyzes the snapshot and generates a locator strategy
- If the locator fails, the agent automatically retries — taking a fresh snapshot and re-analyzing
No hardcoded selectors. No brittle XPaths. The AI adapts to whatever the UI looks like right now.
The Migration: 500 Selectors in 2 Weeks
I started with our most flaky test suite — the e-commerce checkout flow. Here's what the migration looked like:
Before (brittle):
await page.click('[data-testid="checkout-submit-btn"]');
await page.fill('#shipping-address-line-1', address);
After (self-healing):
await agent.action("Click the primary submit button on the checkout page");
await agent.action("Fill the first shipping address line with: " + address);
Over two weeks, I migrated 500+ selectors across 120 test files. The LLM handles the element discovery, and if the UI changes, it simply re-discovers the element on the next run.
The Results: 70% Fewer False Failures
After running the self-healing suite alongside our traditional tests for 30 days:
- False failures dropped 70% — from ~15 per week to ~4
- Maintenance hours fell by 60% — engineers stopped playing "fix the selector" every sprint
- Test authoring got faster — writing natural language intent is quicker than crafting precise selectors
- Slight speed tradeoff — AI-powered runs were ~20% slower due to LLM inference, but running Ollama locally kept latency manageable
The biggest win? Developer confidence in the test suite went up. When tests fail now, people actually investigate because they trust it's a real bug, not a stale selector.
What Google's Opal Announcement Means for QA
Last week, Google announced an AI agent for building automated workflows in Opal, powered by Gemini 3 Flash. It lets non-technical users build workflows with text prompts. This is the same direction QA is heading: natural language as the interface for automation.
The convergence is clear. In 2026, the best QA engineers aren't the ones writing the most precise XPaths — they're the ones designing systems where AI handles the fragile parts and humans focus on test strategy.
How to Start Self-Healing Your Own Suite
If you want to experiment, here's my recommendation:
- Start small — pick your flakiest 20 tests and migrate those first
- Use local LLMs — Ollama gives you privacy and zero API costs
- Keep fallbacks — if the AI can't find an element after 3 retries, fall back to a traditional selector
- Measure everything — track false failure rates before and after
I've open-sourced my Playwright Agent + MCP + Ollama setup if you want to see the full implementation.
The Bottom Line
Flaky selectors are a solved problem in 2026 — we just need to stop solving it the 2015 way. AI self-healing tests aren't science fiction. They're running in my CI pipeline right now, catching real bugs instead of crying wolf about renamed CSS classes.
The future of QA automation isn't writing better selectors. It's not writing selectors at all.
Fight On! ✌️
Share this post
You Might Also Like
I Replaced Half My QA Workflow with Playwright AI Agents — Here's What Actually Happened
After six months running AI-assisted testing with Playwright's MCP integration and self-healing tests in production, I have thoughts. Spoiler: it's not the apocalypse QA engineers feared.
QA EngineeringI Replaced My Entire Playwright Test Maintenance Workflow With AI — And Saved 8 Hours a Week
Test maintenance used to eat my Tuesdays alive. Flaky selectors, broken locators, UI drift after every sprint. Here's how I rebuilt the whole workflow around AI and got my time back.
Quantitative TradingThe Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
Career & Best PracticesThe Numbers I Used to Ask You to Trust
My April posts reported measured numbers you had to take on faith. My recent ones derive every figure from public config. The change was not discipline. It was topology.
Latest Blog Posts
The Ninety Minutes My Engine Sits Out
My stock engine refuses to open any new position after 2:30 PM ET. It surrenders the most active hour of the day on purpose. Here is the arithmetic behind the refusal.
The Numbers I Used to Ask You to Trust
My April posts reported measured numbers you had to take on faith. My recent ones derive every figure from public config. The change was not discipline. It was topology.
Five Up, Three Down, Even Money
My bracket risks 3% to make 5%, which reads like a favorable bet. On a price with no drift it is exactly break-even, and the reason is a theorem, not a coincidence.
Related Tools & Demos
Multi-Model LLM Harness
One interface to call any AI model — capability routing, fallback chains, budgets, circuit breakers, and a quality feedback loop. A practical architecture pattern write-up.
Automated Trading System
Multi-engine trading platform with real-time risk management, regime-based strategy selection, and automated order execution.
View Source Code →Personal Health Analytics
Multi-modal health data platform integrating wearables, lab results, and lifestyle tracking with predictive habit modeling.
View Source Code →
Stay in the Loop
Get weekly insights on AI-driven QA, engineering leadership, and automation strategies.
No spam, ever. Unsubscribe anytime.