jorge.engineering testing · ai agents

How I use AI to write tests (without making a mess)

I ask AI to write a bunch of tests, then delete most of them. Sounds weird, but it works way better than trying to guide AI from the start.

ai testing agentic devtools
How I use AI to write tests (without making a mess)

TL;DR: Ask the agent to write broad tests, then review and keep only the ones that catch likely production issues. Result: fewer tests, clearer signal, lower maintenance.

The Problem with Traditional Test Writing

When I added tests for analytics in a feature, I needed to be sure every viewed event was wired correctly without spending a weekend on boilerplate. Tools like Copilot or Claude can generate lots of tests, but without guidance they add noise: checks of internals, repeated cases for the same branch, and fragile mocks.

What works for me is a simple two‑step approach: ask the agent to write broad coverage, then keep only the few tests that catch real problems.

Phase 1: Generate Comprehensive Tests

The Initial Prompt

Create a draft PR with tests for all analytics `viewed` events in the feature to prevent accidental breaks.

Why this works: I don’t constrain the AI here. I ask it to search the codebase: find every event, every prop combination, and every edge. The first pass is discovery.

The agent explored the codebase and produced tests for two places where we track viewed events:

  • ViewTracker (12 tests): reusable page‑view tracker
  • TotalsPanel (15 tests): transaction fee display tracking

What the AI Generated

Representative cases for ViewTracker:

it('tracks viewed with correct payload on mount', () => { /* ... */ })
it('skips tracking when pathname includes "manager"', () => { /* ... */ })
it('handles optional tenantId correctly', () => { /* ... */ })
it('re-tracks when pageName changes', () => { /* ... */ })
// + variations for component/app/page names, and URL payload fields

And for TotalsPanel:

it('tracks fee_displayed with correct payload', () => { /* ... */ })
it('does not track when the feature is disabled', () => { /* ... */ })
it('does not track when the fee is null', () => { /* ... */ })
it('tracks only once on initial render', () => { /* ... */ })
it('does not emit unrelated viewed events', () => { /* ... */ })
// + fee amount variants, membership permutations, etc.

Phase 2: Ruthless Filtering

Now comes the important part: I reviewed each test and asked, “Would this catch a real production bug?” Tests that only verified implementation details or checked obvious cases got removed.

Result: From 27 tests down to 8 high-signal tests. Each one protects against a real failure mode.

What I Kept

  • Tests that verify the analytics payload structure (breaking these would cause data pipeline failures)
  • Tests for conditional tracking logic (feature flags, null checks)
  • Tests that prevent duplicate events
  • Edge cases with real user impact

What I Removed

  • Tests that checked internal state instead of behavior
  • Redundant variations of the same logic path
  • Tests for TypeScript-enforced requirements
  • Overly specific mock assertions

Why This Works

AI agents are excellent at exhaustive generation but poor at prioritization. By splitting the process:

  1. Phase 1 (AI): Generate comprehensive coverage without overthinking
  2. Phase 2 (Human): Apply judgment about production risks and maintenance burden

The result is a lean test suite that actually protects the codebase without becoming a maintenance liability.

“The best code is no code. The second best is code that clearly prevents real problems.”

Takeaways

  • Let AI generate broad test coverage as a starting point
  • Review ruthlessly: keep only tests that prevent real production issues
  • Fewer, better tests beats comprehensive, noisy coverage
  • This approach works best when you have clear production failure modes in mind

Questions or feedback? This is an experiment in practical AI-assisted development.