Why Claude Code Needs More Than Just Your Words: A Case Study in Story Refinement
By Mike Holloway
The 3-Sentence Story Showcasing Lazy Prompting
Here's a real story description I wrote for one of my own projects:
"We need to create a prototype web page that can be launched to test the consumer chat functionality outside of the product_intel.io site. So this page is basically a representation of a client implementation of our consumer chat. And we need a testing page to see the behaviors and everything else. It doesn't need to be very fancy or anything else, but it should include a basic chat input box and chat window and chat responses so that we can test the true behavior and output of the consumer chat."
Three sentences. Somewhat clear intent. A human developer would read this, ask a few questions, make some assumptions, and start building. Granted this is not the quality of a story you would put into Jira for your development team, but the recent models have become so good at what appears to be interpreting your intent, it could become easy to start to give your coding agent this type of prompt. Especially when your coding sessions have a lot of memories and "context" to what you have been building, it really can enable laziness because it seems to be working...
An AI agent would do the same, except it wouldn't ask the questions. It would make the assumptions silently, start building immediately, and produce something that technically works but may not be what you needed.
Now here's what that same story looks like after 30 seconds of AI-powered spec enrichment:
- 6 files to create, 2 to modify, 3 to reuse (don't recreate)
- 6 acceptance criteria, each with data source, display behavior, and error state
- 8 negative constraints ("Do NOT build a full chat UI library")
- 5 error handling scenarios with exact user-facing messages
- 10 unresolved questions the AI identified but couldn't answer alone
The enriched spec is 400+ lines. The original was 3 sentences. The difference between the two is the difference between an AI agent that builds what you meant and one that builds what it assumed.
The Hidden Cost of Vague Stories
The problem with vague stories isn't that they produce bad code. The code works. It compiles, it renders, it does what was asked. The problem is that vague stories create an implicit permission to make assumptions, and every assumption is a coin flip that compounds over time.
I learned this the hard way building ProductIntel, an AI-native operations platform with 21 modules, 85+ database tables, and 55+ pages. Over six months of development, across hundreds of iterations and hundreds of conversations with AI, we accumulated technical debt that was invisible until it wasn't.
The Architecture That Outgrew Itself
When we started building, the implicit requirement was "build pages that show data." Simple enough. Every page used client-side fetch() calls to load data from API routes. It was the simplest pattern and it worked perfectly, when there were 5 pages.
By Release 6, we had 55+ pages making 63 client-side fetch calls. The authentication system used server-side cookies that raced with client-side fetches, causing silent 401 failures. Pages with AI features made 3-6 parallel LLM calls that took 10-20 seconds. Users would navigate between tabs and watch loading spinners while the same data was re-fetched from scratch.
The fix took three full sessions across four days: migrating 65 fetch calls across 4 tiers, converting 40 to server-side data loading, wrapping 15 with auth-aware utilities, converting 25 mutations to server actions, removing 7 dead API routes, and adding 9 error boundaries.
One line in an early story spec would have prevented all of it:
Negative Constraint: Do NOT use client-side fetch() for initial
page data loads. Use server components with direct service calls.
That's what a refined story catches. Not a bug, but a decision that should have been made explicitly instead of assumed implicitly.
The 170-Document System Prompt
We built a Knowledge Chat feature, meaning "Add an AI chat to the knowledge base that can answer questions about our docs." Simple, right?
The implementation loaded all documents from the database and stuffed their summaries into the LLM's system prompt. With 9 product docs, this worked great. Then our Discovery Agent started running, generating findings stored as document-type artifacts. Each run created 5+ new documents. We ran Discovery dozens of times during testing.
One day, the Knowledge Chat started giving vague, generic answers. Response times increased. Token costs spiked. We didn't know why, until we built an Inference Inspector and looked at the actual prompt being sent to the model.
170+ documents. 51,000+ characters. Most of them duplicate discovery outputs that had nothing to do with the knowledge base.
The model was drowning in noise. The actual product documentation, the stuff the user was asking about, was buried under a mountain of AI-generated artifacts.
The fix was a one-line filter:
const docs = allDocs.filter(d => d.status === 'published' && !d.title.startsWith('Discovery:'))
But here's the thing. The original code wasn't wrong. getDocuments() returned all documents. That's what it was supposed to do. Nobody specified "only include published product documentation, not AI-generated artifacts." So the agent included everything. Technically correct. Practically disastrous.
A refined story would have specified:
Data Source: getDocuments() filtered to status='published' and
type NOT IN ('discovery-finding', 'discovery-output')
Negative Constraint: Do NOT load all documents into the system
prompt. Discovery artifacts are NOT knowledge base content.
The Vector Search That Couldn't Find Its Own Documents
We embedded all our product documentation for semantic search. A user asks "What's the difference between Archive API v2 and v3?" and the system should find the v2 and v3 API specs and answer from them.
Instead, the top result was the Preview Manager documentation at 69% relevance. The actual Archive API v3 spec, the document literally about what the user asked, scored 25%.
The root cause: we embedded entire documents as single vectors. A 5,000-word API spec gets compressed into one array of 1,536 numbers, a "semantic fingerprint" of the entire document's average meaning. The v3 spec covers webhooks, batch operations, customer portals, and statistics. "Version differences" is a tiny fraction of what the document is about, so the embedding barely matches.
The original approach was: "Embed the documents so we can do semantic search."
Nobody questioned whether "one embedding per document" was the right granularity. Nobody specified chunking. Nobody defined what "good retrieval" looks like. So the agent did the simplest thing and it worked, until it didn't.
What Good Refinement Actually Looks Like
Story refinement isn't about writing a novel. It's about answering the questions that an AI agent would otherwise answer silently and wrong.
The enriched spec we generated has several sections that prevent the exact problems we lived through:
Negative Constraints: What NOT to Build
This is arguably the most valuable section:
- Do NOT build a full chat UI library, use basic HTML/Tailwind components only
- Do NOT create a separate database table, reuse pb_artifacts
- Do NOT hardcode model names, always use resolveModel()
- Do NOT implement message search, filtering, or export features
Without these, an agent will build more than you need. It will create a component library for a test page. It will design a schema for a throwaway feature. It will add search functionality nobody asked for. Each of these adds complexity, increases maintenance surface, and burns tokens.
Negative constraints are the guardrails that keep an agent focused on the 20% of the work that delivers 80% of the value.
Unresolved Questions: Where the AI Needs You
The enrichment produced 10 questions it couldn't answer:
- Should chat history persist across browser sessions?
- What is the maximum message length allowed?
- Should the test page require authentication?
- What is the expected response time for assistant messages?
This is the AI saying: "Here are the decisions I'd have to guess about if you don't tell me. Do you want me guessing, or do you want to decide?"
In our experience, the unresolved questions are where the real architectural decisions hide. "Should chat history persist?" is actually a question about storage strategy, session management, and data lifecycle. An agent that guesses "yes" might build a full persistence layer for a throwaway test page.
Files to Reuse: Don't Reinvent the Wheel
The enrichment identifies existing code that should be reused:
- context-engine.ts: Use for enriching chat context
- schema.ts: Reference pb_artifacts table definition
- models.ts: Use resolveModel() for LLM selection
Without this, agents commonly rebuild utilities that already exist in the codebase. We've seen it happen. A new feature creates its own database helper when a perfectly good one exists two directories over. The enrichment prevents this by explicitly naming what already exists.
The Honest Truth About AI Spec Enrichment
The enriched spec isn't perfect. It hallucinated some implementation details, like suggesting the wrong table for message storage because it didn't have full awareness of how the consumer chat system actually works. It made assumptions about architecture that a human with context would challenge.
But imperfect refinement is 10x better than no refinement.
The goal of spec enrichment isn't to produce a perfect blueprint. It's to:
- Surface decisions that need to be made before code is written
- Establish constraints that prevent scope creep and over-engineering
- Identify reusable code so the agent doesn't reinvent existing patterns
- Define error handling so failures are graceful, not silent
- Give the agent enough structure to build 80% correctly on the first attempt instead of 20%
The remaining 20%, the hallucinated details and wrong assumptions, is where human judgment matters. I think an AI PM's job isn't to accept the enrichment as-is. It's to review it, catch the mistakes, answer the unresolved questions, and then hand a genuinely solid spec to the agent.
The Pattern Behind the Pattern
Looking back at every architectural problem we encountered, from the client-side fetch migration to the system prompt bloat to the embedding granularity to the dozens of smaller issues along the way, they all share the same root cause:
A decision that should have been made explicitly was instead made implicitly by the agent.
| What I said | What the agent assumed | What should have been specified |
|---|---|---|
| "Add a page that shows data" | Client-side fetch | Server component with direct service calls |
| "Add AI chat to the knowledge base" | Load all docs into prompt | Filter by status, exclude AI-generated artifacts |
| "Embed the documents for search" | One embedding per document | Chunk at 300-500 tokens, embed each chunk |
| "Show discovery findings on the product page" | Show all findings | Scope to product, filter by confidence, limit count |
Every row in that table is a story that wasn't refined. Every row cost hours to fix after the fact.
The Practical Takeaway
If you're building with AI agents, whether through Claude Code, Cursor, Copilot, or any other tool, the highest-leverage investment you can make isn't a better model or a bigger context window. It's better input.
Spend 30 seconds running your story through spec enrichment before the first line of code is written. Review the negative constraints. Answer the unresolved questions. Verify the file paths and data sources.
The 30 seconds the LLM spends enriching saves hours of migration later. And the unresolved questions it surfaces? Those are your highest-risk assumptions, the ones that compound silently until they become the wall your architecture hits at scale.
Story refinement isn't about being pedantic. It's about being intentional. And in an AI-first development world, intentionality is the difference between building something that works and building something that lasts.
Mike Holloway is a VP of Product Management at a Fortune 500 FinTech with 25+ years of enterprise product and technology leadership. He writes about the intersection of product management, AI engineering, and enterprise software. See his AI skills portfolio at mikeholloway.dev/ai-skills.
This is an early preview. I'd rather get the ideas out while the conversation is happening than wait for perfectly polished prose. Further editing and proofing will come in time.
A note on how this was written: This article was developed through collaborative sessions with Claude, drawing on real learnings from building ProductIntel over the past six months. The experiences, the specific technical examples, and the lessons are mine. The organization and drafting was a human-AI partnership, which feels appropriate given the article is about how to work with AI more effectively.