Most prompt engineering content is useless. "Be specific" and "provide context" doesn't help when you need 99.9% reliability at 2 million requests per month. This is what actually works in production.
These are real prompts running in production systemsโthe patterns that survived A/B testing, edge cases, and scale.
Pattern 1: The Production System Prompt
A good system prompt does three things: defines the role, sets constraints, and specifies output format. Here's our lead qualification prompt:
Pattern 2: Few-Shot Examples
Examples are worth 1000 words of instructions. Include 2-3 examples covering common cases and edge cases:
const fewShotExamples = [
{
role: "user",
content: "hi im john smith at 123 main st need to sell fast my number is 5551234567"
},
{
role: "assistant",
content: JSON.stringify({
name: "John Smith",
phone: "5551234567",
email: null,
property_address: "123 Main St",
motivation: "high",
timeline: "immediate",
confidence: 0.85
})
},
{
role: "user",
content: "asdf keyboard smash 12345"
},
{
role: "assistant",
content: JSON.stringify({
error: "unparseable",
raw: "asdf keyboard smash 12345"
})
}
];
Critical: Always Include Edge Cases
Your few-shot examples should include at least one "bad input" example. Without it, the model will try to extract data from garbage, leading to hallucinated outputs that pollute your database.
Pattern 3: Structured Output Enforcement
JSON mode isn't enough. Validate and coerce outputs to your schema:
import { z } from 'zod';
const LeadSchema = z.object({
name: z.string().nullable(),
phone: z.string().regex(/^\d{10}$/).nullable(),
email: z.string().email().nullable(),
property_address: z.string().nullable(),
motivation: z.enum(['high', 'medium', 'low', 'unknown']),
timeline: z.string().nullable(),
confidence: z.number().min(0).max(1)
});
async function qualifyLead(input: string): Promise<Lead> {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 500,
system: SYSTEM_PROMPT,
messages: [...fewShotExamples, { role: 'user', content: input }]
});
const text = response.content[0].text;
try {
const parsed = JSON.parse(text);
return LeadSchema.parse(parsed);
} catch (e) {
await logPromptFailure(input, text, e);
throw new Error('Output validation failed');
}
}
Pattern 4: Token Optimization
Tokens cost money. At scale, every word matters:
| Technique |
Token Savings |
Trade-off |
| Abbreviate instructions |
20-30% |
Slight accuracy drop |
| Remove verbose examples |
40-50% |
Edge case handling |
| Use shorter field names |
10-15% |
Readability |
| Compress system prompt |
25-35% |
Maintainability |
Before: 847 tokens
You are a helpful assistant that analyzes real estate leads and extracts relevant information from them. Please carefully read the input and identify the following fields if they are present...
After: 312 tokens
Extract lead data. Output JSON only.
Fields: name, phone (10 digits), email, address, motivation (high/med/low/unknown), timeline, confidence (0-1).
Unknown = null. Bad input = {"error":"unparseable"}.
Pattern 5: Prompt Versioning
Prompts are code. Version them like code:
export const LEAD_QUALIFIER_PROMPT = {
version: '2.3',
model: 'claude-sonnet-4-20250514',
system: `Extract lead data. Output JSON only...`,
examples: [...],
changelog: [
'2.3: Added confidence score',
'2.2: Fixed phone validation edge case',
'2.1: Reduced tokens by 40%',
'2.0: Complete rewrite for Claude 3'
],
testConfig: {
enabled: true,
variants: ['v2.2', 'v2.3'],
metric: 'conversion_rate'
}
};
Pattern 6: Graceful Degradation
When the model fails, have a fallback:
async function qualifyWithFallback(input: string) {
try {
return await qualifyLead(input);
} catch (e) {
console.log('AI extraction failed, trying regex');
}
const phone = input.match(/\d{10}/)?.[0] || null;
const email = input.match(/[\w.-]+@[\w.-]+/)?.[0] || null;
return {
name: null,
phone,
email,
property_address: null,
motivation: 'unknown',
timeline: null,
confidence: 0.3,
extraction_method: 'regex_fallback'
};
}
Production Checklist
- System prompt defines role, constraints, and output format
- Few-shot examples cover common cases AND edge cases
- Zod or similar validates all AI outputs
- Prompts are versioned with changelogs
- Token usage is monitored and optimized
- Fallback extraction exists for failures
- Failed extractions are logged for iteration
- A/B testing infrastructure for prompt variants
The difference between a demo and production is error handling. Your prompt will fail. Plan for it.
Related Articles
Need Production AI Systems?
We build AI infrastructure that works at scale.
โ Get Started