The techniques that turn vague AI outputs into production-grade, QA-defensible results โ with interactive before/after comparisons and a live prompt builder anchored on your Day 1 exercise: Case Context Summarizer.
80% of prompt quality comes from 4 fundamentals. Master these and every prompt you write will be dramatically better.
Say exactly what you mean. If a colleague would ask "what do you mean?" โ your prompt needs work.
Give the AI the background it needs: domain, data, situation, constraints. Without context, it guesses.
Tell the AI who to be. A "senior IRT TL briefing leadership" focuses on different signals than a "Live-chat agent drafting a Pax response."
Define what "done" looks like โ format, length, structure, style. No framing = unpredictable output.
Context is the most impactful pillar for GS workflows. Skip any type and the output suffers in a specific way:
| Type | What it tells the AI | If you skip it... | GS example |
|---|---|---|---|
| Domain | Industry, market, business area | Generic, non-specific answers | "In the context of GS Support for Southeast Asian ride-hailing..." |
| Data | Specific case data, transcripts, history | AI hallucinates plausible booking IDs / SOP steps | "Here is the D365 case + last 5 chat turns: [data]" |
| Situational | Why you need this now โ the trigger | Wrong tone, urgency, severity framing | "True Safety case escalation" vs "Routine MIWI write-up" |
| Constraints | Rules, limits, requirements | Ignores your standards | "Default SGD; redact PAX/DAX names; cite SOP article ID" |
Prompt: "Summarise this case"
The Pax seems to have an issue with their ride. They mentioned some concerns about the driver. The case looks like it might need follow-up. Overall this appears to be a customer service matter to resolve.
โ ๏ธ Vague, no severity, no booking, no SOP citation, not stakeholder-ready, would fail QA audit
Prompt: Role + Context + Structure + Constraints
RISK RATING: ๐ก AMBER
Transaction Analysis: Volume surged 271% (4,200โ15,600). Avg txn flat at $10 suggests organic growth.
Chargeback Alert: Rate 0.3%โ4.1%, exceeding 1.0% benchmark by 4x.
Action: 1) Issue warning letter (Ops, 5 days) 2) Reduce PayLater limit (Risk, immediate)
โ Structured, data-cited, specific actions with owners and timelines
Financial decisions require multi-step logic. CoT makes reasoning visible and auditable โ the AI shows its work.
Just add "Think step by step." No examples needed. Best for quick calculations and simple logic.
Provide one example with reasoning. The AI follows the exact same pattern. Best for consistent processes.
"First identify key factors, then analyze." Forces prioritization before writing. Best for complex analysis.
"Solve 3 ways, report majority." Multiple approaches catch what a single analysis misses. Best for high-stakes.
Question: "Is this IRT case true Safety P1 or downgradable?"
Yes, this looks like a true Safety case. The Pax raised a concern that warrants IRT attention.
โ ๏ธ No reasoning. No SOP citation. No severity criteria visible. Not QA-defensible. Could be wrong (= 30-min SLA loss + DSAT).
Pax message: "the driver smelled like beer when I got in"
โ Explicit impairment language? YES [SOP ยง8.3 trigger]
โ Ride concluded <1 hour ago? YES (22:14 vs 22:38)
โ Two or more P1 indicators? YES โ P1 confirmed [SOP ยง7.1]
โ Action: suspend Dax + notify Country Safety in 30 min [SOP ยง2.4]
โ
Severity: P1 true Safety
โ Every step visible. SOP citations. Conclusion backed by criteria. QA-defensible.
For high-stakes Safety / fraud / Dax-suspension decisions, analyse from 3 independent angles and take the majority vote:
| Approach | Analysis | Conclusion |
|---|---|---|
| 1. Pax language | Explicit "smelled like beer" โ impairment keyword per SOP ยง8.3 | ๐ด P1 SAFETY |
| 2. Timing | Ride ended 22:14 SGT, case opened 22:38 โ within 1-hour P1 window | ๐ด P1 SAFETY |
| 3. Pax credibility | Grab VIP, no past complaints, consistent ride history | ๐ข LEGITIMATE REPORT |
Majority: 2/3 P1 SAFETY. Approach 3 alone (looking only at Pax credibility) might have de-prioritised this case. The majority vote catches what a single lens misses โ and protects the 30-min SLA.
Same data, dramatically different insights โ just by changing who the AI "is." The AI was trained on millions of documents written by different professionals. When you assign a persona, you activate that specific knowledge cluster.
Data: Pax (Grab VIP) reported "the driver smelled like beer", ride ended 22:14, Dax has a 4.2 rating with one prior unresolved Pax safety report
SEVERITY: P1 โ IMMEDIATE ACTION
Explicit impairment language is an automatic P1 trigger per SOP ยง8.3. Dax has a prior unresolved Pax safety flag โ pattern risk.
Action: Suspend Dax pending review. Notify Country Safety in <30 min. First-response email to Pax within SLA.
P1 SAFETY + PAX RECOVERY OPPORTUNITY
Pax is a Grab VIP โ DSAT and brand risk if not handled with care. Beyond the SOP-required actions, the relationship matters.
Action: Standard P1 escalation + VIP-tier care script. Personalised first response within 15 min, offer goodwill credit, dedicated TL follow-up.
Both are valid. The Safety analyst sees the regulatory + Dax-action picture. The Pax Care lead sees the Pax-relationship picture. Neither is wrong โ they serve different audiences (Country Safety lead vs DSAT trend report).
Get 3 perspectives in one prompt โ no need to schedule 3 meetings:
| Perspective | Focus | Key finding |
|---|---|---|
| ๐ก๏ธ Risk Manager | Default rate, exposure, regulation | "Doubling limits increases exposure by $12M" |
| ๐ Product Manager | Adoption, competition, revenue | "Current $500 limit is #1 reason for churn" |
| โ๏ธ Compliance | Responsible lending, MAS guidelines | "MAS requires affordability assessment above $500" |
Synthesis: Proceed with phased rollout ($750 first) with income verification. Monitor default rate weekly. Full $1,000 after 90-day review.
Multi-agent framing is a well-established prompt engineering technique with several names in the research literature:
| Technique | Source | Key idea |
|---|---|---|
| Solo Performance Prompting (SPP) | Wang et al., 2023 | A single LLM simulates multiple personas that collaborate internally โ "cognitive synergy through multi-persona self-collaboration" |
| Multi-Persona Thinking (MPT) | arXiv 2025 | Dialectical reasoning from multiple perspectives to reduce bias and improve decision quality |
| Town Hall Debate Prompting | arXiv 2025 | Splices a language model into multiple personas that debate one another to reach a conclusion |
| Self-Consistency | Wang et al., 2022 | Generate multiple reasoning paths and aggregate โ the broader technique family that multi-perspective builds on |
Consistent format + grounded in YOUR data = production-safe outputs.
Different every time. Hard to compare. Can't feed into systems. Requires human parsing.
Consistent format. Comparable across items. Machine-parseable. Scannable by busy stakeholders.
Tell the AI exactly what shape the output should take. The more specific your format instructions, the more consistent the results.
| Technique | Prompt example | What you get |
|---|---|---|
| Named sections | "Use these sections: Symptom, Severity, Booking, Action, Next Step" | Same headings every case โ scannable for TLs and stakeholders |
| Table format | "Present as: Field | Value | Source | Confidence" | Aligned data, scannable in D365 case notes |
| JSON output | "Return JSON: {symptom, severity, booking_id, action, sop_citation}" | Machine-readable, feeds into D365 / dashboards / Slack escalations |
| Numbered actions | "List 3 next agent actions. Each: action, owner (agent / TL / SPV), deadline, SOP-ID" | Actionable items with accountability + SOP traceability |
| Severity + justification | "Assign P1 / P2 / P3 severity. Justify in exactly 2 sentences with SOP citation." | Consistent QA-defensible severity decisions across all cases |
| Length control | "Stakeholder summary: max 3 sentences. Detail: max 150 words." | Right depth for the audience (stakeholder vs TL vs agent) |
When you ask AI to produce a report, analysis, or any reusable document โ ask for Markdown. It's the format that works best for both humans and AI.
| Format | Human readable | AI readable | Token cost | Reusable |
|---|---|---|---|---|
| โ | โ Can't parse | N/A | โ | |
| Word (.docx) | โ | โ ๏ธ Partial | N/A | โ |
| HTML | โ ๏ธ Tags clutter | โ | High (~20 tokens/heading) | โ |
| Markdown โ | โ | โ | Low (~8 tokens/heading) | โ |
How to ask for it:
.md. On Day 2, every artifact you create โ steering files (.kiro/steering/rules.md), skills (SKILL.md), agent configs โ is Markdown. It's the interface layer between you and AI: structured enough for machines, readable enough for humans, and 60% fewer tokens than HTML.This isn't just a convention โ research and industry practice back it up:
| Finding | Impact | Source |
|---|---|---|
| Markdown vs HTML token usage | 60% fewer tokens for same content structure | Token comparison (heading: ~8 vs ~20 tokens) |
| Markdown vs JSON for LLM comprehension | 16% average token savings with equal or better accuracy | Format performance benchmarks |
| Table extraction accuracy | Markdown 60.7% vs HTML 53.6% | ReleasePad, 2025 |
| RAG retrieval with clean Markdown | Up to 35% better retrieval accuracy, 20-30% fewer tokens | AnythingMD |
| llms.txt web standard (Sept 2024) | Websites now serve Markdown specifically for AI agents | Jeremy Howard, Answer.AI |
| LLM Markdown awareness research | LLMs are expected to produce structured Markdown for readability | arXiv:2501.15000, 2025 |
llms.txt standard (proposed by Jeremy Howard of fast.ai in September 2024) is like robots.txt but for AI โ websites now serve Markdown files at their root specifically for AI agents to read. When you write a steering file, a SKILL.md, or ask for a report โ Markdown is the right default.This section is for Citizen Developers and technical team members. Most business users can skip this โ the plain-text techniques above are all you need for daily use.
When building prompt templates at the code level (Bedrock API, application backends), developers often wrap prompt sections in XML tags. This is how Anthropic recommends structuring complex API calls โ the tags create unambiguous boundaries between instructions, data, and constraints.
| Pattern | Tags | When to use |
|---|---|---|
| Data analysis | <data> <task> <format> | Analyzing reports, transactions, metrics |
| Document Q&A | <document> <question> <rules> | Policy lookups, compliance checks |
| Multi-step | <context> <step1> <step2> | Complex workflows, chained analysis |
| Review | <draft> <criteria> <instructions> | Reviewing reports, emails, proposals |
<data>...</data> tags behind the scenes before sending to the Bedrock API. You design the content; your tech team handles the XML structure. For daily use in Claude Cowork or Kiro, plain-text headers (### ROLE, ### CONTEXT) work just as well.Without grounding rules, the AI mixes its training data with your documents โ you can't trace which claims come from where. In an audit, that's a disaster.
Q: "What happens if a customer misses 3 PayLater payments?"
The account is typically suspended and sent to collections. Late fees of around 1-2% per month are usually applied.
โ ๏ธ Sounds plausible but NONE of this may be in your actual policy. "Typically" and "usually" = hallucination.
Q: Same question + policy doc attached + grounding rules
Per PayLater Terms v3.2:
โข After 3 missed payments โ "Default" status [Section 8.3]
โข Late fee: $5 or 1.5% whichever higher [Section 7.1]
โข [INSUFFICIENT DATA: credit score reporting not in provided document]
โ Every claim cites a section. Admits what it doesn't know. No hallucination.
Toggle techniques on/off to see how the prompt AND the AI's response change. Watch quality improve as you add each technique.
Recognize these patterns? Fix them with one-line additions to your prompt.
| Mistake | Why it hurts | Quick fix |
|---|---|---|
| ๐ณ The Kitchen Sink | Cramming 5 tasks into 1 prompt | One task per prompt, chain results |
| ๐ The Blank Canvas | No examples = AI guesses your format | Show 1-2 examples of desired output |
| ๐ The Trust Fall | No grounding = confident hallucinations | "ONLY from provided data" |
| ๐ The Vague Ask | "Analyze this" โ analyze what, how, for whom? | Specify audience, format, length |
| โฑ๏ธ The One-Shot Wonder | Expecting perfection on first try | Plan for 2-3 refinement turns |
| ๐ The Copy-Paste Trap | Same prompt for different models | Tune syntax per model family |
| โ๏ธ The Set-and-Forget | Never re-testing after model updates | Monthly prompt health checks |
Every production-quality prompt goes through this cycle:
| Round | What you do | Result |
|---|---|---|
| 1. Baseline | Write prompt using 4 pillars. Run 3 times. | See what AI gets right and wrong (~60% quality) |
| 2. Fix failures | Add negative constraints + example of good output. Run 3 more. | Consistency jumps to ~85% |
| 3. Polish | Add self-review step. Tighten format. Test edge cases. | Production-ready at ~95% |
Negative constraints prevent common failure modes:
| Problem | Add this constraint |
|---|---|
| AI adds unsolicited opinions | "Do not include personal opinions or speculation" |
| AI uses data not in your input | "Do not reference any data outside the provided documents" |
| AI writes too much | "Do not exceed 300 words" |
| AI hedges everything | "Do not use phrases like 'it depends' or 'generally speaking'" |
| AI explains obvious things | "Do not explain what PayLater is or how digital wallets work" |
| AI invents numbers | "If a metric is not in the data, write [DATA NOT AVAILABLE]" |
It doesn't. Each session is completely isolated. The AI has zero memory of previous conversations.
| What persists | What doesn't |
|---|---|
| โ Files in your workspace (reports, templates, code) | โ Chat conversation history |
| โ Steering files (.kiro/steering/) โ loaded every session | โ What you said 3 sessions ago |
| โ Skills (.kiro/skills/) โ activated by keywords | โ Old tabs or closed sessions |
| โ Custom agents (.kiro/agents/) โ invoked by name | โ Your "relationship" with the AI |