Intercom Fin AI:
RAG-Powered Resolution Engine
A 17-page technical PRD for a production-ready multi-tenant RAG system — dual-model routing, 9-point failure analysis, permission-aware retrieval, token cost modeling, and a complete system prompt specification.
The Problem
B2B SaaS support teams are drowning in fragmented knowledge. Agents switch between Intercom Articles, Notion, Confluence, Slack, admin tools, and macros to answer a single ticket — copying, pasting, hoping they haven't leaked an internal link to a customer.
Context-switching across 5+ tools per ticket. Inconsistent answers. Internal links accidentally shared with customers. New agents take months to onboard because knowledge is tribal. The result: slow responses, inconsistent quality, and growing "knowledge debt."
Why RAG — Not Fine-Tuning or Pure Prompting
| Approach | Problem |
|---|---|
| Pure Prompting | Can't fit hundreds of articles in-context. "Lost in the middle" issues degrade answer quality on long prompts. |
| Fine-Tuning | Bakes knowledge into weights — makes per-tenant isolation, GDPR "right to be forgotten," and rapid policy updates impractical. |
| RAG | Separates reasoning (LLM) from knowledge (indexed docs + live APIs). Supports instant re-indexing, citation tracking, and permission-aware retrieval per tenant. |
Technical Architecture — 7 Layers
| Layer | What It Does |
|---|---|
| 1. Data Ingestion | Connectors for Intercom Articles, PDFs, Notion/Confluence, CRM/admin DB, past conversations. Normalize, parse, optional OCR, generate semantic metadata per chunk. |
| 2. Chunking & Embedding | Semantic header-aware recursive chunking (400–800 tokens). Metadata attached: tenant_id, visibility, roles, plan, source_type. Stored in multi-tenant vectorDB with namespace filtering. |
| 3. Query Processing | Safety + intent classification → query refinement → pre-filtered vector query (tenant + visibility + role/plan) → retrieve top 40 → rerank to top 5–10. |
| 4. Dynamic Context | If query needs live data (billing, feature flags), call registered tools: billing API, CRM, admin DB. Inject structured outputs alongside retrieved chunks. |
| 5. LLM Generation | Needle (GPT-4o mini / Claude Haiku) for standard queries. Sword (GPT-4o / Claude Sonnet) triggered on low confidence, multi-doc synthesis, VIP users. |
| 6. Response Post-processing | Validate: hallucination heuristics, policy keywords, missing citations. Render citation UI for end-users; source snippets + macro suggestions for agents. |
| 7. Observability | Structured logs per step: query, chunks, tools used, model version, latency, cost, confidence. LangSmith/Arize integration for pipeline traces. |
Dual-Model Strategy: Needle vs Sword
Intent detection, FAQ answers, short responses. ~1–3K input + 400–800 output tokens. Fast, cheap.
85–90% of all queries resolved here.
Triggered by: confidence <0.7, multi-doc synthesis, complex policy questions, VIP users. ~3–5K input tokens.
10–15% of queries escalate here.
Target COGS: ≤$0.15/resolution against $0.99 billing = ≥85% gross margin. With semantic caching, prompt caching, and small-to-large routing, modeled Year-1 costs on 50K MAUs / 75K queries/month stay well below revenue.
9-Point RAG Failure Analysis
For each failure point: what breaks, how to detect it, what the PM specifies to fix it.
| # | Failure Point | Detection Signal | PM Fix |
|---|---|---|---|
| 1 | Data Quality / OCR | Unusual embedding norms; "bad answer" feedback clustered by source | Source quality score exposed to AI Ops Manager; mark certain PDFs "human-reviewed only" |
| 2 | Chunking Strategy | Golden eval failures on multi-step questions; LLM-Judge scoring low completeness | Tune chunk sizes; "Warning/Note" must stay in same chunk as the procedure it qualifies |
| 3 | Embedding Quality | Low recall on golden questions where correct chunk is known | Upgrade embedding model; add synonym/alias dictionaries to metadata |
| 4 | Search/Retrieval | Ground-truth chunk not in retrieved set on eval queries | Per-source priority (Articles > PDFs > tickets); filter knobs in AI Ops UI |
| 5 | Re-ranking Failures | Offline eval on reranker; A/B test reranking models | Fine-tune reranker on domain-specific pairs; add source authority + recency features |
| 6 | Prompt / Augmentation | Hallucination feedback; "unfaithful but plausible" golden eval answers | Hard constraint: "Use only provided snippets; if insufficient, say you cannot answer and escalate" |
| 7 | Model Quality | Golden set comparison across models; hallucination rate per model | Switch/upgrade models; adjust Needle→Sword routing thresholds |
| 8 | Data Drift | Time-based mismatch: doc last-updated vs. retrieval hit rates; spike in agent edits on specific topics | Near-real-time re-index on article publish webhooks; doc gating before feature rollouts |
| 9 | User Behavior | Router flags low clarity / multi-intent; safety guardrails detect prompt injection | Ask clarifying questions on low-confidence intent; split multi-intent queries; reject unsafe with friendly refusal |
Permission Architecture
Every chunk carries metadata that enforces access control at retrieval time — not generation time (too late to be safe).
tenant_id · source_type · visibility (public / customer_internal / agent_internal) · required_role(s) · required_plan(s) · locale · version
Every retrieval query includes mandatory pre-filters: WHERE tenant_id = <tenant> AND visibility IN allowed_visibilities AND required_plan IN user_plans
Edge Cases Handled
- Visibility change (Internal → Public): Watcher re-indexes permissions within minutes of change
- Agent offboarding: SSO (Okta/SAML) revocation immediately removes access — no delay
- Plan downgrade: Retrieval narrows automatically; Fin suggests upgrade flow instead of leaking enterprise content
Success Metrics
| Metric | Type | Definition |
|---|---|---|
| Automated Resolution Rate | Business | % conversations resolved without human intervention |
| CSAT (AI-handled) | Business | Satisfaction score for Fin-resolved tickets — target ≥90% |
| Precision@K | RAG | Correct chunk in top K retrieved results on golden dataset |
| Faithfulness Score | RAG | LLM-as-judge 1–5 scale vs. ground-truth answers |
| Hallucination Rate | RAG | % answers introducing unsupported facts or violating citation rules |
| Gross Margin | Business | COGS per resolution ≤$0.15 against $0.99 billing = ≥85% GM target |
System Prompt Specification
You are "Fin," an AI support agent for a multi-tenant SaaS product.
Use only provided snippets and tool outputs. Do not invent or speculate.
If retrieved context is insufficient, outdated, or contradictory — clearly say you are not certain and recommend escalation to a human agent.
Never expose content marked internal-only to end-users — summarize as needed in internal notes for agents only.
Always include citations: after each factual claim, reference the snippet IDs that support it.
5 Example Interactions
| Scenario | Query | Fin's Behavior |
|---|---|---|
| Simple FAQ | "How do I reset my password?" | Needle model → retrieve public article → 2–3 step answer + citation → no tools needed |
| Negative Constraint | "How do I set up round-robin for Twitter DMs?" | Explain not supported for Twitter; cite both assignment docs and channel limitation — never invent a workaround |
| Policy + Dynamic | "My trial ended yesterday, can I extend it?" | Tool call for trial_end_date → retrieve trial policy → personalized yes/no with rationale + citation |
| Internal SOP | Agent on Legacy plan deprecation issue | Retrieve internal SOP → generate internal note with workaround → propose customer-safe reply (no internal jargon) |
| Escalation | "There's a bug with the new webhook system" | Retrieve docs → tool check for known incidents → if unclear: gather info, escalate to #support-eng with summarized context + tags |