B2B SaaS · Technical PRD · RAG Architecture · AI Evals

Intercom Fin AI:
RAG-Powered Resolution Engine

A 17-page technical PRD for a production-ready multi-tenant RAG system — dual-model routing, 9-point failure analysis, permission-aware retrieval, token cost modeling, and a complete system prompt specification.

TypeTechnical PRD
DomainB2B SaaS / AI Support
TargetSeries B–D SaaS
Informed byReal consulting work

The Problem

B2B SaaS support teams are drowning in fragmented knowledge. Agents switch between Intercom Articles, Notion, Confluence, Slack, admin tools, and macros to answer a single ticket — copying, pasting, hoping they haven't leaked an internal link to a customer.

The Before State

Context-switching across 5+ tools per ticket. Inconsistent answers. Internal links accidentally shared with customers. New agents take months to onboard because knowledge is tribal. The result: slow responses, inconsistent quality, and growing "knowledge debt."

Why RAG — Not Fine-Tuning or Pure Prompting

ApproachProblem
Pure PromptingCan't fit hundreds of articles in-context. "Lost in the middle" issues degrade answer quality on long prompts.
Fine-TuningBakes knowledge into weights — makes per-tenant isolation, GDPR "right to be forgotten," and rapid policy updates impractical.
RAGSeparates reasoning (LLM) from knowledge (indexed docs + live APIs). Supports instant re-indexing, citation tracking, and permission-aware retrieval per tenant.

Technical Architecture — 7 Layers

LayerWhat It Does
1. Data IngestionConnectors for Intercom Articles, PDFs, Notion/Confluence, CRM/admin DB, past conversations. Normalize, parse, optional OCR, generate semantic metadata per chunk.
2. Chunking & EmbeddingSemantic header-aware recursive chunking (400–800 tokens). Metadata attached: tenant_id, visibility, roles, plan, source_type. Stored in multi-tenant vectorDB with namespace filtering.
3. Query ProcessingSafety + intent classification → query refinement → pre-filtered vector query (tenant + visibility + role/plan) → retrieve top 40 → rerank to top 5–10.
4. Dynamic ContextIf query needs live data (billing, feature flags), call registered tools: billing API, CRM, admin DB. Inject structured outputs alongside retrieved chunks.
5. LLM GenerationNeedle (GPT-4o mini / Claude Haiku) for standard queries. Sword (GPT-4o / Claude Sonnet) triggered on low confidence, multi-doc synthesis, VIP users.
6. Response Post-processingValidate: hallucination heuristics, policy keywords, missing citations. Render citation UI for end-users; source snippets + macro suggestions for agents.
7. ObservabilityStructured logs per step: query, chunks, tools used, model version, latency, cost, confidence. LangSmith/Arize integration for pipeline traces.

Dual-Model Strategy: Needle vs Sword

🪡 Needle (Default)
GPT-4o mini · Claude Haiku · Gemini Flash

Intent detection, FAQ answers, short responses. ~1–3K input + 400–800 output tokens. Fast, cheap.

85–90% of all queries resolved here.

⚔️ Sword (On-Demand)
GPT-4o · Claude Sonnet · Gemini Pro

Triggered by: confidence <0.7, multi-doc synthesis, complex policy questions, VIP users. ~3–5K input tokens.

10–15% of queries escalate here.

Token Economics

Target COGS: ≤$0.15/resolution against $0.99 billing = ≥85% gross margin. With semantic caching, prompt caching, and small-to-large routing, modeled Year-1 costs on 50K MAUs / 75K queries/month stay well below revenue.

9-Point RAG Failure Analysis

For each failure point: what breaks, how to detect it, what the PM specifies to fix it.

#Failure PointDetection SignalPM Fix
1Data Quality / OCRUnusual embedding norms; "bad answer" feedback clustered by sourceSource quality score exposed to AI Ops Manager; mark certain PDFs "human-reviewed only"
2Chunking StrategyGolden eval failures on multi-step questions; LLM-Judge scoring low completenessTune chunk sizes; "Warning/Note" must stay in same chunk as the procedure it qualifies
3Embedding QualityLow recall on golden questions where correct chunk is knownUpgrade embedding model; add synonym/alias dictionaries to metadata
4Search/RetrievalGround-truth chunk not in retrieved set on eval queriesPer-source priority (Articles > PDFs > tickets); filter knobs in AI Ops UI
5Re-ranking FailuresOffline eval on reranker; A/B test reranking modelsFine-tune reranker on domain-specific pairs; add source authority + recency features
6Prompt / AugmentationHallucination feedback; "unfaithful but plausible" golden eval answersHard constraint: "Use only provided snippets; if insufficient, say you cannot answer and escalate"
7Model QualityGolden set comparison across models; hallucination rate per modelSwitch/upgrade models; adjust Needle→Sword routing thresholds
8Data DriftTime-based mismatch: doc last-updated vs. retrieval hit rates; spike in agent edits on specific topicsNear-real-time re-index on article publish webhooks; doc gating before feature rollouts
9User BehaviorRouter flags low clarity / multi-intent; safety guardrails detect prompt injectionAsk clarifying questions on low-confidence intent; split multi-intent queries; reject unsafe with friendly refusal

Permission Architecture

Every chunk carries metadata that enforces access control at retrieval time — not generation time (too late to be safe).

Chunk Metadata Schema

tenant_id · source_type · visibility (public / customer_internal / agent_internal) · required_role(s) · required_plan(s) · locale · version

Every retrieval query includes mandatory pre-filters: WHERE tenant_id = <tenant> AND visibility IN allowed_visibilities AND required_plan IN user_plans

Edge Cases Handled

  • Visibility change (Internal → Public): Watcher re-indexes permissions within minutes of change
  • Agent offboarding: SSO (Okta/SAML) revocation immediately removes access — no delay
  • Plan downgrade: Retrieval narrows automatically; Fin suggests upgrade flow instead of leaking enterprise content

Success Metrics

≥50%Automated Resolution Rate target
≥90%CSAT for AI-handled conversations
≤$0.15Target COGS per resolution
MetricTypeDefinition
Automated Resolution RateBusiness% conversations resolved without human intervention
CSAT (AI-handled)BusinessSatisfaction score for Fin-resolved tickets — target ≥90%
Precision@KRAGCorrect chunk in top K retrieved results on golden dataset
Faithfulness ScoreRAGLLM-as-judge 1–5 scale vs. ground-truth answers
Hallucination RateRAG% answers introducing unsupported facts or violating citation rules
Gross MarginBusinessCOGS per resolution ≤$0.15 against $0.99 billing = ≥85% GM target

System Prompt Specification

Hard Rules (Excerpt)

You are "Fin," an AI support agent for a multi-tenant SaaS product.

Use only provided snippets and tool outputs. Do not invent or speculate.

If retrieved context is insufficient, outdated, or contradictory — clearly say you are not certain and recommend escalation to a human agent.

Never expose content marked internal-only to end-users — summarize as needed in internal notes for agents only.

Always include citations: after each factual claim, reference the snippet IDs that support it.

5 Example Interactions

ScenarioQueryFin's Behavior
Simple FAQ"How do I reset my password?"Needle model → retrieve public article → 2–3 step answer + citation → no tools needed
Negative Constraint"How do I set up round-robin for Twitter DMs?"Explain not supported for Twitter; cite both assignment docs and channel limitation — never invent a workaround
Policy + Dynamic"My trial ended yesterday, can I extend it?"Tool call for trial_end_date → retrieve trial policy → personalized yes/no with rationale + citation
Internal SOPAgent on Legacy plan deprecation issueRetrieve internal SOP → generate internal note with workaround → propose customer-safe reply (no internal jargon)
Escalation"There's a bug with the new webhook system"Retrieve docs → tool check for known incidents → if unclear: gather info, escalate to #support-eng with summarized context + tags