**Created**: November 11, 2025
**For**: Understanding the mechanics behind AI verification
**Context**: Technical implementation details for evaluating verification tool
- [[#The Verification Problem|The Verification Problem]]
- [[#The Verification Problem#What We're Trying to Solve|What We're Trying to Solve]]
- [[#The Verification Problem#The Core Challenge|The Core Challenge]]
- [[#RAG-Based Verification|RAG-Based Verification]]
- [[#RAG-Based Verification#How It Works|How It Works]]
- [[#RAG-Based Verification#Technical Components|Technical Components]]
- [[#RAG-Based Verification#Verification Mechanism|Verification Mechanism]]
- [[#RAG-Based Verification#Performance Metrics|Performance Metrics]]
- [[#NLI-Based Verification|NLI-Based Verification]]
- [[#NLI-Based Verification#Natural Language Inference|Natural Language Inference]]
- [[#NLI-Based Verification#How It's Used for Verification|How It's Used for Verification]]
- [[#NLI-Based Verification#Technical Implementation|Technical Implementation]]
- [[#NLI-Based Verification#NLI Models Used|NLI Models Used]]
- [[#NLI-Based Verification#Performance|Performance]]
- [[#Multi-Agent Verification (Hebbia Approach)|Multi-Agent Verification (Hebbia Approach)]]
- [[#Multi-Agent Verification (Hebbia Approach)#The "Agent Swarm" Architecture|The "Agent Swarm" Architecture]]
- [[#Multi-Agent Verification (Hebbia Approach)#Key Technical Innovations|Key Technical Innovations]]
- [[#Multi-Agent Verification (Hebbia Approach)#Hebbia's Verification Mechanisms|Hebbia's Verification Mechanisms]]
- [[#Multi-Agent Verification (Hebbia Approach)#Performance|Performance]]
- [[#Ensemble Methods|Ensemble Methods]]
- [[#Ensemble Methods#The Principle|The Principle]]
- [[#Ensemble Methods#Technical Implementation|Technical Implementation]]
- [[#Ensemble Methods#Component Methods|Component Methods]]
- [[#Ensemble Methods#Performance|Performance]]
- [[#Comparative Analysis|Comparative Analysis]]
- [[#Comparative Analysis#Method Comparison Matrix|Method Comparison Matrix]]
- [[#Comparative Analysis#Cost-Benefit Analysis|Cost-Benefit Analysis]]
- [[#Comparative Analysis#Decision Framework|Decision Framework]]
- [[#Key Takeaways|Key Takeaways]]
- [[#Key Takeaways#Technical Principles|Technical Principles]]
- [[#Key Takeaways#Implementation Insights|Implementation Insights]]
- [[#Key Takeaways#Future Directions|Future Directions]]
- [[#Further Reading|Further Reading]]
## The Verification Problem
### What We're Trying to Solve
```
Input: LLM generates text
Problem: Is it factually accurate?
Constraint: Can't retrain the model every time
Goal: Verify outputs faster/cheaper than humans reading everything
```
### The Core Challenge
**LLMs have no "ground truth" detector built in**. They're trained to predict next tokens, not to verify factual accuracy. This creates three verification challenges:
1. **Hallucination**: Model generates plausible but false information
2. **Source Attribution**: Model can't show where info came from
3. **Consistency**: Same question → different answers
---
## RAG-Based Verification
### How It Works
**RAG (Retrieval Augmented Generation)** = Fetch relevant docs → Give to LLM → Generate answer
```
┌──────────────┐
│ User Query │
└──────┬───────┘
│
▼
┌──────────────────────┐
│ 1. Convert to Vector │ ← Embedding model converts text to numbers
└──────┬───────────────┘
│
▼
┌────────────────────────┐
│ 2. Search Vector DB │ ← Find semantically similar documents
│ (semantic search) │
└──────┬─────────────────┘
│
▼
┌────────────────────────┐
│ 3. Retrieve Top N Docs │ ← Usually N = 3-10
└──────┬─────────────────┘
│
▼
┌─────────────────────────┐
│ 4. Inject into Prompt │ ← "Given these docs: [docs], answer: [query]"
└──────┬──────────────────┘
│
▼
┌─────────────────────────┐
│ 5. LLM Generates Answer │ ← With context from retrieved docs
└──────┬──────────────────┘
│
▼
┌─────────────────────────┐
│ 6. Return with Citations│ ← Shows which docs were used
└─────────────────────────┘
```
The most common verification approach: fetch relevant documents, inject into prompt, generate with citations.
![[Screenshot 2025-11-24 at 01.00.22.png]]
Standard RAG workflow: The most common verification method, but requires humans to manually check sources.
### Technical Components
**1. Embedding Model**
- Converts text to dense vectors (typically 768-1536 dimensions)
- Examples: `text-embedding-ada-002` (OpenAI), `all-MiniLM-L6-v2` (open source)
- **Purpose**: Enable semantic search (find similar meaning, not just keywords)
**2. Vector Database**
- Stores document embeddings for fast retrieval
- Examples: Pinecone, Weaviate, Chroma, FAISS
- **Search**: Uses cosine similarity or dot product to find nearest neighbors
**3. Chunking Strategy**
- Documents split into smaller pieces (chunks)
- **Challenge**: Too small = missing context; too large = irrelevant info
- **Typical size**: 256-512 tokens per chunk
**4. Retrieval Methods**
```python
# Dense Retrieval (semantic)
query_embedding = embed(query)
results = vector_db.search(query_embedding, top_k=5)
# Sparse Retrieval (keyword)
results = bm25_search(query, documents)
# Hybrid (best of both)
dense_results = semantic_search(query)
sparse_results = keyword_search(query)
final_results = rerank(dense_results + sparse_results)
```
### Verification Mechanism
RAG verifies through **source grounding**:
```
User: "What's the capital of France?"
Without RAG:
LLM: "The capital of France is Paris."
Problem: Could be hallucinated, no source
With RAG:
1. Retrieve: [doc1: "France, capital Paris", doc2: "Paris, French capital"]
2. Generate: "The capital of France is Paris."
3. Cite: [Source: doc1, doc2]
Benefit: Human can check sources
```
### Performance Metrics
**Standard RAG** (like ChatGPT with web search):
- Accuracy: ~68% on complex queries
- Speed: 2-5 seconds per query
- Cost: $0.001-0.01 per query
**Limitations**:
1. **Retrieval Fails**: Relevant info not found → hallucination
2. **Context Loss**: Chunks miss important context
3. **Ranking Errors**: Wrong docs ranked higher
4. **Citation Lag**: Still need human to verify sources
---
## NLI-Based Verification
### Natural Language Inference
**NLI Task**: Given two sentences, determine relationship:
- **Entailment**: Premise → Hypothesis is true
- **Contradiction**: Premise → Hypothesis is false
- **Neutral**: Can't determine from premise
### How It's Used for Verification
```
Premise (Source): "The company reported $10M revenue in Q3"
Hypothesis (LLM Output): "Q3 revenue was $10 million"
NLI Model: ENTAILMENT ✓
vs.
Premise (Source): "The company reported $10M revenue in Q3"
Hypothesis (LLM Output): "Q3 revenue was $12 million"
NLI Model: CONTRADICTION ✗
```
Use inference models to check if generated text is logically supported by source documents.
![[Screenshot 2025-11-24 at 01.02.12.png]]
![[Screenshot 2025-11-24 at 01.02.20.png]]
Two NLI approaches: SelfCheckGPT uses consistency across samples; RefChecker compares knowledge triplets against sources.
### Technical Implementation
**Method 1: SelfCheckGPT NLI**
```python
# Step 1: Generate multiple samples
samples = []
for i in range(20):
samples.append(llm.generate(prompt))
# Step 2: Split into sentences
sentences = []
for sample in samples:
sentences.extend(split_sentences(sample))
# Step 3: Check each sentence for consistency
for sentence in sentences:
# Use NLI model (e.g., DeBERTa) to check if sentence
# is entailed by other samples
consistency_scores = []
for other_sample in samples:
score = nli_model(premise=other_sample, hypothesis=sentence)
consistency_scores.append(score)
# If low consistency → hallucination
if avg(consistency_scores) < 0.35:
flag_as_hallucination(sentence)
```
**Key Insight**: If LLM knows something, it should be consistent across multiple generations. Inconsistency = uncertainty = likely hallucination.
**Method 2: RefChecker (Amazon)**
```
┌─────────────────┐
│ LLM Response │
└────────┬────────┘
│
▼
┌──────────────────────────┐
│ Extract Knowledge Triplets│ ← <subject, predicate, object>
│ Example: <Apple, revenue,│
│ $394B in 2023> │
└────────┬─────────────────┘
│
▼
┌──────────────────────────┐
│ Compare with Source Docs │
│ Using NLI Model │
└────────┬─────────────────┘
│
▼
┌──────────────────────────┐
│ Classify Each Triplet: │
│ • Entailment (supported) │
│ • Contradiction (wrong) │
│ • Neutral (can't verify) │
└────────┬─────────────────┘
│
▼
┌──────────────────────────┐
│ Report Hallucinations │
└──────────────────────────┘
```
### NLI Models Used
**Common Choices**:
1. **DeBERTaV3**: Fine-tuned on NLI datasets, ~90% accuracy
2. **RoBERTa-NLI**: Lightweight, fast inference
3. **T5-based**: TRUE/TrueTeacher models
**Training Data**:
- SNLI (Stanford Natural Language Inference)
- MNLI (Multi-Genre NLI)
- DocNLI (Document-level NLI)
### Performance
**SelfCheckGPT**:
- Calibration: 80% score → 80% of flagged sentences are actually hallucinations
- Speed: ~1 second per sentence (20 samples = ~20 seconds)
- Cost: 20x generation cost
**RefChecker**:
- Precision: ~85-90% on detecting hallucinations
- Recall: ~75-80% (misses some hallucinations)
- Speed: ~2-3 seconds per response
---
## Multi-Agent Verification (Hebbia Approach)
### The "Agent Swarm" Architecture
Unlike single-model RAG, Hebbia uses multiple specialized agents working in parallel.
```
User Query: "Compare termination clauses in 100 contracts"
┌────────────────────────┐
│ Orchestrator Agent │ ← Breaks down complex query
└───────┬────────────────┘
│
│ Decomposes into sub-tasks:
│
├──► Task 1: Retrieve all 100 contracts
│
├──► Task 2: Identify termination clauses in each
│
├──► Task 3: Extract key terms from each clause
│
├──► Task 4: Categorize by clause type
│
└──► Task 5: Generate comparison matrix
│
▼
┌─────────────────────┐
│ Specialized Agents │ ← Each handles one task type
└─────────────────────┘
│
│ Agents work in parallel:
│
┌─────────┴─────────────┬──────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌──────────────┐ ┌─────────────┐
│ ReadAgent │ │ ExtractAgent │ │ CompareAgent│
│ (o1 for │ │ (GPT-4o for │ │ (GPT-4o for │
│ reasoning) │ │ extraction) │ │ synthesis) │
└───────┬───────┘ └──────┬───────┘ └─────┬───────┘
│ │ │
└────────────────────┴───────────────────┘
│
▼
┌──────────────────────┐
│ OutputAgent │ ← Formats final result
│ (structured output) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Matrix Grid Interface│ ← Spreadsheet view
│ With Citations │
└──────────────────────┘
```
Advanced platforms use specialized agents working in parallel with transparent verification at each step.
![[Screenshot 2025-11-24 at 01.03.06.png]]
Multi-agent architecture: Task decomposition → Specialized agents work in parallel → Each verifies its own output → Aggregated with full citations.
### Key Technical Innovations
**1. Task Decomposition**
```python
def decompose_query(complex_query):
# Use LLM to break down query
decomposition = orchestrator_llm.call(
prompt=f"Break this into atomic steps: {complex_query}"
)
# Example output:
# Step 1: Retrieve contracts mentioning "termination"
# Step 2: For each contract, extract termination clause
# Step 3: For each clause, identify: notice period, conditions, penalties
# Step 4: Create comparison table
return decomposition.steps
```
**2. Dynamic Model Routing**
```python
def route_task(task):
if task.requires_deep_reasoning:
return o1_model # OpenAI o1 for complex reasoning
elif task.requires_vision:
return gpt4o_vision # For charts, tables
elif task.is_simple_extraction:
return gpt4o_mini # Cheaper for simple tasks
else:
return gpt4o # General purpose
```
**3. "Infinite Context Window"**
Not actually infinite, but clever workaround:
```
Traditional LLM: 128K token limit → Can't process 100 documents at once
Hebbia Approach:
1. Index all 100 documents in vector DB
2. For each sub-task, retrieve only relevant chunks
3. Each agent works on <128K tokens
4. Combine results across agents
5. Result: Effectively "infinite" by not loading everything at once
```
**4. Verification Through Transparency**
Every agent step is visible:
```
Matrix Grid View:
┌────────────┬─────────────────┬──────────────────┐
│ Contract │ Termination │ Notice Period │
│ │ Clause (source) │ (extracted) │
├────────────┼─────────────────┼──────────────────┤
│ Contract-1 │ [📄 p. 12] │ 30 days [📄 p.12]│
│ Contract-2 │ [📄 p. 8] │ 60 days [📄 p.8] │
│ Contract-3 │ [📄 p. 15] │ 90 days [📄 p.15]│
└────────────┴─────────────────┴──────────────────┘
Click on [📄 p. 12] → See exact source text
User can verify: "Did the AI extract correctly?"
```
### Hebbia's Verification Mechanisms
**1. Source Tracking**
- Every cell in Matrix links to exact source
- Hover → see verbatim text from document
- Citations at token level, not document level
**2. Agent Consensus**
```python
# For critical extractions, use multiple agents
results = []
for agent in [agent1, agent2, agent3]:
results.append(agent.extract(clause))
# Verify consistency
if len(set(results)) == 1:
return results[0] # All agree
else:
return flag_for_human_review(results)
```
**3. Iterative Refinement**
```
User: "Extract revenue"
Agent: Returns $50M
User: "That's wrong, check again"
Agent: Re-runs with adjusted prompt → Returns $53.2M
System: Learns from correction
```
### Performance
**Hebbia Matrix**:
- Accuracy: 92% (vs 68% standard RAG)
- Speed: Seconds for 1000s of documents
- Verification: Built-in source links
- Cost: High (multiple model calls) but justified by accuracy
**Why It Works Better**:
1. **Specialization**: Each agent optimized for its task
2. **Parallel Processing**: Faster than sequential
3. **Better Retrieval**: Advanced indexing beats standard RAG
4. **Transparency**: Every step visible = easier verification
---
## Ensemble Methods
### The Principle
Use multiple verification methods, vote on result.
```
Input: LLM Output to verify
Method 1: NLI Check → 85% confidence "Correct"
Method 2: NER Check → 90% confidence "Correct"
Method 3: Span Detection → 70% confidence "Hallucination"
Method 4: Consistency Check → 80% confidence "Correct"
Ensemble Vote: 3 "Correct", 1 "Hallucination"
→ Final: "Correct" (but flagged for review due to dissent)
```
Combine multiple verification methods and vote on results for maximum accuracy.
![[Screenshot 2025-11-24 at 01.04.01.png]]
Ensemble method: Run multiple verification techniques, combine signals with gradient boosting, achieve 5-10% higher accuracy than single methods.
### Technical Implementation
**Production System Example** (from research):
```python
class HallucinationDetector:
def __init__(self):
self.ner_model = load_ner_model() # Named Entity Recognition
self.nli_model = load_nli_model() # Natural Language Inference
self.sbd_model = load_sbd_model() # Span-Based Detection
self.gbdt = load_gbdt_model() # Gradient Boosting Decision Tree
def detect(self, response, source_docs):
# Run all methods
ner_score = self.check_entities(response, source_docs)
nli_score = self.check_entailment(response, source_docs)
sbd_score = self.check_spans(response, source_docs)
# Ensemble with GBDT
features = [ner_score, nli_score, sbd_score]
hallucination_prob = self.gbdt.predict_proba(features)
if hallucination_prob > 0.7:
return "HALLUCINATION"
elif hallucination_prob < 0.3:
return "VERIFIED"
else:
return "UNCERTAIN"
```
### Component Methods
**1. Named Entity Recognition (NER)**
```python
# Check if entities in response exist in source
source_entities = ner_model.extract(source_docs)
response_entities = ner_model.extract(response)
hallucinated_entities = []
for entity in response_entities:
if entity not in source_entities:
hallucinated_entities.append(entity)
score = 1 - (len(hallucinated_entities) / len(response_entities))
```
**2. Span-Based Detection (SBD)**
```python
# Mark each span in response as verified or not
spans = split_into_spans(response)
verified_spans = []
for span in spans:
# Check if span appears in source (with some fuzzy matching)
if fuzzy_match(span, source_docs, threshold=0.8):
verified_spans.append(span)
coverage = len(verified_spans) / len(spans)
```
**3. Gradient Boosting Decision Trees**
```python
# Train on labeled data
X_train = [
[ner_score1, nli_score1, sbd_score1], # Example 1
[ner_score2, nli_score2, sbd_score2], # Example 2
...
]
y_train = [0, 1, ...] # 0=verified, 1=hallucination
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train, y_train)
# Use for new predictions
new_scores = [ner_score, nli_score, sbd_score]
prediction = gbdt.predict(new_scores)
```
### Performance
**Ensemble vs Single Methods**:
- Accuracy: +5-10% over best single method
- Robustness: Handles edge cases better
- Latency: 2-3x slower (multiple models)
- Cost: Higher (running multiple methods)
**When to Use**:
- High-stakes applications (legal, medical)
- When single-method confidence is low
- Production systems with strict accuracy requirements
---
## Comparative Analysis
### Method Comparison Matrix
![[Screenshot 2025-11-24 at 01.04.41.png]]
### Cost-Benefit Analysis
**Example: Legal Document Review**
**Traditional Human Review**:
- Speed: 2-3 hours per document
- Cost: $400-600 per document ($200/hr lawyer)
- Accuracy: 95-98%
- Verification: N/A (human is ground truth)
**Standard RAG**:
- Speed: 2-3 minutes per document
- Cost: $0.50 per document
- Accuracy: 68%
- Verification Cost: $200 (1 hour human review)
- **Total**: $200.50 per document, 1 hour total
- **ROI**: Negative (slower + similar cost due to verification)
**Multi-Agent (Hebbia)**:
- Speed: 30 seconds per document
- Cost: $5 per document
- Accuracy: 92%
- Verification Cost: $50 (15 min spot-check)
- **Total**: $55 per document, 15 min total
- **ROI**: **90% time savings, 86% cost savings**
### Decision Framework
**Choose Standard RAG when**:
- Low-stakes applications
- Simple queries
- Speed > accuracy
- Budget constrained
**Choose NLI-Based when**:
- Need hallucination detection
- Can tolerate latency
- Post-processing is acceptable
- Have labeled training data
**Choose Multi-Agent when**:
- Complex analytical tasks
- High-stakes decisions
- Need transparency
- ROI justifies cost
**Choose Ensemble when**:
- Highest accuracy required
- Have labeled data for training
- Latency acceptable
- High-stakes + production system
---
## Key Takeaways
### Technical Principles
1. **No Silver Bullet**: All methods have trade-offs
2. **Verification ≠ Elimination**: Reduces but doesn't eliminate hallucinations
3. **Transparency Matters**: Verifiable systems need source tracking
4. **Context is King**: Better retrieval → better verification
![[Screenshot 2025-11-24 at 01.05.37.png]]
### Implementation Insights
1. **Start Simple**: RAG is good enough for 80% of use cases
2. **Measure First**: Track accuracy before optimizing
3. **Build Iteratively**: Add verification layers as needed
4. **User Feedback**: Critical for improving accuracy over time
### Future Directions
1. **Real-time Verification**: Verify during generation, not after
2. **Uncertainty Quantification**: Model should know when it doesn't know
3. **Formal Verification**: Provable correctness for critical applications
4. **Self-Improving**: Systems that learn from corrections
---
## Further Reading
**Papers**:
- RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)
- SelfCheckGPT: "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" (Manakul et al., 2023)
- RefChecker: Amazon Science Blog, 2024
- Multi-Agent: Hebbia blog posts on Matrix architecture
**Tools**:
- LangChain (RAG framework)
- LlamaIndex (RAG framework)
- Haystack (NLP framework with verification)
- Hebbia Matrix (production multi-agent system)
**Benchmarks**:
- HaluEval: Hallucination evaluation dataset
- RAGTruth: Human-labeled RAG outputs
- BEIR: Information retrieval benchmark
---
_For questions or discussion: This is a living document based on November 2025 research._