LLM as Judge Verification System
Automatic numerical accuracy verification for generated slides using MLflow's custom prompt judge, with content-hash-based persistence and human feedback collection.
Stack / Entry Points
- Backend: MLflow 3.6+ (
make_judgeAPI), litellm 1.80+ (Databricks model routing), FastAPI verification routes - Frontend: React auto-verification in
SlidePanel, verification badge component, feedback UI - Storage: PostgreSQL
verification_mapcolumn (JSON keyed by content hash) - MLflow: Traces logged to Databricks workspace for verification runs and human feedback
- Boot files:
src/services/evaluation/llm_judge.py(core evaluator),src/api/routes/verification.py(API endpoints),src/utils/slide_hash.py(content hashing) - Environment: Requires
DATABRICKS_HOSTandDATABRICKS_TOKENfor MLflow tracking
Architecture Snapshot
Slides generated/modified → Frontend (SlidePanel.tsx)
↓
Auto-verify slides without verification
(parallel API calls for each unverified slide)
↓
POST /api/verification/{slide_index}
↓
Backend (verification.py) assembles:
- Slide HTML + scripts (what LLM generated)
- Genie query results (source truth)
↓
evaluate_with_judge(genie_data, slide_content)
↓
MLflow make_judge creates custom prompt judge:
- Semantic comparison (7M = 7,000,000)
- Derived calc validation (growth % from Q1/Q2)
- Chart data accuracy (Chart.js arrays)
↓
Judge returns: score (0-100), rating, explanation, issues
↓
Backend saves to verification_map[content_hash]
↓
Frontend displays badge with rating + details popup
↓
User provides feedback (👍/👎 + optional rationale)
↓
POST /api/verification/{slide_index}/feedback
↓
mlflow.log_feedback(trace_id, ...)
→ Logged as Assessment in Databricks MLflow UI
Key Concepts / Data Contracts
1. Verification Result (Frontend ↔ Backend)
// frontend/src/types/verification.ts
interface VerificationResult {
score: number; // 0-100 (internal use)
rating: VerificationRating; // 'green' | 'amber' | 'red' | 'error' | 'unknown'
explanation: string; // Human-readable assessment
issues: Array<{ // Specific problems found
type: string;
detail: string;
}>;
duration_ms: number; // Verification latency
trace_id?: string; // MLflow trace ID for linking feedback
genie_conversation_id?: string; // For "View Source Data" link
error: boolean;
error_message?: string;
timestamp?: string; // ISO string
content_hash?: string; // Hash of slide content (for persistence)
}
2. Rating Scale & Thresholds (RAG System)
The verification uses a simple RAG (Red/Amber/Green) indicator system:
# src/services/evaluation/llm_judge.py
RATING_SCORES = {
"green": 85, # No issues detected (≥80%)
"amber": 65, # Review suggested (50-79%)
"red": 25, # Review required (<50%)
}
# Rating thresholds:
# - green: ≥80% — All data correctly represents source
# - amber: 50-79% — Some concerns, review suggested
# - red: <50% — Significant issues, review required
# - unknown: No source data available (title slides, etc.)
| Rating | Score Range | Badge Label | User Action |
|---|---|---|---|
| 🟢 Green | ≥80% | No issues | Proceed with confidence |
| 🟡 Amber | 50-79% | Review suggested | Quick review recommended |
| 🔴 Red | <50% | Review required | Must review before using |
| ⚪ Unknown | N/A | Unable to verify | No source data available |
3. Content Hash Persistence
Verification results persist separately from slide content using content-hash-based storage:
# Database schema
class SessionSlideDeck(Base):
deck_json = Column(Text) # Slides (html, scripts, css) - NO verification
verification_map = Column(Text) # JSON: {"content_hash": VerificationResult}
# Hash computation (src/utils/slide_hash.py)
def compute_slide_hash(html: str) -> str:
normalized = normalize_html(html) # Strip whitespace, comments, lowercase
return hashlib.sha256(normalized.encode()).hexdigest()[:16]
Why content hash? Decouples verification from slide regeneration. When chat modifies slides, deck_json is overwritten but verification_map is preserved. On load, verification is merged back by matching content hashes.
4. Judge Prompt Structure
The custom prompt judge evaluates:
- Numerical exactness: Source
7234567↔ Slide7.2M(✓ pass) - Semantic equivalence:
$7.2M,7,200,000,~7 millionall match - Derived calculations: "50% growth" validated against Q1/Q2 source numbers
- Chart data: Chart.js
data: [7.2, 8.5, 9.1]compared to source CSV - Format tolerance: Rounding, currency symbols, percentage conversion allowed
See src/services/evaluation/llm_judge.py::_build_judge_prompt() for full prompt text.
Component Responsibilities
Backend
| Path | Responsibility | APIs Touched |
|---|---|---|
src/services/evaluation/llm_judge.py | Core judge evaluation logic using MLflow 3.x make_judge | MLflow (mlflow.genai.make_judge, mlflow.set_tracking_uri) |
src/services/evaluation/__init__.py | Exports evaluate_with_judge, LLMJudgeResult, RATING_SCORES | None (module exports) |
src/api/routes/verification.py | FastAPI endpoints for verification and feedback | MLflow (log_feedback), SessionManager, evaluate_with_judge |
src/utils/slide_hash.py | HTML normalization and content hash computation | None (pure functions) |
src/api/services/session_manager.py | Load/save verification_map, merge verification on get_slide_deck | Database |
Frontend
| Path | Responsibility | Backend Touchpoints |
|---|---|---|
frontend/src/components/SlidePanel/SlidePanel.tsx | Auto-verification trigger, parallel verify calls, state management | api.verifySlide, api.getSlides |
frontend/src/components/SlidePanel/VerificationBadge.tsx | Renders rating badge, details popup, feedback UI | api.submitVerificationFeedback |
frontend/src/components/SlidePanel/SlideTile.tsx | Hosts badge, edit detection | — |
frontend/src/services/api.ts | API client methods for verification flow | /api/verification/* endpoints |
frontend/src/types/verification.ts | TypeScript types and utility functions (badge colors, icons) | None (types only) |
frontend/src/components/Help/HelpPage.tsx | Verification tab with user documentation | None (UI only) |
State/Data Flow
Auto-Verification Flow
-
Slides generated or modified
- Chat completes → frontend fetches slides via
api.getSlides() - Backend returns slides with
content_hashcomputed for each
- Chat completes → frontend fetches slides via
-
Frontend triggers auto-verification
SlidePaneleffect detects slides without verification- Filters out already-attempted hashes (prevents re-triggering)
- Calls
runAutoVerification()for unverified slides in parallel
-
Backend verifies each slide
- Fetch slide HTML + Genie query results
- If no Genie data → return
rating="unknown"(skip verification) - Call
evaluate_with_judge(genie_data, slide_content) - Save result to
verification_map[content_hash]
-
Frontend displays results
- Refresh slides to get merged verification
- Badges appear on each slide with color-coded rating
Verification Persistence Behavior
| Action | Verification Behavior |
|---|---|
| Generate new slides | All slides auto-verified (no hash match) |
| Edit a slide | Only edited slide re-verified (hash changed) |
| Add slides via chat | Existing slides keep verification, new slides auto-verified |
| Delete a slide | Other slides keep verification (different hashes) |
| Reorder slides | All slides keep verification (position-independent) |
| Restore session | Verification merged back by hash match |
Verification and Save Points
Save points use a two-phase approach to ensure both deck content integrity and verification score preservation:
- Backend creates save point immediately after deck persistence (no verification yet for new edits)
- Frontend calls sync-verification after auto-verification completes, backfilling scores onto the latest save point via
POST /api/slides/versions/sync-verification
This decoupling prevents the race condition where verification timing (especially fast unable_to_verify in no-Genie mode) could cause save points to capture stale deck state.
Human Feedback Flow
- User clicks verification badge → popup shows details
- User provides feedback via 👍/👎 buttons
- 👎 opens rationale input → user explains issue
- Frontend calls
api.submitVerificationFeedback(... , traceId) - Backend logs to MLflow as structured Assessment
- Feedback visible in MLflow UI under original trace
Interfaces / API Table
Backend REST API
| Method | Path | Request Body | Response | Purpose |
|---|---|---|---|---|
POST | /api/verification/{slide_index} | { session_id } | VerifySlideResponse | Verify slide accuracy |
POST | /api/verification/{slide_index}/feedback | { session_id, is_positive, rationale?, trace_id? } | { status, message, linked_to_trace } | Submit human feedback |
GET | /api/verification/genie-link?session_id=... | – | { has_genie_conversation, url?, message } | Get Genie conversation URL |
Frontend API Client (api.ts)
// Verify slide
api.verifySlide(sessionId: string, slideIndex: number): Promise<VerificationResult>
// Submit feedback
api.submitVerificationFeedback(
sessionId: string,
slideIndex: number,
isPositive: boolean,
rationale?: string,
traceId?: string
): Promise<{ status, message, linked_to_trace }>
// Get Genie link
api.getGenieLink(sessionId: string): Promise<{
has_genie_conversation,
url?,
message
}>
MLflow APIs Used
# Feedback logging
mlflow.set_tracking_uri("databricks")
mlflow.log_feedback(
trace_id=trace_id,
name="human_verification_feedback",
value=is_positive,
rationale=user_comment,
source=AssessmentSource(
source_type=AssessmentSourceType.HUMAN,
source_id=f"session_{session_id[:8]}"
),
metadata={"session_id": session_id, "slide_index": slide_index}
)
Operational Notes
Error Handling
-
No Genie data (title slides, no-query sessions)
- Backend returns
score=0,rating="unknown" - Frontend shows gray badge: "? Unknown"
- Not an error – expected for non-data slides
- Backend returns
-
MLflow judge failure (network, model timeout)
- Backend catches exception, returns
error=true,error_message - Frontend shows red badge: "! Error"
- Error result not persisted
- Backend catches exception, returns
-
Feedback submission failure
- If
log_feedback()fails, error logged but request returns 200 - Response includes
linked_to_trace: false
- If
Logging & Tracing
- Verification events: Structured logs with session_id, slide_index, score, rating, content_hash
- MLflow traces: All verifications logged to Databricks
- Performance: Judge latency typically 1-3 seconds
Configuration
# src/services/evaluation/llm_judge.py
DATABRICKS_MODEL = "databricks-claude-3-5-sonnet"
JUDGE_TEMPERATURE = 0 # Deterministic judgments
# MLflow tracking
mlflow.set_tracking_uri("databricks")
Database Migration
If upgrading from a version without verification_map:
ALTER TABLE session_slide_decks ADD COLUMN verification_map TEXT;
Backward compatible – NULL treated as empty dict {}.
Extension Guidance
Adding New Validation Checks
- Update judge prompt in
src/services/evaluation/llm_judge.py::_build_judge_prompt() - Add new issue types to prompt
- Update frontend
VerificationResultinterface if needed - Add test cases in
test_llm_judge_spike.py
Changing Rating Thresholds
The RAG system uses these thresholds:
- Green: ≥80% (judge returns "green")
- Amber: 50-79% (judge returns "amber")
- Red: <50% (judge returns "red")
To modify:
- Update
RATING_SCORESdict inllm_judge.py - Update judge prompt instructions in
JUDGE_INSTRUCTIONS - Update frontend
verification.ts::getRatingColor()and related functions - Update Help page documentation in
HelpPage.tsx
Custom Feedback Fields
- Add fields to
FeedbackRequestPydantic model - Update frontend
api.submitVerificationFeedback()call - Include new fields in
mlflow.log_feedback()metadata
Known Limitations
1. No Slide → Query Mapping
The LLM as Judge verifies each slide against all Genie query results from the session, not the specific query that generated that slide.
Why: The agent makes multiple queries during slide generation but doesn't tag which query's data goes to which slide.
Practical impact: Verification still works correctly—the judge finds matching data. However, it cannot tell you which query produced a specific slide's content.
Future consideration: Log query attribution per slide during generation.
2. Narrative Quality Not Assessed (Future Consideration)
- Phase 1 verifies numerical accuracy only
- Nice to have: Add narrative coherence scoring for story flow and logical structure
Cross-References
- Frontend Overview – React component structure
- Backend Overview – FastAPI routes and session management
- Database Configuration – Schema details including verification_map
Last Updated: 2024-12-16
Status: ✅ Production-ready (Phase 1 – Numerical Accuracy + Auto-Verification)