Pure Evaluation Dashboard
Research-oriented metrics that evaluate RAG retrieval quality and LLM generation quality in isolation from operational noise. These metrics are designed for reproducible experiments and model benchmarking.
Methodology: Retrieval quality is measured via latency
(time to retrieve context in milliseconds), top-k similarity scores
(average cosine similarity of returned chunks), and cross-source retrieval
rate (percentage of queries that returned both textbook and uploaded
chunks). Similarity is computed with L2-normalized embeddings
(all-MiniLM-L6-v2, 384-dim) via numpy np.dot.
Similarity Score Distribution
Retrieval Metrics at a Glance
| Metric | Value |
|---|---|
| Average Latency | 1479.5 ms |
| Average Similarity | 0.554 |
| Cross-Source Rate | 24.5% |
| Total Logs | 196 |
Methodology: Generation quality is assessed via per-type success rates, provider reliability comparisons, output validation pass rates, and average output lengths. Success means a non-empty, structurally valid output was produced. Validation is an additional structural check on the output content.
GenerationMetric Performance
| Type | Success | Validation | Avg Duration | Cache Hit | Avg Length |
|---|---|---|---|---|---|
| summary | 87.9% | 87.9% | 35333 ms | 40.2% | 2126 |
| quiz | 98.3% | 98.3% | 29685 ms | 50.9% | 2969 |
| recommendations | 100.0% | 100.0% | 1300 ms | 0.8% | 3092 |
Methodology: Content lineage verifies that AI-generated summaries and quiz questions actually contain terms from the uploaded source material. A keyword overlap ratio is computed between the source chunks fed into the AI prompt and the generated output. A score ≥ 0.10 (10% shared vocabulary) is considered verified. Low scores (< 0.05) flag potential hallucinations or heavy reliance on textbook cross-references.
Question Lineage Score Distribution
Recent Questions — Lineage Audit
| Q# | Question | Score | Status | Chunks |
|---|---|---|---|---|
| 11 | What is the primary purpose of computer programming? | 0.021 | Unverified | 8 |
| 12 | In the programming cycle, which step comes directly after "Planning a… | 0.0182 | Unverified | 8 |
| 13 | What is computer programming? | 0.021 | Unverified | 8 |
| 14 | Which statement best defines a computer program? | 0.0266 | Unverified | 8 |
| 15 | In a flowchart, which symbol represents a decision point where multip… | 0.0196 | Unverified | 8 |
Data Transparency Notice
All metrics on this page are computed in real-time from the live database. No cached or synthetic values are used. The evaluation methodology follows standard information-retrieval and machine-learning evaluation practices:
- Success rates are calculated as
successes / (successes + failures) - Similarity scores are cosine similarities on L2-normalized embeddings
- Retrieval latency measures end-to-end context retrieval time in milliseconds
- GenerationMetric records are captured per AI provider call