Pure Evaluation Dashboard

Research-oriented metrics that evaluate RAG retrieval quality and LLM generation quality in isolation from operational noise. These metrics are designed for reproducible experiments and model benchmarking.

RAG Retrieval Evaluation

Methodology: Retrieval quality is measured via latency (time to retrieve context in milliseconds), top-k similarity scores (average cosine similarity of returned chunks), and cross-source retrieval rate (percentage of queries that returned both textbook and uploaded chunks). Similarity is computed with L2-normalized embeddings (all-MiniLM-L6-v2, 384-dim) via numpy np.dot.

Avg Retrieval Latency
1479.5ms
Mean across all retrieval logs
Avg Top-k Similarity
0.554
Cosine similarity of returned chunks
Total Retrieval Logs
196
Instrumented retrievals
Cross-Source Rate
24.5%
Both textbook & uploaded returned
Similarity Score Distribution
Retrieval Metrics at a Glance
Metric Value
Average Latency 1479.5 ms
Average Similarity 0.554
Cross-Source Rate 24.5%
Total Logs 196
Model Generation Evaluation

Methodology: Generation quality is assessed via per-type success rates, provider reliability comparisons, output validation pass rates, and average output lengths. Success means a non-empty, structurally valid output was produced. Validation is an additional structural check on the output content.

Quiz Validation Pass Rate
100.0%
Structurally valid quiz outputs
Summary Validation Pass Rate
100.0%
Structurally valid summary outputs
Generation Metrics Recorded
Yes
Instrumentation active
GenerationMetric Performance
Type Success Validation Avg Duration Cache Hit Avg Length
summary 87.9% 87.9% 35333 ms 40.2% 2126
quiz 98.3% 98.3% 29685 ms 50.9% 2969
recommendations 100.0% 100.0% 1300 ms 0.8% 3092
Content Lineage Evaluation

Methodology: Content lineage verifies that AI-generated summaries and quiz questions actually contain terms from the uploaded source material. A keyword overlap ratio is computed between the source chunks fed into the AI prompt and the generated output. A score ≥ 0.10 (10% shared vocabulary) is considered verified. Low scores (< 0.05) flag potential hallucinations or heavy reliance on textbook cross-references.

Summary Lineage Pass
63.6%
Keyword overlap ≥ 10%
Quiz Lineage Pass
0.8%
Keyword overlap ≥ 10%
Avg Summary Lineage
0.12
Mean keyword overlap ratio
Low-Lineage Questions
244
Score < 0.05 (possible hallucination)
Question Lineage Score Distribution
Recent Questions — Lineage Audit
Q# Question Score Status Chunks
1 Which operation is NOT typically performed by a program? 0.0283 Unverified 6
2 Which statement about high‑level languages is FALSE? 0.0445 Unverified 6
3 Which of the following best defines a program? 0.0324 Unverified 6
4 Machine language instructions are written using which representation? 0.0202 Unverified 6
5 What is the primary function of an assembler? 0.0263 Unverified 6
Data Transparency Notice

All metrics on this page are computed in real-time from the live database. No cached or synthetic values are used. The evaluation methodology follows standard information-retrieval and machine-learning evaluation practices: