A rubric-graded evaluation dataset built from clinical guideline documents (Indian and international). Each sample is a doctor-side query against a known protocol, paired with rubrics that grade (a) whether the system retrieved/identified the correct guideline content and (b) whether the final answer is clinically complete and safe.
This dataset is built to stress-test clinical assistants on protocol-grounded question answering and surface specific failure modes: Guideline grounding — Does the model answer from the correct guideline (e.g., ADA vs. AHA vs. IAP) rather than generic medical knowledge? Query interpretation under realistic noise — Doctor queries here are shorthand, abbreviated, or code-mixed (Hinglish). Can the model still extract the right clinical intent? Completeness vs. brevity — Answer rubrics reward inclusion of the key clinical details a guideline specifies (dosing, contraindications, follow-up criteria), penalising answers that omit them. Safety regressions — Negative rubric criteria penalise dangerous, contraindicated, or off-protocol recommendations, so this dataset catches models that confidently produce harmful advice. Implicit-memory blind spots — Several queries name a specific reference (publisher/guideline). Models relying purely on parametric memory tend to fail these; models with web search or RAG over the cited corpora do better. (See "Dataset Nuance" below.)
This Dataset Is Built To Stress-test Clinical Assistants On Protocol-grounded Question Answering And Surface Specific Failure Modes: Guideline Grounding — Does The Model Answer From The Correct Guideline (E.g., Ada Vs. Aha Vs. Iap) Rather Than Generic Medical Knowledge? Query Interpretation Under Realistic Noise — Doctor Queries Here Are Shorthand, Abbreviated, Or Code-mixed (Hinglish). Can The Model Still Extract The Right Clinical Intent? Completeness Vs. Brevity — Answer Rubrics Reward Inclusion Of The Key Clinical Details A Guideline Specifies (Dosing, Contraindications, Follow-up Criteria), Penalising Answers That Omit Them. Safety Regressions — Negative Rubric Criteria Penalise Dangerous, Contraindicated, Or Off-protocol Recommendations, So This Dataset Catches Models That Confidently Produce Harmful Advice. Implicit-memory Blind Spots — Several Queries Name A Specific Reference (Publisher/guideline). Models Relying Purely On Parametric Memory Tend To Fail These; Models With Web Search Or Rag Over The Cited Corpora Do Better. (See "Dataset Nuance" Below.)
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.