A benchmark for evaluating whether LLMs can correctly compute numeric medical values (BMI, GFR, APACHE II, drug dosing, etc.) from parametric memory and inline arithmetic alone. Covers all calculators listed on Omni Calculator — Health.
Medical Calculator Evaluation Dataset A benchmark for evaluating whether LLMs can correctly compute numeric medical values (BMI, GFR, APACHE II, drug dosing, etc.) from parametric memory and inline arithmetic alone. Covers all calculators listed on Omni Calculator — Health. Overview Property Value Total questions 1066 Unique calculators 358 Categories 24 medical domains Format Single-turn structured JSON Q&A Evaluation Numeric comparison on… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/medical_calculator_eval.
Purpose This Dataset Is Designed To Surface Two Failure Modes In Frontier And Local Models: Formula Recall — Does The Model Know The Correct Formula Or Scoring Rubric For A Given Clinical Calculator? Inline Arithmetic — Can The Model Carry Out The Computation Accurately Without A Calculator Tool? Use It To Find Out Where A Model's Parametric Medical Knowledge Ends And Where Its Arithmetic Breaks Down.
MIT
© 2026 - Copyright AIKosh. All rights reserved. This portal is developed by National e-Governance Division for AIKosh mission.