Benchmark Results
SecondLook beats current benchmarks in rare disease diagnosis
Tested against the same Phenopacket2Prompt benchmark used to evaluate o1-preview and GPT-4o in the peer-reviewed literature — graded the same way, in Mondo ontology space, so the numbers are directly comparable.
head-to-head
+20.8%
More Top-3 accuracy than OpenAI o3 single-shot on identical cases — same patient, same grader.
head-to-head
+15.5%
More Top-3 accuracy than Claude Opus 4.7 single-shot on identical cases — same patient, same grader.
vs. prior LLM SOTA
+55.5%
More Top-1 accuracy than o1-preview from the published Phenopacket2Prompt evaluation.
About the Phenopacket2Prompt benchmark
Phenopacket2Prompt is a public dataset of 9,587 published clinical vignettes, each derived from a peer-reviewed case report and paired with a verified ground-truth diagnosis (typically an OMIM identifier). Because every case maps to a real published patient, it is widely used as the rare-disease benchmark for diagnostic AI evaluation.
The Claude Opus 4.7 and OpenAI o3 numbers were generated by us on the same case sample as SecondLook, using each model in a single-shot diagnostic prompt so the comparison is head-to-head — apples-to-apples LLM evaluation throughout.