Early results for Claude Opus 4.8 and Gemini 3.5 Flash on @OpenAI's HealthBench Professional:
Opus 4.8 looks essentially flat against 4.7 (within noise). Gemini 3.5 Flash is a step up from 3.1 Pro.
Tested a council of AI models on this case to see how they respond ๐
Overall, most models interpreted the lesion as a benign/low-grade vascular or calcified process rather than a true high-grade gliomaโmost commonly cavernous malformation/organizing hemorrhagic vascular lesion with reactive gliosis.
The main disagreements were grok-4.3, which called it gliosarcoma (GBM grade 4), and gemini-3.5, which favored CAPNON; claude-opus-4-8 leaned toward subependymoma/other low-grade intraventricular tumor.
๐ https://t.co/8KVMhnc5kh
We tested a few AI models out of curiosity to see how they interpret this case, this is what they said ๐ง๐
Most models interpret the slides as showing acellular purple globular material with a benign colloid/proteinaceous appearance, with the strongest consensus favoring a benign thyroid colloid nodule (Bethesda II).
The main outlier is Gemini, which instead interprets the globules as Actinomyces โsulfur granulesโ; Claude is noncommittal but notes the material could be colloid or other metachromatic globules.
๐ Full case: https://t.co/BvMskRIj0z
We tested a few AI models out of curiosity to see how they interpret this case, this is what they said ๐ง๐
Most models interpret the slides as showing acellular purple globular material with a benign colloid/proteinaceous appearance, with the strongest consensus favoring a benign thyroid colloid nodule (Bethesda II).
The main outlier is Gemini, which instead interprets the globules as Actinomyces โsulfur granulesโ; Claude is noncommittal but notes the material could be colloid or other metachromatic globules.
๐ Full case: https://t.co/BvMskRIj0z
We tested a few AI models to see what they say ๐
All models broadly agree that the findings most likely represent prominent hematogones/benign B-cell precursors in an infant marrow, not definitive B-ALL, and that CD79a/morphology alone are insufficient.
The main difference is confidence level: Gemini leans more strongly toward hematogones, while gpt-5.5 and Claude are more cautious and stress that flow cytometry ยฑ molecular testing (especially KMT2A in this age group) is needed to exclude infant B-ALL.
This is what the council said this time ๐ง
There is no clear consensus across the models: one favored caseating granulomatous lymphadenitis/tuberculosis, another called Hodgkin lymphoma, and a third interpreted it as metastatic mucinous adenocarcinoma.
๐ https://t.co/1NkgVMrdeJ
We recorded a video to show you how to use the platform ๐
- Please go to https://t.co/NF4wrpOEEp
- Create a free account. You only need an email address.
- Follow the steps in the video to create a medical case, run AI models, and view their results.
Let us know if you need any help. Weโre here! ๐
@LizMontgomeryMD We tested a few AI models on this case out of curiosity to see how they interpreted it. This is what they said ๐
๐ Full case: https://t.co/XnngSzpivt