Released ClinAuthBench v1: a synthetic inpatient health authorization benchmark for testing whether LLMs can reason over dense chart packets without inventing unsupported claims.
HF dataset: 📷https://t.co/QMxkjndnAf
GitHub: https://t.co/eM8NBqTn0k
@MParakhin Unless the code you wrote was bad clumsy and ai generated in no serious way Claude can find these issues in a large code base unless you baby sit it , talk to it like a toddler . It actually sucks on a large code base
@gdb It would also help if you can break it down by industry like healthcare , agriculture, how droughts arr being caused- more insights to calamities - how do we share on what we do using codex
@_philschmid Isn’t this where model should be intelligen enough to do itself - rather than training the model for specific behavior. After 6 months something different comes in do we still retraining the model and the cycle always continues ?
@yuyinzhou_cs This is interesting . I also released a dataset on similar healthcare : https://t.co/QMxkjndnAf
The idea is similar final answer accuracy is not enough. You need workflow-stage evaluation, validation checks, deterministic scoring, and error diagnosis
@dwarkesh_sp@srush_nlp What happens if the second model has bias, is not a SOTA as first one, hallucinated & gave incorrect recommendation . I understand trajectory → reward = 0. But your bet is The bet is:
“A slightly wrong local signal is better than an extremely sparse global signal.”
No real patient data. No payer/provider/facility/EHR affiliation.this also include a notebook. This is evaluation-scale, not training-from-scratch scale.
The goal is to support reproducible work on authorization reasoning, evidence-grounded summarization, contradiction handling, and hallucination control in dense clinical-style documentation
Released ClinAuthBench v1: a synthetic inpatient health authorization benchmark for testing whether LLMs can reason over dense chart packets without inventing unsupported claims.
HF dataset: 📷https://t.co/QMxkjndnAf
GitHub: https://t.co/eM8NBqTn0k
You could have ford raptor r for off-roading & a ct5 v blackwing as your daily driver and sports vehicle right so why not 2 models . In space of research ,finding needle in the haystack codex is way better Claude sucks & breaks down on extremely long complex cases but UI hands down better
@DrDatta_AIIMS Have 2 poster papers selected in healthcare Berkeley ai summit and AMIA - not sure if someone from US on Visa can collaborate- this is more on behavioral health - Suicide , opioid , clinical depression, trails , clinical forms , lab etc