Husband to @pearlsesq, Software engineer @databricks, French Canadian (i.e. likes poutine and hockey), previously @stripe, @SlackHQ, @Foursquare, @Google
We just published OfficeQA Pro - a set of 133 challenging questions from the original OfficeQA benchmark. Even the best frontier agents still struggle on OfficeQA Pro with common issues stemming from errors in parsing, retrieval, and visual reasoning.
Most AI benchmarks test reasoning in isolation.
Real enterprise tasks require grounded reasoning:
1️⃣ Find the right documents
2️⃣ Extract the right values
3️⃣ Perform analyses
OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%.
🧵Paper & details below!
Today we’re releasing OfficeQA — a new benchmark for end-to-end grounded reasoning that reflects the real work enterprises need AI agents to do.
More details below 👇
Since joining @databricks, our research team has been hard at work on Agent Bricks, a new product that helps enterprises develop state-of-the-art domain-specific agents. We are now releasing a research blog about Agent Learning from Human Feedback (ALHF) https://t.co/2RDs3H6mkY
RLVR isn't just for math and coding! At @databricks, it's impacting products and users across domains. One example: SQL Q&A. We hit the top of the BIRD single-model single-generation leaderboard with our standard TAO+RLVR recipe - the one rolling out in our Agent Bricks product.
Hey @minimax_ai, I'm trying to serve M1-80k on vLLM. Your docs say "a server with 8 H800s can process inputs up to 2 million tokens" but then recommend --max_model_len 4096. What settings did you use for 2M tokens? I'm trying this on 8 H100s.
Big news: we've agreed to acquire @MosaicML, a leading generative AI platform. I couldn’t be more excited to join forces once the deal closes. https://t.co/L4TyrruUEU
@harryh Sidney is much better at this, except it gets confused easily, I think because of the other people who will show up on your LinkedIn profile, it thought I currently had @leok 's job . I assume it's the same problem with ChatGPT