BioMysteryBench, our new bioinformatics eval, tests whether Claude can devise creative solutions to open-ended research problems.
Read more: https://t.co/iKDWA76Nu9
Opus 4.6 is our most capable Computer Use model to date. Excited for everyone to give Computer Use a try with Claude in Chrome, Cowork, and Claude Code!
To celebrate, I let Claude (4.6) Monet show off his artistic side in the Claude for Chrome extension.
I'm excited about this! Our team has been working really hard to improve Gemini 1.5 capabilities significantly on multiple fronts and in particular MATH/STEM! Please see the report here:
https://t.co/Wi3bBNPewY
Today we have published our updated Gemini 1.5 Model Technical Report. As @JeffDean highlights, we have made significant progress in Gemini 1.5 Pro across all key benchmarks; TL;DR: 1.5 Pro > 1.0 Ultra, 1.5 Flash (our fastest model) ~= 1.0 Ultra.
As a math undergrad, our drastic results in mathematics are particularly exciting to me!
In section 7 of the tech report, we present new results on a math-specialised variant of Gemini 1.5 Pro which performs strongly on competition-level math problems, including a breakthrough performance of 91.1% on Hendryck’s MATH benchmark without tool-use (examples below 🧵).
Gemini 1.5 is widely available, try it out for free here https://t.co/GJXW8lduNk & read the full tech report here: https://t.co/Pltp92WcNo
In writing this paper, there were countless features we thought might be bugs. After careful inspection, ~all of them revealed surprising and subtle model properties.
To me this capacity for surprise is the true test of a new technique.
This thread is about my favorite finding.
Excited to announce that the entire Blueshift team has joined @DeepMind! We will be working with @OriolVinyalsML and others to advance capabilities of LLMs developed by DM / Alphabet! We hope to continue to grow DM's presence in Bay Area and New York in the coming months :-)
If you are interested in solving challenging multi-step reasoning problems with LLMs, join us!
We have an opening for a Research Scientist position at Blueshift!
Learn more about the role & apply here:
https://t.co/zDM9ooMLRN
Learn about our team:
https://t.co/eg6Obh2167
@amirzait Great question! In https://t.co/RBS70Y20Ww we began to study memorization. We indeed looked at acc on modified questions, checked for MATH in the training data, and compared acc when removing answers similar to MATH. But this is an important direction for more follow up!
1/ Super excited to introduce #Minerva 🦉(https://t.co/UI7zV0IXlS). Minerva was trained on math and science found on the web and can solve many multi-step quantitative reasoning problems.
Very excited to present Minerva🦉: a language model capable of solving mathematical questions using step-by-step natural language reasoning.
Combining scale, data and others dramatically improves performance on the STEM benchmarks MATH and MMLU-STEM. https://t.co/bQJOyMSCD4
@HAKSOAT MMLU doesn't seem to have many pure E&M problems that require multiple steps. I agree it would be interesting to do a systematic evaluation. But here is one that I grabbed:
@KyleCranmer One fun aspect of how few shot prompting works with these generative models is we give:
Question: ... Answer: ...
...
Question: ... Answer: ...
Question:
And the model produces an answer. But then it keeps making up new questions and answers -- next year's pset 😉.
@holmesjtg We don't have any concrete plans, but are definitely very interested in how this can be adapted to be a helpful tutor, answer questions as students ask them (rather than as tests phrase them) etc... Do you have any favorite datasets for this?
@pablo_derbez Without additional prompting, it can still be quite brittle to such things. On the other hand, we have seen examples where the problem answer options assume some kind of rounding, Minerva solves exactly and then correctly realizes it is supposed to round.
3/ Find out more about Minerva in the blog post (https://t.co/UI7zV0IXlS), paper (https://t.co/RBS70Y20Ww) or explore more minerva samples (https://t.co/zMcW595QpD)!
2/ Among many impressive properties, one side effect of training on the web is that Minerva has seen text used to draw mathematical figures and so can sometimes reason about diagrams.
Thrilled to announce🦉Minerva: a large language model capable of solving mathematical problems using step-by-step reasoning in natural language.
See blog here: https://t.co/eDtHy9oXci and samples here: https://t.co/GGECkO5Noo (1/n)
Very excited to announce a significant milestone in expanding reasoning capabilities of language models! 🎉🎉
We introduce #Minerva🦉: a language model that can solve mathematical questions using step-by-step natural language reasoning:
https://t.co/ned5a6jcVl
🧵
1/