Today I finally get to share something our team has been quietly grinding on for months โ we've created an ๐ผ๐ฝ๐ฒ๐ป ๐๐ผ๐๐ฟ๐ฐ๐ฒ๐ฑ ๐๐ฒ๐ฟ๐๐ถ๐ผ๐ป ๐ผ๐ณ Cursor ๐๐ฒ๐ป๐ฐ๐ต @cursor_ai .
If youโve been following Cursorโs Composer launch and their internal "Cursor Bench" for testing vibe coding models, you can think of our ๐๐๐๐ ๐ฏ๐ฒ๐ป๐ฐ๐ต as the open-source, model-agnostic counterpart.
Here is what we provide by @SFResearch . With ๐๐๐๐ ๐ฏ๐ฒ๐ป๐ฐ๐ต we:
โข Ship a ๐๐๐ฟ๐๐ผ๐ฟ-๐๐๐๐น๐ฒ ๐ฎ๐ด๐ฒ๐ป๐ ๐๐๐ฎ๐ฐ๐ธ: ReAct loop, semantic @ codebase search, grep, file read/write, refactor tools, and a three-tier memory system inspired by production coding assistants like Cursor.
โข ๐ง๐ฎ๐ธ๐ฒ ๐ด,๐ฌ๐ฌ๐ฌ ๐ฟ๐ฒ๐ฎ๐น-๐๐ผ๐ฟ๐น๐ฑ ๐๐ถ๐ฏ๐ฒ ๐ฐ๐ผ๐ฑ๐ถ๐ป๐ด ๐๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ๐ and turn them into interactive agent gyms across 10 languages and 10Kโ1M token codebases.
โข Let you plug in any model (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.) and see how it actually behaves on long, messy, multi-turn coding tasks.
A few fun findings: Cursor-style agents with context management are surprisingly robust at 1M-token contexts, but thereโs a hard trade-off between deep exploration vs. efficiency โ no one frontier model sits in the โperfectโ top-right corner yet. Anthropic Claude 4.5 and Google Gemini 2.5 pro are at the Pareto Frontier.
Everything is open source (agent, code, scenarios, traces, metrics) on @huggingface:
๐ Tech Report: https://t.co/i6UTFGou4T
๐ค GitHub:https://t.co/OEwv4x5tC5
๐ค Dataset: https://t.co/PuxHwxoHVU
If youโre building coding agents, benchmarking your model against GPT/Claude/Gemini, or want to train your coding agents with RL in real coding environments, weโd love for you to try LCBA bench, and tell us your findings!
Excited to see the first paper getting accepted at @DMLRJournal. In the last few months, we are fascinated by the quality of reviews and the engaging interactions between authors and reviewers! Thanks everyone! Please continue to send your best work about Data x ML๐
'Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift'
by Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li
Action Editor: Hongyang Zhang
https://t.co/mXh5OCAt9T
#Multimodal#Robustness#DistributionShift
๐๐ Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! ๐๐ฏ https://t.co/uRIGBlnGAI
1/5
What is missing in the text generation evaluation for BERTScore, BLERUT, COMET, SEScore & SEScore2? Explanation! Can we build a metric that not only produces a well-correlated quality score but also tell you the rationales, error type, and error location? Checkout InstructScore!
๐ Excited to share our latest work in EMNLP main conference: "Learning from Mistakes via Interactive Study Assistant for Large Language Models". We introduce a study assistant (SALAM) to conduct thoughtful analysis on LLMs' mistakes and provide guidelines to avoid past mistakes
๐ญTired of in-context demos & docs for LLM tool use?
๐ฐToo GPU-poor to tune LLMs for unseen tools?
๐คฌFrustrated with frequent syntax errors in tool calls?
Check out our new preprint ๐๐จ๐จ๐ฅ๐๐๐ that addresses all these issues from the decoding side!
https://t.co/vssxVg833j
1/5
Excited to share our recent work, AnyMAL -- a unified Multimodal LLM built on LLaMA-2 that can reason over various inputs, e.g. images, audio, motion sensors.
Check out our paper for more information on the model training, evaluation, safety and more!
โก๏ธ https://t.co/HmyVynWXPH
A topic that comes up in every interview:
Bias, variance, and their relationship with machine learning algorithms.
Here is a simple summary that you will easily remember.
โ
Our #ACL2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions" is out (https://t.co/k9IrNlrlqJ)!!!
It serves as a thorough reference for the VLN research community (for both starters and experts).
https://t.co/m9xkegUs1g
How to present a line plot?
Line plots are effective for describing the relationship between two variables of interests.
Unfortunately, most junior students would simply copy&paste the figure from the paper in their talk and cause much confusion. ๐
Let's break it down ... ๐งต
Our team at Google Brain is looking for outstanding PhD students (expected graduation after 2023) who are interested in student researcher internships this year 2022. https://t.co/U2vuC8WoFI
The Embodied AI Lecture Series at AI2 is back! Subscribe to the mailing list for info about how to join these free lectures live, or stay tuned and we'll post the recorded sessions after the fact.
Subscribe:
https://t.co/RGbNIgKhxA
More info: https://t.co/lZyPl1jhLu
I've been writing research articles for over 10 years now and one of the hardest parts is writing consistently and efficiently without procrastinating. I'm going to share some of my tips here ๐งต 1/10
AI2's computer vision team PRIOR announced an exciting new release of their #EmbodiedAI platform AI2-THOR โ in partnership with @unity, you can now train headlessly on multiple GPUs. ๐
Learn more:
https://t.co/iBbjPJBfMQ