“What’s my approval rating?”
- It’s bad Mr Prime Minister. It’s -43. It’s…it’s in the mud.
“Could it get lower?”
- I mean anything COULD happen but realistically-
“Kill the ponies.”
- Wha-
“The ponies. The cute little ponies. Kill them.”
- Sir, they’re endangered
“Fuck ‘em. Make the call.”
What has been publicly “rolledback” may still privately be happening as there is no way to know what was degraded and whats not.
Previously it was “accidental overlook” when it came to coding over several turns of chat via IDEs (ex: through cursor) where minor mistakes were added so a few more turns of usage could increase token usage. While most coding benchmarks arent multi-turn coding sessions via chat, this would have no impact on scores.
Although the models are far better than what openAI or google offer currently, One should reduce dependence on whatever model is SOTA at a given time and limit the usage to harder tasks or the ones other models are failing to grasp/solve.
Same applies for any tools, I personally wouldnt use claude/codex unless I absolutely have to, but regulary use the both of them via cursor or other IDEs
@therealoliulv@speedrun Unfortunately, this is something that exists out of the box for aws, some service companies have built this already.
https://t.co/9JVg2p5GGm
🚀 Our community-led ML Agents group is kicking off a new collaborative project to build a Street Navigation Agent for more inclusive, region-aware local search. In many parts of the world, businesses exist physically — but not digitally.
They're exploring how AI can use tools like Google Street View to read storefront signs, apply distance & category constraints, and reason step-by-step to identify real-world services.
We’re also building a global benchmark across countries and languages to evaluate visual verification.
Congrats to everyone involved in Kaleidoscope, a cross-institutional collaboration accepted to ICLR 2026 🔥
A special shoutout to @mziizm who championed this collaboration from day 1. It is the first accepted paper for many of the collaborators who are first time authors.
Many researchers join our community seeking mentorship, support, and a roadmap as they embark on their journeys.
@_1024_m and @jebish7 did just this. Now, just 2 years later, they are creating these pathways for others, opening doors, and leading the way.
In 2025, our Open Science Community Leads showed what’s possible when AI research is built in the open.
38 leads, 17 programs, 125 guest speakers advancing open, collaborative AI across the world (find all talks here! https://t.co/UBXm7nkwB1). 🤯
(3/3)
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance.
https://t.co/8HkIeXYHRr
A Hindi-English bi-lingual LLM with over 140 checkpoints trained with variations in data distributions.
Findings :
- LLM-translated data can work as good as real data to address lack of data
- Each task type has a different optimal data distribution amount, which could be determined by test runs on a subset of data.
- LLM-generated thinking texts were made descriptive yet concise, this led to less emission (less token consumptions) during evals for text-generation tasks while providing better performance.
Release :
- Open Data, Models and 140 Checkpoints
https://t.co/rkG5vXBhZu
https://t.co/ukSrLOShq8
Three of our papers have been accepted at AACL 2025 @aaclmeeting (2 Main, 1 Findings).
1. DSBC : Data Science task Benchmarking with Context engineering
https://t.co/WwYwQ6uunl
2. Uncovering Cultural Representation Disparities in Vision-Language Models
https://t.co/jTtbZqGx3w
3. Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
https://t.co/8HkIeXYHRr
Grateful to the co-authors
@SidYaeger@Siddartha_10@jebish7@delliott@alexrs95@_sumand@_srishtiyadav@KanwalMehreen2
This was made possible through research grants from
@TraversaalAI@AnthropicAI@Cohere_Labs
(2/3)
Uncovering Cultural Representation Disparities in Vision-Language Models
https://t.co/jTtbZqGx3w
https://t.co/VASt9oPevl
Key Highlights :
- We test several VLMs at country/culture recognition task in 3 settings : Open-ended, MCQs with similar or neighbouring countries, MCQs with random countries
- We also test them by image ablations (noise, rotations, greyscaling, etc..)
Findings :
- Country level biases do correlate with country wise availability of online data i.e more data or mentions >> less bias or misclassification. This contradicts the common assumption of western-favouritism.
- Image perturbations affect biases in a very random way even among models belonging to the same family.
- Language of prompt had negligible effect other than improving accuracy over countries that speak the language.
Our open science community welcomes a new group focusing on agents, led by @_1024_m & @jebish7.
They'll explore:
📊evaluation frameworks
🖥️agentic applications
🏇efficient systems
...via panel discussions and community-led projects.
Join our community on this exploration.