Great conversation with @20vcFund & @HarryStebbings about the Dubsmash journey and all things consumer products. Lots of learnings over the years packed into an hour. https://t.co/6OH2JU8Gy8
Introducing Agent Arena: real-world agentic evals at scale.
How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks.
On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents.
Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more.
Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination.
This leaderboard snapshot is built from 300K+ tasks, 2M+ tool calls, and 40M lines of code by agents.
Top labs in Agent Arena:
- #1 @OpenAI: GPT-5.5 (High)
- #2 @AnthropicAI: Claude-Opus-4.7 (Thinking)
- #3 @Zai_org: GLM-5.1
- #4 @GoogleDeepMind: Gemini-3.1-Pro
- #5 @Kimi_Moonshot: Kimi-K2.6
More analysis in the thread, with the full technical blog below.
Introducing Agent Mode: Agentic AI is now measured in the Arena.
Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more.
It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions.
Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.
A huge kudos to Joe & Jared for their persistence and commitment through the HitRecord journey. Their emphasis on community creation is something we deeply respect at Reddit. Excited to see what is next with @MasterClass.
HITRECORD has landed an exit!
Our mission has always been to inspire creativity & I think teaching and learning is a huge part of that.
Last fall, we launched a learning service, & @MasterClass liked it. Now we’re joining their team.
Full post: https://t.co/vDXUFJncH4
(1/5)
Really excited for @dubsmash to partner with @CashApp to back the first-ever black creator house in Atlanta. Incredible talent, all under one roof.
The New Influencer Capital of America https://t.co/MElsYgg4yi
A great post on @legiontech's Series B funding round by @meganrosedickey: https://t.co/U4GfqHlF5s
Want to help reimagine workforce management and improve the lives of hourly workers? Join the legion! https://t.co/UZL5Saf1oJ
@dubsmash this. Account. Is. On. 🔥 almost at 2k followers. Well over 100k views they just haven’t updated! I’m so happy to be thriving on this platform. Everyone join @dubsmash like now #Verified#dubsmash
"In the middle of what looked like a moment of huge success, we realized we had to tear our company down. If we hadn’t, it wouldn’t have had a future at all." https://t.co/kGjaKrTtYf
1/ Dubsmash was in the press recently talking about the growth of the product and community and the non-linear, turbulent path the last few years have been. In times like these, it felt important to capture the ups, downs, and in-betweens…
@dHandlos @maxniederhofer@dubsmash never really focused on monetization in the beginning.
we stayed focused largely on building a product that could retain users