Evolvent AI

Verified account

@Evolvent_AI

Building persistent agent infrastructure for infinite self-evolving intelligence. More at:

Joined April 2026

91 Following

180 Followers

79 Posts

Pinned Tweet

about 2 months ago

Launch Week — Day 1: ClawMark Most agent benchmarks give the model one shot, one prompt, one frozen environment. Real coworker tasks span multiple days — and the world keeps changing while the agent works. Introducing 🦞ClawMark: a multi-day, dynamic-environment benchmark for coworker agents. Built by Evolvent together with 40+ researchers from NUS, HKU, MIT, UW, and UC Berkeley. Open-sourced at: https://t.co/QN7XgIoaN1 100 tasks. 13 professional domains. Fully rule-based scoring. Results from 6 frontier models below. 🧵👇

6

55

11

22

17K

6 days ago

Exactly what Evolvent AI (https://t.co/RUIUwpvR2l) is building

8 days ago

None of this guarantees recursive self-improvement is on the horizon. It’s not yet clear that Claude is capable of research judgment—of choosing the right problems to work on. But if these trends continue, AI systems designing and building their own successors is plausible. This could revolutionize society—medicine, technology, the economy—for the better. But it may also compound alignment issues and ultimately lead to loss of control. The Anthropic Institute (in collaboration with external stakeholders) will conduct research to think through the implications of increasingly powerful, potentially self-improving systems—and how to create the ability for the world to make deliberate choices about the future development of the technology. Read the full post: https://t.co/XkYALsONft

110

2K

191

444

524K

0

0

0

0

41

29 days ago

@tydsh Congrats! We were also launching a startup to explore self-evolving agent infrastructure!

0

1

0

1

499

about 1 month ago

@Elonsfannumber1 yeah the most cost-effective model so far

0

1

0

0

33

about 1 month ago

Ran deepseek-v4-pro through ClawMark (our living-world openclaw benchmark) — 100/100 tasks, 0.685 avg score, 40.7h total time. Slots in at #4, just edging out kimi-k2.6 (0.684) and gemini-3.1-pro (0.682) — all three within a 0.003 window. claude-4-6 / gpt-5.4 still hold the top at 0.72–0.76. Updated leaderboard 👇

Evolvent_AI's tweet photo. Ran deepseek-v4-pro through ClawMark (our living-world openclaw benchmark) — 100/100 tasks, 0.685 avg score, 40.7h total time.

Slots in at #4, just edging out kimi-k2.6 (0.684) and gemini-3.1-pro (0.682) — all three within a 0.003 window. claude-4-6 / gpt-5.4 still hold the top at 0.72–0.76.

Updated leaderboard 👇

5

84

4

11

7K

about 1 month ago

@wojtess Yeah will release the result of glm 5.1 soon

0

1

0

0

40

about 2 months ago

@bourneliu66 Open weights + top-tier performance + game-changing pricing = paradigm shift. Our independent ClawMark results confirm K2.6 is the real deal: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

944

about 2 months ago

@oran_ge Price, performance, open weights—name a better combo. We put it to the test on our live agent benchmark: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

493

about 2 months ago

@shao__meng Can confirm: K2.6 is not a demo. It’s a production-grade beast. Our benchmark says it all: https://t.co/RDdk22LIpa Open source is eating the world.

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

1

0

0

80

about 2 months ago

Update: Added Kimi K2.6 results. Fixed OpenClaw compatibility bug (reported by Kimi team), re-ran benchmark, and finalized fresh ClawMark scores. Updated results table below 👇

Evolvent_AI's tweet photo. Update: Added Kimi K2.6 results. Fixed OpenClaw compatibility bug (reported by Kimi team), re-ran benchmark, and finalized fresh ClawMark scores. Updated results table below 👇 https://t.co/iEUMemFxqo

about 2 months ago

Launch Week — Day 1: ClawMark Most agent benchmarks give the model one shot, one prompt, one frozen environment. Real coworker tasks span multiple days — and the world keeps changing while the agent works. Introducing 🦞ClawMark: a multi-day, dynamic-environment benchmark for coworker agents. Built by Evolvent together with 40+ researchers from NUS, HKU, MIT, UW, and UC Berkeley. Open-sourced at: https://t.co/QN7XgIoaN1 100 tasks. 13 professional domains. Fully rule-based scoring. Results from 6 frontier models below. 🧵👇

6

55

11

22

17K

1

2

0

0

303

about 2 months ago

@_akhaliq Verified: K2.6 is the real deal 🚀 Outperformed Gemini 3.1 Pro on our ClawMark living-world benchmark. Read our full analysis: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

119

about 2 months ago

@DeRonin_ Kimi 2.6 = next-level agentic performance. Confirmed on ClawMark, our living-world openclaw benchmark. Full scores here: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

139

about 2 months ago

@cgtwts Can confirm: K2.6 is not a demo. It’s a production-grade beast. Our benchmark says it all: https://t.co/RDdk22LIpa Open source is eating the world.

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

2

0

0

48

about 2 months ago

@shiri_shh Historic day for open-source AI. We independently measured K2.6’s agent capabilities and the results are massive: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

58

about 2 months ago

@chetaslua Price, performance, open weights—name a better combo. We put it to the test on the ClawMark, our live agent benchmark: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

2

0

0

190

about 2 months ago

@itsPaulAi Open weights + top-tier performance + game-changing pricing = paradigm shift. Our independent ClawMark results confirm K2.6 is the real deal: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

57

about 2 months ago

@k1rallik Independent ClawMark test: Kimi 2.6 > Gemini 3.1 Pro. Real-world performance, real data. Read why it matters: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

1

1

1

157

about 2 months ago

@svpino Independent ClawMark test: Kimi 2.6 > Gemini 3.1 Pro. Real-world performance, real data. Read why it matters: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

1

1

0

0

126

about 2 months ago

@kanavtwt Verified: K2.6 is the real deal 🚀 Outperformed Gemini 3.1 Pro on our ClawMark living-world benchmark. Read our full analysis: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

1

0

0

3K

about 2 months ago

@mervenoyann Can confirm — K2.6 isn’t just a demo-reel model. It outperformed Gemini 3.1 Pro on ClawMark. Our independent test: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

0

0

0

39

about 2 months ago

@JulianGoldieSEO Kimi 2.6 just proved it’s NOT a demo-reel model! We tested it on ClawMark and it beat Gemini 3.1 Pro. Full results: https://t.co/RDdk22LIpa

about 2 months ago

Can confirm — K2.6 isn't just a demo-reel model. Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5. Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

Evolvent_AI's tweet photo. Can confirm — K2.6 isn't just a demo-reel model.

Few days ago, we received a bug report from kimi team, and we got early API access, re-ran ClawMark (our living-world openclaw benchmark). After fixing a compatibility bug in openclaw's repo (https://t.co/owWPiOuWgs), K2.6 lands at 0.684 avg score — edging out gemini-3.1-pro (0.682) and jumping +0.124 over K2.5.

Shipping shaders and agentic benchmark gains in the same release is a pretty rare combo. 👀

3

145

5

32

35K

0

1

0

0

1K

Last Seen Users on Sotwe

Trends for you

Most Popular Users