Wu Haoning @HaoningTimothy - Twitter Profile

Pinned Tweet

5 months ago

We are really taking a long time to prove this: everyone is building big macs but we bring you a kiwi🥝 instead. You have multimodal with K2.5 everywhere: chat with visual tools, code with vision, generate aesthetic frontend with visual refs...and most basically, it is a SUPER POWERFUL VLM

Kimi.ai @Kimi_Moonshot

5 months ago

Kimi K2.5 has arrived! 🥝 Here are 2 things to know: Aesthetic Coding x Agent Swarm.

134

6K

507

3K

617K

19

477

19

92

44K

Wu Haoning

@HaoningTimothy

13 days ago

actually opus-4.7 and 4.8 have stalled at the same token-accuracy pareto for osworld

Tianbao Xie

@TianbaoX

14 days ago

OSWorld: Hurray! survive one more day

2

18

0

1

5K

2

16

0

3K

HaoningTimothy retweeted

ishan

@0xishand

15 days ago

@_LuoFuli Looking forward to the blog! (If you’re able to share) - do you guys use G3 or G4 offloading or just CPU?

0

2

1

0

2K

Wu Haoning

@HaoningTimothy

about 1 month ago

this is literally insane that we truly see an open model thinking 200-300k for math problems…

1

111

0

24

12K

Who to follow

Jingkang (Jake) Yang

@JingkangY

Egocentric Model Researcher | Prev. Co-Founder at Synvo AI (https://t.co/iLyMFdMNYG) | MMLab@NTU Ph.D. (https://t.co/E8cQaOk45D) | ECCV’22 Best Backpack Award 🎒

Yuanhan (John) Zhang

@zhang_yuanhan

Coder @ Meta Superintelligence Lab Ph.D @MMLabNTU

Fangzhou Hong

@hongfz16

Building @ropedia_ai | PhD @MMLabNTU | Ex-Intern @RealityLabs | B.Eng @Tsinghua_Uni

Wu Haoning

@HaoningTimothy

about 1 month ago

I saw flowers and moonlight today.

1

19

0

1

854

Wu Haoning

@HaoningTimothy

about 1 month ago

Very Nice Move👍

Peter Gostev

@petergostev

about 1 month ago

Note we've renamed Code Arena to Frontend Design: WebDev for these chats. I hope this is less confusing, but lmk if you have better suggestions

11

178

9

15

23K

0

8

0

1

703

Wu Haoning

@HaoningTimothy

about 1 month ago

往错误的方向行走不会到达终点。

4

43

0

6K

HaoningTimothy retweeted

Fanqing Meng

@FanqingMengAI

about 1 month ago

https://t.co/mAk0mbDNBa https://t.co/x1SEMzxzGp tech report release

2

40

8

16

12K

Wu Haoning

@HaoningTimothy

about 2 months ago

open intelligence in duet

Lincoln 🇿🇦

@Presidentlin

about 2 months ago

@deepseek_ai Victory Whales are smarter than ants, I guess 15.18% vs 5.80%

7

35

2

1

11K

1

40

1

2

3K

HaoningTimothy retweeted

Andon Labs

@andonlabs

about 2 months ago

Kimi K2.6 is #5 on Vending-Bench 2. It's the best open model, overtaking GLM 5.1.

6

161

7

16

16K

Wu Haoning

@HaoningTimothy

about 2 months ago

As long as K2.5/K2.6 is multimodal, we are also making it to use (I am really amazed by how it excels at long multi-image documents because we are not specially optimizing for them too much) However still a long way to go

Arena.ai

@arena

about 2 months ago

Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5: - #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking) - #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark and Gemini 3.1 Pro. Huge congrats again to the @Kimi_Moonshot team on the open source progress!

arena's tweet photo. Kimi K2.6 is the new SOTA open model in Vision and Document Arena, with solid gains since Kimi K2.5:
- #1 open on Vision Arena (#15 overall), +14 over #2 Kimi K2.5 (Thinking)
- #1 open on Document Arena (#8 overall), +9 over K2.5 and on par with proprietary models like Muse Spark and Gemini 3.1 Pro.

Huge congrats again to the @Kimi_Moonshot team on the open source progress!

12

256

17

35

95K

1

102

7

11

14K

Wu Haoning

@HaoningTimothy

about 2 months ago

I think deepseek-v4 is not over-benchmaxxing, which is good. We build these things for people to use.

14

1K

36

41

45K

HaoningTimothy retweeted

Vals AI

@ValsAI

about 2 months ago

The 🐳 has surfaced and it’s a powerhouse on the Vals leaderboards, dominating on coding. DeepSeek V4 just landed #2 on the Vals Index, nearly tying Kimi K2.6 (only 0.07% behind).

ValsAI's tweet photo. The 🐳 has surfaced and it’s a powerhouse on the Vals leaderboards, dominating on coding. DeepSeek V4 just landed #2 on the Vals Index, nearly tying Kimi K2.6 (only 0.07% behind). https://t.co/Ow5R3uA4UI

9

258

13

20

16K

HaoningTimothy retweeted

Artificial Analysis

@ArtificialAnlys

about 2 months ago

GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google OpenAI gave us pre-release access to test all five reasoning effort levels: xhigh, high, medium, low and non-reasoning. ➤ OpenAI topping five headline evaluations: GPT-5.5 (xhigh) leads Terminal-Bench Hard, GDPval-AA and our newly hosted APEX-Agents-AA. The model trails only other OpenAI models in CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains are on AA-Omniscience (+14 pts), our knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark. ➤ 20% more expensive to run our Intelligence Index: Per-token pricing has doubled from GPT-5.4 to $5/$30 per 1M input/output tokens. However, a ~40% token use reduction largely absorbs the hike - resulting in a net ~+20% cost to run our Intelligence Index. ➤ Effort a clear ladder for balancing intelligence and cost: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900. GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) on our Intelligence Index at half the cost to run (~$500 vs ~$1 ,000). ➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by ~30 pts and Gemini 3.1 Pro Preview by ~470 pts. GDPval-AA is Artificial Analysis’ benchmark that leverages OpenAI’s GDPval dataset to evaluate models on real-world economically valuable tasks. ➤ Top AA-Omniscience accuracy, but trailing the frontier on hallucination: Our private AA-Omniscience benchmark rewards factual knowledge across diverse topics, but punishes hallucination. GPT-5.5 (xhigh) has the highest accuracy at 57% - meaning the model can recall facts in the Omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% - vs Opus 4.7 (max) at 36%, and Gemini 3.1 Pro Preview at 50%. This makes it more likely to answer a question when it does not ‘know’ the answer. The 14 pt gain in AA-Omniscience from GPT-5.4 (xhigh) was largely driven by knowledge, with a modest improvement in hallucination. Congratulations to the team at @OpenAI and @sama on the launch

ArtificialAnlys's tweet photo. GPT-5.5 takes OpenAI back to the clear number one in AI. OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google

OpenAI gave us pre-release access to test all five reasoning effort levels: xhigh, high, medium, low and non-reasoning.

➤ OpenAI topping five headline evaluations: GPT-5.5 (xhigh) leads Terminal-Bench Hard, GDPval-AA and our newly hosted APEX-Agents-AA. The model trails only other OpenAI models in CritPt and AA-LCR, and comes second to Gemini 3.1 Pro Preview on three additional evaluations. The largest gains are on AA-Omniscience (+14 pts), our knowledge and hallucination benchmark, and τ²-Bench Telecom (+7 pts), a customer service agent benchmark.

➤ 20% more expensive to run our Intelligence Index: Per-token pricing has doubled from GPT-5.4 to $5/$30 per 1M input/output tokens. However, a ~40% token use reduction largely absorbs the hike - resulting in a net ~+20% cost to run our Intelligence Index.

➤ Effort a clear ladder for balancing intelligence and cost: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on our Intelligence Index at one quarter of the cost (~$1,200 vs $4,800) - although Gemini 3.1 Pro Preview scores the same at a cost of ~$900. GPT-5.5 (low) approximates Claude Opus 4.7 (Non-reasoning, high) on our Intelligence Index at half the cost to run (~$500 vs ~$1 ,000).

➤ Number one in GDPval-AA with an Elo of 1785: GPT-5.5 (xhigh) leads Claude Opus 4.7 (max) by ~30 pts and Gemini 3.1 Pro Preview by ~470 pts. GDPval-AA is Artificial Analysis’ benchmark that leverages OpenAI’s GDPval dataset to evaluate models on real-world economically valuable tasks.

➤ Top AA-Omniscience accuracy, but trailing the frontier on hallucination: Our private AA-Omniscience benchmark rewards factual knowledge across diverse topics, but punishes hallucination. GPT-5.5 (xhigh) has the highest accuracy at 57% - meaning the model can recall facts in the Omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% - vs Opus 4.7 (max) at 36%, and Gemini 3.1 Pro Preview at 50%. This makes it more likely to answer a question when it does not ‘know’ the answer. The 14 pt gain in AA-Omniscience from GPT-5.4 (xhigh) was largely driven by knowledge, with a modest improvement in hallucination.

Congratulations to the team at @OpenAI and @sama on the launch

63

2K

208

278

265K

Wu Haoning

@HaoningTimothy

about 2 months ago

It becomes much smarter than previous generations!

Jasper Dekoninck @j_dekoninck

about 2 months ago

Kimi K2.6 becomes the #1 open model on MathArena!

3

245

10

17

33K

0

25

1

2

1K

Wu Haoning

@HaoningTimothy

about 2 months ago

@teortaxesTex Enjoy Kimi first! (We are trying our best to serve everyone)

0

3

0

53

Wu Haoning

@HaoningTimothy

about 2 months ago

@teortaxesTex hope they be fast... (from my very personal perspective day-0 open-source is better than day-X open-source but I am not working for business teams just a model trainer

0

4

0

130

Wu Haoning

@HaoningTimothy

about 2 months ago

Shall we be looking forward to V4 now? Serving the burst of K2.6 has made us GPU poor again now😂

0

2

0

220

HaoningTimothy retweeted

Maksym Andriushchenko

@maksym_andr

about 2 months ago

💥 Kimi-K2.6-thinking is the new best open-weight model on HalluHard (without web search)! K2.5 had 76.9% hallucination rate, whereas K2.6 now has 63.6%. Since our benchmark contains hard hallucination cases, this improvement is very notable. Thank you @Kimi_Moonshot for providing API credits and @dyfan22 for running the eval! Full results: https://t.co/BFzcZWC555 Paper: https://t.co/kKxIapQjkB

maksym_andr's tweet photo. 💥 Kimi-K2.6-thinking is the new best open-weight model on HalluHard (without web search)!

K2.5 had 76.9% hallucination rate, whereas K2.6 now has 63.6%. Since our benchmark contains hard hallucination cases, this improvement is very notable.

Thank you @Kimi_Moonshot for providing API credits and @dyfan22 for running the eval!

Full results: https://t.co/BFzcZWC555
Paper: https://t.co/kKxIapQjkB

0

12

3

2

2K

Wu Haoning

@HaoningTimothy

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users