hzwer

@hzwer

Stepfun

Joined March 2019

52 Following

225 Followers

63 Posts

hzwer retweeted

StepFun @StepFun_ai

18 days ago

⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency. #1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0. Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution. - 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels. - Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees. - Web + visual search reaches further: more sources, deeper follow-up. - Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels. - Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP. - Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395. GitHub: https://t.co/kqlZkVIRHv HuggingFace: https://t.co/qqceCrgPiw GGUF: https://t.co/rR6XrnymWG ModelScope: https://t.co/wney6Tzvqy API: https://t.co/RvHWzRG7Fu Blog: https://t.co/BxDiajiQ5G

StepFun_ai's tweet photo. ⚡️ Step 3.7 Flash is here: The new frontier is agent efficiency.

#1 ClawEval-1.1 (67.1), #1 SimpleVQA Search (79.2), #2 SWE-PRO (56.3), 95.3 on V* Python. Open weights under Apache 2.0.

Built for agentic, coding, search, and multimodal workflows — balancing speed, cost, and reliable execution.

- 400 TPS. 198B sparse MoE, ~11B active. 256K context, 3 reasoning levels.
- Understands UIs, charts, docs, images — then writes code or calls tools to act on what it sees.
- Web + visual search reaches further: more sources, deeper follow-up.
- Reliable tool use — less drift, fewer broken toolcalls. 98%+ on τ²-bench across all difficulty levels.
- Works with Claude Code, KiloCode, Hermes Agent, OpenClaw, and protocols like MCP.
- Runs locally on Mac Studio M4 Max, DGX Spark, AMD AI Max+ 395.

GitHub: https://t.co/kqlZkVIRHv
HuggingFace: https://t.co/qqceCrgPiw
GGUF: https://t.co/rR6XrnymWG
ModelScope: https://t.co/wney6Tzvqy
API: https://t.co/RvHWzRG7Fu
Blog: https://t.co/BxDiajiQ5G

120

212

620

340K

hzwer retweeted

StepFun @StepFun_ai

about 2 months ago

StepAudio 2.5 TTS is live now! Control emotion, pacing, pauses, and delivery with plain natural language. No tags, no preset combos. Just describe what you want the voice to do. Zero-shot voice cloning with full timbre + emotion control. Available via Pay-as-you-go API or Step Plan.

172

117

17K

hzwer @hzwer

3 months ago

9️⃣「工作流代码要干净。有用的工具留着维护，临时脚本用完删掉或 ignore，不要把垃圾进 git。」 🔟「你不只是在完成任务，你是在值班。没人叫你也要巡逻——查 Codex、查进度、查异常、查卡住。主动发现问题比被动等指令有价值 10 倍。」 #openclaw #codex #科研 #水产市场 #小龙虾 #生产力工具

228

hzwer @hzwer

3 months ago

最近攒了一些openclaw使用小技巧抓你的小龙虾学习一下下面这些话： 1️⃣「用户消息必须秒回。任何 >5s 的操作都走后台，前台只做快速指令 message 发送。」 2️⃣「使用第一性原理思考。不要假设用户非常清楚自己想要什么和该怎么得到。从原始需求和问题本质出发，审慎分析后再行动。」

233

Who to follow

Jingkang (Jake) Yang

@JingkangY

Egocentric Model Researcher | Prev. Co-Founder at Synvo AI (https://t.co/iLyMFdMNYG) | MMLab@NTU Ph.D. (https://t.co/E8cQaOk45D) | ECCV’22 Best Backpack Award 🎒

Jeff Li

@jiefengli_jeff

Sr. Research Scientist @NVIDIA | PhD from SJTU @sjtu1896 | Interested in 3D Computer Vision, Human Digitization | Views are my own

KAUST Computer Vision Lab (IVUL)

@KaustVision

Image and Video Understanding Lab (IVUL) @KAUST_News supervised by Prof. @BernardSGhanem

hzwer @hzwer

3 months ago

6️⃣「上下文努力控制在 100k 以内。大了会慢、会挂、会丢消息。主动做 compaction，不要等爆。」 7️⃣「commit early, commit often。改了就 commit + push，不要本地攒一堆。」 8️⃣「Codex 可能缺环境变量（API key、proxy）。启动前确认，任务里要求 git commit。」

152

hzwer retweeted

StepFun @StepFun_ai

4 months ago

"can we get the base model?" sure. here's two. "can we get the code?" sure. here's SteptronOSS. "what about the SFT data?" coming soon. maximum sincerity, minimum barriers. - Step 3.5 Flash Base — pretrained foundation - Step 3.5 Flash Base-Midtrain — code, agents & long-context - SteptronOSS — open-sourced, ready for your custom workflows - SFT Data — coming soon for reference not just the final checkpoint — a customizable pipeline. 🤗 https://t.co/pVaFiJ87UT 🤗 https://t.co/YDOZUcqgXh 💻 https://t.co/G42LCHmQMr

119

465

146K

hzwer @hzwer

4 months ago

CVPR 2026 accepted 🥳

AI Native Foundation

@AINativeF

about 1 year ago

7. ViStoryBench: Comprehensive Benchmark Suite for Story Visualization 🔑 Keywords: story visualization, generative models, evaluation benchmark, ViStoryBench, character consistency 💡 Category: Generative Models 🌟 Research Objective: The paper aims to improve the performance of story visualization frameworks by proposing a novel evaluation benchmark that assesses models across diverse story types and artistic styles. 🛠️ Research Methods: The study introduces ViStoryBench, which features a carefully curated dataset that evaluates models on various narrative structures, visual aesthetics, and different plot types, ensuring comprehensive comparisons through a wide range of evaluation metrics. 💬 Research Conclusions: The framework allows researchers to identify the strengths and weaknesses of different models, particularly in maintaining character consistency, handling complex plots, and generating accurate visuals, thus fostering targeted improvements in the field of story visualization. 👉 Paper link: https://t.co/Zwh4edvsh7

AINativeF's tweet photo. 7. ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

🔑 Keywords: story visualization, generative models, evaluation benchmark, ViStoryBench, character consistency

💡 Category: Generative Models

🌟 Research Objective: The paper aims to improve the performance of story visualization frameworks by proposing a novel evaluation benchmark that assesses models across diverse story types and artistic styles.

🛠️ Research Methods: The study introduces ViStoryBench, which features a carefully curated dataset that evaluates models on various narrative structures, visual aesthetics, and different plot types, ensuring comprehensive comparisons through a wide range of evaluation metrics.

💬 Research Conclusions: The framework allows researchers to identify the strengths and weaknesses of different models, particularly in maintaining character consistency, handling complex plots, and generating accurate visuals, thus fostering targeted improvements in the field of story visualization.

👉 Paper link: https://t.co/Zwh4edvsh7

288

222

hzwer retweeted

Jasper Dekoninck @j_dekoninck

4 months ago

We now added GLM-5 to our leaderboard! It's the second-best open model on MathArena, but it does significantly underperform compared to other models on ArXivMath, especially given its great competition-problem performance.

j_dekoninck's tweet photo. We now added GLM-5 to our leaderboard! It's the second-best open model on MathArena, but it does significantly underperform compared to other models on ArXivMath, especially given its great competition-problem performance. https://t.co/AujwqdtSVw

hzwer retweeted

Jasper Dekoninck @j_dekoninck

4 months ago

AIME 2026 is now complete and fully available on MathArena (and HuggingFace)🎉

156

15K

hzwer @hzwer

4 months ago

AIME26 Step 3.5 Flash ⚡️⚡️⚡️

159

hzwer @hzwer

4 months ago

@teortaxesTex 会有的先期待一下step 3.6😆

hzwer @hzwer

4 months ago

写技术报告让 Gemini pro 检查表格它说 Step 3.5 Flash 放卫星 🥹

562

hzwer @hzwer

4 months ago

CF-Div2-Stepfun: 53 codeforces problems (Sep 2024–Feb 2025) + full tests & checkers, built for testing LLM competitive programming performance 🔥 https://t.co/WeowZQ3FlD #LLM #benchmark

hzwer retweeted

Zhihu Frontier

@ZhihuFrontier

4 months ago

🚀What Benchmark Design Tells Us About the Result of Step 3.5 Flash? Here's a detailed breakdown from model infra engineer & Zhihu contributor P2oileen, who worked directly on the benchmarking infrastructure. 💬"If high scores can't be reproduced, a tech report is just paper." Since June last year, he's been working on model evaluation at @StepFun_ai, focusing on one core goal: making training metrics reliable, reproducible, and trustworthy. And Step 3.5 Flash(196B MoE, A11B) did deliver with good results on different benchmarks. 🧱 Benchmarking at Scale: Lessons from the Infrastructure Our evaluation platform now integrates 300+ internal and external benchmarks. Making these scores "real" required fixing a lot of invisible problems: 1️⃣ Failure-aware evaluation API timeouts, service instability, sandbox crashes—these happen constantly at scale. Instead of silently assigning low scores, we now: • Capture failures across model calls, judgers, tokenizers, and sandboxes • Explicitly report failure rates and causes in the UI 2️⃣ Train–Inference consistency We moved from "convert weights → deploy → test" to in-training evaluation with guarded inference services, eliminating misalignment between training and evaluation. 3️⃣ Managing 300+ benchmarks without chaos Early on, one maintainer + manual reviews = daily firefighting. Now we have: • Standardized onboarding & release processes • Clear ownership for every benchmark • AI-powered reviewers • Experiment with Agents auto-integrating benchmarks 🔍 Making Scores Credible To ensure metrics actually mean something: • Reproduce competitor results internally to verify alignment • Check for data contamination by deduplicating evaluation sets • Reduce false negatives via: - Prompt engineering (e.g. \boxed{} answer formats) - Upgraded LLM-based judgers - Fixing brittle answer extractors • Statistical testing for small datasets (mean + variance) • Pretrain-specific strategies: output truncation, stop tokens, few-shot pattern stabilization • Prompt alignment is critical. They open-source all System + Question Prompts in the appendix — reproducibility matters more than prompt tricks. 🧠 Reasoning-Heavy & Long-Context Tasks Step 3.5 Flash produces long, reasoning-dense outputs. We added: • Token-level reasoning length monitoring • Streamed inference optimizations (socket buffers, keep-alive strategies) These infra-level details are unglamorous—but they're what make high scores on AIME, IMO, FRAMES actually possible. 📌 Final Takeaway After 8 months of large-model evaluation, one lesson stands out: ✨Evaluation shouldn't just "follow" training—it should slightly lead it. For Step 3.5 Flash, strong results came from carefully chosen benchmarks and well-designed evaluation protocols. Our next goal is to formalize this intuition into scalable quality checks—so evaluation certainty can offset training uncertainty. We hope this experience is useful to the community—and that it helps power Step 4 and beyond 🚀 Feel free to try Step 3.5 Flash, and happy to discuss benchmarking with fellow practitioners. 📖 Read more: https://t.co/EDn3hPGKFd 🔗 Full evaluation protocols: https://t.co/vJKWwbSY8Q #Step35Flash #OpenSource #AI #LLM #MoE #Reasoning #Agent #AIInfra

$ZhihuFrontier's tweet photo. 🚀What Benchmark Design Tells Us About the Result of Step 3.5 Flash? Here's a detailed breakdown from model infra engineer & Zhihu contributor P2oileen, who worked directly on the benchmarking infrastructure. 💬"If high scores can't be reproduced, a tech report is just paper." Since June last year, he's been working on model evaluation at @StepFun_ai, focusing on one core goal: making training metrics reliable, reproducible, and trustworthy. And Step 3.5 Flash(196B MoE, A11B) did deliver with good results on different benchmarks. 🧱 Benchmarking at Scale: Lessons from the Infrastructure Our evaluation platform now integrates 300+ internal and external benchmarks. Making these scores "real" required fixing a lot of invisible problems: 1️⃣ Failure-aware evaluation API timeouts, service instability, sandbox crashes—these happen constantly at scale. Instead of silently assigning low scores, we now: • Capture failures across model calls, judgers, tokenizers, and sandboxes • Explicitly report failure rates and causes in the UI 2️⃣ Train–Inference consistency We moved from "convert weights → deploy → test" to in-training evaluation with guarded inference services, eliminating misalignment between training and evaluation. 3️⃣ Managing 300+ benchmarks without chaos Early on, one maintainer + manual reviews = daily firefighting. Now we have: • Standardized onboarding & release processes • Clear ownership for every benchmark • AI-powered reviewers • Experiment with Agents auto-integrating benchmarks 🔍 Making Scores Credible To ensure metrics actually mean something: • Reproduce competitor results internally to verify alignment • Check for data contamination by deduplicating evaluation sets • Reduce false negatives via: - Prompt engineering (e.g. \boxed{} answer formats) - Upgraded LLM-based judgers - Fixing brittle answer extractors • Statistical testing for small datasets (mean + variance) • Pretrain-specific strategies: output truncation, stop tokens, few-shot pattern stabilization • Prompt alignment is critical. They open-source all System + Question Prompts in the appendix — reproducibility matters more than prompt tricks. 🧠 Reasoning-Heavy & Long-Context Tasks Step 3.5 Flash produces long, reasoning-dense outputs. We added: • Token-level reasoning length monitoring • Streamed inference optimizations (socket buffers, keep-alive strategies) These infra-level details are unglamorous—but they're what make high scores on AIME, IMO, FRAMES actually possible. 📌 Final Takeaway After 8 months of large-model evaluation, one lesson stands out: ✨Evaluation shouldn't just "follow" training—it should slightly lead it. For Step 3.5 Flash, strong results came from carefully chosen benchmarks and well-designed evaluation protocols. Our next goal is to formalize this intuition into scalable quality checks—so evaluation certainty can offset training uncertainty. We hope this experience is useful to the community—and that it helps power Step 4 and beyond 🚀 Feel free to try Step 3.5 Flash, and happy to discuss benchmarking with fellow practitioners. 📖 Read more: https://t.co/EDn3hPGKFd 🔗 Full evaluation protocols: https://t.co/vJKWwbSY8Q #Step35Flash #OpenSource #AI #LLM #MoE #Reasoning #Agent #AIInfra$

970

hzwer retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

4 months ago

StepFun Step 3.5-Flash Tech Report is here! And it's great! – they compare against absolute frontier (Gemini Pro/Opus/5.2-xhigh), primarily in agency. 74.4 SWE-Bench etc. – 4,096 H800s, 17.2T tokens, Muon (fancy again) - PaCoRe is their "Heavy" mode - lots of details on training

teortaxesTex's tweet photo. StepFun Step 3.5-Flash Tech Report is here! And it's great!
– they compare against absolute frontier (Gemini Pro/Opus/5.2-xhigh), primarily in agency. 74.4 SWE-Bench etc.
– 4,096 H800s, 17.2T tokens, Muon (fancy again)
- PaCoRe is their "Heavy" mode
- lots of details on training https://t.co/NzNW1cEQM4

136

20K

hzwer retweeted

Yasmine

@CyouSakura

4 months ago

We’re #1 on OpenRouter trending today! 🚀 @StepFun_ai Sustaining ~160 tokens/s on Hopper GPUs. Technical report coming soon. OpenRouter: https://t.co/wkKNREp2CQ Blog: https://t.co/xm8Hk6tyP3

CyouSakura's tweet photo. We’re #1 on OpenRouter trending today! 🚀 @StepFun_ai

Sustaining ~160 tokens/s on Hopper GPUs.

Technical report coming soon.

OpenRouter: https://t.co/wkKNREp2CQ
Blog: https://t.co/xm8Hk6tyP3 https://t.co/5pfQ0O1cS4

154

20K

hzwer retweeted

ModelScope

@ModelScope2022

4 months ago

Stepfun open-sourced Step-3.5-Flash, a powerhouse model specifically architected for high-speed reasoning and complex Agentic workflows. 🚀 Model: https://t.co/0Z6oFwJ9kI Key Technical Specs: ✅ Sparse MoE Architecture: 196B total params, but only ~11B active per token. SOTA efficiency. ✅ MTP-3 (Multi-Token Prediction): It predicts 3 tokens at once, hitting a blistering 350 TPS for code-heavy tasks. ⚡ ✅ Hybrid Attention (SWA + Full): A 3:1 mix that masters 256K context windows while keeping compute costs low. ✅ Parallel Thinking: Massively boosted performance for multi-step reasoning and deep search. Why Devs should care: - Built for Agents: Excels at long-chain task decomposition and cloud-edge collaboration. - Local-First Optimization: Runs smoothly on NVIDIA DGX Spark, Apple M3/M4 Max, and AMD AI Max+ 395. 💻 Benchmarks show it rivals top-tier closed-source models in math and coding scenarios, making it the perfect "copilot" for autonomous systems.

ModelScope2022's tweet photo. Stepfun open-sourced Step-3.5-Flash, a powerhouse model specifically architected for high-speed reasoning and complex Agentic workflows. 🚀

Model: https://t.co/0Z6oFwJ9kI

Key Technical Specs:
✅ Sparse MoE Architecture: 196B total params, but only ~11B active per token. SOTA efficiency.
✅ MTP-3 (Multi-Token Prediction): It predicts 3 tokens at once, hitting a blistering 350 TPS for code-heavy tasks. ⚡
✅ Hybrid Attention (SWA + Full): A 3:1 mix that masters 256K context windows while keeping compute costs low.
✅ Parallel Thinking: Massively boosted performance for multi-step reasoning and deep search.

Why Devs should care:
- Built for Agents: Excels at long-chain task decomposition and cloud-edge collaboration.
- Local-First Optimization: Runs smoothly on NVIDIA DGX Spark, Apple M3/M4 Max, and AMD AI Max+ 395. 💻

Benchmarks show it rivals top-tier closed-source models in math and coding scenarios, making it the perfect "copilot" for autonomous systems.

192

35K

hzwer retweeted

Zephyr

@zephyr_z9

4 months ago

Bruh Somebody test StepFun Flash 3.5 the perfomance is insane for 196B (11B Active) model U can easily run it on a high-end Mac

438

141

67K

hzwer retweeted

Yasmine

@CyouSakura

4 months ago

Fast enough to think. Reliable enough to act. Step-3.5-Flash is here @StepFun_ai⚡ Website: https://t.co/tQRWfXOBwW Blog: https://t.co/Ezb8uUvQCY Powering the next wave of intelligence—from real-time reasoning to reliable agentic action. We are so back. 🚀 Website: https://t.co/tQRWfXOBwW Blog: https://t.co/Ezb8uUvQCY Github: https://t.co/2zI8JtmI43 Huggingface: mtp3_bf16: https://t.co/odrPo9rKTc API Platform: https://t.co/JFZUYRQ4Jt Openrouter: https://t.co/e9Q67oEvyv Discord： https://t.co/oLpLYAJd2i

CyouSakura's tweet photo. Fast enough to think. Reliable enough to act.

Step-3.5-Flash is here @StepFun_ai⚡

Website: https://t.co/tQRWfXOBwW
Blog: https://t.co/Ezb8uUvQCY

Powering the next wave of intelligence—from real-time reasoning to reliable agentic action.

We are so back. 🚀

Website:
https://t.co/tQRWfXOBwW
Blog:
https://t.co/Ezb8uUvQCY
Github:
https://t.co/2zI8JtmI43
Huggingface:
mtp3_bf16: https://t.co/odrPo9rKTc
API Platform:
https://t.co/JFZUYRQ4Jt
Openrouter:
https://t.co/e9Q67oEvyv
Discord：
https://t.co/oLpLYAJd2i

772

345

98K

hzwer

@hzwer

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users