Rubio Huang @HuangRubio - Twitter Profile

13 days ago

Submit your API of GUI Model to test on Workflow Gym. https://t.co/Wh7Q29dQ1C I really love the showcase page [co-designed with Trae Agent], do you?

GeZhang86038849's tweet photo. Submit your API of GUI Model to test on Workflow Gym.
https://t.co/Wh7Q29dQ1C

I really love the showcase page [co-designed with Trae Agent], do you? https://t.co/lXylRye4y0

2

5

1

2

236

HuangRubio retweeted

Rui-Jie Zhu

@RidgerZhu

8 months ago

Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.” TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size.

RidgerZhu's tweet photo. Thrilled to release new paper: “Scaling Latent Reasoning via Looped Language Models.”

TLDR: We scale up loop language models to 2.6 billion parameters, and pretrained on > 7 trillion tokens. The resulting model is on par with SOTA language models of 2 to 3x size. https://t.co/6iauhVZ83g

23

691

149

463

176K

HuangRubio retweeted

Yong Shan @yshan783399

10 months ago

We are thrilled to introduce the Seed-OSS family of open-source LLMs, developed by ByteDance's Seed Team. GitHub: https://t.co/lUNRuigqMA HuggingFace: https://t.co/1WuQHpGcIo Feel free to try it out and share your feedback!

yshan783399's tweet photo. We are thrilled to introduce the Seed-OSS family of open-source LLMs, developed by ByteDance's Seed Team.

GitHub: https://t.co/lUNRuigqMA
HuggingFace: https://t.co/1WuQHpGcIo

Feel free to try it out and share your feedback! https://t.co/Q8kYgO1uT2

7

206

52

71

41K

Rubio Huang @HuangRubio

10 months ago

@elonmusk didn’t lie. Grok-4 sees the future. 🔮🚀 �� FutureX Leaderboard (live benchmark for real-world foresight) https://t.co/YYDPeBU2pJ

Jiashuo Liu @liujiashuo77

10 months ago

We built FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc. Among 23 AI agents, #Grok4 ranked #1 🏆 Elon didn’t lie. @elonmusk your model sees further 🚀🍀 LeaderBoard: https://t.co/fwck0NROHZ

liujiashuo77's tweet photo. We built FutureX, the world’s first live benchmark for real future prediction — politics, economy, culture, sports, etc.
Among 23 AI agents, #Grok4 ranked #1 🏆
Elon didn’t lie.
@elonmusk your model sees further 🚀🍀

LeaderBoard: https://t.co/fwck0NROHZ https://t.co/L26mqqCxMp

219

1K

191

540

8M

0

1

0

60

Who to follow

BigWes♊️

@horsley_wesley

I don’t know what I’m doing here…

Vaghr

@vaghr_btc

早期alpha｜铭文赛道｜Degen｜nft ｜撸毛｜所有观��只是个人记录，不构成任何投资建议

Coronado

@keehvin_

HuangRubio retweeted

Daoguang Zan @zandaoguang

about 1 year ago

🔥 Can your LLM fix bugs beyond Python? Meet our Multi-SWE-bench — the first multilingual benchmark for issue resolving. Not just Python, but Java, TS, JS, Go, Rust, C, and C++🧩 💥 1,632 real-world issues ✅ Verified by 68 engineers 📦 Dockerized, reproducible, battle-tested 🧠 Covers easy, medium, and hard bug fixes 📊 Designed to benchmark LLMs as true dev agents To scale beyond benchmarks, we also launch Multi-SWE-RL — �� An open-source RL community to build interactive training environments for LLMs as autonomous agents. 🌱 4,723 containerized issue-resolving tasks, 7 languages, and counting. 🤝 We invite the community to contribute, expand, and shape the future of software-native RL. It took us a year to build. Now let’s see what your model can do. 🏆 Leaderboard: https://t.co/o3FyhQBjgp 📄 Paper: https://t.co/fMyR8vM7bu 🧬 Code: https://t.co/zPfnScyFyj 📚 Multi-SWE-bench Dataset: https://t.co/6aEMNCiyBk 🎮 Multi-SWE-RL Dataset: https://t.co/0yIPbFiz5M #LLM #RL #SWEbench #OpenAI #Anthropic #DeepSeek #Doubao

zandaoguang's tweet photo. 🔥 Can your LLM fix bugs beyond Python?
Meet our Multi-SWE-bench — the first multilingual benchmark for issue resolving.
Not just Python, but Java, TS, JS, Go, Rust, C, and C++🧩

💥 1,632 real-world issues
✅ Verified by 68 engineers
📦 Dockerized, reproducible, battle-tested
🧠 Covers easy, medium, and hard bug fixes
📊 Designed to benchmark LLMs as true dev agents

To scale beyond benchmarks, we also launch Multi-SWE-RL —
�� An open-source RL community to build interactive training environments for LLMs as autonomous agents.
🌱 4,723 containerized issue-resolving tasks, 7 languages, and counting.
🤝 We invite the community to contribute, expand, and shape the future of software-native RL.

It took us a year to build.
Now let’s see what your model can do.

🏆 Leaderboard: https://t.co/o3FyhQBjgp
📄 Paper: https://t.co/fMyR8vM7bu
🧬 Code: https://t.co/zPfnScyFyj
📚 Multi-SWE-bench Dataset: https://t.co/6aEMNCiyBk
🎮 Multi-SWE-RL Dataset: https://t.co/0yIPbFiz5M

#LLM #RL #SWEbench #OpenAI #Anthropic #DeepSeek #Doubao

8

46

10

29

13K

Rubio Huang @HuangRubio

over 1 year ago

Great MOE kernels

Haibin @eric_haibin_lin

over 1 year ago

❗️Open source MOE kernels alert❗️ Introducing COMET, a computation/communication library for MoE models from Bytedance. Battle-tested in our 10k+ GPU clusters, COMET shows promising efficiency gains and significant GPU-hour savings (millions 💰💰💰). Integration of DualPipe & DeepEP requires too much effort? Try COMET, a drop in replacement for your MOE block! Key Points: ✅ Deployed on 10K+ GPU cluster, saved MILLIONS of GPU hours ✅ 1.96x layer-wise speedup, 1.71x end-to-end boost for MoE models ✅ Fine-grained Computation-communication Overlapping for MoE Why devs care: 📌 Plug-and-play with existing frameworks (just a few lines of code change) 📌 Supports ALL MoE parallel modes: TP/EP/EP+TP 📌 MLSys'25 top scores (5/5/5/4) - battle-tested at scale 📄 Paper: https://t.co/T8lL1V1g9t 📦 Code: https://t.co/cPBN3prBQA Great work done by Shulai, @NingxinZheng_ and team #OpenSource #LLM #MOE #MLSys2025 #CUDA

8

218

57

133

83K

0

85

HuangRubio retweeted

Ge Zhang @GeZhang86038849

over 1 year ago

[1/n] SuperExcited to announce SuperGPQA!!! We spend more than half a year to finally make it done! SuperGPQA is a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. It also provides the largest human-LLM collaborated high-quality benchmark annotation practice! We thank the sponsorship from https://t.co/3wo3OclltR and https://t.co/W10Le33GhQ! Resources: Websites: https://t.co/rycXK6loPJ Huggingface: https://t.co/2XLu2RTs2t Github: https://t.co/2HcE5n0uuZ Paper: https://t.co/vzoiz0bVcK HF Paper: https://t.co/DgMZFoxfMj

GeZhang86038849's tweet photo. [1/n]

SuperExcited to announce SuperGPQA!!!
We spend more than half a year to finally make it done!
SuperGPQA is a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines.
It also provides the largest human-LLM collaborated high-quality benchmark annotation practice!

We thank the sponsorship from https://t.co/3wo3OclltR and https://t.co/W10Le33GhQ!

Resources:
Websites: https://t.co/rycXK6loPJ
Huggingface: https://t.co/2XLu2RTs2t
Github: https://t.co/2HcE5n0uuZ
Paper: https://t.co/vzoiz0bVcK
HF Paper: https://t.co/DgMZFoxfMj

5

209

46

67

29K

HuangRubio retweeted

Qian Liu

@sivil_taram

over 1 year ago

🎉 Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models. 🤝 Looking to contribute? Join our Program Committee: https://t.co/nWjz2GOGwc 🔍 Learn more at: https://t.co/Y5tJBSRoow #OpenScience #MachineLearning #FoundationModels 1/N

sivil_taram's tweet photo. 🎉 Announcing the first Open Science for Foundation Models (SCI-FM) Workshop at #ICLR2025! Join us in advancing transparency and reproducibility in AI through open foundation models.

🤝 Looking to contribute? Join our Program Committee: https://t.co/nWjz2GOGwc

🔍 Learn more at: https://t.co/Y5tJBSRoow

#OpenScience #MachineLearning #FoundationModels

1/N

6

173

42

43

41K

HuangRubio retweeted

Ge Zhang @GeZhang86038849

over 1 year ago

[1/n] 🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows: 🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data. 🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content. 🧑‍⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test). 🪙We provide all the full-process materials for training fasttext and bert. 📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])

GeZhang86038849's tweet photo. [1/n]
🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows:
🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data.
🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content.
🧑‍⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test).
🪙We provide all the full-process materials for training fasttext and bert.
📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])

7

160

44

77

24K

Rubio Huang @HuangRubio

over 1 year ago

+1

Junyang Lin

@JustinLin610

over 1 year ago

No visa. Can't go to NeurIPS. ( QwQ

35

343

6

7

70K

0

1

0

47

HuangRubio retweeted

Zekun Wang (ZenMoore) 🔥 @ZenMoore1

over 1 year ago

Models and Codes are available at https://t.co/kwULVZrtYS.

0

5

2

0

526

HuangRubio retweeted

Ge Zhang @GeZhang86038849

over 1 year ago

[1/n] 🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning. Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd? It's time to put your code LLMs to the test!!! 📝

GeZhang86038849's tweet photo. [1/n]
🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning.

Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd?

It's time to put your code LLMs to the test!!! 📝

11

134

33

63

47K

HuangRubio retweeted

Ge Zhang @GeZhang86038849

over 1 year ago

[1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀 Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model prediction, AutoKaggle takes on all phases, boosting efficiency and reducing manual overhead. 💡 Highlights of AutoKaggle: 🛠️ Phase-based workflow: Six key phases (Understanding, EDA, Cleaning, Feature Engineering, Model Building). 🤖 Five specialized agents: Reader, Planner, Developer, Reviewer, Summarizer. 🔁 Iterative debugging & unit testing for robust, correct code generation. 📊 Built-in ML tools library to handle data cleaning, feature engineering, and modeling. 🤤 Flexible Customize Support on ML Tool Library allows you to drive the workflow as you want.

GeZhang86038849's tweet photo. [1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀

Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model prediction, AutoKaggle takes on all phases, boosting efficiency and reducing manual overhead.

💡 Highlights of AutoKaggle:
🛠️ Phase-based workflow: Six key phases (Understanding, EDA, Cleaning, Feature Engineering, Model Building).
🤖 Five specialized agents: Reader, Planner, Developer, Reviewer, Summarizer.
🔁 Iterative debugging & unit testing for robust, correct code generation.
📊 Built-in ML tools library to handle data cleaning, feature engineering, and modeling.
🤤 Flexible Customize Support on ML Tool Library allows you to drive the workflow as you want.

7

152

36

76

15K

HuangRubio retweeted

Ge Zhang @GeZhang86038849

over 1 year ago

[1/n] ### Exploring the Boundaries of AI Reasoning — Launch of KOR-Bench 🚀To more accurately assess large models' reasoning in new, unfamiliar areas, we’re thrilled to introduce the all-new KOR-Bench (Knowledge-Orthogonal Reasoning Benchmark)! ### 💡 Highlights of KOR-Bench: > 5 categories (🔢Operation, 🔍Logic, 🔐Cipher, 🧩Puzzle, 📖Counterfactual) assess reasoning from multiple perspectives, using 25 custom rules 📜 with 10 problem ❓ instances each, ensuring rules are orthogonal to pre-training data. > Minimizes reliance on pre-trained knowledge by testing large language models' ability to solve new rule-driven questions using new rule descriptions, ensuring a fairer evaluation of models' true reasoning skills. > Encourages models to break traditional frameworks and adapt to non-standard challenges, revealing abilities in reading comprehension, immediate learning, knowledge transfer, logical reasoning, and problem-solving. 🔗 #Reasoning #KOR Bench #Large Language Models #Benchmark

GeZhang86038849's tweet photo. [1/n]
### Exploring the Boundaries of AI Reasoning — Launch of KOR-Bench

🚀To more accurately assess large models' reasoning in new, unfamiliar areas, we’re thrilled to introduce the all-new KOR-Bench (Knowledge-Orthogonal Reasoning Benchmark)!

### 💡 Highlights of KOR-Bench:

> 5 categories (🔢Operation, 🔍Logic, 🔐Cipher, 🧩Puzzle, 📖Counterfactual) assess reasoning from multiple perspectives, using 25 custom rules 📜 with 10 problem ❓ instances each, ensuring rules are orthogonal to pre-training data.

> Minimizes reliance on pre-trained knowledge by testing large language models' ability to solve new rule-driven questions using new rule descriptions, ensuring a fairer evaluation of models' true reasoning skills.

> Encourages models to break traditional frameworks and adapt to non-standard challenges, revealing abilities in reading comprehension, immediate learning, knowledge transfer, logical reasoning, and problem-solving.

🔗 #Reasoning #KOR Bench #Large Language Models #Benchmark

3

53

14

35

5K

HuangRubio retweeted

Rhymes.AI @rhymes_ai_

over 1 year ago

Really humbled that Aria from Rhymes AI is so well received by the @huggingface community as 🔥#1 Daily Paper🔥. Awesome work from @LiJunnan0409 and our multimodality native team! https://t.co/qWhAAGr3Zs 📢 Blog: https://t.co/wBW6kbIYD8 📖 Tech Report: https://t.co/gGz7okl403 ⏬ Model: https://t.co/w2ekCpAu0A 🔧 Github: https://t.co/FtuPBbxh5A 👀 Demo (scroll down to Chat): https://t.co/j7z3WBsLGB

rhymes_ai_'s tweet photo. Really humbled that Aria from Rhymes AI is so well received by the @huggingface community as 🔥#1 Daily Paper🔥. Awesome work from @LiJunnan0409 and our multimodality native team! https://t.co/qWhAAGr3Zs

📢 Blog: https://t.co/wBW6kbIYD8
📖 Tech Report: https://t.co/gGz7okl403
⏬ Model: https://t.co/w2ekCpAu0A
🔧 Github: https://t.co/FtuPBbxh5A
👀 Demo (scroll down to Chat): https://t.co/j7z3WBsLGB

31

46

8

11

6K

Rubio Huang @HuangRubio

over 1 year ago

Congrats @GeZhang86038849

JB @IAMJBDEL

over 1 year ago

HuggingFace Paper-central now hosts open-source leaderboards. This is like a h-index but for 🤗 artifacts. Discover the authors whose papers have attracted the most open-source artifacts (datasets, models or spaces), and most-active contributors who have developed artifacts associated with papers.

IAMJBDEL's tweet photo. HuggingFace Paper-central now hosts open-source leaderboards.
This is like a h-index but for 🤗 artifacts. Discover the authors whose papers have attracted the most open-source artifacts (datasets, models or spaces), and most-active contributors who have developed artifacts associated with papers.

2

44

15

20

35K

0

30

HuangRubio retweeted

JB @IAMJBDEL

over 1 year ago

HuggingFace Paper-central now hosts open-source leaderboards. This is like a h-index but for 🤗 artifacts. Discover the authors whose papers have attracted the most open-source artifacts (datasets, models or spaces), and most-active contributors who have developed artifacts associated with papers.

2

44

15

20

35K

HuangRubio retweeted

Yizhi Li @yizhilll

over 1 year ago

Exciting news! We're thrilled to introduce OmniBench: a groundbreaking benchmark for evaluating omni-language models (OLMs) that can process visual, acoustic, and textual inputs simultaneously! 🖼️🔊📝 https://t.co/K438PYY7wm #Multimodal#LLM

1

15

8

3

3K

HuangRubio retweeted

Wenhu Chen @WenhuChen

almost 2 years ago

A sad truth about evaluation is that: If you make a private test set for your benchmark, people just won't adopt it. We have our official MMMU private test set hosted in EvalAI (https://t.co/PsEuP1XQLd), but everyone is still reporting validation score. I found it's similar for MathVista, where everyone is just reporting testmini score.

9

196

11

65

83K

Rubio Huang @HuangRubio

about 2 years ago

@__kolesnikov__ seem dm doesn't work, do you have time for a coffee chat?

0

82

Rubio Huang

@HuangRubio

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users