Sijun He

@SijunHe

Working on #ERNIE at @Baidu, used to struggle at @TwitterEng, @Stanford

Shenzhen, China

Joined September 2013

626 Following

395 Followers

441 Posts

SijunHe retweeted

Chuang Gan

@gan_chuang

7 months ago

ICLR has placed OpenReview in a difficult position, so I want to offer a few words about the OpenReview team working behind the scenes. OpenReview has long been operated at UMass Amherst as a non-profit organization founded by Andrew McCallum. Each year, Andrew must raise more than $2 million to support a 20-person team that provides essential infrastructure for most major conferences. I once asked Andrew what might have been a naïve question: whether he had considered developing a business model for OpenReview, given its prominence and the seemingly obvious opportunities. He pushed back, explaining that everything he has done for OpenReview is driven by a commitment to serve and strengthen the academic community. He is willing to devote significant personal effort to ensure the platform remains freely accessible to all. We should not blame such a brilliant and dedicated team for an accidental issue. Otherwise, fewer people would be willing to shoulder this kind of responsibility in the future. Deep respect to the OpenReview team! I’m grateful for their work and happy to support in any way!

987

135

178K

Sijun He @SijunHe

7 months ago

@finnlau_cd Ever wondered where all that bamboo comes from? Why would China prefer exporting metal scaffolding than (chinese-imported) bamboo?

110

Sijun He @SijunHe

7 months ago

This guy dropped the “best US open-weight model” — which is fine-tuned on a Deepseek base model — then proceed to thank a bunch of open-source orgs, not including Deepseek AI. Shameless...

Drishan Arora

@drishanarora

7 months ago

4/ For Cogito v2.1, we fork off the open-licensed Deepseek base model from November 2024. This is an obvious choice for a pretrained base model, as Deepseek architecture has an ecosystem of cheap inference built around it. We have built a frontier training stack, while being an early stage startup, since we can stand on the shoulders of open source champions like @huggingface, @togethercompute, @runpod and @nebiusai, as well as stellar contributions by @Microsoft, @Meta, @nvidia and a lot of other folks in open source. Over the last months, we have iterated and refined our post-training strategies of self-play + RL (called Iterated Distillation and Amplification - IDA) with Cogito v1 and v2. You will see high-quality responses from Cogito v2.1 while being a bit different from usual models - we increase the model’s intelligence prior and teach it how to think via process supervision. So there are significantly shorter reasoning chains for the responses. We also use less markdown, less verbosity. In short, we want to make the model great for API usage - faster, fewer tokens with super high quality.

32K

145

SijunHe retweeted

PaddlePaddle

@PaddlePaddle

12 months ago

🚀Excited to announce that the ERNIE 4.5 series models are officially open-sourced today! 🙌ERNIE 4.5 models achieved state-of-the-art performance across multiple text and multimodal benchmarks, especially in instruction following, world knowledge memorization, visual understanding and multimodal reasoning. ERNIE 4.5 models are trained using PaddlePaddle, achieving 47% MFU in the pre-training of the largest ERNIE 4.5 language model. ERNIE 4.5 models include 10 models, consist of MoE models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. Try it now👉https://t.co/531t0rVSWZ Hugging Face: https://t.co/xZHzlXh8UX GitHub: https://t.co/QkcxLgChbK AI Studio: https://t.co/Gyak0JYlLD

PaddlePaddle's tweet photo. 🚀Excited to announce that the ERNIE 4.5 series models are officially open-sourced today!

🙌ERNIE 4.5 models achieved state-of-the-art performance across multiple text and multimodal benchmarks, especially in instruction following, world knowledge memorization, visual understanding and multimodal reasoning. ERNIE 4.5 models are trained using PaddlePaddle, achieving 47% MFU in the pre-training of the largest ERNIE 4.5 language model.

ERNIE 4.5 models include 10 models, consist of MoE models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model.
Try it now👉https://t.co/531t0rVSWZ

Hugging Face: https://t.co/xZHzlXh8UX
GitHub: https://t.co/QkcxLgChbK
AI Studio: https://t.co/Gyak0JYlLD

213

40K

Who to follow

Yilei Qian

@YileiQian

Post-training @OpenAI, Previously "The algorithm" @Twitter, @Coursera

Tu Vu

@tuvllms

Assistant Professor @VT_CS & @Google (part-time). PhD from @UMass_NLP. Google FLAMe/FreshLLMs/Flan-T5 Collection/SPoT #LLMs #NLProc #AI

Asa Cooper Stickland

@AsaCoopStick

Poking language models @ the UK AI Security Institute

SijunHe retweeted

Baidu Inc. @Baidu_Inc

about 1 year ago

Introducing ERNIE X1 Turbo & ERNIE 4.5 Turbo! Building on the success of ERNIE X1 and 4.5, the upgraded ERNIE X1 Turbo and 4.5 Turbo deliver results faster and cheaper. Both models stand out for their multimodal capabilities, strong reasoning and low costs. For X1 Turbo, input and output prices start at $0.14 per 1M tokens and $0.55 per 1M tokens, respectively, priced at approximately 25% of DeepSeek R1. #BaiduCreate2025 (Exchange rate used: 1 RMB ≈ 0.14 USD)

537

160

136K

Sijun He @SijunHe

over 1 year ago

@karpathy Wrote up my observations and implementation notes at https://t.co/jSgJa9qknA. The code is available at https://t.co/ongR8kF2uM - an implementation that prioritizes clarity over efficiency

105

Sijun He @SijunHe

over 1 year ago

Weekend hack: extended @karpathy's nanoGPT to play with MoE architecture! Despite being GPU-poor, I was able to validate some key insights from the Switch Transformers paper with my humble experiments. Learned a ton about scaling MoE and expert imbalance in the process.

SijunHe's tweet photo. Weekend hack: extended @karpathy's nanoGPT to play with MoE architecture! Despite being GPU-poor, I was able to validate some key insights from the Switch Transformers paper with my humble experiments. Learned a ton about scaling MoE and expert imbalance in the process. https://t.co/u9vodjq9cB

265

Sijun He @SijunHe

over 1 year ago

@julien_c All the best Julien!

Sijun He @SijunHe

almost 2 years ago

@awnihannun Nice! How does MLX compare to llama.cpp running on Metal backend when it comes to performance?

218

SijunHe retweeted

Andrej Karpathy

@karpathy

almost 2 years ago

# RLHF is just barely RL Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well. What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better: Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this: 1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse, 2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM. For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was. And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL. No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

karpathy's tweet photo. # RLHF is just barely RL

Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well.

What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better:

Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this:

1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse,
2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM.

For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was.

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL.

No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

403

SijunHe retweeted

Hamel Husain

@HamelHusain

about 2 years ago

LLM bullshit knife, to cut through bs RAG -> Provide relevant context Agentic -> Function calls that work CoT -> Prompt model to think/plan FewShot -> Add examples PromptEng -> Someone w/good written comm skills. Prompt Optimizer -> For loop to find best examples.

525

290K

Sijun He @SijunHe

about 2 years ago

@_philschmid Great post! How is the synthetic training dataset created? Couldn't find that detail in the blog

238

Sijun He @SijunHe

about 2 years ago

@aria42 @RaniNelken 🤷

Sijun He @SijunHe

about 2 years ago

@RaniNelken @aria42 Well, this one comes to mind 😂

Elon Musk

@elonmusk

about 4 years ago

@paraga 💩

48K

437

234

SijunHe retweeted

Yann LeCun

@ylecun

over 2 years ago

Vinod, Chinese companies are very much on top of it already. Many have already developed their own home-grown LLMs from scratch. They're not far behind and they have some original ideas that could actually be useful to the rest of the world. They don't want to "steal" Western technology because (1) they don't need to, (2) they want sovereignty, (3) they want control.

106

137

182

276K

SijunHe retweeted

Stanford NLP Group

@stanfordnlp

almost 3 years ago

It’s the origin of attention! @DBahdanau & @kchonyc couldn’t afford Google’s large multi-GPU neural MT, so they thought of a better way. Us either. @lmthang & @chrmanning introduced simpler multiplicative attention. Then @GoogleAI folk wondered if attention is all you need…

339

108

170K

Sijun He @SijunHe

almost 3 years ago

Congrats! This is the right move. HF gotta pay the bills somehow and the Elastic license/model is a good choice

Jeff Boudier 🤗

@jeffboudier

almost 3 years ago

Today is a huge milestone for one of our latest libraries: Text Generation Inference - we released v1.0 and under a new license: HFOIL 1.0 https://t.co/7OTgLyM8T3 This 🧵 explains what this new license means, and why the change!

115

116

309K

349

SijunHe retweeted

martin grasser

@martingrasser

almost 3 years ago

Today we say goodbye to this great blue bird This logo was designed in 2012 by a team of three. @toddwaterbury, @angyche and myself, The logo was designed to be simple, balanced, and legible at very small sizes, almost like a lowercase "e", a 🧵

168K

50K

14K

14M

Sijun He @SijunHe

about 3 years ago

Great thread on AI Product and LLMs!

Tim Shi

@timshi_ai

about 3 years ago

New Post: Sharing some thoughts on the emerging tech stack of generative AI. LLMs != AI System != AI Product Thin wrappers are being built every day, but are they sufficient for vertical SaaS applications? Let's unpack 👉🚀 @cresta #GenerativeAI #GPT4 https://t.co/AEmK76YGh5

timshi_ai's tweet photo. New Post: Sharing some thoughts on the emerging tech stack of generative AI.

LLMs != AI System != AI Product

Thin wrappers are being built every day, but are they sufficient for vertical SaaS applications? Let's unpack 👉🚀 @cresta #GenerativeAI #GPT4

https://t.co/AEmK76YGh5 https://t.co/FE0wzMbv4z

125

29K

191

Sijun He @SijunHe

about 3 years ago

@mmbronstein nice watch!

194

Sijun He

@SijunHe

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users