Jingang Wang @bitwjg - Twitter Profile

6/ RSI is getting more practical. Near-term self-improvement probably looks like: auto-generate tasks, sample failures, improve harnesses, select RL data, update skills, and run self-verify -> self-train loops.

0

44

Who to follow

Jacob Andreas

@jacobandreas

Teaching computers to read. Assoc. prof @MITEECS / @MIT_CSAIL / @NLP_MIT (he/him). https://t.co/5kCnXHjtlY https://t.co/2A3qF5vdJw

Taco Cohen

@TacoCohen

Slop janitor & post-trainologer at Meta / FAIR. Into codegen, RL, equivariance. Spent time at Qualcomm, Scyfer (acquired), UvA, Deepmind, OpenAI.

ML Review

@ml_review

Latest Machine Learning Papers, Lectures, Projects etc.

Jingang Wang @bitwjg

4 days ago

May takeaway on Coding Agents and AI-for-AI: the frontier is moving from "which base model is best?" to "who has the best training data, verifier, harness, and eval loop?"

6

1

0

49

Jingang Wang @bitwjg

4 days ago

5/ Harness became a core asset. LiteCoder-Terminal, unix-ctf, and agent failure taxonomies all point to the same bottleneck: better coding agents need better executable environments, not just better prompts.

0

22

Jingang Wang @bitwjg

4 days ago

4/ The data unit is changing. For agent SFT/RL, a "successful patch" is too coarse. The useful unit is the state transition: what the agent saw, tried, verified, failed, recovered from, and changed.

0

17

Jingang Wang @bitwjg

4 days ago

3/ RL post-training matured fast. May moved from rollout efficiency and cold-start tricks, to GRPO/credit assignment, to RL data selection: SAERL, label-free RL, verifier-free rewards, and offline RL for code.

0

46

Jingang Wang @bitwjg

4 days ago

2/ Eval is fragmenting. SpecBench, AgentLens-style trace diagnostics, FastKernels, TerminalBench-related work, DeepSWE, and SWE-bench Multimodal all point to one thing: final pass/fail is not enough.

0

79

Jingang Wang @bitwjg

4 days ago

1/ SWE-bench looked stable at the top in May. Public Verified scores hovered around 79%, while new work pushed toward harder evals: multimodal SWE, terminal tasks, production kernels, long-horizon agents, and safety.

0

27

Jingang Wang @bitwjg

4 days ago

@DhruvJain08 @thsottiaux YES， PLEASE！！！

0

50

Jingang Wang @bitwjg

4 days ago

@op7418 看来经常reconnecting是个普遍问题，我一直以为是我自己梯子的问题。还有就是自动压缩上下文的时候总是remote连接超时的问题

0

5

0

2K

Jingang Wang @bitwjg

10 days ago

A worthwhile read. "The Ultra-Long Context Paradox" by Chen Zhang, the core contributor of Longcat Zigzag Attention. https://t.co/saps520U1A

Chen Zhang @ChenZhang0212

12 days ago

"Long context is a solved problem." It isn't. Context rot. Context anxiety. Lossy compression that compounds over hours. I wrote about why the answer is to contine scaling — it's to push past 1M tokens. https://t.co/8BR599To0C

0

2

1

0

108

0

2

0

65

bitwjg retweeted

Chen Zhang @ChenZhang0212

12 days ago

"Long context is a solved problem." It isn't. Context rot. Context anxiety. Lossy compression that compounds over hours. I wrote about why the answer is to contine scaling — it's to push past 1M tokens. https://t.co/8BR599To0C

0

2

1

0

108

Jingang Wang @bitwjg

10 days ago

@lorenlugosch Jason was probably constrained by a non-compete agreement, so he couldn’t use his real name.

0

8

0

763

Jingang Wang @bitwjg

17 days ago

@tvytlx 目前没有共识，各个harness框架都有不同的抽象和实现

1

2

0

3K

bitwjg retweeted

Wenhao Chai @ CVPR 2026

@wenhaocha1

25 days ago

Introducing MLS-Bench for machine learning science. Auto research built on coding agents is undoubtedly another major market beyond SWE coding. It is harder and more challenging. However, we believe there are two different categories here. Auto research from @karpathy , MLE-Bench, and PostTrainBench are one type of attempt: engineering. Agents are asked to optimize a specific engineering objective, but we do not require them to produce transferable, generalizable behavior. MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve a specific component of an ML system or algorithm, and to demonstrate that the improvement generalizes and scales under controlled settings. We find that current agents are still far from consistently outperforming human-designed methods, and that engineering-style tuning is much easier for them than genuine method invention. @Lyubh22 is the lead of this project. I was deeply impressed by the way he used Discord to organize agent trajectories and share them with the team. Paper: https://t.co/pr2DgBcQto Code: https://t.co/mvoirci6o6 Website: https://t.co/jmJMwfgChL

wenhaocha1's tweet photo. Introducing MLS-Bench for machine learning science.

Auto research built on coding agents is undoubtedly another major market beyond SWE coding. It is harder and more challenging. However, we believe there are two different categories here. Auto research from @karpathy , MLE-Bench, and PostTrainBench are one type of attempt: engineering. Agents are asked to optimize a specific engineering objective, but we do not require them to produce transferable, generalizable behavior.

MLS-Bench contains 140 tasks across 12 domains. Each task requires an agent to improve a specific component of an ML system or algorithm, and to demonstrate that the improvement generalizes and scales under controlled settings. We find that current agents are still far from consistently outperforming human-designed methods, and that engineering-style tuning is much easier for them than genuine method invention.

@Lyubh22 is the lead of this project. I was deeply impressed by the way he used Discord to organize agent trajectories and share them with the team.

Paper: https://t.co/pr2DgBcQto
Code: https://t.co/mvoirci6o6
Website: https://t.co/jmJMwfgChL

9

193

39

99

74K

Jingang Wang @bitwjg

29 days ago

@natolambert Thanks for your visiting

0

1

0

484

Jingang Wang @bitwjg

about 2 months ago

@natolambert I think it is also feasible to update the knowledge cutoff time during the CPT/mid-training phase.

0

27

bitwjg retweeted

Denny Zhou

@denny_zhou

4 months ago

Intelligence will not become a commodity. Intelligence will not be too cheap to meter. In the future, the real gap between people will be defined by their access to intelligence. Just as a car lets an average person outrun any world champion, AI lets ordinary minds solve problems the greatest minds of the past could not. Be prepared for this revolution.

69

448

41

148

43K

Jingang Wang

@bitwjg

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users