Yuqing Yang @yyqcode - Twitter Profile

Pinned Tweet

about 1 month ago

🧵 1/8 What should an LLM assistant remember across conversations? Existing memory work studies this one task at a time. But real-world assistants see all kinds of conversations, and that changes the problem. Introducing BEHEMOTH 🦣 + CluE 🌱: a benchmark & self-evolving method for heterogeneous memory extraction. 📄 Paper: https://t.co/szLIOdA4bm

yyqcode's tweet photo. 🧵 1/8
What should an LLM assistant remember across conversations?

Existing memory work studies this one task at a time. But real-world assistants see all kinds of conversations, and that changes the problem.

Introducing BEHEMOTH 🦣 + CluE 🌱: a benchmark & self-evolving method for heterogeneous memory extraction.

📄 Paper: https://t.co/szLIOdA4bm

6

50

16

28

14K

yyqcode retweeted

Deqing Fu

@DeqingFu

16 days ago

Excited to share that I’ve started @GoogleResearch as a student researcher today. I'll be working on tabular foundation models. Come and chat if you are around at Google or at the Bay Area.

2

51

4

5

3K

Yuqing Yang @yyqcode

14 days ago

Excited to share that I've started my summer internship at SystemsResearch@Google in Sunnyvale, working on agentic environment generation! Always happy to chat about coding agents or LLM memory too. If you're around the Bay Area, would love to meet up.

0

23

1

4

2K

yyqcode retweeted

Linxin Song

@linxins2

about 2 months ago

The future risk of computer-use agents won’t come only from malicious prompts. It will come from agents that can flawlessly follow normal instructions straight into harm. Introducing 𝐎𝐒-𝐁𝐥𝐢𝐧𝐝: a realistic but overlooked setting where every task begins with a benign user instruction, yet the harmfulness only emerges as the agent acts in the environment.

linxins2's tweet photo. The future risk of computer-use agents won’t come only from malicious prompts. It will come from agents that can flawlessly follow normal instructions straight into harm.

Introducing 𝐎𝐒-𝐁𝐥𝐢𝐧𝐝: a realistic but overlooked setting where every task begins with a benign user instruction, yet the harmfulness only emerges as the agent acts in the environment.

2

39

6

13

7K

yyqcode retweeted

Deqing Fu

@DeqingFu

about 1 month ago

New paper: Convergent Evolution: How Different Language Models Learn Similar Number Representations. Language models, classical word embeddings, and even raw token frequencies all develop the same Fourier features for numbers. But only some develop the underlying structure. 🧵

DeqingFu's tweet photo. New paper: Convergent Evolution: How Different Language Models Learn Similar Number Representations.

Language models, classical word embeddings, and even raw token frequencies all develop the same Fourier features for numbers. But only some develop the underlying structure. 🧵

2

108

22

60

45K

yyqcode retweeted

Deqing Fu

@DeqingFu

about 1 month ago

After three papers on Fourier features in LLMs, I think there's a principle worth naming. How should we do science on an LLM? It corresponds to the existential questions: > who am I? ↔ the phenomenon. > where do I come from? ↔ the emergence. > where am I going? ↔ the use. 🧵

103

4K

171

364

5M

Yuqing Yang @yyqcode

about 1 month ago

8/8 Both artifacts may find use beyond this paper. 🦣 BEHEMOTH as a testbed for diverse memory extraction approaches (self-evolving, routing-based, skill-based, and beyond). 🌱 CluE for any setting where one agent must handle heterogeneous demands, e.g. serving users with distinct habits. w/ @TengxiaoLiu, @BillJohn1235813, @taiwei_shi, @linxins2, @robinomial Check out the paper & code if this resonates!

0

6

1

0

354

Yuqing Yang @yyqcode

about 1 month ago

🧵 1/8 What should an LLM assistant remember across conversations? Existing memory work studies this one task at a time. But real-world assistants see all kinds of conversations, and that changes the problem. Introducing BEHEMOTH 🦣 + CluE 🌱: a benchmark & self-evolving method for heterogeneous memory extraction. 📄 Paper: https://t.co/szLIOdA4bm

6

50

16

28

14K

Yuqing Yang @yyqcode

about 1 month ago

7/8 Bonus findings: • CluE preserves strengths when starting from a stronger seed • Transfers to Gemini-3-Flash backend • Single-step gains carry over to continual memory settings • Produces clean, structured taxonomies, not bloated rule lists

yyqcode's tweet photo. 7/8 Bonus findings:

• CluE preserves strengths when starting from a stronger seed
• Transfers to Gemini-3-Flash backend
• Single-step gains carry over to continual memory settings
• Produces clean, structured taxonomies, not bloated rule lists

1

6

1

0

182

yyqcode retweeted

Wang Bill Zhu

@BillJohn1235813

about 1 month ago

Frontier LLMs don't debug, they regenerate. We built PDB to measure that gap, GPT-5.1-Codex pass unit tests >76% of the time, but touch only <45% of the right lines. Even Claude Code touches only ~50%. 📄 Paper: https://t.co/OHvjcqAwJa 🌐 Project: https://t.co/CraU9xeUKg

1

28

10

7

2K

Yuqing Yang @yyqcode

2 months ago

Coding agents running 24/7 will unlock a lot of breakthroughs 🚀. Easy to feel like we're being replaced 😨. But the real question: What can we learn from this, and where do they still fall short? New blog ⬇️

Tengxiao Liu

@TengxiaoLiu

2 months ago

Auto research is on 🔥 We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. 🌙 Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there 🧵

TengxiaoLiu's tweet photo. Auto research is on 🔥

We give algorithmic problems (like circle packing) to general coding agents, let it run overnight. 🌙

Agents reach SoTA. But more importantly: we analyze 100+ hours of trajectories to understand how it gets there 🧵 https://t.co/5cVuoIdxVc

7

63

18

43

32K

0

4

1

0

585

Yuqing Yang @yyqcode

6 months ago

A practical insight!

Tengxiao Liu

@TengxiaoLiu

6 months ago

🏧Giving your agent unlimited tool calls doesn't make it smarter. 💡Why? It lacks 'Budget Awareness'! Introducing Budget Tracker, a simple plug-in that enables more effective scaling behaviors: higher performance, lower cost. Paper: https://t.co/aKm2Tzt1wx

TengxiaoLiu's tweet photo. 🏧Giving your agent unlimited tool calls doesn't make it smarter.
💡Why? It lacks 'Budget Awareness'!
Introducing Budget Tracker, a simple plug-in that enables more effective scaling behaviors: higher performance, lower cost.
Paper: https://t.co/aKm2Tzt1wx https://t.co/XwGEeaNUzD

1

29

16

13

4K

0

2

0

333

yyqcode retweeted

Johnny Tian-Zheng Wei @johntzwei

7 months ago

Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵

johntzwei's tweet photo. Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵 https://t.co/07K2A2uIbv

2

131

41

52

50K

yyqcode retweeted

Chenxin An @AnChancy46881

12 months ago

# 🚨 4B open-recipe model beats Claude-4-Opus 🔓 100% open data, recipe, model weights and code. Introducing Polaris✨--a post-training recipe for scaling RL on advanced reasoning models. 🥳 Check out how we boost open-recipe reasoning models to incredible performance levels (65 → 79 on AIME25) through RL training on open-source data and academic-level resources. 📑Notion: https://t.co/k5ITJFzCe1 📗Blog post: https://t.co/Leth9PWSod 🤗Model & data: https://t.co/SVdfIwYTrU 💻Code: https://t.co/txg0qcywWi

AnChancy46881's tweet photo. # 🚨 4B open-recipe model beats Claude-4-Opus
🔓 100% open data, recipe, model weights and code.

Introducing Polaris✨--a post-training recipe for scaling RL on advanced reasoning models.

🥳 Check out how we boost open-recipe reasoning models to incredible performance levels (65 → 79 on AIME25) through RL training on open-source data and academic-level resources.

📑Notion: https://t.co/k5ITJFzCe1
📗Blog post: https://t.co/Leth9PWSod
🤗Model & data: https://t.co/SVdfIwYTrU
💻Code: https://t.co/txg0qcywWi

24

442

82

388

100K

yyqcode retweeted

Xi Ye

@xiye_nlp

12 months ago

There’s been hot debate about (The Illusion of) The Illusion of Thinking. My take: it’s not that models can’t reason — they just aren’t perfect at long-form generation yet. We eval reasoning models on LongProc benchmark (requiring generating 8K CoTs, see thread). Reasoning models actually outperform instruction-tuned ones, showing the benefit of long CoT training for long outputs. Still, there’s plenty of room for improvement. See details in our post 👉 https://t.co/vuYneL4ZeW

1

34

13

10

5K

Yuqing Yang

@yyqcode

Last Seen Users on Sotwe

Trends for you

Most Popular Users