Tsendsuren @TsendeeMTS - Twitter Profile

Tsendsuren @TsendeeMTS

1 day ago

An another outer loop

Anthropic

@AnthropicAI

1 day ago

Our internal data shows Claude is accelerating AI development—a possible path to recursive self-improvement, or AI autonomously building a more capable successor. It’s happening faster than we thought, and the implications deserve greater attention. https://t.co/OVVPJO7VQx

2K

27K

4K

15K

17M

0

1

86

TsendeeMTS retweeted

Xie Zhifei

@XieZhifei14110

2 days ago

Audio is the modality of interaction. Audio language model is out. Introducing the Audio Interaction Model, a new paradigm for end to end streaming unified audio models

11

112

12

82

12K

TsendeeMTS retweeted

elie

@eliebakouch

3 days ago

the only knob they change is model depth (number of layers), everything is derived from it with heuristics. first heuristic: hidden size = L * 256/3 this is derived from recent models, here is how it compares to others. other parameters: - fixed expert sparsity (unless ablated) - FFN expansion is 2x, latentMoE hyperparameters are 2x compression -> 3x expansion (see the plot on latentMoE to understand what this means)

eliebakouch's tweet photo. the only knob they change is model depth (number of layers), everything is derived from it with heuristics.

first heuristic:

hidden size = L * 256/3

this is derived from recent models, here is how it compares to others.

other parameters:
- fixed expert sparsity (unless ablated)
- FFN expansion is 2x, latentMoE hyperparameters are 2x compression -> 3x expansion (see the plot on latentMoE to understand what this means)

1

47

1

3

5K

Tsendsuren @TsendeeMTS

8 days ago

IIUC, this reads inputs N times to update the memory more. This is something I tried in Infini-attention and it does improve in some cases for NIH but was not consistent.

Sangyun Lee

@sang_yun_lee

9 days ago

Our fix is simple: to use N recurrent forward passes for learning fast weights. This gives the model enough time to learn a good representation of context. We call this process “sleep”. This is not the same as looped transformers–our model still uses a single forward pass outside the sleep phase (i.e. when the context window is not full).

sang_yun_lee's tweet photo. Our fix is simple: to use N recurrent forward passes for learning fast weights. This gives the model enough time to learn a good representation of context. We call this process “sleep”. This is not the same as looped transformers–our model still uses a single forward pass outside the sleep phase (i.e. when the context window is not full).

1

15

0

6

3K

0

1

0

121

Who to follow

Undral Amarsaikhan

@uundaa

CEO @TengerTV | Co-Founder @UnreadToday | Part of @GlobalShapers @TeamMongolia |

Making sense of things. Founder @arvispublishing Previously co-founded @unreadtoday

Tsendsuren @TsendeeMTS

8 days ago

Add noised conditioned on layer index and train layers to denoise independently so when they are stacked together, recover the true target. Neat!

Sakana AI

@SakanaAILabs

10 days ago

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://t.co/c9AvsRKybj What if we didn’t have to hold an entire neural network in memory to train it? Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network. In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance. With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block. How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently. We validated this across five different architectures: • ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers In each case, performance is competitive with end-to-end training while using a fraction of the memory. This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training. Read our paper and code, to learn more. Paper: https://t.co/CRj96VGYQn GitHub: https://t.co/eNW0K9Xh8E 🐟

55

2K

365

2K

854K

0

1

105

Tsendsuren @TsendeeMTS

9 days ago

This is incredibly fast release with quality gains 🤯

Claude

@claudeai

9 days ago

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today at the same price.

claudeai's tweet photo. Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price. https://t.co/EufxL7T1kb

4K

67K

9K

8K

15M

0

1

50

TsendeeMTS retweeted

Nous Research

@NousResearch

18 days ago

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks. Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade. Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B. The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

NousResearch's tweet photo. Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

75

1K

162

638

102K

TsendeeMTS retweeted

Mati Staniszewski

@mati

18 days ago

Albert Einstein + ElevenLabs. AI agents can make education more accessible - a teacher for every student in every field. A classroom size of one learning from icons who shaped the world Today with his estate, we’re bringing Albert Einstein to ElevenLabs

25

340

39

103

31K

Tsendsuren @TsendeeMTS

19 days ago

@ChangHao564792d Np, congratulations on awesome work.

0

30

Tsendsuren @TsendeeMTS

21 days ago

Good old dagger for LLMs, going on my reading list. I felt there are more fruits to pick in this direction.

ChangHao @ChangHao564792d

23 days ago

🚀 Excited to share our new paper: Revisiting DAgger in the Era of LLM-Agents! Training long-horizon LLM agents is hard: 🔸 SFT → covariate shift 🔸 RL → sparse rewards 🔸 On-policy distillation → cold-start failure + needs white-box teacher logits We bring back DAgger to fix all three: on-policy rollouts ✕ dense teacher supervision, no cold-start, fully black-box-teacher compatible. ✨ Results on SWE-bench Verified: 🔹 Our 4B agent hits 27.3%, beating published 8B SWE-agent systems 🔹 Our 8B agent hits 29.8%, surpassing SWE-Gym-32B and within 5 pts of strong 32B agents 📄 Paper: https://t.co/e8TTuh1VWb 🤗 HF Daily: https://t.co/nwVwqaahlq

ChangHao564792d's tweet photo. 🚀 Excited to share our new paper: Revisiting DAgger in the Era of LLM-Agents!

Training long-horizon LLM agents is hard:
🔸 SFT → covariate shift
🔸 RL → sparse rewards
🔸 On-policy distillation → cold-start failure + needs white-box teacher logits

We bring back DAgger to fix all three: on-policy rollouts ✕ dense teacher supervision, no cold-start, fully black-box-teacher compatible.

✨ Results on SWE-bench Verified:
🔹 Our 4B agent hits 27.3%, beating published 8B SWE-agent systems
🔹 Our 8B agent hits 29.8%, surpassing SWE-Gym-32B and within 5 pts of strong 32B agents

📄 Paper: https://t.co/e8TTuh1VWb
🤗 HF Daily: https://t.co/nwVwqaahlq

11

126

18

121

26K

2

3

1

2

385

TsendeeMTS retweeted

Jonas Geiping

@jonasgeiping

24 days ago

We’re training models wrong and it’s due to chatGPT. Even the modern coding agents used daily still use message-based exchanges: They send messages to users, to themselves (CoT) and to tools, and receive messages in turn. This bottlenecks even very intelligent agents to a single stream. The models cannot read while writing, cannot act while thinking and cannot think while processing information. In our new paper, see below, we discuss LLMs with parallel streams. We show that multi-stream LLMs can … 🔵Be created by instruction-tuning for the stream format 🔵Simplify user and tool use UX removing many pain points with agents and chat models (such as having to interrupt the model to get a word in) 🔵Multi-Stream LLMs are fast, they can predict+read tokens in all streams in parallel in each forward pass, improving latency 🔵 LLMs with multiple streams have an easier time encoding a separation of concerns, improving security 🔵 LLMs with many internal streams provide a legible form of parallel/cont. reasoning. Even if the main CoT stream is accidentally pressured or too focused on a particular task to voice concerns, other internal streams can subvocalize concerns that would otherwise not be verbalized. Does this sound related to a recent thinky post :) - Yes, but I don’t feel so bad about being outshipped with such a cool report on their side by 23 hours. I’ll link a 2nd thread below with a more direct comparison. I actually think both are complementary in interesting ways.

42

1K

168

1K

156K

TsendeeMTS retweeted

Jonas Geiping

@jonasgeiping

24 days ago

What do we gain? First off, we can improve latencies because we now overlap thinking, system inputs, tool use and even auditing calls (and we show this in the paper). Second, we find that the models we train in a clean ablation with this format actually have a significantly easier time withstanding prompt injections, because it is easier to separate input and output if they are separate streams.

jonasgeiping's tweet photo. What do we gain? First off, we can improve latencies because we now overlap thinking, system inputs, tool use and even auditing calls (and we show this in the paper).

Second, we find that the models we train in a clean ablation with this format actually have a significantly easier time withstanding prompt injections, because it is easier to separate input and output if they are separate streams.

2

52

6

20

14K

Tsendsuren @TsendeeMTS

29 days ago

Since LLMs, the community has gone through some forgetting. Good that it can take fresh eyes but could also be inefficient.

0

53

TsendeeMTS retweeted

elie

@eliebakouch

29 days ago

this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more could even see it being used as an eval to quantify how smart a model is, i love this

eliebakouch's tweet photo. this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more

could even see it being used as an eval to quantify how smart a model is, i love this

22

1K

109

874

111K

TsendeeMTS retweeted

fly51fly @fly51fly

about 1 month ago

[LG] Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling A Nikulkov [AI at Meta] (2026) https://t.co/UCxxkYtD7w

fly51fly's tweet photo. [LG] Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
A Nikulkov [AI at Meta] (2026)
https://t.co/UCxxkYtD7w https://t.co/d1hZpW31ou

2

53

9

52

8K

Tsendsuren @TsendeeMTS

about 1 month ago

Compression strikes here. Let us vibe check

DeepSeek

@deepseek_ai

about 1 month ago

Structural Innovation & Ultra-High Context Efficiency 🔹 Novel Attention: Token-wise compression + DSA (DeepSeek Sparse Attention). 🔹 Peak Efficiency: World-leading long context with drastically reduced compute & memory costs. 🔹 1M Standard: 1M context is now the default across all official DeepSeek services. 4/n

deepseek_ai's tweet photo. Structural Innovation & Ultra-High Context Efficiency

🔹 Novel Attention: Token-wise compression + DSA (DeepSeek Sparse Attention).
🔹 Peak Efficiency: World-leading long context with drastically reduced compute & memory costs.
🔹 1M Standard: 1M context is now the default across all official DeepSeek services.

4/n

20

2K

149

206

642K

0

136

TsendeeMTS retweeted

DeepSeek

@deepseek_ai

about 1 month ago

Structural Innovation & Ultra-High Context Efficiency 🔹 Novel Attention: Token-wise compression + DSA (DeepSeek Sparse Attention). 🔹 Peak Efficiency: World-leading long context with drastically reduced compute & memory costs. 🔹 1M Standard: 1M context is now the default across all official DeepSeek services. 4/n

20

2K

149

206

642K

Tsendsuren @TsendeeMTS

about 2 months ago

@GiimaaAj @billboard Энэ жилийнх тэхдээ янзын болжээ. Үзэхгүй яав гээд харамсаад суужын 😅

1

0

74

Tsendsuren @TsendeeMTS

2 months ago

Interesting! Few years back, I did experiment and observed 4x reduction without regression. In some cases, it even gave boost. But still entailed computing that giant LxL matrix so I dropped it.

Ashwin Gopinath

@ashwingop

2 months ago

https://t.co/IJoTFAonJS

16

358

50

473

64K

0

1

226

Tsendsuren @TsendeeMTS

2 months ago

@GiimaaAj Bayariin mendee

0

1

0

13

Tsendsuren

@TsendeeMTS

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users