Aetjess @aetjesseth - Twitter Profile

about 21 hours ago

@WilliambilSf You just open https://t.co/K0UPvK5FSp to the menu and chat AI, type create a coin with the name Aevon and you already have your own coin

1

0

51

Aetjess

@aetjesseth

about 21 hours ago

@WilliambilSf Bro, you better relaunch it so it grows organically. Don't trust the community.

1

0

1K

Aetjess

@aetjesseth

1 day ago

@WilliambilSf I think you need funding for your project. Someone created a coin for you on @bankrbot Ca coin: 0x20d35a75b2547d8ad23e629868226c0bf3934ba3 Are you interested in integrating your project with Bankrbot? You'll get more money to develop your project.

1

0

104

aetjesseth retweeted

Adithya S K

@adithya_s_k

16 days ago

ICYMI, just dropped the largest Open Source Multilingual OCR Dataset > 1M images , 22 languages , 6 tasks its also trending in the multimodal category with close to 3k downloads in the last 3 days

adithya_s_k's tweet photo. ICYMI, just dropped the largest Open Source Multilingual OCR Dataset

> 1M images , 22 languages , 6 tasks

its also trending in the multimodal category with close to 3k downloads in the last 3 days https://t.co/NQLp5nM6i9

1

84

6

19

6K

aetjesseth retweeted

Joseph Suarez 🐡

@jsuarez

18 days ago

Reinforcement learning research with Joseph Suarez https://t.co/YiT9ahZUsm

0

12

1

1K

aetjesseth retweeted

Joseph Suarez 🐡

@jsuarez

18 days ago

Another massive fail. Cites PPO-v3 + DreamerV3 on percentile scaling for robust advantage scaling. Pretty nifty right? Except I'm the last author on PPO-v3 and the paper states that DreamerV3's scaling tricks generally do not work at all.

jsuarez's tweet photo. Another massive fail. Cites PPO-v3 + DreamerV3 on percentile scaling for robust advantage scaling. Pretty nifty right? Except I'm the last author on PPO-v3 and the paper states that DreamerV3's scaling tricks generally do not work at all. https://t.co/jUSUDHwQpD

4

55

2

24

6K

aetjesseth retweeted

waterloo intern

@waterloo_intern

17 days ago

if you can't guess the kernel, you're not locked in enough

16

309

8

202

34K

aetjesseth retweeted

Jueun Kim @jueunkim_0525

18 days ago

🚨New Optimizer Paper AMUSE: Anytime MUon with Stable gradient Evaluation AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay. • Stronger 124M / 720M / 1B pretraining • Strong ImageNet / ViT fine-tuning performance.

jueunkim_0525's tweet photo. 🚨New Optimizer Paper
AMUSE: Anytime MUon with Stable gradient Evaluation

AMUSE combines Muon with Schedule-Free-style gradient evaluation for stable anytime training without LR decay.

• Stronger 124M / 720M / 1B pretraining
• Strong ImageNet / ViT fine-tuning performance. https://t.co/Y1qQnpDt2n

16

320

40

207

43K

aetjesseth retweeted

DailyPapers

@HuggingPapers

17 days ago

GARD: Geometry-Aware Representation Denoising Diffusion-based restoration directly inside the feature space of a 3D reconstruction model. Preserves cross-view geometry while recovering clean images and 3D structure from degraded inputs. Outperforms pixel-space and VAE-based methods.

HuggingPapers's tweet photo. GARD: Geometry-Aware Representation Denoising

Diffusion-based restoration directly inside the feature space of a 3D reconstruction model.

Preserves cross-view geometry while recovering clean images and 3D structure from degraded inputs.

Outperforms pixel-space and VAE-based methods.

1

49

8

34

3K

aetjesseth retweeted

Xiuyu Li

@sheriyuo

17 days ago

It is not the first time API providers have misled users by offering a weaker model than the one they claim. Even OpenAI can undermine the trust game. Our latest paper is the first academic work to discuss this issue in detail. We propose an attack against existing detection methods, showing how a small model can impersonate a larger model in practice and fool users. I really love working on these kinds of fresh ideas, whether or not they are directly related to my main research line lol Your “Pro” LLM Subscription May Actually Be “Free”: Exposing Fingerprint Spoofing Risks in LLM Inference Services Coming to arXiv in several days! GPT-5.5 getting caught for silently downgrading intelligence https://t.co/i0Xom5I3Yh

sheriyuo's tweet photo. It is not the first time API providers have misled users by offering a weaker model than the one they claim. Even OpenAI can undermine the trust game.

Our latest paper is the first academic work to discuss this issue in detail. We propose an attack against existing detection methods, showing how a small model can impersonate a larger model in practice and fool users.

I really love working on these kinds of fresh ideas, whether or not they are directly related to my main research line lol

Your “Pro” LLM Subscription May Actually Be “Free”: Exposing Fingerprint Spoofing Risks in LLM Inference Services
Coming to arXiv in several days!

GPT-5.5 getting caught for silently downgrading intelligence
https://t.co/i0Xom5I3Yh

5

44

5

19

3K

aetjesseth retweeted

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

17 days ago

DATA QUALITY IS NOT JUST A MIXTURE WEIGHT, IT IS A SCHEDULING VARIABLE. Curated data plays two roles: early, it amplifies signal through smaller batches; late, it suppresses noise through larger batches. Drop-Stable-Rampup follows directly: drop batch at the quality transition, hold low, then ramp near the end. Paper: https://t.co/HgmF2Gdz2A

gm8xx8's tweet photo. DATA QUALITY IS NOT JUST A MIXTURE WEIGHT, IT IS A SCHEDULING VARIABLE.

Curated data plays two roles: early, it amplifies signal through smaller batches; late, it suppresses noise through larger batches.

Drop-Stable-Rampup follows directly: drop batch at the quality transition, hold low, then ramp near the end.

Paper: https://t.co/HgmF2Gdz2A

1

30

5

19

2K

aetjesseth retweeted

Ethan Caballero

@ethanCaballero

17 days ago

New paper: We present a "Unified Neural Scaling Law" functional form that accurately models & extrapolates the multivariate scaling behaviors of artificial neural networks as the variables listed in this attached video are varied. (1/N)

10

479

64

410

46K

aetjesseth retweeted

Sebastian Raschka

@rasbt

17 days ago

The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below: 1. Full attention as an anti-trend?: They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2. 2. Linear and sparse attention deployment issues: They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system. In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision. Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context). 3. Fine-grained Mixture-of-Experts (MoEs) are useful: Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing. Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing. The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago). 4. Sophisticated agent pipeline It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline. They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc. 5. Interleaved thinking for context management Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especially in multi-step agent tasks. (Another point why long-context support is so important these days). 6. Speed rewards It's common to have token usage penalties, but what's interesting is that the MiniMax team adds a task-completion-time reward that depends on wall-clock time. This is to minimize unnecessary (slow) tool calls. Also, I'm thinking that this would encourage agent parallelization (if supported by the harness) 7. Self-evolution Looks like self-evolution is also already a big design component of open-weight LLMs. E.g., the paper says that M2.7 already handles 30 to 50 percent of the daily RL iteration workload, modifies its own scaffold, and completed a 100-round autonomous scaffold optimization cycle with a 30 percent gain on internal evaluations.

rasbt's tweet photo. The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below:

1. Full attention as an anti-trend?:

They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2.

2. Linear and sparse attention deployment issues:

They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system.

In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision.

Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context).

3. Fine-grained Mixture-of-Experts (MoEs) are useful:

Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing.

Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing.

The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago).

4. Sophisticated agent pipeline

It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline.

They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc.

5. Interleaved thinking for context management

Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especially in multi-step agent tasks. (Another point why long-context support is so important these days).

6. Speed rewards

It's common to have token usage penalties, but what's interesting is that the MiniMax team adds a task-completion-time reward that depends on wall-clock time. This is to minimize unnecessary (slow) tool calls. Also, I'm thinking that this would encourage agent parallelization (if supported by the harness)

7. Self-evolution

Looks like self-evolution is also already a big design component of open-weight LLMs. E.g., the paper says that M2.7 already handles 30 to 50 percent of the daily RL iteration workload, modifies its own scaffold, and completed a 100-round autonomous scaffold optimization cycle with a 30 percent gain on internal evaluations.

36

537

94

259

38K

aetjesseth retweeted

Serena Ge (Datacurve)

@serenaa_ge

18 days ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

511

6K

742

3K

2M

aetjesseth retweeted

Binfeng Xu

@billxbf

18 days ago

Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change. Find a problem, design the harness, and train your own agents! 🧵

billxbf's tweet photo. Excited to release 🌟Polar🌟, our Agent RL rollout infra for real-world harnesses. Be it Codex, Claude Code, OpenClaw, Hermes, or your self-made ones 🔥 -- Polar takes your harnesses directly as training environments without code change.

Find a problem, design the harness, and train your own agents! 🧵

25

904

144

949

131K

Aetjess

@aetjesseth

Last Seen Users on Sotwe

Trends for you

Most Popular Users