bfuzzy @bfuzzy1 - Twitter Profile

Pinned Tweet

bfuzzy @bfuzzy1

8 months ago

you’re never ready start, screw up, adjust, repeat progress comes from doing, not waiting

0

4

1

0

718

bfuzzy1 retweeted

Ramp Labs

@RampLabs

1 day ago

Today we’re releasing Ramp SWE-Bench: a private, production-grounded coding benchmark created from real engineering problems we've faced at Ramp.

34

914

48

447

163K

bfuzzy @bfuzzy1

4 days ago

great post!

Interlatent

@interlatent

4 days ago

Our mission is to make it easy for anyone to deploy a robot to help them in the real world We wrote an intuitive guide to understanding modern robotics, catered toward an audience that understands technology but not AI robotics We hope that this short blog post embeds in you the core principles that will bring further curiosity.

35

2K

281

3K

314K

0

1

0

47

bfuzzy @bfuzzy1

4 days ago

Getting downgraded just setting up pufferlib. lol.😬

SemiAnalysis

@SemiAnalysis_

5 days ago

BREAKING NEWS: Anthropic's latest model will NOT help you if it thinks your ML research/ML engineering is interesting, and/or will secretly degrade its IQ so that the average engineer won't notice. We are already seeing Anthropic's latest model's moderation filters our GPU inference research and programming 😭

SemiAnalysis_'s tweet photo. BREAKING NEWS: Anthropic's latest model will NOT help you if it thinks your ML research/ML engineering is interesting, and/or will secretly degrade its IQ so that the average engineer won't notice. We are already seeing Anthropic's latest model's moderation filters our GPU inference research and programming 😭

206

5K

520

2K

2M

0

25

Who to follow

Moloch

@LittleJoeTables

Offsec Engineer Formerly of @BishopFox https://t.co/YcsVLOezuj https://t.co/z3UKx3Wcrf

Jonny Johnson

@JonnyJohnson_

Windows Internals & Telemetry Research @ThePayloadPod Blog: https://t.co/MnE9BCsSnA Github: https://t.co/v7hSLq6Edz

klez

@KlezVirus

Adversary Simulation @SpecterOps - Opinions are my own

bfuzzy1 retweeted

Michael Tschannen @mtschannen

11 days ago

For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme: Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs! 1/

mtschannen's tweet photo. For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme:

Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!

1/ https://t.co/4J2JKCtzU5

27

1K

129

539

108K

bfuzzy @bfuzzy1

21 days ago

observe. ooda. pay attention.

0

14

bfuzzy1 retweeted

Nous Research

@NousResearch

23 days ago

Today we release a study on decoupling the benefits of subword tokenization for language model training, by simulating each suspected benefit one at a time inside a 1.7B byte-level pretraining pipeline. We formulate seven hypotheses for why subword LLMs outperform byte-level LLMs (covering computational efficiency, structural priors over subword boundaries and positions, and the optimization objective) and implement each as a controlled intervention against a byte-level baseline. Three of the seven move the validation loss at this scale; the rest either have negligible effect or hurt. Validated at 1.7B parameters on fineweb-edu with a LLaMA-3 architecture, with 68M-parameter replications in the appendix. The work was led by Théo Gigant, Bowen Peng, and Jeffrey Quesnelle. Paper: https://t.co/Blk7YdVLnc

NousResearch's tweet photo. Today we release a study on decoupling the benefits of subword tokenization for language model training, by simulating each suspected benefit one at a time inside a 1.7B byte-level pretraining pipeline.

We formulate seven hypotheses for why subword LLMs outperform byte-level LLMs (covering computational efficiency, structural priors over subword boundaries and positions, and the optimization objective) and implement each as a controlled intervention against a byte-level baseline. Three of the seven move the validation loss at this scale; the rest either have negligible effect or hurt.

Validated at 1.7B parameters on fineweb-edu with a LLaMA-3 architecture, with 68M-parameter replications in the appendix.

The work was led by Théo Gigant, Bowen Peng, and Jeffrey Quesnelle.

Paper: https://t.co/Blk7YdVLnc

41

981

115

321

69K

bfuzzy1 retweeted

Nous Research

@NousResearch

26 days ago

Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks. Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade. Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B. The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

NousResearch's tweet photo. Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating sparse circuits in the MLP basis without training a sparse autoencoder, modifying weights, or degrading general capability benchmarks.

Given a small set of contrastive prompt pairs that elicit a target behavior and its opposite, CNA isolates the top 0.1% of MLP neurons whose activations differ most between the two sets. Ablating that small circuit removes the behavior while leaving the rest of the model intact, and the intervention remains robust at high strengths where residual-stream methods like Contrastive Activation Addition (CAA) start to degrade.

Validated on the refusal circuit across 8 instruct-tuned models, including Llama-3.1-70B, Llama-3.2-3B, Qwen2.5-72B, and Qwen2.5-14B.

The work on CNA was led by @yaboilyrical, with support from @qorprate and @karan4d.

75

1K

160

638

102K

bfuzzy1 retweeted

Nous Research

@NousResearch

30 days ago

Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context. It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss. During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model. Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism. The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.