Daniel Chang

@dnnssl2

poignantmaxxing @mercor_ai abstraction of oneself empathy for the game

Joined February 2022

207 Following

190 Followers

191 Posts

Pinned Tweet

Daniel Chang

@dnnssl2

4 months ago

moshi moshi

792

dnnssl2 retweeted

Susan Zhang

@suchenzang

13 days ago

current LLMs fundamentally consist of four main components: - input layer: where input "words" (prompt) get mapped to "latents" aka some-model-representation-you-don't-understand-unless-you-start-reading-tea-leaves-of-spurious-correlations (some quite compelling à la word2vec style; latents is also unnecessary lingo so i will refer to these as "inputs" with quotes from now on) - mixing layers: where you jumble all your "inputs" together to see if any correlations between "inputs" can become useful (commonly used to compress or expand dims; predicting a single classification target == compress to a single dim, etc) - attention layers: where you learn how "inputs" relate to each other (aka discern what's important to remember vs fluff) - residuals: where you short-circuit a mixing/attention layer because it's probably adding too much confusion (aka avoid overthinking for simple things) ----- a "big" LLM simply scales two things: - width == how many dimensions you give to your "inputs" (the more dims, in theory the more unique/discerning/precise/complex your knowledge can become) - depth == how many mixing/attention/residual layers you can stack/loop between (aka "reason" over, where more of these ~= more "reasoning" abilities) "capabilities" that seem impressive to humans usually arise from taking advantage of both depth & width: where a model seemingly makes connections between disparate ideas, beyond what an average human can hold in working memory. this requires models to "completely light up" when responding to a "hard prompt", where effectively no param/layer goes unused. ----- the anatomy of a "model capability" is precisely the same mechanism that can be co-opted for a jailbreaking exploit: your goal is simply to "light up" as much of the model as possible, dodging any shallow input-classifiers at the beginning by triggering as many disparate "input ideologies" as possible, and subsequently have these "inputs" relate to each other in seemingly unrelated-yet-related ways that ideally have similar "complexity" as your jailbreak goal (to make it past enough layers of the model). think of the attack-vector as bundling your goal in a series of schizo-nerd-snipes: a sufficiently capable model will try to reason through everything all at once, eliminate the dead-ends, and successfully deliver the one jailbreak use-case you bubble-wrapped for. of course, there's an art to the above, and some are already extraordinarily proficient at the trojan-horse-packaging, but at some point there's no difference between "a capability" and "a jailbreak", though i'll be happy to be proven otherwise. ----- tl;dr ant flew too close to the sun, better kiss the ring or get buried.

168K

dnnssl2 retweeted

Serena Ge (Datacurve)

@serenaa_ge

about 1 month ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

510

740

dnnssl2 retweeted

Sri Kosuri

@srikosuri

about 1 month ago

Why did Erdos have so many problems?

134

183

113

264K

dnnssl2 retweeted

Yulin Chen ✈️ ICML2026 @YulinChen99

about 1 month ago

[1/6] GRPO on math problems with Qwen2.5-0.5B/3B and Llama-3.2-3B-Instruct. Bucket hard examples by training dynamics. ~Half of all hard examples are unlearnable. Across model and dataset.

YulinChen99's tweet photo. [1/6] GRPO on math problems with Qwen2.5-0.5B/3B and Llama-3.2-3B-Instruct. Bucket hard examples by training dynamics.

~Half of all hard examples are unlearnable. Across model and dataset. https://t.co/ZZYgRHZPkN

Daniel Chang

@dnnssl2

about 1 month ago

@JesseTinsley Sent a DM

195

Daniel Chang

@dnnssl2

2 months ago

ok so we have 2 unsaturated OSS benchmarks left

Noam Brown

@polynoamial

2 months ago

I'm a manager at @OpenAI, but with GPT-5.5 I'm a more effective IC than I've ever been. I can now write CUDA kernels like a pro. I can rely on it to run my research experiments. And we know how to make it much more powerful from here.

polynoamial's tweet photo. I'm a manager at @OpenAI, but with GPT-5.5 I'm a more effective IC than I've ever been. I can now write CUDA kernels like a pro. I can rely on it to run my research experiments. And we know how to make it much more powerful from here. https://t.co/6n2uAYCtTf

164

491

362K

302

dnnssl2 retweeted

Cameron R. Wolfe, Ph.D.

@cwolferesearch

2 months ago

Reading through these papers has given me a better understanding of why RL scaling laws are so messy compared to those from pretraining. Pretraining scaling laws and RL scaling laws are two completely different things for several reasons: 1. Defining compute: Pretraining has a very clean compute footprint of C = 6ND. RL compute is more complex to capture due to the presence of both sampling and policy updates. Some papers try to maintain the same FLOP estimate for compute, while others measure compute in terms of GPU hours. The efficiency of our training framework can cause the relationship between FLOPs / GPU hours to vary pretty drastically. 2. Intra versus inter-model extrapolation: Pretraining scaling laws fit trends across many model training runs with different settings to understand how model / data size (and compute) impact results. This allows us to extrapolate teh results of future training runs. In RL, we fit scaling laws both within an individual training run (intra-model extrapolation) and across training runs (inter-model extrapolation). Intra-model extrapolation is not necessary for pretraining because it is more stable, while RL is extremely sensitive to the exact training configuration being used. 3. Measuring performance: Pretraining scaling laws predict a very particular performance metric: the cross entropy loss (or some other related entropy metric) measured over an in-domain, held-out validation set. This is a stable performance metric that is typically computed over a very diverse dataset (i.e., some random sample from the pretraining corpus). RL scaling laws maintain the practice of computing performance over an in-domain validation set. However, the performance metric that they predict is reward (or accuracy) on a validation set. This is a downstream performance metric, and it can fluctuate drastically depending on the benchmark being used or the composition of data in that benchmark. 4. Lack of standardization: There are generally just more knobs that we can change in RL compared to pretraining. The design space is massive, and we are not sure (yet) which design decisions impact the scaling properties of RL. Several papers have focused on this topic and made meaningful progress on understanding what changes actually impact RL scaling. However, this does not change the fact that slight differences in the RL training setup can completely change observed scaling trends for RL. For this reason, many papers are comparing apples to oranges in terms of their recommendations for RL scaling, making progress on the topic difficult. There are even some papers that have completely opposite findings from each other, and this is likely do to slight differences in their exact GRPO formulation.

230

270

28K

Daniel Chang

@dnnssl2

2 months ago

@aj_kourabi could be worse https://t.co/3gGKhnSr9G

Convexititties, CFA

@convexititties

2 months ago

Food equity research feels so unserious. Just read a deep dive on Chili’s new chicken sandwich and how the specs compare to peers. Right down to the “bun advantage.” A sector where certified fat asses are the alpha generators, surely.

126

86K

dnnssl2 retweeted

Anjney Midha

@AnjneyMidha

2 months ago

if you train on data from dead startups, your AI will learn…how to run a dead startup mediocrity at scale

333

56K

Daniel Chang

@dnnssl2

2 months ago

the terminal value of the economy is underwritten to working with a sell-side broker such as myself to sell your data assets in order to work with a buy-side broker such as myself to get convertible debt and gpu allocation

128

dnnssl2 retweeted

TeraflopAI

@TeraflopAI

2 months ago

Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule (@johngfriedman), @TeraflopAI, and @daftengine collaborated to release 43 Billion Tokens of SEC EDGAR data.

TeraflopAI's tweet photo. Given the increasingly closed-source nature of the U.S. AI ecosystem, it is now more important than ever to push for the proliferation of open model and dataset releases. Datamule (@johngfriedman), @TeraflopAI, and @daftengine collaborated to release 43 Billion Tokens of SEC EDGAR data.

42K

dnnssl2 retweeted

Tim Hwang

@timhwang

3 months ago

ICMI believes that Christian theology offers concrete technical methods for confronting the trickiest problems in AI safety. Today, we release a pair of papers that reproduce @PalisadeAI @apolloaievals work showing how religious framings influence corrigibility and scheming.

timhwang's tweet photo. ICMI believes that Christian theology offers concrete technical methods for confronting the trickiest problems in AI safety.

Today, we release a pair of papers that reproduce @PalisadeAI @apolloaievals work showing how religious framings influence corrigibility and scheming. https://t.co/AsYIkCLqDI

767

476

333K

dnnssl2 retweeted

Ziming Liu @ZimingLiu11

3 months ago

Many people (myself included) are wondering: If not LLMs, what should I work on? In today's blog, I provide a unified framework for finding blue-ocean opportunities in AI, especially in this chaotic era of LLMs. https://t.co/WNXdfNTJ8Z

ZimingLiu11's tweet photo. Many people (myself included) are wondering:
If not LLMs, what should I work on?

In today's blog, I provide a unified framework for finding blue-ocean opportunities in AI, especially in this chaotic era of LLMs.

https://t.co/WNXdfNTJ8Z https://t.co/cX998td9OO

901

106

944

74K

Daniel Chang

@dnnssl2

3 months ago

@Mascobot generational run

Daniel Chang

@dnnssl2

4 months ago

@BristowEwan specifically smash bros

Daniel Chang

@dnnssl2

4 months ago

@BristowEwan nintendo commercial-coded

543

Daniel Chang

@dnnssl2

4 months ago

this too is a supply chain risk

Drew Fallon

@drewfallon12

4 months ago

BLUE BOTTLE COFFEE ACQUIRED BY LUCKIN FOR $400M A couple months ago, it was reported that Nestlé was exploring a sale of Blue Bottle Coffee - the property it invested ~$450m in to acquire a 68% stake in 2017. Fast forward 9 years later, & Luckin Coffee is acquiring Blue Bottle for just a touch less than the principal nestle invested. Prior to Nestle's acquisition, Blue Bottle had raised over $100m from the likes of Index Ventures, Google Ventures, and more. Philz Coffee & La Colombe Coffee Workshop also recently traded hands.

drewfallon12's tweet photo. BLUE BOTTLE COFFEE ACQUIRED BY LUCKIN FOR $400M

A couple months ago, it was reported that Nestlé was exploring a sale of Blue Bottle Coffee - the property it invested ~$450m in to acquire a 68% stake in 2017.

Fast forward 9 years later, & Luckin Coffee is acquiring Blue Bottle for just a touch less than the principal nestle invested.

Prior to Nestle's acquisition, Blue Bottle had raised over $100m from the likes of Index Ventures, Google Ventures, and more.

Philz Coffee & La Colombe Coffee Workshop also recently traded hands.

106

305

445

242

Daniel Chang

@dnnssl2

4 months ago

@regardthefrost @POTUS @NSF Congrats Jim!

Daniel Chang

@dnnssl2

Last Seen Users on Sotwe

Trends for you

Most Popular Users