yjoan:Y-I.Kim(yongil

@yjoankim

gpt 3.5 to seedance 2.0 moment, and what oa next?

Bundang-gu, Republic of Korea

Joined January 2010

18 Following

24 Followers

947 Posts

yjoankim retweeted

Jun Song

@jun_song

about 2 months ago

Running Kimi-k2.6 1T 8bit with only 21GB RAM on my Macbook at speed of 25tok/s. Some of my theory worked, but architecture is not perfect. Need to fix a lot of stuff, but there is hope. Working hard on this future method of Local LLM.

jun_song's tweet photo. Running Kimi-k2.6 1T 8bit with only 21GB RAM on my Macbook at speed of 25tok/s.

Some of my theory worked, but architecture is not perfect.

Need to fix a lot of stuff, but there is hope.

Working hard on this future method of Local LLM. https://t.co/tY6v765HHq

745

320

41K

yjoankim retweeted

Jun Song

@jun_song

about 2 months ago

Just had a chat with an insider at one of South Korea's biggest memory chip makers (reminder: Korea is the world's biggest memory exporter). Here is what I can share: • B2B supply prices have already skyrocketed. This hasn't even hit consumer pricing yet. • Global demand is still insane. Low-end DRAM is going to face major bottlenecks due to yield issues. • It's not just DRAM. Production can't keep up with demand for other parts either. The only exception? Consumer CPUs. Sales are down, so they are actually oversupplied. Expect massive price hikes for consumer hardware over the next 12 months. Brace your wallets.

114

11K

yjoankim retweeted

CJ Zafir

@cjzafir

about 2 months ago

If you love fine-tuning open-source models (like me), then listen. > Start with 1B, 2B, 4B, and 8B models. (Don't start with a 27B model or bigger at first.) > Use WebGPU providers. I use Google Colab Pro for any model smaller than 9B. A single A100 80GB costs around $0.60/hr, which is cheap. Enough for small models. > Don’t buy GPUs unless you fine-tune 7 to 10 models. You'll understand the nitty-gritty in the process. > Use Codex 5.5 × DeepSeek v4 Pro to create datasets. Codex to plan, DeepSeek v4 Pro to generate rows. > Use Unsloth's instruct models as a base from Hugging Face. Yes, there are others too, but Unsloth also provides fast fine-tuning notebooks. > Use Unsloth's fine-tuning notebooks as a reference. Paste them into Codex, and Codex will write a custom notebook with the configs you need. > Spend 1 day learning about: - SFT (supervised fine-tuning) - RL training (GRPO, DPO, PPO, etc.) - LoRA / QLoRA training - Quantization and types - Local inference engines (llama.cpp) - KV cache and prompt cache > Just get started. Claude, Codex, and ChatGPT can design a step-by-step plan for how you can fine-tune your first AI model. Future tech is moving toward small 5B to 15B ELMs (Expert Language Models) rather than general 1T LLMs. So fine-tuning is an important skill that anyone can acquire today. Tune models, test them, use them. Then fine-tune for companies and make a career out of it. (Companies pay $50k+ to fine-tune models on their data so they can get personalized AI models.) Shoot your questions below. I'll be sharing in-depth raw findings about this topic in the coming days.

cjzafir's tweet photo. If you love fine-tuning open-source models (like me), then listen.

> Start with 1B, 2B, 4B, and 8B models. (Don't start with a 27B model or bigger at first.)

> Use WebGPU providers. I use Google Colab Pro for any model smaller than 9B. A single A100 80GB costs around $0.60/hr, which is cheap. Enough for small models.

> Don’t buy GPUs unless you fine-tune 7 to 10 models. You'll understand the nitty-gritty in the process.

> Use Codex 5.5 × DeepSeek v4 Pro to create datasets. Codex to plan, DeepSeek v4 Pro to generate rows.

> Use Unsloth's instruct models as a base from Hugging Face. Yes, there are others too, but Unsloth also provides fast fine-tuning notebooks.

> Use Unsloth's fine-tuning notebooks as a reference. Paste them into Codex, and Codex will write a custom notebook with the configs you need.

> Spend 1 day learning about:
- SFT (supervised fine-tuning)
- RL training (GRPO, DPO, PPO, etc.)
- LoRA / QLoRA training
- Quantization and types
- Local inference engines (llama.cpp)
- KV cache and prompt cache

> Just get started. Claude, Codex, and ChatGPT can design a step-by-step plan for how you can fine-tune your first AI model.

Future tech is moving toward small 5B to 15B ELMs (Expert Language Models) rather than general 1T LLMs.

So fine-tuning is an important skill that anyone can acquire today.

Tune models, test them, use them. Then fine-tune for companies and make a career out of it. (Companies pay $50k+ to fine-tune models on their data so they can get personalized AI models.)

Shoot your questions below. I'll be sharing in-depth raw findings about this topic in the coming days.

312

189K

yjoankim retweeted

Jun Song

@jun_song

about 2 months ago

Gemma4 is the strongest multilingual Local LLM model. If you are not using English or Chinese, Gemma4 might be better choice than Qwen3.6

Who to follow

어슬렁~어슬렁~느리게~ 인생 마지막날...와인을 마시다 죽고싶다🍷

yjoankim retweeted

2 months ago

new mistral model: 128B dense with an arch from 3 years ago (llama 2), very low context (128k), priced higher than deepseek v4 pro (1.6T total params, 1M context) and every other oss model that outperforms it this is very sad

eliebakouch's tweet photo. new mistral model: 128B dense with an arch from 3 years ago (llama 2), very low context (128k), priced higher than deepseek v4 pro (1.6T total params, 1M context) and every other oss model that outperforms it

this is very sad https://t.co/IwRlgnHHov

107

228

433K

yjoankim retweeted

Deedy

@deedydas

2 months ago

Researchers just estimated the size of all the LLMs by asking it knowledge questions of varying degrees of obscurity! – GPT 5.5: ~10T params – Claude Opus 4.x: ~4-5T – Grok 4: ~3T The idea here is that factual capacity scales log-linearly with size. The paper shows 7 knowledge tiers and T7 is essentially ~0% for all models, suggesting there is still significant headroom for pretraining. Gemini 3.1 Pro is likely >10T given its used as an anchor but has no direct estimate. This means we can infer what different models might cost to some degree and their post-training effectiveness (performance at certain non-factual tasks given its size). One of the coolest papers I’ve read of late.

deedydas's tweet photo. Researchers just estimated the size of all the LLMs by asking it knowledge questions of varying degrees of obscurity!

– GPT 5.5: ~10T params
– Claude Opus 4.x: ~4-5T
– Grok 4: ~3T

The idea here is that factual capacity scales log-linearly with size. The paper shows 7 knowledge tiers and T7 is essentially ~0% for all models, suggesting there is still significant headroom for pretraining. Gemini 3.1 Pro is likely >10T given its used as an anchor but has no direct estimate.

This means we can infer what different models might cost to some degree and their post-training effectiveness (performance at certain non-factual tasks given its size).

One of the coolest papers I’ve read of late.

157

203

966

356K

yjoankim retweeted

Bojie Li

@bojie_li

2 months ago

Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: https://t.co/CkwJsXqnsX Paper: https://t.co/eNUdC9ye7w

bojie_li's tweet photo. Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is.

Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time.

For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years.

After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings:

1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size).

2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers.

3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters.

Website: https://t.co/CkwJsXqnsX
Paper: https://t.co/eNUdC9ye7w

234

391K

yjoankim retweeted

Jun Song

@jun_song

2 months ago

A new open-source LLM company gave me the opportunity to preview their upcoming models, so I'm currently testing them out. It's really interesting because these new models have strong multimodal capabilities.

yjoankim retweeted

Chon Tang

@chontang

2 months ago · Taiwan

Apparently Deepseek is paying 5500 rmb a day ($19k USD a month) to *interns* out of Tsinghua. Apparently ByteDance pays 8000 rmb a day ($26k a month). Most Chinese returnees from Silicon Valley compare offers between US and China with a 1:3 ratio (because of the lower cost of living / higher quality of life in China). In other words, they would choose the Deepseek offer over a $500k USD offer from the US. Tldr, frontier labs in China are offering competitive pay vs US labs.

204

655

489K

yjoankim retweeted

antirez @antirez

2 months ago

DeepSeek v4 Flash with *local inference* after 24h of playing with that: even with the 2 bit selective quantization GGUF, iti is the FIRST time I feel I have a frontier model running on my computer. This is *crazy*, and probably a much stronger change in the landscape than PRO.

103

495

122K

yjoankim retweeted

elvis

@omarsar0

2 months ago

NEW paper from Alibaba. A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads. AgenticQwen-30B-A3B: 50.2 average on TAU-2 + BFCL-V4 Multi-Turn. AgenticQwen-8B: 47.4. Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model. How: two RL flywheels run in parallel. - The reasoning loop mines the model's own errors into harder problems each round. - The agentic loop grows simple linear tool-use trajectories into multi-branch behavior trees. - Simulated users actively try to mislead the agent. The training distribution gets harder on its own. Why it matters for agent devs: you can stop paying frontier prices for routine tool-use workloads. And the flywheel recipe is reusable. Generate your hard examples from your own agent's failures, not from static synthetic data. Paper: https://t.co/NGDXulumid Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. NEW paper from Alibaba.

A 30B MoE with only 3B active params matches Qwen3-235B on real tool-use workloads.

AgenticQwen-30B-A3B: 50.2 average on TAU-2 + BFCL-V4 Multi-Turn.

AgenticQwen-8B: 47.4.

Both more than double their vanilla Qwen baselines and close most of the gap to a 235B model.

How: two RL flywheels run in parallel.

- The reasoning loop mines the model's own errors into harder problems each round.

- The agentic loop grows simple linear tool-use trajectories into multi-branch behavior trees.

- Simulated users actively try to mislead the agent. The training distribution gets harder on its own.

Why it matters for agent devs: you can stop paying frontier prices for routine tool-use workloads.

And the flywheel recipe is reusable. Generate your hard examples from your own agent's failures, not from static synthetic data.

Paper: https://t.co/NGDXulumid

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

433

396

38K

yjoankim retweeted

Sebastian Raschka

@rasbt

2 months ago

April was a pretty strong month for LLM releases: - Gemma 4 - GLM-5.1 - Qwen3.6 - Kimi K2.6 - DeepSeek V4 All are now added to the LLM Architecture Gallery. More details once I am fully back in May!

rasbt's tweet photo. April was a pretty strong month for LLM releases:
- Gemma 4
- GLM-5.1
- Qwen3.6
- Kimi K2.6
- DeepSeek V4

All are now added to the LLM Architecture Gallery.
More details once I am fully back in May! https://t.co/HDYbWi2pcc

436

127K

yjoankim retweeted

Benjamin Marie

@bnjmn_marie

2 months ago

Updated: Qwen3.6 GGUF evals I added models from bartowski and lm-studio. All good overall. I also tried going below Q2_K_XL, but those quants are not usable. Most could not finish the benchmarks in a reasonable time because they generated way too many tokens. Endless generation, broken outputs. I let UD-IQ2_XXS finish the evals just to show an example. For Qwen3.6 27B, 11.8 GB is probably the lowest you can go without breaking the model More details and evaluations of abliterated versions: https://t.co/zbBCZ0Ty7a

bnjmn_marie's tweet photo. Updated: Qwen3.6 GGUF evals

I added models from bartowski and lm-studio. All good overall.

I also tried going below Q2_K_XL, but those quants are not usable. Most could not finish the benchmarks in a reasonable time because they generated way too many tokens. Endless generation, broken outputs.

I let UD-IQ2_XXS finish the evals just to show an example.

For Qwen3.6 27B, 11.8 GB is probably the lowest you can go without breaking the model

More details and evaluations of abliterated versions:
https://t.co/zbBCZ0Ty7a

249

134

17K

yjoankim retweeted

Wildminder

@wildmindai

3 months ago

RotorQuant - upgraded TurboQuant. > 10x KV cache compression > 28% faster decoding > 5x faster prefill > 44x fewer parameters Same quality as full attention. 1/10th the memory. Ok, another massive VRAM discount for local LLMs. https://t.co/2LHZ47fptn

wildmindai's tweet photo. RotorQuant - upgraded TurboQuant.

> 10x KV cache compression
> 28% faster decoding
> 5x faster prefill
> 44x fewer parameters
Same quality as full attention. 1/10th the memory.
Ok, another massive VRAM discount for local LLMs.

https://t.co/2LHZ47fptn https://t.co/ukCabwrjSz

161

66K

yjoankim retweeted

Chris

@ChrissGPT

3 months ago

Hear me out. A 100T dense model isn’t impossible anymore, it’s just deeply unnatural to serve at scale because Vera Rubin solves the fit problem, a 100T model is 200 TB, which drops to 700 GPUs unquantized or 170 GPUs at 4 bit, MEANING a few racks to hold it (depending on how many GPUs per rack), Assume 1 million concurrent users 20 tokens per second each - 20 million tokens per second system wide, and even at - 1k to 5k tokens per second per GPU you still need -4k to 20k GPUs minimum, which realistically becomes tens of thousands of Vera Rubin’s once you factor in latency, & batching limits, Training time would for Rubin would be at 17.5 PFLOPS per GPU (at FP4 peak), 100k GPUs still means 4 days for 1T tokens and 40 days for 10T tokens at perfect utilization, so the good news is that Rubin makes 100T dense easy to instantiate but not to serve massively (obviously), and Feynman will push hardware further!! but the direction is already clear that 100T+ models will be sparse and system optimized rather than brute forced dense. (In the beginning) however this is what I foresee being the AGI models we see in 2029.

643

200

84K

yjoankim retweeted

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

3 months ago

now we only need to know active params and training tokens to estimate xAI's compute efficiency, anyone up for coaxing it out of Elon? 500B, regardless, is extremely good for Grok-4.20's apparent knowledge. Just how overtrained is it?

teortaxesTex's tweet photo. now we only need to know active params and training tokens to estimate xAI's compute efficiency, anyone up for coaxing it out of Elon?
500B, regardless, is extremely good for Grok-4.20's apparent knowledge. Just how overtrained is it? https://t.co/gwOgZ3fPdy

107

yjoankim retweeted

Lifetimize

@lifetimization

3 months ago

lol Elon musk leaked Sonnet’s and Opus’s sizes: Sonnet: 1T Opus: 5T

501

454K

yjoankim retweeted

will brown

@willccbb

3 months ago

cheaper blended cost than GPT-4-32K when it was released 3 years ago

642

82K

yjoankim retweeted

Eric ⚡️ Building...

@outsource_

3 months ago

🤯 GEMMA 4 + OPUS 4.6 REASONING DROPPED @kaiostephens goal: produce a Gemma 4-31B reasoning adapter trained only on Opus reasoning 🧠 What the model is: 🧬 Tiny QLoRA adapter on Gemma 4 31B-it 📊 Fine-tuned on ~1,900 curated Opus Examples ⚡ Trained in ~1 hour on a single GH200 GPU 📖 Fully open Apache 2.0 What it does: ✨ Boosts overall quality, coherence, and personality 🧮 Stronger math, code, and Opus problem solving 💬 More refined, thoughtful responses 🏠 Built for local agents, workflows, and heavy daily Vs base Gemma 4 31B: 📐 Same efficient base model, no extra size or speed 📈 Noticeable step up in real-world depth and quality 💪 Base was already strong this levels It up! Grab the adapter here 👇🏻 https://t.co/ncOxReheZx

outsource_'s tweet photo. 🤯 GEMMA 4 + OPUS 4.6 REASONING DROPPED

@kaiostephens goal: produce a Gemma 4-31B reasoning adapter trained only on Opus reasoning 🧠

What the model is:
🧬 Tiny QLoRA adapter on Gemma 4 31B-it
📊 Fine-tuned on ~1,900 curated Opus Examples
⚡ Trained in ~1 hour on a single GH200 GPU
📖 Fully open Apache 2.0

What it does:
✨ Boosts overall quality, coherence, and personality
🧮 Stronger math, code, and Opus problem solving
💬 More refined, thoughtful responses
🏠 Built for local agents, workflows, and heavy daily

Vs base Gemma 4 31B:
📐 Same efficient base model, no extra size or speed
📈 Noticeable step up in real-world depth and quality
💪 Base was already strong this levels It up!

Grab the adapter here 👇🏻
https://t.co/ncOxReheZx

121

84K

yjoankim retweeted

Lisan al Gaib

@scaling01

3 months ago

Mythos speeds up AI research by up to 400 times A 300X speedup over the baseline requires 40 hours of work by a human expert It also clears the >8h threshold of human equivalent work time on ALL tasks!

scaling01's tweet photo. Mythos speeds up AI research by up to 400 times

A 300X speedup over the baseline requires 40 hours of work by a human expert

It also clears the >8h threshold of human equivalent work time on ALL tasks! https://t.co/vg0lHAvwAF

154

411

86K

yjoan:Y-I.Kim(yongil

@yjoankim

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users