todays read is vasily volkov's phd thesis on gpu latency hiding. it's quite long, so i think it'll keep me occupied for the next few days.
https://t.co/TVZKVeMh9Q
LoRA, low-rank adaptation, is arguably the most popular parameter-efficient fine-tuning method for LLMs.
But how does it actually work?
Check out the video to learn LoRA and friends (LoRA+, QLoRA, VeRA, and DoRA)!
https://t.co/v9qFegbK7g
Best YouTube Channels To Learn AI in 2026 (No BS). Save it.
1. Fundamentals – 3Blue1Brown
2. Deep Learning – Andrej Karpathy
3. AI Research – Yannic Kilcher
4. Practical AI – AssemblyAI
5. LLMs – AI Explained
6. ML Theory – StatQuest
7. Papers Simplified – Two Minute Papers
8. GenAI – Matthew Berman
9. AI Agents – Nicholas Renotte
10. Applied ML – Krish Naik
11. PyTorch – Aladdin Persson
12. Math for ML – Serrano Academy
13. Industry Insights – Lex Fridman
14. Real-world AI – DeepLearningAI
tensor algebra is not abstract math.
it is the grammar of modern intelligence.
a scalar is one number.
a vector is a line of numbers.
a matrix is a grid of numbers.
a tensor is the general form: numbers arranged across multiple dimensions.
images are tensors.
videos are tensors.
robot sensor streams are tensors.
neural network weights are tensors.
physics simulations are tensors.
deep learning is basically tensor algebra + optimization + compute.
once you understand tensors, AI stops looking like magic.
it becomes structure.
reality → numbers → geometry → transformations → intelligence.
Ling-2.6 and Ring-2.6 are here
Alibaba's Ant Group open-sources trillion-parameter agentic models with hybrid linear attention and the KPop RL framework, delivering instant responses and deep reasoning.
ViTTT
[CVPR 2026] [Best Paper Finalist] [Oral] Official repository of Vision Test-Time Training
https://t.co/DiVWBWXhzv
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time Training (ViT3) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT3 across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT3 consistently matches or outperforms advanced linear-complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT3 baseline can facilitate future work on visual TTT models.
Best models for your hardware
- 4gb to 12gb vram -
VibeThinker-3B - smokes everything remotely close to its weight class. Challenging 30b models! Last version was also topping math benchmarks
https://t.co/RTchJFFTnV
- 12gb to 24gb vram -
Gemma-12B-coder
Built on top of an already strong model, reduced refusals and 262k context window trained on fable traces https://t.co/DVAhlQ7Y4n
- 24gb to 64gb vram -
Gemma-4-26b-diffusion
This model was already by far one of the most functional and capable models, now it’s hitting 500+ tok/s on consumer hardware! Smart AF made by Google deepmind https://t.co/mSaWPFpgXQ
Cohere North-Mini-Code 30B
A new coding model made by an already impressive lab, its priming worth a shot if you’re looking to test the limits of local coding https://t.co/gDPEj6lPAW
———
For those with 4x 6000s or 3x DGX Spark I think my GLM-5.2-REAP is worth a shot.
Lmk how it goes!
This video by @jbhuang0604 is a compact but very informative dive into the progress of self-supervised learning over the past few decades.
from IMAX in 1992
covering methods like MoCo, SimCLR, DINO, BYOL, MAE
all the way up to LeJEPA in 2025
Highly recommend watching!
probably the best blog i have read for some time
viewing SFT, RL, and OPD as different ways of reshaping a model's distribution makes their tradeoffs super intuitive.
- SFT pulls toward a fixed external target
- RL moves along the reward gradient on on-policy samples
- OPD sits in between, using a teacher signal but on student-generated data, which is why it inherits RL's anti-forgetting properties even when the teacher itself was an overtrained SFT model.
the post is heavily grounded in recent literature and uses the distributional perspective as a unifying bridge across all three paradigms, i really like the point it argues the load-bearing ingredient is on-policy data and OPD's convergence to RL-like outcomes is the strongest evidence
The Top AI Papers of the Week (June 7 - June 14)
- Agentopia
- Self-Harness
- Agents' Last Exam
- MiniMax Sparse Attention
- Lookahead Sparse Attention
- How AI Agents Reshape Knowledge Work
- The Geometry of On-Policy Distillation
Read on for more:
I’m giving a talk on LLM judges at the Toronto Machine Learning Summit next week. The talk will cover practical techniques like:
- Collecting high-quality expert feedback on subjective tasks.
- Calibrating LLM judges with expert opinions.
- Properly eliciting reasoning within an LLM judge.
- Using multiple agents to decompose complex evaluation tasks.
- Continually improving LLM judges with production monitoring / metrics.
This talk will be full of practical details for building useful evaluation systems. Hope to see you there!