Brilliant idea! Next up: Apple randomly reboots your Mac if you're building competing tech, Gmail silently edits your email if you mention rival platforms, and Tesla Autopilot swerves if it detects you're working on self-driving cars.
All in the name of safety, of course. Because malicious actors controlling the world’s operating systems, inboxes and cars would be extremely dangerous!
We are hiring Research Scientists for our Frontiers-of-AI team at Google DeepMind Bangalore, Singapore, Mountain View.
If you're passionate about cutting-edge AI research and building thinking, efficient, elastic, customized, and safe LLMs, we'd love to hear from you.
We are looking for candidates with a PhD and a strong demonstrated record of ideating and executing deep research projects.
If interested, please apply here: https://t.co/NSxao1nPYo
The Matryoshka🪆wave strikes again!
🚀 Excited to share our latest work, accepted to KDD 2025: Matryoshka Model Learning for Improved Elastic Student Models! https://t.co/uWPU3WhP3K
We introduce MatTA, a novel nested distillation framework which enables the extraction of multiple high quality student models from a single training run, enhancing adaptability in production ML systems. A thread. 🧵 (1/6)
cc @ManishGuptaMG1@jainprateek_
Over the last ~2 hours I curated a new Podcast of 10 episodes called "Histories of Mysteries". Find it up on Spotify here:
https://t.co/BH6FTglLIf
10 episodes of this season are:
Ep 1: The Lost City of Atlantis
Ep 2: Baghdad battery
Ep 3: The Roanoke Colony
Ep 4: The Antikythera Mechanism
Ep 5: Voynich Manuscript
Ep 6: Late Bronze Age collapse
Ep 7: Wow! signal
Ep 8: Mary Celeste
Ep 9: Göbekli Tepe
Ep 10: LUCA: Last Universal Common Ancestor
Process:
- I researched cool topics using ChatGPT, Claude, Google
- I linked NotebookLM to the Wikipedia entry of each topic and generated the podcast audio
- I used NotebookLM to also write the podcast/episode descriptions.
- Ideogram to create all digital art for the episodes and the podcast itself
- Spotify to upload and host the podcast
I did this as an exploration of the space of possibility unlocked by generative AI, and the leverage afforded by the use of AI. The fact that I can, as a single person in 2 hours, curate (not create, but curate) a podcast is I think kind of incredible. I also completely understand and acknowledge the potential and immediate critique here, of AI generated slop taking over the internet. I guess - have a listen to the podcast when you go for walk/drive next time and see what you think.
The Transformer architecture has changed surprisingly little from the original paper in 2017 (over 7 years ago!).
The diff:
- The nonlinearity in the MLP has undergone some refinement. Almost every model uses some form of gated nonlinearity. A silu or gelu nonlinearity is common.
- The placement of normalization layers. This tends to vary a little from architecture to architecture. Sometimes more normalization layers per Transformer block (e.g.Gemma 2). Sometimes keys and queries are normalized (e.g. Command+R).
- The type of normalization layer. RMS norm is commonly used instead of Layer Norm. All of Llama 3, Phi 3 and Gemma 2 use RMS norm now. Seems like vanilla Layer Norm is becoming a little less common.
- Group-query attention is now a staple as it really speeds up inference for larger KV cache's (e.g. longer prompts / generations).
- And of course the positional encodings have changed from sinusoidal to rotary (aka RoPE). Not too much variation otherwise.
We pay too much attention to the most confident voices—and too little attention to the most thoughtful ones.
Certainty is not a sign of credibility. Speaking assertively is not a substitute for thinking deeply.
It's better to learn from complex thinkers than smooth talkers.