An Ex-Meta L8’s Agentic Engineering Setup
In this guest article, @kunchenguid shares the agentic engineering workflow he uses on a day-to-day basis.
Read the full article here: https://t.co/4mXyh1EiFF
CMU Advanced NLP Lecture 9: Decoding Algorithms
This lecture explains a key aspect of generative LLMs:
The model learns a probability distribution, but useful generation still depends on how we decode from that distribution.
🔹 Greedy decoding picks the most likely token each step, but local best choices may not produce the best full sequence.
🔹 Beam search keeps multiple candidate paths, making decoding closer to sequence-level optimization.
🔹 Sampling turns probabilities into diverse outputs, but naive sampling can become incoherent because of long-tail tokens.
🔹 Top-k, top-p, and temperature control the tradeoff between quality, diversity, and randomness.
The key idea: LLM generation is not just “the model predicts words.” It is model probabilities + decoding strategy.
My note:
https://t.co/ucwWjpPowE
Jane Street pays $650,000 a year for quants. Stanford just released the exact RL-for-trading bible for free.
16 chapters. 0 to algo trader. Asset allocation, market making, American option exercise, full Python code & Colab notebooks.
Bookmark & give it a weekend.
This 115-page book unlocks the secrets of LLM fine tuning.
https://t.co/Uhs8edPUV8
A comprehensive guide which covers:
> the fine-tuning process for LLMs
> combining both theory and practice.
𝗧𝗵𝗲 𝗿𝗲𝗰𝗼𝗿𝗱𝗶𝗻𝗴 𝗼𝗳 𝗟𝘂𝗰𝗮𝘀 𝗕𝗲𝘆𝗲𝗿'𝘀 (@giffmana) 𝗹𝗲𝗰𝘁𝘂𝗿𝗲 𝗮𝘁 @ETH 𝗶𝘀 𝗻𝗼𝘄 𝗹𝗶𝘃𝗲 𝗼𝗻 𝗬𝗼𝘂𝗧𝘂𝗯𝗲 𝗳𝗼𝗿 𝗲𝘃𝗲𝗿𝘆𝗼𝗻𝗲 𝘄𝗵𝗼 𝗰𝗼𝘂𝗹𝗱𝗻'𝘁 𝗷𝗼𝗶𝗻 𝘂𝘀 𝗶𝗻 𝗽𝗲𝗿𝘀𝗼𝗻!
This past Monday, we had the pleasure of hosting Lucas (@Meta@AIatMeta Superintelligence Labs) for our "Robot Learning: From Fundamentals to Foundation Models" course. He joined us to talk about: "𝗩𝗶𝘀𝗶𝗼𝗻 𝗶𝗻 𝘁𝗵𝗲 𝗔𝗴𝗲 𝗼𝗳 𝗟𝗟𝗠𝘀".
Drawing from a remarkable track record in computer vision and multimodal AI (𝗩𝗶𝗧, 𝗦𝗶𝗴𝗟𝗜𝗣, 𝗣𝗮𝗹𝗶𝗚𝗲𝗺𝗺𝗮) 🧠, Lucas delivered a masterclass on the frontier of multimodal foundation model training: from pre-training to post-training, where the field stands today, and what comes next 🚀
📽️ YouTube Recording: https://t.co/wNz1NwYkvb
📚 Course Website: https://t.co/DoQUYy3MjB
Excited to see my student’s work on Flux Matching out. It turns out you can learn a much broader class of vector fields with the data distribution as stationary (not just the score). This lets you enforce useful properties like fast mixing, and it already works on high-dimensional image datasets!
Many roughly know how a transformer works
To REALLY understand modern neural LMs—MoEs, GPU tiling, kernels, RLHF, data—you need CS336
By @tatsu_hashimoto, @percyliang
The 2026 edition appears on yt with ~2 weeks delay
https://t.co/iEWTqEivvB
Materials
https://t.co/E1pzUSC6Tr
New Anthropic research: Natural Language Autoencoders.
Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read.
Here, we train Claude to translate its activations into human-readable text.
Introducing a new sequence model Raven which pushes the boundary of fixed-state-size sequence models!
Raven bridges popular linear-time models with constant state capacity, like SSMs and sliding window attention (SWA). Like SWA, its state is a finite set of slots; unlike SWA, Raven learns to selectively choose which slots to update with each new token it caches. This is a much more principled update mechanism that leads to dramatically better retrieval abilities than prior linear models.
I personally don't think SWA is a very principled model - but it's convenient and works well empirically - and am most excited to see Raven be used as a strictly better drop-in replacement. More broadly the framework it develops hopefully introduces more ideas to combine the strengths of SSM-like and attention-like models.
This work was led by @rshia_afz and @avivbick
David Silver RL Lecture 7: Policy Gradient Methods
The main story is actor-critic.
Pure policy gradient learns from full returns:
unbiased, but high variance.
Actor-critic adds a critic:
🔹 actor updates the policy
🔹 critic estimates how good the action was
🔹 lower-variance signal
🔹 online step-by-step learning
Actor-critic feels like the bridge between policy optimization and value learning.
My note: https://t.co/et5hAJFzcb
We post-trained a 3B model with RL to beat Opus on spreadsheet retrieval. Faster, cheaper, more accurate.
- If a piece of your agent loop is narrow, verifiable, and highly repeatable, a tiny trained model might beat the frontier.
- The application layer is still early and new verticals are opening fast. Cheap domain specialists orchestrated by a frontier model that only spends tokens on judgment is a bet worth watching.
This is exactly why Anthropic spends enormous effort on preventing reward hacking in every generation of RL training.
In highly constrained RLVR settings, reward hacking can be relatively limited because the verifier is narrow and deterministic.
But once you move toward reward models, open-ended environments, or agentic interaction, reward hacking never truly disappears. The model will always search for shortcuts inside the reward surface.
Restrict KL divergence? Better verifiers? LLM-as-a-verifier?
Probably the clearest article explaining Anthropic’s work on reward hacking prevention (Chinese):
https://t.co/uTFBPHoHju
this is fascinating, they train an encoder/decoder but use LLM matching the target model's shape for each part, so the latent space is just plain language and they can detect reward hacking, unwanted behavior and more
could even see it being used as an eval to quantify how smart a model is, i love this