@cneuralnetwork It's extremely hard to land a pre-doc as it's very competitive. But, you can try applying, I was not able to get a good pre-doc after trying, hence, had to think of doing a PhD
LLM RL optimizes for sequential reasoning
We also optimize over the reasoning strategy, incl parallel trains of thought, aggregation of parallel traces, & sequential reasoning
This allows the model to better explore & allocate compute at test time
https://t.co/DkTSllkmvp
Defended my PhD recently!
Grateful to my advisors @LukeZettlemoyer and @nlpnoah, committee members Colin and @lucyluwang, and the many mentors, collaborators, and the @uwnlp community for all the support and friendship along the way ❤️
Slides: Break the language model monolith (https://t.co/gKWcXEck2f)
An interesting new paper by my recent PhD graduate on how AI agents' greed for visible incentives can lead them to abandon their safety alignment.
You can read it here: https://t.co/y64uOBvSiC
Turn any paper into running code.
Just swap arxiv → autoarxiv in the paper url.
That hands the paper to an AI agent from alphaXiv. It reads the abstract, the claims, and the linked GitHub repo, then clones the codebase and works through the usual setup pain like dependencies, broken paths, environment config, and hardware assumptions.
From there it designs a minimal reproduction. That means a smaller model, fewer steps, and a single GPU instead of a cluster, scaled down just enough to test whether the headline claim holds.
The whole run is live and fully logged. Loss curves, metrics, and training progress are all observable as it happens.
What comes back is a clean signal on whether the minimal run matches the paper's reported result, plus an estimate of what a full replication would cost in compute and time.
A lot of research code dies in setup before anyone verifies a single number. This moves reproduction from a weekend of debugging to a url change.
Pick a paper and try it now.
video credits: @askalphaxiv
I'm joining OpenAI next week!🥹 The job search turned out to be really challenging but also super rewarding, so I wrote a small blog to share what I learned along the way and hopefully make the process a little less mysterious for the next person. https://t.co/6FigSBdenD
Resources to Review Linear Algebra for Deep Learning (Interview)
I've been asked many times by new grads on how to review linear algebra and deep learning fundamentals.
I found reading textbook particularly efficient and effective. Even after joining industry, I still read textbooks from time to time. I summarize resources, important topics, and my personal notes below.
🚀 15+ Model Compression Techniques Every ML Engineer Should Know
1. Quantization
Reduces numerical precision from FP32 to FP16, INT8, INT4, or even 1-bit representations. This significantly reduces memory usage and often accelerates inference with minimal accuracy loss.
---
2. Pruning
Removes unimportant weights, neurons, attention heads, layers, or entire blocks. The goal is to eliminate redundancy while preserving model performance.
---
3. Knowledge Distillation
Transfers knowledge from a large teacher model into a smaller student model. The student learns both the task and the teacher's behavior.
---
4. Low-Rank Factorization
Decomposes large weight matrices into smaller matrices with lower rank. This reduces storage and matrix multiplication costs.
---
5. Weight Sharing
Multiple parameters share the same stored values. Instead of millions of unique weights, many connections reuse common representations.
---
6. Sparse Representations
Stores only important weights while ignoring near-zero values. Sparse models can dramatically reduce storage and computation requirements.
---
7. Structured Pruning
Instead of removing individual weights, entire neurons, channels, attention heads, or Transformer blocks are removed. Hardware often benefits more from structured sparsity.
---
8. Dynamic Sparsity
The sparse structure changes during training rather than remaining fixed. Important connections can emerge while unimportant ones disappear.
---
9. Weight Clustering
Groups similar weights together and replaces them with shared cluster centroids. This reduces memory requirements while preserving behavior.
---
10. Huffman Coding
Applies lossless compression after quantization or clustering. Frequently occurring values receive shorter binary representations.
---
11. Tensor Decomposition
Uses techniques such as CP, Tucker, and Tensor Train decompositions to compress large neural network tensors.
---
12. Neural Architecture Search (NAS)
Discovers compact architectures automatically. Instead of compressing a large model, it finds a smaller architecture from the beginning.
---
13. Lottery Ticket Pruning
Finds sparse subnetworks capable of achieving performance comparable to the original network. The idea is that efficient subnetworks already exist inside large models.
---
14. Early Exit Networks
Allows inference to terminate early when confidence is sufficiently high. Easy samples require less computation than difficult samples.
---
15. Mixture of Experts (MoE)
Only a small subset of model parameters are activated for each input. Massive models become computationally efficient because most parameters remain inactive.
---
16. Retrieval-Augmented Generation (RAG)
Instead of storing all knowledge inside model weights, external knowledge is retrieved when needed. This reduces pressure to continuously scale model size.
---
17. Adapter-Based Learning
Techniques such as LoRA, QLoRA, Adapters, IA3, and Prefix Tuning train tiny parameter subsets instead of full models.
---
18. Layer Dropping
Removes less important layers after training. Many Transformer layers contribute less than expected and can sometimes be eliminated.
---
19. Token Pruning
Removes less important tokens during inference. Especially useful for vision transformers and long-context language models.
---
20. KV Cache Compression
Compresses attention cache memory used during long-context inference. Critical for serving modern LLMs efficiently.
---
Modern Compression Stack
Pretrained Model
↓
Architecture Optimization
↓
Pruning
↓
Distillation
↓
Quantization
↓
Sparse Computation
↓
KV Cache Compression
↓
Production Deployment
since a good bunch of discourse is going on around "how to do research", these pieces are quite worth a read.
https://t.co/pA0MkOMlKS
https://t.co/rw9uMiwlCj
https://t.co/H1AGvnb7LP
https://t.co/FTyAabr9Rx