@vboykis Producer: Pitch me.
Me: It's a psych horror about a software engineer who will be auctioned off to someone who will inhabit his body. His only clues are commit messages in a code repository. It's called "Git Out."
Producer: Get out.
Me: no, git out. Git is a
Producer: Get out.
JHU mmBERT extended from 8k to 32k token length by vLLM Semantic Router Team. Cutting edge results on 1,800+ languages, now with longer context!
https://t.co/maN3bT1X17
A dead-simple trick to improve LLM performance:
Just repeat your prompt twice.
No fancy prompting techniques, no chain-of-thought, just plain repetition.
Google researchers tested this across Gemini, GPT, Claude, and Deepseek, and the results were surprisingly good.
Here's why it works:
LLMs are causal, meaning tokens can only see what came before them. When you ask a question after providing context, the question tokens never "saw" the full picture.
By repeating the prompt, every token gets to attend to every other token during prefill.
The best part:
- No increase in output length
- No increase in latency
- Works as a simple drop-in replacement
On one task, Gemini Flash-Lite jumped from 21% to 97% accuracy just by repeating the input.
Important note:
This helps most when reasoning is disabled. If you're already using "think step-by-step," the gains are mostly neutral since reasoning models tend to repeat the prompt internally anyway.
Paper: "Prompt Repetition Improves Non-Reasoning LLMs" from Google Research.
Sometimes the simplest ideas win.
Link to the paper in the next tweet.
This deadline for post-doc applications is coming up. There are so many great people in AI at JHU, even more with the >20 tenure track hires that started this fall.
Stanford researchers built a new prompting technique!
By adding ~20 words to a prompt, it:
- boosts LLM's creativity by 1.6-2x
- raises human-rated diversity by 25.7%
- beats fine-tuned model without any retraining
- restores 66.8% of LLM's lost creativity after alignment
Let's understand why and how it works:
Post-training alignment methods like RLHF make LLMs helpful and safe, but they unintentionally cause mode collapse. This is where the model favors a narrow set of predictable responses.
This happens because of typicality bias in human preference data:
When annotators rate LLM responses, they naturally prefer answers that are familiar, easy to read, and predictable. The reward model then learns to boost these "safe" responses, aggressively sharpening the probability distribution and killing creative output.
But here's the interesting part:
The diverse, creative model isn't gone. After alignment, the LLM still has two personalities. The original pre-trained model with rich possibilities, and the safety-focused aligned model.
Verbalized Sampling (VS) is a training-free prompting strategy that recovers the diverse distribution learned during pre-training.
The idea is simple:
Instead of prompting "Tell me a joke" (which triggers the aligned personality), you prompt: "Generate 5 responses with their corresponding probabilities. Tell me a joke."
By asking for a distribution instead of a single instance, you force the model to tap into its full pre-trained knowledge rather than defaulting to the most reinforced answer.
Results show verbalized sampling enhances diversity by 1.6-2.1x over direct prompting while maintaining or improving quality.
Variants like VS-based Chain-of-Thought and VS-based Multi push diversity even further.
You can find the paper link in the next tweet.
👉 Over to you: What other methods can be used to improve LLM diversity?
Just read Apple's "OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework". Similar to the OLMo, it's refreshing to see an LLM paper that shares details discussing the architecture, training methods, and training data.
Let's start with the most interesting tidbits:
- OpenELM comes in 4 relatively small and convenient sizes: 270M, 450M, 1.1B, and 3B
- OpenELM performs slightly better than OLMo even though it's trained on 2x fewer tokens
- The main architecture tweak is a layer-wise scaling strategy
Sharing details is not the same as explaining them, which is what research papers were aimed to do when I was a graduate student. For instance, they sampled a relatively small subset of 1.8T tokens from various publicly available datasets (RefinedWeb, RedPajama, The PILE, and Dolma). This subset was 2x smaller than Dolma, which was used for training OLMo. What was the rationale for this subsampling, and what were the criteria?
The layer-wise scaling strategy (adopted from the "DeLighT: Deep and Light-weight Transformer" paper) is very interesting. I wish there was an ablation studio training an LLM with and without this strategy on the same dataset. But those experiments are expensive, and I can understand why they didn't do it.
An interesting bonus that I didn't expect was that the researchers compared LoRA and DoRA (which I discussed a few weeks ago) for parameter-efficient finetuning! It turns out that there wasn't a noticeable difference between the two methods, though.
Anyways, great work, and big kudos to the researchers (and Apple) for sharing!
A while ago I complained here about persistent storage in Google Colab.
Have been using @LightningAI Studios for a while now for:
- Full VSCode (incl. GH Copilot)
- Persisted files shared across notebooks
- Multi-GPU/node (!!)
It's been great. Feels like a remote ML workstation
1: did you hear bob was fired?
2: I didn't, what did they even do?
1: no one knows, maybe that's why they got fired
(Two weeks later)
2: oh, yeah bob kind of did a lot.
Ready to build a “Chat with your GitHub repository” application with Mistral via @ollama, @weaviate_io, and @llama_index?
I’ve just dropped a @LightningAI Studio template.
No setup, just copy and dive right into action.
Jump right in here: https://t.co/Bh7Wog5Kqi
@deliprao Also necessary vs sufficient. Models keep getting bigger, but what is necessary for different use cases? What is the 'right size's for different tasks? Different data sets? When does a small domain specific model beat a large, general model?
@deliprao Representation learning. Everyone looks at what you can do with LLMs, but few understand what they actually are. Open the hood and poke around.
I'm making a list of NLP faculty who are recruiting PhD students:
https://t.co/jnfwyXou0j
Results are shared here (after I confirm the submission):
https://t.co/qhweG2kFm0
This is an experiment intended to help students find advisers and help advisers find students
One pattern I noticed is that great AI researchers are willing to manually inspect lots of data. And more than that, they build infrastructure that allows them to manually inspect data quickly. Though not glamorous, manually examining data gives valuable intuitions about the problem.
The canonical example here is Andrej Karpathy doing the ImageNet 2000-way classification task himself. And in the era of large language models, manually examining data is probably even more insightful since completions are hard to evaluate via benchmarks.
In this spirit, I recently did a few days of pair programming with @hwchung27 where we were starting on a new problem. Instead of trying to replicate baselines and design new methods, we ran some evaluations and manually inspected them to gain insights. We first paid about one day of overhead getting all the relevant information in a single UI so we could examine the data without having to click through multiple web pages. The second day, we spent an afternoon reading examples together and taking notes on the patterns that we noticed in the examples. ChatGPT generates long text, and we actually read the whole thing carefully, even if one example took 20 minutes to understand. I think we both gained a deeper understanding of the problem that we could not have gotten from reading research papers.
(In 2018, for example, I helped pathologists label a lot of data to train a lung cancer classifier. After having manually labeled 200+ images (with pathologist correction), I’d probably gained a pathologist-level understanding at that one particular lung cancer classification task :))
TinyML and Efficient Deep Learning Computing
MIT 6.5940 (https://t.co/9cZmEhXrrr)
“This course will introduce efficient AI computing techniques that enable powerful deep learning applications on resource-constrained devices. Topics include model compression, pruning, quantization, neural architecture search, distributed training, data/model parallelism, gradient compression, and on-device fine-tuning. It also introduces application-specific acceleration techniques for large language models, diffusion models, video recognition, and point cloud. This course will also cover topics about quantum machine learning. Students will get hands-on experience deploying large language models (e.g., LLaMA 2) on a laptop.”