I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit.
My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to use Claude Code: we intentionally build it in a way that you can use it, customize it, and hack it however you like. Each person on the Claude Code team uses it very differently.
So, here goes.
I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter.
The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.
Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
- more information compression (see paper) => shorter context windows, more efficiency
- significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images.
- input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful.
- delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go.
OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa.
So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to.
Now I have to also fight the urge to side quest an image-input-only version of nanochat...
From double descent to grokking, deep learning sometimes works in unpredictable ways.. or does it?
For NeurIPS,@Jeffaresalan & I explored if&how statistics + smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n
What is the performance limit when scaling LLM inference? Sky's the limit.
We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.
https://t.co/HO2seV73KT (ICLR 2024)
1/8 Looks like my paper "Tabular Data: Deep Learning is Not All You Need" just hit 1,000+ citations 🥳🥳🥳
Here's the story of how we almost didn't publish it...
https://t.co/KiZ9dUTYWn
Self-attention as a directed graph!
Self-attention is at the heart of transformers, the architecture that led to the LLM revolution that we see today.
In this post, I'll clearly explain self-attention & how it can be thought of as a directed graph.
Read more...👇
🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥
Technical details at: https://t.co/7bx5EnzWCs
The single most undervalued fact of linear algebra: matrices are graphs, and graphs are matrices.
Encoding matrices as graphs is a cheat code, making complex behavior simple to study.
Let me show you how!
There's a stunning, simple explanation behind matrix multiplication.
This is the first time this clicked on my brain, and it will be the best thing you read all week.
Here is a breakdown of the most crucial idea behind modern machine learning:
1/15
So much misunderstanding of this comment!
Here is a list of things I am *NOT* saying:
- you need a PhD to do Science. You don't. A PhD teaches you to do research, but you can learn that on your own (though it's much easier with a mentor).
- you need to get papers accepted by a journal or conference to publish: you don't. You can just post it in https://t.co/yQWSsbCJc1 . Many influential papers never went through the formal peer review process, or went through it after they became influential.
- engineering is not science: it can be, depending on your methodology. I'm a scientist *and* an engineer. These activities are complementary and need each other.
- science requires formal papers: it doesn't. A clear explanation on a website and a piece of code on a public repo will do.
What I *AM* saying is that science progresses through the collision of ideas, verification, analysis, reproduction, and improvements.
If you don't publish your research *in some way* your research will likely have no impact.
It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization (DPO) by @rm_rafailov@archit_sharma97@ericmitchellai@StefanoErmon@chrmanning and @chelseabfinn. This beautiful paper proposes a much simpler alternative to RLHF (reinforcement learning from human feedback) for aligning language models to human preferences.
RLHF has been a key technique for training LLMs. In brief, RLHF (i) Gets humans to specify their preferences by ranking LLM outputs, (ii) Trains a reward model (used to score LLM outputs) -- typically represented using a transformer network -- to be consistent with the human rankings, (iii) Uses reinforcement learning to tune an LLM, also represented as a transformer, to maximize rewards. This requires two transformer networks, and RLHF is also finicky to the choice of hyperparameters.
DPO simplifies the whole thing. Via clever mathematical insight, the authors show that given an LLM, there is a specific reward function for which that LLM is optimal. DPO then trains the LLM directly to make the reward function (that’s now implicitly defined by the LLM) consistent with the human rankings. So you no longer need to deal with a separately represented reward function, and you can train the LLM directly to optimize the same objective as RLHF.
Although it’s still too early to be sure, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years.
You can read the paper here: https://t.co/m14qRYszVa I also write more about this in The Batch (linked to below).
https://t.co/8h2ag2plIa