John Lau @JohnKSLau - Twitter Profile

6 months ago

I'm Boris and I created Claude Code. Lots of people have asked how I use Claude Code, so I wanted to show off my setup a bit. My setup might be surprisingly vanilla! Claude Code works great out of the box, so I personally don't customize it much. There is no one correct way to use Claude Code: we intentionally build it in a way that you can use it, customize it, and hack it however you like. Each person on the Claude Code team uses it very differently. So, here goes.

1K

55K

7K

104K

8M

JohnKSLau retweeted

Andrej Karpathy

@karpathy

8 months ago

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

558

13K

2K

7K

3M

JohnKSLau retweeted

Alicia Curth @AliciaCurth

over 1 year ago

From double descent to grokking, deep learning sometimes works in unpredictable ways.. or does it? For NeurIPS,@Jeffaresalan & I explored if&how statistics + smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n

AliciaCurth's tweet photo. From double descent to grokking, deep learning sometimes works in unpredictable ways.. or does it?

For NeurIPS,@Jeffaresalan & I explored if&how statistics + smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n https://t.co/uaLMDhFVc6

6

573

81

710

71K

JohnKSLau retweeted

Denny Zhou

@denny_zhou

almost 2 years ago

What is the performance limit when scaling LLM inference? Sky's the limit. We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient. https://t.co/HO2seV73KT (ICLR 2024)

denny_zhou's tweet photo. What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

https://t.co/HO2seV73KT (ICLR 2024)

107

3K

506

2K

795K

Who to follow

Yoshiyuki Nakai 中井喜之

@yoshiyukinakai

2004 @univkyoto | 2010 @rakuten_tech | 2022 @googlecloud |『Super Study Guide: Transformer と大規模言語モデル』翻訳 https://t.co/nJ3qZtypdQ

Alisher Abdulkhaev

@alisher_ai

CTO & Co-Founder at Kanaria Tech. Building AI brain for social robot navigation.

Akihiro Matsukawa

@amatsukawa

ML Engineer. Formerly @DeepMindAI, @Google, @Twitter. Amateur Netflix watcher and hotpot eater. Opinions are my own, etc.

JohnKSLau retweeted

Daily Dose of Data Science

@DailyDoseOfDS_

almost 2 years ago

Bayes’ theorem clearly explained:

24

6K

969

5K

907K

JohnKSLau retweeted

Daily Stoic

@dailystoic

almost 2 years ago

30 short rules for a better life (from the Stoics).

14

9K

1K

9K

1M

JohnKSLau retweeted

Ravid Shwartz Ziv

@ziv_ravid

almost 2 years ago

1/8 Looks like my paper "Tabular Data: Deep Learning is Not All You Need" just hit 1,000+ citations 🥳🥳🥳 Here's the story of how we almost didn't publish it... https://t.co/KiZ9dUTYWn

14

1K

178

997

240K

John Lau @JohnKSLau

almost 2 years ago

@mhidaka お疲れ様でした！

0

1

0

152

JohnKSLau retweeted

LetsDefend

@LetsDefendIO

about 2 years ago

TCP IP Model

5

2K

470

2K

162K

JohnKSLau retweeted

Akshay 🚀

@akshay_pachaar

about 2 years ago

Self-attention as a directed graph! Self-attention is at the heart of transformers, the architecture that led to the LLM revolution that we see today. In this post, I'll clearly explain self-attention & how it can be thought of as a directed graph. Read more...👇

akshay_pachaar's tweet photo. Self-attention as a directed graph!

Self-attention is at the heart of transformers, the architecture that led to the LLM revolution that we see today.

In this post, I'll clearly explain self-attention & how it can be thought of as a directed graph.

Read more...👇 https://t.co/HNfVEnUkzu

7

1K

172

1K

118K

JohnKSLau retweeted

Holden @hodlenx

about 2 years ago

🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs

31

574

140

385

68K

JohnKSLau retweeted

Tivadar Danka

@TivadarDanka

over 3 years ago

The single most undervalued fact of linear algebra: matrices are graphs, and graphs are matrices. Encoding matrices as graphs is a cheat code, making complex behavior simple to study. Let me show you how!

TivadarDanka's tweet photo. The single most undervalued fact of linear algebra: matrices are graphs, and graphs are matrices.

Encoding matrices as graphs is a cheat code, making complex behavior simple to study.

Let me show you how! https://t.co/abEviwAmIO

177

18K

3K

9K

2M

JohnKSLau retweeted

Santiago

@svpino

about 2 years ago

There's a stunning, simple explanation behind matrix multiplication. This is the first time this clicked on my brain, and it will be the best thing you read all week. Here is a breakdown of the most crucial idea behind modern machine learning: 1/15

svpino's tweet photo. There's a stunning, simple explanation behind matrix multiplication.

This is the first time this clicked on my brain, and it will be the best thing you read all week.

Here is a breakdown of the most crucial idea behind modern machine learning:

1/15 https://t.co/R0V6fHVYVV

49

6K

769

11K

1M

JohnKSLau retweeted

Yann LeCun

@ylecun

about 2 years ago

So much misunderstanding of this comment! Here is a list of things I am *NOT* saying: - you need a PhD to do Science. You don't. A PhD teaches you to do research, but you can learn that on your own (though it's much easier with a mentor). - you need to get papers accepted by a journal or conference to publish: you don't. You can just post it in https://t.co/yQWSsbCJc1 . Many influential papers never went through the formal peer review process, or went through it after they became influential. - engineering is not science: it can be, depending on your methodology. I'm a scientist *and* an engineer. These activities are complementary and need each other. - science requires formal papers: it doesn't. A clear explanation on a website and a piece of code on a public repo will do. What I *AM* saying is that science progresses through the collision of ideas, verification, analysis, reproduction, and improvements. If you don't publish your research *in some way* your research will likely have no impact.

744

9K

671

2K

2M

JohnKSLau retweeted

Koichi Tsunoda @無職

@KoichiTsunoda

about 2 years ago

もし多様性を推進したいなら、まず自分が圧倒的なマイノリティ（社会的弱者）になる環境に飛び込んでみるのをオススメします

0

28

2

7K

JohnKSLau retweeted

antirez @antirez

over 2 years ago

Using MicroPython Viper code emitter I got something like a 20x speed-up. C-level performances or alike. Quite a game changer when speed is needed.

antirez's tweet photo. Using MicroPython Viper code emitter I got something like a 20x speed-up. C-level performances or alike. Quite a game changer when speed is needed. https://t.co/AJzq4HA3Af

5

96

9

38

14K

JohnKSLau retweeted

Josef Strzibny

@strzibnyj

over 2 years ago

I still don't know how we got here as an industry.

109

1K

69

303

518K

JohnKSLau retweeted

Andrew Ng

@AndrewYNg

over 2 years ago

It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization (DPO) by @rm_rafailov @archit_sharma97 @ericmitchellai @StefanoErmon @chrmanning and @chelseabfinn. This beautiful paper proposes a much simpler alternative to RLHF (reinforcement learning from human feedback) for aligning language models to human preferences. RLHF has been a key technique for training LLMs. In brief, RLHF (i) Gets humans to specify their preferences by ranking LLM outputs, (ii) Trains a reward model (used to score LLM outputs) -- typically represented using a transformer network -- to be consistent with the human rankings, (iii) Uses reinforcement learning to tune an LLM, also represented as a transformer, to maximize rewards. This requires two transformer networks, and RLHF is also finicky to the choice of hyperparameters. DPO simplifies the whole thing. Via clever mathematical insight, the authors show that given an LLM, there is a specific reward function for which that LLM is optimal. DPO then trains the LLM directly to make the reward function (that’s now implicitly defined by the LLM) consistent with the human rankings. So you no longer need to deal with a separately represented reward function, and you can train the LLM directly to optimize the same objective as RLHF. Although it’s still too early to be sure, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years. You can read the paper here: https://t.co/m14qRYszVa I also write more about this in The Batch (linked to below). https://t.co/8h2ag2plIa

51

5K

749

4K

696K

JohnKSLau retweeted