I don't have too too much to add on top of this earlier post on V3 and I think it applies to R1 too (which is the more recent, thinking equivalent).
I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI. You may not always be utilizing it fully but I would never bet against compute as the upper bound for achievable intelligence in the long run. Not just for an individual final training run, but also for the entire innovation / experimentation engine that silently underlies all the algorithmic innovations.
Data has historically been seen as a separate category from compute, but even data is downstream of compute to a large extent - you can spend compute to create data. Tons of it. You've heard this called synthetic data generation, but less obviously, there is a very deep connection (equivalence even) between "synthetic data generation" and "reinforcement learning". In the trial-and-error learning process in RL, the "trial" is model generating (synthetic) data, which it then learns from based on the "error" (/reward). Conversely, when you generate synthetic data and then rank or filter it in any way, your filter is straight up equivalent to a 0-1 advantage function - congrats you're doing crappy RL.
Last thought. Not sure if this is obvious. There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2. 2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself. These thoughts are *emergent* (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.
(Last last thought/reference this time for real is that RL is powerful but RLHF is not. RLHF is not RL. I have a separate rant on that in an earlier tweet
https://t.co/RMIpFPVpuM)
I’m a US citizen because I was born while my dad was a student.
My citizenship allowed me to work at UChicago, which then got me into a Harvard PhD, a degree I hope to use to teach fellow Americans.
The US has given me so many opportunities Hurts to think I’m not wanted here.
The impact of EO would be huge.
Story I received today 👇
I don't know if this feels like a pricking case of betrayal in many levels just for me or if this applies to a lot of others.
I've been brought up to believe in merit, the value of hard work and dedication and doing what's right by the law.
I completed my masters in this country from a reputed institute and I major in cyber security.
I also have an approved petition in the National Interest waiver and exceptional ability category (EB2-NIW ) but because of my country of birth I'm stuck in backlogs.
My husband works for one of the companies that Elon founded and is also an accomplished engineer with skills in advanced manufacturing.
He doesn't have a US citizenship like me and we are both on work visas.
We are both expecting our first kid in the next two months and as we have built a whole life together here for over 8+ years,
after having worked extremely hard, it feels like a last stab to watch birthright citizenship also being taken away from our yet to be born child.
I know this might seem like a rant but I'm sure there are human stories like mine that makes this new EO extremely unfair.
My request to you is: can you help amplify stories like ours?
I would like to keep names out of this until we figure out the situation going forward. However I'm sure I'm not the only one.
We are both legal immigrants who are very much within the "jurisdictional authority" of the US.
We studied here.
We have built our lives here.
We are not 'birth tourists'.
We pay our taxes.
We pay for an education system that even we don't see benefitting our own kid in the years to come.
We are in perpetual fear and uncertainty.
There is an unjust and unforgiving lack of political will to do right by the folks abiding by the law even if we are a silent minority.
I hope this gets amplified and I hope you'd help in doing that. Thank you very much for reading and I appreciate your time.
@elonmusk - what are you even doing?
# scheduling workloads to run on humans
Some computational workloads in human organizations are best "run on a CPU": take one single, highly competent person and assign them a task to complete in a single-threaded fashion, without synchronization. Usually the best fit when starting something new. Comparable to "building the skeleton" of a thing.
Other workloads are best run on a GPU: take a larger number of (possibly more junior) people and assign tasks in parallel: massively multi-threaded, requiring synchronization overhead. Usually a good fit for later stages of a project, or parts that naturally afford parallelism, comparable to "fleshing out" a thing when the skeleton is there.
There's some middle ground here - sometimes you can imagine a multi-threaded CPU execution of a small team collaborating.
A good manager will understand the computational geometry of the project at hand and know when to delegate parts of it on the CPU or on the GPU. One notable place where the analogy breaks down a bit is that the worst thing that can happen when you misallocate computer resources is that your program will run slower. But in human organizations it can be much worse - not just slower, but the result can be of lower quality overall, more brittle, more disorganized, less consistent, uglier.
The most common stumbling point here is trying to parallelize something that was supposed to run on the CPU. In the common tongue, this comes from the misunderstanding that something can go faster if you put more people on it, usually leading to outcomes where something is "designed by a committee" - not only is the thing actually slower, but the philosophy is inconsistent, the entropy is high, and the long-term outcomes much worse.
The opposite problem is more rare and usually looks like someone doing something repetitive, uninteresting or tedious, where they could really benefit from more help.
I think this is one accidental advantage of startups - they lack resources of large companies and run compute on powerful CPUs, winning in cases where that is the right thing to do. Larger companies, especially in cases where something is deemed of high strategic importance, will almost always reach for too much parallelism.
TLDR: Think about your project, its computational geometry, its inherent parallelism, and which parts are a best fit for a CPU or a GPU.
Panda: Performance debugging for databases using LLM agents - a very interesting application of LLMs by @vikramank14 and the team at #AWS AI Labs, published at @cidrdb. Download the pdf from @AmazonScience. https://t.co/uBQe3gqCGF
Panda: Performance debugging for databases using LLM agents - a very interesting application of LLMs by @vikramank14 and the team at #AWS AI Labs, published at @cidrdb. Download the pdf from @AmazonScience. https://t.co/uBQe3gqCGF
This is how I advise my #PhD students to write research manuscripts (in case someone finds it helpful).
General points:
1. Research questions addressed by your manuscript are key and should guide you.
2. Don’t view your manuscript as an article. See it as a STORY.
3. Pick the writing style that is easily understood by a broader community. Make reading easy.
4. Most of data should get into the paper. If some doesn’t support the hypothesis, it still must be in the Suppl. Information. It must show the reproducibility limits.
5. Make the paper shorter, not longer. Cut out things that may sound like ‘bluff’ or ‘decoration’ of the story. Use well-defined terminology, don’t invent it unless clearly necessary.
6. Focus on reporting & explaining the numbers. Minimize discussions of qualitative outcomes and your imagination.
▫️
Specific steps:
1️⃣ First, formulate and polish the key questions that your study addresses. It may take hours or even days (even though you've been doing research in this area for years). A single study should address no more than 1-3 key questions. It’s your perfect start for writing.
2️⃣ Write down the structure of your STORY first: Sections and Subsections that will answer those questions. Into each subsection, put 1-2 sentences that formulate the message(s) from this subsection. It will hugely help you navigate the manuscript later and save a lot of time.
3️⃣ Write approximate messages in the conclusion section. Usually, no more than 1-4 sentences.
▫️
At this point, SHARE your [structure+questions+messages] document with your advisor for feedback. Toss it back and forth until you both converge. You can also include major collaborators if needed.
▫️
4️⃣ Write the introduction part. Put down the paragraphs that introduce a reader into the key question(s) of the manuscript and the background of your story.
5️⃣ Write the main text for each section, smoothly and firmly. Each paragraph should add a separate value and end with a message-like sentence. Follow the “First… Second… Third…” structure for paragraphs when possible, it gives rigor and readability to your story.
6️⃣ Write the conclusions. Add a broader perspective that is justified and not generic.
7️⃣ Write the abstract. It must have simple terminology and clearly explain what readers can find inside the paper. It also should contain the key conclusions.
8️⃣ Write up 4-5 different titles and spend >30 mins with your team discussing which title sounds best.
Finally, iterate on the resulting draft within your team.
The number of drafts can easily exceed 20.
▫️
❗In addition, I always emphasize that a high quality of your research paper:
- sharpen your writing and analytical skills.
- shapes your reputation.
- shows who you are as a researcher and communicator.
▫️
p.s. Everyone has a different style of advising and writing. You can adopt only some specific steps if you find them helpful.
p.p.s. Another way that we sometimes use is by starting with figures ('story in figures' style).
#AcademicTwitter #AcademicChatter
So the actual scary part to me is that GPT4 understands what it means to say, "Compress this in a way where *you* can decompress it." Humans take for granted that we know our own capabilities, that we reflect, that we can imagine how we would react to a future input, we can optimize over possible future inputs like that.
Since when does GPT4 know what "you" refers to, what "you" can do, since when can it imagine "you" well enough to answer an unprecedented and untrained question about how to compress a sentence that "you" will decompress? And solve it in a way no human would, that couldn't just be imitative?
Like. It found a short sentence with, I assume, embedding similar to the long sentence.
How does it KNOW? It doesn't have access to the encodings!
("Stochastic parrot" my fucking ass.)
🪐 Introducing Galactica. A large language model for science.
Can summarize academic literature, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.
Explore and get weights: https://t.co/jKEP8S7Yfl
How can robots perform a wide variety of novel tasks from natural language?
Execited to present Code as Policies - using language models to directly write robot policy code from language instructions.
See paper, colabs, blog, and demos at https://t.co/QS6bgGOORw
long 🧵👇
"What Makes Convolutional Models Great on Long Sequence Modeling?"
CNNs—not transformers—now dominate the hardest sequence modeling benchmark.
Here's how this happened: [1/14]
@ylecun@jamesbuchanan27 Also, these systems are fundamentally deterministic but what makes them stochastic are millions of end-points (users) interacting with them in an unpredictable manner. It'll be interesting how this uncertainty can be factorized in aleatoric & epistemic, 'Z' in JEPA and estimated.
@ylecun@jamesbuchanan27 More specifically, in our group we look at building these world models for systems like databases that are well studied and interpretable but complex and configurable. So modular world models or causal models seem to be a better approach than large monolith architecture (4/n)