20min talk I gave at the Berkeley AI hackathon a few weeks ago, on how hacking around makes its way to real-world impact in my experience.
While True: build and publish projects.
Accumulate 10,000 hours.
Snowball your work.
https://t.co/dril50VAU5
never ever watch tiktok, instagram / facebook reels, or youtube shorts. it's literally devil juice incarnate, it will fry your brain. do not watch that shit
I often see resumes where someone has worked at 6 companies over the last 10 years.
I'm staring at one right now. These job-hoppers usually last a little under 2yrs per company on average.
If someone has a jumpy resume, it's the first thing I notice. It's so blatant to me, yet it continues to fool a lot of people.
Thinking back over the last 15yrs at my companies:
I've hired thousands of people, and I can't think of a single person I've hired with a jumpy resume who has worked out long-term. Not one.
An incredible skill that I have witnessed, especially at OpenAI, is the ability to make “yolo runs” work.
The traditional advice in academic research is, “change one thing at a time.” This approach forces you to understand the effect of each component in your model, and therefore is a reliable way to make something work. I personally do this quite religiously. However, the downside is that it takes a long time, especially if you want to understand the interactive effects among components.
A “yolo run” directly implements an ambitious new model without extensively de-risking individual components. The researcher doing the yolo run relies primarily on intuition to set hyperparameter values, decide what parts of the model matter, and anticipate potential problems. These choices are non-obvious to everyone else on the team.
Yolo runs are hard to get right because many things have to go correctly for it to work, and even a single bad hyperparameter can cause your run to fail. It is probabilistically unlikely to guess most or all of them correctly.
Yet multiple times I have seen someone make a yolo run work on the first or second try, resulting in a SOTA model. Such yolo runs are very impactful, as they can leapfrog the team forward when everyone else is stuck.
I do not know how these researchers do it; my best guess is intuition built up from decades of running experiments, a deep understanding of what matters to make a language model successful, and maybe a little bit of divine benevolence. But what I do know is that the people who can do this are surely 10-100x AI researchers. They should be given as many GPUs as they want and be protected like unicorns.
The GOAT of tennis @DjokerNole said: "35 is the new 25.” I say: “60 is the new 35.” AI research has kept me strong and healthy. AI could work wonders for you, too!
PyTorch's design origins, its connection to Lua, its intertwined deep connection to JAX, its symbiotic connection to Chainer
The groundwork for PyTorch originally started in early 2016, online, among a band of Torch7's contributors.
Torch7 (~2010-2017)
These days, we also commonly refer to Torch7 as LuaTorch, as it was used via Lua. Torch7 was written by Ronan Collobert, @clmt and @koraykv in ~2010. I was deeply involved in Torch7 since 2012, with official "maintainer" status, joining these three original authors in April 2014.
Refactoring LuaTorch to be language agonstic (late 2015 to mid 2016)
LuaTorch's C backend with all the CPU and CUDA code for Linear Algebra and Neural Networks was deeply intertwined with Lua. So, a bunch of us lead by @lantiga@neurosp1ke@szagoruyko5 me @apaszke@fvsmassa refactored these backends to be agnostic of Lua, and usable independently. We did this after discussing online that we should move LuaTorch to a new, modern design, but hadn't quite framed what that design should be.
Writing a new Python based Torch (mid 2016)
@apaszke reached out to me early 2016 looking for internships. At that time, the entire LuaTorch team at @AIatMeta was ~3 people (@GregoryChanan@TrevorKilleen and me). I asked Adam to come do an internship to build the next version of LuaTorch, with modern design. @colesbury was in-between projects, so he joined in full-time as well.
We started from a fork of the LuaTorch, LuaTorch-nn codebases specifically for two things:
1. the TH/THC and THNN/THCUNN C backends
2. Building a compatibility with LuaTorch's checkpoints, so that LuaTorch users could smoothly continue into PyTorch. We did this by transpiling LuaTorch's `nn` code to Python. We called this package in PyTorch `torch.legacy.nn`.
Then, coming to the design itself, we debated a lot of designs. The strong inspirations were:
1. torch-autograd (written by @awiltschko and @clmt ) 2. Chainer (written by the team at @PreferredNet ).
@ebetica who loved Chainer would obsessively tell us its the best thing ever, so he came on board to build this together with us. Quite a few others such as Natalia Gimelshein and @adamlerer part-time got involved in various ways.
We wrote the code for the new design of PyTorch from scratch.
The connection to JAX: inspiration of HIPS/autograd
@awiltschko's torch-autograd (which was a big inspiration for PyTorch's design) was directly inspired by @SingularMattrix@DougalMaclaurin@DavidDuvenaud and @ryan_p_adams 's HIPS/autograd library, so in that indirect sense, we had strong inspiration from Ryan's library. In fact, we were so oblivious to certain origins that we named our Autodiff engine `torch.autograd` because we thought it was the norm within the autodiff community to call things "autograd". We later had to apologize to @SingularMattrix and team about the name of our subpackage conflicting with their `autograd` package.
Later, @SingularMattrix@DougalMaclaurin and others went on to create JAX, continuing down their design exploration of HIPS/autograd.
The inspiration from Chainer -> PyTorch and the inspiration for PyTorch -> Chainer v2
Chainer was a strong inspiration, we really liked the concept of Chains and stuff. The Chainer devs were friends of us, and we interacted with them a lot as well. I visited them in Japan in 2017.
Chainer's design is in my opinion a revolutionary design -- very original for that time and pretty awesome. We are proud to have been inspired from it.
However, unlike people commonly misunderstand and misattribute, we didn't simply replicate Chainer's design as-is. People have posted online on how PyTorch's design looks exactly like Chainer's and hence its origins are just copy-paste -- and that's because they don't understand the co-evolution. After PyTorch's release, Chainer evolved to include some of PyTorch's good ideas, and eventually they converged to look the same. For example, Chainer's nn Chains required you to pass in all the modules to the constructor (or use an add_link). The concept of self-assignment (i.e.) `self.conv = nn.Conv2d(...)`, the concept of `Parameter` was something we introduced as an evolved upgrade from Chainer v1. We also innovatively changed the way the autodiff engine was implemented -- things like "variable versioning" to detect correctness issues with inplace operations, and a few other new ideas, ideas that eventually went back into Chainer in their v2.
When Chainer's community wanted to stop development, @PreferredNet amicably and proactively joined the PyTorch community (link in references).
Post-launch evolution (2017 to present)
This post doesn't have the space to cover PyTorch's:
* evolution to add in ideas from Caffe2 (@jiayq@dzhulgakov et. al)
* its 5 compiler designs before we landed on what seems great (Zach DeVito, @ezyang@apaszke@jamesr66a Jason Ansel Christian Sarofeen et. al.)
* our inspirations from JAX and designing functorch (Richard Zou, @cHHillee@vfdev_5 Animesh Jain)
* our entire distributed design and evolution
* the origins of the sparse package (@braizh ) and its evolution (@cpuhrsch et. al.)
* PyTorch's domain libraries
* data loading (@colesbury@TongzhouWang )
* community design, community growth, innovation in design of incentives (@ptrblck_de Alban Desmaison, me)
* Several innovations in GPU code (several key folks from NVIDIA and Meta)
Many other parts of PyTorch that I didn't include -- its become somewhat of a monolith at this point.
Attributing ideas is healthy, awesome and should be done more often
Since PyTorch has launched, several new libraries have used the designs and ideas from PyTorch -- the particular new ideas that we introduced eventually propagated to many other libraries -- and this is awesome.
We are proud to have been inspired by work before us, and we are proud to have inspired work after us.
We also take pride in always attributing our inspirations clearly -- torch-autograd, chainer and many other projects that have inspired us in lesser ways.
I think people don't do this enough, attribute their origins clearly -- either ego or corporate controls come into play to erase history -- and people should do more here. In that sense, I'm really proud of my JAX friends who see framework design as a scientific endeavor, openly discussing ideas and evolutions, and proudly attributing their origins and inspiration.
References:
1. My reply in March'17 on the origins of PyTorch: https://t.co/zjViIvFm2M
2. Chainer's v1 design: https://t.co/gMdDC1ZMKR
3. https://t.co/tra6N8bJiT
4. PyTorch's autodiff innovations in a short paper: https://t.co/ke1bbkzv5o
5. The PyTorch paper: https://t.co/tbcGibPqgR
why does it feel like there are more insane people in the generative image diffusion space than language modeling space? is it really because of just anime? picture related
So @ylecun: "I've been advocating for deep learning architecture capable of planning since 2016" vs me: "I've been publishing deep learning architectures capable of planning since 1990." I guess in 2016 @ylecun also picked up the torch. (References attached)
Original tweet by @ylecun: https://t.co/bCvO0SEkhl
REFERENCES
Online planning with deep learning architectures / artificial neural networks (NNs):
[PLAN1] J. Schmidhuber (1990). Making the world differentiable: on using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. TR FKI-126-90, TU Munich, Feb 1990, revised Nov 1990. The first paper on online planning with reinforcement learning recurrent NNs and on artificial curiosity with generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game. Extending NN-based system identification and control of the 1980s by Werbos, Munro, Nguyen & Widrow, and others. See also overviews [PLAN7][LEC]. https://t.co/VJvrmtsE6m
[PLAN2] J. Schmidhuber (1990). An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. In IJCNN'90, San Diego, volume 2, pages 253-258, June 17-21. Based on [PLAN1].
[PLAN3] J. Schmidhuber (1991). Reinforcement learning in Markovian and non-Markovian environments. NIPS'3, pages 500-506. San Mateo, CA: Morgan Kaufmann. Partially based on [PLAN1]. https://t.co/pp2LU4lr0w
[PLAN4] David Ha (@hardmaru), J. Schmidhuber (2018). Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal. (Oral.) https://t.co/PrHKOHF54i. Github: https://t.co/RrUNYSIz6n
[PLAN7] J. Schmidhuber (2020, updated 2023). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). https://t.co/OBoGmuZWab
Hierarchical planning with with deep learning architectures:
[HPLAN1] J. Schmidhuber (1990). Towards compositional learning with dynamic neural networks. TR FKI-129-90, TU Munich. An RL machine gets extra command inputs of the form (start, goal). A JEPA-like evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of cost-minimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals. The system is learning action plans at multiple levels of abstraction and multiple time scales and solves what Y. LeCun called an "open problem" in 2022 [LEC]. https://t.co/Kq1oViaRCo
[HPLAN2] J. Schmidhuber (1991). Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland. Extending [HPLAN1]. https://t.co/3Jt80SSqy9
[HPLAN3] J. Schmidhuber and R. Wahnsiedler (1992). Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pages 196-202. MIT Press. Based on [HPLAN1-2]. https://t.co/aqoryPjb3O
[HPLAN4] J. Schmidhuber (2015). On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models. Check out the reinforcement learning prompt engineer in Sec. 5.3: a controller neural network C learns to send prompt sequences into a world model M (e.g., a foundation model) trained on, say, videos of actors. C also learns to interpret answers of M, extracting algorithmic information from M. Acid test: does C learn its control tasks faster with M than without? Is it cheaper to learn C’s tasks from scratch, or to address algorithmic info in M in some computable way, enabling things such as abstract hierarchical planning and reasoning? https://t.co/5FQEb7Cc3F . See also this tweet https://t.co/1szutyBLUw
[HPLAN5] J. Schmidhuber (2018). One Big Net For Everything. Collapsing the controller and the world model and the abstract planner of [HPLAN4] into a single network, using the neural network distillation of 1991 [UN1]. https://t.co/7NLeiMnFcq
[UN1] J. Schmidhuber (1991). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991. First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding (with self-supervised target generation). Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—--such approaches are now widely used. See also this tweet: https://t.co/E9GjXsamkE
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years ago, Schmidhuber's team published most of what LeCun calls his "main original contributions:" neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc. This was also discussed on Hacker News, reddit, and in the media. See this tweet: https://t.co/R3s2amgjle LeCun also listed the "5 best ideas 2012-2022" without mentioning that most of them are from Schmidhuber's lab, and older. See this tweet: https://t.co/akuK7L0lCi
It is embarrassing how much this stupid fucking meme has affected my own thinking and my own mental health.
So, SO many times I’ve been the left person, and someone is explaining to me why I can’t do or think something without reading every book they’ve read, received the degrees they’ve earned, or had the life experiences they’ve had.
It used to bother me.
But every single time now when I find myself in that situation I pause and think, “is this person correct, or are they just the middle person and I should find the right person”
this is actually hilarious. Quora SEO'd themselves to the top of every search result, and is now serving chatGPT answers on their page, so that's propagating to the answers google gives
the internet is dying
@mikpanko if you can't express what you want in words, no amount of clicking will get you the right answer
if you can express it in words, it's not a big leap to express it in code