Very proud of our new preprint introducing reverse predictivity — a two-way test of AI–brain alignment. We find a striking asymmetry: models & brains don’t map to each other equally, while brain-to-brain mappings are symmetric 🧠🤖
Who invented convolutional neural networks (CNNs)?
1969: Fukushima had CNN-relevant ReLUs [2].
1979: Fukushima had the basic CNN architecture with convolution layers and downsampling layers [1]. Compute was 100 x more costly than in 1989, and a billion x more costly than today.
1987: Waibel applied Linnainmaa's 1970 backpropagation [3] to weight-sharing TDNNs with 1-dimensional convolutions [4].
1988: Wei Zhang et al. applied "modern" backprop-trained 2-dimensional CNNs to character recognition [5].
All of the above was published in Japan 1979-1988.
1989: LeCun et al. applied CNNs again to character recognition (zip codes) [6,10].
1990-93: Fukushima’s downsampling based on spatial averaging [1] was replaced by max-pooling for 1-D TDNNs (Yamaguchi et al.) [7] and 2-D CNNs (Weng et al.) [8].
2011: Much later, my team with Dan Ciresan made max-pooling CNNs really fast on NVIDIA GPUs. In 2011, DanNet achieved the first superhuman pattern recognition result [9]. For a while, it enjoyed a monopoly: from May 2011 to Sept 2012, DanNet won every image recognition challenge it entered, 4 of them in a row. Admittedly, however, this was mostly about engineering & scaling up the basic insights from the previous millennium, profiting from much faster hardware.
Some "AI experts" claim that "making CNNs work" (e.g., [5,6,9]) was as important as inventing them. But "making them work" largely depended on whether your lab was rich enough to buy the latest computers required to scale up the original work. It's the same as today. Basic research vs engineering/development - the R vs the D in R&D.
REFERENCES
[1] K. Fukushima (1979). Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979.
[2] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. This work introduced rectified linear units (ReLUs), now used in many CNNs.
[3] S. Linnainmaa (1970). Master's Thesis, Univ. Helsinki, 1970. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation. (See Schmidhuber's well-known backpropagation overview: "Who Invented Backpropagation?")
[4] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. Backpropagation for a weight-sharing TDNN with 1-dimensional convolutions.
[5] W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First backpropagation-trained 2-dimensional CNN, with applications to English character recognition.
[6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989. See also Sec. 3 of [10].
[7] K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990. A 1-dimensional convolutional TDNN using Max-Pooling instead of Fukushima's Spatial Averaging [1].
[8] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, pp. 121-128. A 2-dimensional CNN whose downsampling layers use Max-Pooling (which has become very popular) instead of Fukushima's Spatial Averaging [1].
[9] In 2011, the fast and deep GPU-based CNN called DanNet (7+ layers) achieved the first superhuman performance in a computer vision contest. See overview: "2011: DanNet triggers deep CNN revolution."
[10] How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023. See also the YouTube video for the Bower Award Ceremony 2021: J. Schmidhuber lauds Kunihiko Fukushima.
Excited to release what we’ve been working on at Amaranth Foundation, our latest whitepaper, NeuroAI for AI safety! A detailed, ambitious roadmap for how neuroscience research can help build safer AI systems while accelerating both virtual neuroscience and neurotech. 1/N
Why do video models handle motion so poorly? It might be lack of motion equivariance.
Very excited to introduce: Flow Equivariant RNNs (FERNNs), the first sequence models to respect symmetries over time.
Paper: https://t.co/dkk43PyQe3
Blog: https://t.co/I1gpam1OL8
1/🧵
Vision transformers have high-norm outliers that hurt performance and distort attention. While prior work removed them by retraining with “register” tokens, we find the mechanism behind outliers and make registers at ✨test-time✨—giving clean features and better performance! 🧵
Last week, our Triangle splatting paper was quietly released, and since then the tech community ignited fierce debates about it !
It was trending on @hackernews !
Today we released the code!
A deep dive into the epic “comeback” of Triangles to the throne of 3D
🧵
1/n
A robot hand grasp over 500 totally new objects without fail? Zero-shot, single-view & super reliable
⬇️ + Paper
Grasping random objects is hard for robots, especially when shapes, weights, and materials vary.
RobustDexGrasp solves this with a smart new way of seeing and controlling the hand, leading to near-perfect grasping, even in noisy or cluttered scenes.
Thank you for sharing, @Hui_Zhang_eth 🙏
Follow him!!
What makes it special
✅ Grabs 500+ unseen objects with 94.6% success using only single-view input
✅ Learns local shapes, not full geometry, for better generalization
✅ Trained with just 35 objects in sim but works in the real world with hundreds more
✅ Adapts to noise, unexpected forces, and even plays chess with VLM planning
It shows that smart sensing and adaptive control can take dexterous grasping to the next level.
Project: https://t.co/JWyFmmCmJ5
Paper: https://t.co/M90aheG6J6
Meet SO-101, next-gen robot arm for all, by @huggingface 🤗
Enables smooth takeover to boost AI capabilities, faster assembly (20mn), same affordable price ($100 per arm) 🤯
Get yours today! Links in thread below 👇
A banger just got released 💥
Here is a snapshot of L2D, the biggest self-driving dataset by far!
- 90 TeraBytes of data
- 5000 hours of driving
- 6 surrounding HD cameras
- OPENLY AVAILABLE
- Train your car to drive like @Tesla at home
🧵 More details in thread
AI can now generate high-quality music, and it sounds insanely good
NotaGen just dropped, and it's pre-trained on 1.6M pieces of music.
7 WILD examples so far
I'm a bit confused...
Google's Veo 2 is the best video model in text-to-video.
But on the other hand...
The newly released image-to-video for Veo 2 (on @freepik and @FAL) feels underwhelming.
Input images generated with @runwayml Frames.
Here it is compared to @LumaLabsAI