AI Start-up founder at Kortical. Since the age of 15, I've been chasing the dream of a building an AI to do my chores...
but for now it's mainly B2B AI
It is well established that models often memorise some subsets of their data but the key distinction is that this isn't ONLY what they do.
This is an explainer for a paper that shows they can learn rules that are provably beyond interpolation. So they can learn things that are not possible to get to by memorisation or interpolation between known datapoints.
Pretty cool huh?
https://t.co/nNfvyqfbLc
@fchollet Exciting timing, I just published a paper showing transformers can learn held-out rules where interpolation provably scores 0%. Rules out the big argument to stop scores pushing higher on ARC-AGI-2 and now 3. Paper and explainer here: https://t.co/ygaLBR6qXz
@GaryMarcus your ACM piece made the case that LLMs are fundamentally limited to interpolation, no doubt why you expect this market plateau. I just published results that might surprise you —> transformers hit 97.9% on held-out rules where every interpolation method scores 0%! Backed by a formal proof. Would love your take. Full explainer and paper here: https://t.co/nNfvyqfbLc
Excited to release new repo: nanochat!
(it's among the most unhinged I've written).
Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI.
It weighs ~8,000 lines of imo quite clean code to:
- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with "GRPO"
- Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI.
- Write a single markdown report card, summarizing and gamifying the whole thing.
Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc.
My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.
Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
😂 I thought it was a terrible paper! Disingenuous, badly researched and hyperbolic. I wrote an article with a detailed takedown here https://t.co/t2O9zQlzsM
That said I do think over anthropomorphising LLMs is a bad idea and we can see way too much of it already, so I don't disagree with you entirely 😆
I'm seeing lots of people are taking Apple's new paper "The Illusion of Thinking" at face value but there is so much wrong with it, I felt compelled to write an article debunking its claims: https://t.co/t2O9zQlzsM
I dive into why bit it looks like they are knowingly trying to create FUD about AI
@rubenhassid I'm seeing lots of people are taking Apple's new paper "The Illusion of Thinking" at face value but there is so much wrong with it, I felt compelled to write an article debunking its claims: https://t.co/DwQzfeqaLz
AGI is on track 😉
GPT-4o just got an INSANE upgrade!
OpenAI just dropped native Image Generation in GPT-4o.
Image & Text quality is insane. 100% AI
10 wild examples (prompts included):
1. Polaroid style photographs
Important update: Figure is launching robots into the home
Our AI, Helix, is advancing faster than any of us anticipated, accelerating our timeline into the home
Therefore, we've moved-up our home timeline by 2 years; starting Alpha testing this year
Teslas now drive themselves from their birthplace at the factory to their designated loading dock lanes without human intervention
One step closer to large-scale unsupervised FSD
Deep Dive On DeepSeek’s New Multimodal AI Released Today And How We Are Getting It Running On A Gaming PC!
—
DeepSeek’s Janus-Pro represents a significant advancement in multimodal large language models (LLMs), particularly in text-to-image generation. Building upon the foundation of the original Janus model,
Janus-Pro introduces enhancements in training processes, data quality, and model architecture, resulting in more stable and detailed image outputs.
Technical Architecture:
Janus-Pro employs a decoupled architecture, optimizing it for tasks involving both multimodal understanding and text-to-image generation. This design allows for separate processing pathways for different modalities, enhancing the model’s flexibility and performance.
The model has been trained on a diverse dataset comprising multimodal, textual, and synthetic aesthetic data through a three-stage process, ensuring superior performance across various tasks.
Performance Benchmarks:
Janus-Pro has demonstrated exceptional capabilities:
•Text-to-Image Generation:
•GenEval: Scored 0.80, surpassing OpenAI’s DALL-E 3 (0.67) and Stability AI’s Stable Diffusion 3 Medium (0.74).
•DPG-Bench: Achieved an overall accuracy of 84.19, highlighting its proficiency in handling dense and nuanced prompts.
•Multimodal Understanding:
•MMMU (Multimodal Machine Understanding): Attained an accuracy of 41.0, outperforming models like TokenFlow-XL (38.7).
•MME (Multimodal Evaluation): Showed significant gains in reasoning and contextual understanding.
These results underscore Janus-Pro’s capabilities in both generating high-quality images from textual prompts and understanding complex multimodal inputs.
Running Janus-Pro on Consumer-Grade GPUs
These are some of the techniques we deploy when adapting a new larger AI model to run efficiently less expensive computer hardware. This is not an exhaustive list but enough to give you an idea and overview.
1.Model Quantization: Reducing the precision of the model’s weights (e.g., from 16-bit to 8-bit or lower) can significantly decrease memory usage and computational requirements, enabling the model to run on GPUs with limited VRAM. Tools like MiniLLM facilitate running large language models on consumer-grade GPUs. We also imply distillation processes to further improve GPU cycles.
2.Efficient Inference Engines: Utilizing inference engines designed for consumer hardware can enhance performance. For instance, PowerInfer is a high-speed LLM inference engine optimized for personal computers equipped with a single consumer-grade GPU. It exploits the high locality inherent in LLM inference to reduce GPU memory demands and CPU-GPU data transfers.
3.Hardware Considerations: High-end consumer GPUs, such as the NVIDIA RTX 4090, are more suitable for running large models like Janus-Pro due to their substantial VRAM and computational capabilities. However, with appropriate optimization techniques, it’s possible to run the model on GPUs with lower specifications, though performance may be affected.
These are some of the strategies we are deploying to run Janus-Pro on consumer-grade gaming computers,
By leveraging Janus-Pro, developers and researchers can explore advanced capabilities in both multimodal understanding and image generation, pushing the boundaries of what’s achievable in AI-driven applications.
We will keep you updated on the progress.
AGI isn’t “near,” @sama—it’s already here.
We coined AGI thinking of human-level intelligence. But LLMs are already general, intelligent, just not sentient, superhuman or demanding civil rights.
It’s time to redefine:
1️⃣ AGI = Broad general knowledge, common-sense machines (e.g., LLMs).
2️⃣ ASI = Artificial Super-intelligence - SuperIntelligent but not necessarily conscious.
3️⃣ ACI = Artificial Conscious Intelligence - Thinking feeling machines that have a sense of self, needs and wants, etc.
Let’s update the terms. #AI
Heard a leak from one of the frontier labs (not oai tbh), they reached an unexpected HUGE wall of diminishing returns trying to brute-force better results by training longer & using more and more data..
(more severe than what is published publicly)
I've spent some time now with @OpenAI o1. The model that was "too dangerous" to release. OpenAI's first specialist reasoning model.
The big question I wanted to answer was just how good is o1 at reasoning?
Despite all the hype, many people are skeptical if LLMs can reason at all. Is it just mimicry, like a parrot, repeating words it doesn't understand?
In this article https://t.co/aVuC8CuilF I give a bit of background, show how I set about to prove it one way or the other and talk through the surprising result. Not to overcook it too much but... I was genuinely not expecting this result.
Let me know if you enjoy the read! 😃
Feedback loop: train SOTA chip design model (AlphaChip) -> use it to design better AI chips -> use them to train better models -> to design better chips... part of the reason why our TPU stack is so good. Congrats @Azaliamirh, @annadgoldie, @JeffDean & the AlphaChip team!