What I've been working on for the past year! https://t.co/CAQMYS1rR7
Inspired by CoVE, ELMo, and ULMFiT we show that a single transformer language model can be finetuned to a wide variety of NLP tasks and performs very well with little tuning/tweaking.
New work with @AlecRad and @DavidDuvenaud:
Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text.
Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:
Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below!
with @AlecRad and @status_effects 🧵
We trained diffusion models on a billion LLM activations, and we want you to use them!
New preprint: Learning a Generative Meta-Model of LLM Activations
Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt.
More in thread 🧵
New paper, w/@AlecRad
Models acquire a lot of capabilities during pretraining.
We show that we can precisely shape what they learn simply by filtering their training data at the token level.
@skornblith@DGBassani It's the max width with 12 layers that could fit in memory on the dev box that trained GPT-1. Also worked out to a month to train which was edge of my patience. The prototypes went 6 layer 512 wide (og tformer paper "base") to 12 layer 512 wide to 12 layer 768 wide.
@chipro Dynamic eval improves an AWD-LSTM baseline by 0.11 nats. Can't be sure it'd have equal sized benefits for both architectures (though https://t.co/hkVohkVMd4 suggests it works fine) but if that gain carried over, the Transformer-XL model would be 48.6 test perplexity.
Extremely excited to share work I've been doing at OpenAI the past few months: MuseNet, a neural net music generator. It's been a huge team effort pulling this all together!
Releasing some work today with @scottgray76@AlecRad and @ilyasut. Contains some simple adaptations for Transformers that extend them to long sequences.
@tallinzen@mcxfrank@emilymbender@yoavgo Don't know exact # since there is not a traditional word-level tokenization step. There are 9B tokens total and the ratio is probably around 1.1 tokens per word? You can probably just call those tokens words for the purpose of a # on a slide.
One commonly cited argument about the difficulty of learning common-sense reasoning is that "no-one writes down common sense". A counter-argument is "well, the web is big": https://t.co/qPNmra86ES
@jacobandreas Sorry - I interpreted:
"if a paper had crossed my desk saying here are some hand-curated best-of-25 samples from our model + PPL comparisons with models trained on other datasets"
as about the paper - especially since the second half of the statement is about the paper.
@jacobandreas The paper relegates samples to the appendix. The unicorn sample is on page 20 and used to make a qualitative point. Almost everything else in the paper is random samples.
@jacobandreas Those samples use a different technique than the ones shown in the blog. The samples you are looking at are temperature=1. We use top_k=40. Unconditional samples with that are here: https://t.co/OxQBnCc6mA
It's also important to note that conditioning on "real" text helps too.
First, reproducibility is not about rerunning code to get the same results. Science must be more robust, as naive copying has many flaws. Second, reproducibility should never be above public safety. We must publish responsibility, with hope and kindness in our minds.
I'd like to weigh in on the #GPT2 discussion. The decision not to release the trained model was carefully considered and important for norm-forming. Serving the public good requires us to draw lines on release somewhere: better long before catastrophe than after.