Harm de Vries

@harmdevries77

Building something new | prev co-lead @BigCodeProject @ServiceNowRsrch | PhD from @Mila_Quebec

Amsterdam

Joined September 2022

176 Following

1.4K Followers

66 Posts

Pinned Tweet

Harm de Vries @harmdevries77

about 3 years ago

Surprised by the loss of LLaMA-7B still going down after 1 trillion tokens? In a new blogpost, I explain why you shouldn't be and argue we haven't reached the limit of the recent trend of training smaller LLMs for longer: https://t.co/E26sli9xTq Analysis in 🧵👇

harmdevries77's tweet photo. Surprised by the loss of LLaMA-7B still going down after 1 trillion tokens?

In a new blogpost, I explain why you shouldn't be and argue we haven't reached the limit of the recent trend of training smaller LLMs for longer:
https://t.co/E26sli9xTq

Analysis in 🧵👇 https://t.co/jX5jtXWGHe

651

123

413

232K

Harm de Vries @harmdevries77

over 1 year ago

@Nikita_Arora17 Which AI model did you use?

126

Harm de Vries @harmdevries77

over 1 year ago

@Thom_Wolf @pollenrobotics @RemiCadene @LeRobotHF @NepYope The real AI at Hugging Face! 🤗

Harm de Vries @harmdevries77

over 1 year ago

@AmirRPeimani Open now!

246

Who to follow

Abhi Venigalla

@ml_hardware

Researcher @Databricks. Former @MosaicML, @CerebrasSystems. Addicted to all things compute.

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

Shayne Longpre

@ShayneRedford

Lead the Data Provenance Initiative. PhD @MIT. 🇨🇦 Prev: @Google Brain, Apple, Stanford. AI/ML/NLP

Harm de Vries @harmdevries77

over 1 year ago

We're hiring! Over the past few months, we’ve been building up our agent tech stack. Now we're ready to scale up. If you live and breathe agentic systems and how they are going to impact work—DM me. We just opened a few engineering and product roles, see https://t.co/cIa1xMQfEG

12K

Harm de Vries @harmdevries77

over 1 year ago

@karpathy @DBahdanau I love my honorable mention. Not in the science part, of course, but it seems I was spot on with the rumours @karpathy

112

17K

Harm de Vries @harmdevries77

over 1 year ago

Interesting analogy between the current GenAI revolution and the computer industry from the 80s!

Nikita Arora @Nikita_Arora17

over 1 year ago

The microprocessor's invention in the mid 80s shifted the computer industry from a vertical to a more horizontal stack. It was a METAMORPHOSIS. The same transition is currently underway with the GenAI revolution. 🧵

Nikita_Arora17's tweet photo. The microprocessor's invention in the mid 80s shifted the computer industry from a vertical to a more horizontal stack.

It was a METAMORPHOSIS.

The same transition is currently underway with the GenAI revolution. 🧵 https://t.co/2spYKggd6g

harmdevries77 retweeted

BigCode @BigCodeProject

almost 2 years ago

Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks! BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.

211

102K

harmdevries77 retweeted

Guilherme Penedo @gui_penedo

about 2 years ago

We are (finally) releasing the 🍷 FineWeb technical report! In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content. Link: https://t.co/MRsc8Q5K9q

gui_penedo's tweet photo. We are (finally) releasing the 🍷 FineWeb technical report!

In it, we detail and explain every processing decision we took, and we also introduce our newest dataset: 📚 FineWeb-Edu, a (web only) subset of FW filtered for high educational content.

Link: https://t.co/MRsc8Q5K9q https://t.co/HVfFnKbeso

304

harmdevries77 retweeted

ServiceNow AI Research

@ServiceNowRSRCH

about 2 years ago

It’s been a year since the release of @BigCodeProject’s 💫 StarCoder models and paper: May the source be with you! Join us as we celebrate the anniversary, and share what you’ve done using #StarCoder. Read how StarCoder has helped ServiceNow developers: https://t.co/tICamMY7OQ

harmdevries77 retweeted

Philipp Schmid

@_philschmid

about 2 years ago

Self-Instruct for CodeLLMs! 👀 @BigCodeProject released a new StarCoder2-Instruct, the first entirely self-aligned code LLM trained with a transparent and permissive pipeline. 🧑🏻‍💻 It used itself to generate thousands of instruction-response pairs, which were then used to fine-tune—achieving 72.6 on HumanEval without relying on human annotations. 🤯 Implementation 1️⃣ Collect Seed Code Snippets, e.g., functions with docstrings. 2️⃣ Apply type checking, decontamination (benchmarks), Quality Filtering & Near-Deduplication 3️⃣ Employ in-context learning to self-generate coding tasks from these snippets. 4️⃣ For each instruction, generate answers and tests using in-context learning. 5️⃣ Execute these tests in a sandbox environment and select responses that pass for training. 6️⃣ Create a Training Dataset with the validated responses 7️⃣ Fine-Tune StarCoder2-15B on the generated self-instruct dataset Insights 🧮 15B parameter version with 8192 context 🔓 Fully open-source datasets and pipeline for distillation 📝 Fully self-aligned without human annotation 🏆 Outperforms CodeLlama-70B-Instruct (72.0) and GPT-4 (march) on HumanEval 🥇 Outperforming other open Models like Grok-1, Command-R+, and DBRX, and closely matching Snowflake Arctic 480B and Mixtral-8x22B-Instruct

_philschmid's tweet photo. Self-Instruct for CodeLLMs! 👀 @BigCodeProject released a new StarCoder2-Instruct, the first entirely self-aligned code LLM trained with a transparent and permissive pipeline. 🧑🏻‍💻 It used itself to generate thousands of instruction-response pairs, which were then used to fine-tune—achieving 72.6 on HumanEval without relying on human annotations. 🤯

Implementation
1️⃣ Collect Seed Code Snippets, e.g., functions with docstrings.
2️⃣ Apply type checking, decontamination (benchmarks), Quality Filtering & Near-Deduplication
3️⃣ Employ in-context learning to self-generate coding tasks from these snippets.
4️⃣ For each instruction, generate answers and tests using in-context learning.
5️⃣ Execute these tests in a sandbox environment and select responses that pass for training.
6️⃣ Create a Training Dataset with the validated responses
7️⃣ Fine-Tune StarCoder2-15B on the generated self-instruct dataset

Insights
🧮 15B parameter version with 8192 context
🔓 Fully open-source datasets and pipeline for distillation
📝 Fully self-aligned without human annotation
🏆 Outperforms CodeLlama-70B-Instruct (72.0) and GPT-4 (march) on HumanEval
🥇 Outperforming other open Models like Grok-1, Command-R+, and DBRX, and closely matching Snowflake Arctic 480B and Mixtral-8x22B-Instruct

155

117

29K

Harm de Vries @harmdevries77

about 2 years ago

@BlackHC Thanks for the shout out :)

190

harmdevries77 retweeted

Andreas Kirsch 🇺🇦

@BlackHC

about 2 years ago

Test-of-time awards should maybe be handed out after a longer period of time but in my opinion this blog post (and the following) were incredibly prescient, and about one year later, everybody in LLMs is doing exactly what it suggested

Harm de Vries @harmdevries77

about 2 years ago

@BG2Pod @altcap @bgurley @sundeep Great pod! Small correction @sundeep : chinchilla is not considered the point of diminishing returns but referred to as the compute-optimal point, the best model for a given FLOP budget. See my blogpost from last year: https://t.co/E26sli903S

222

harmdevries77 retweeted

Gabriele Sarti @gsarti_

about 2 years ago

LLaMA 3 is testing the limits of @harmdevries77's Law (viz: https://t.co/88TvKsOh3o using 8B param & 15T tokens)

Harm de Vries @harmdevries77

about 2 years ago

@lvwerra @ServiceNow 💯 ! A big thanks to @NicolasChapados for understanding the value of open-science for businesses like ServiceNow

harmdevries77 retweeted

Leandro von Werra

@lvwerra

over 2 years ago

Took some time to reflect on the past 1+year of the @BigCodeProject: Here are a few of my learnings from leading it during this time and some ingredients I think are important for a successful open collaboration in ML. What is BigCode? BigCode is an open scientific collaboration working on the responsible development and use of large language models for code. It is hosted by @ServiceNowRSRCH and @huggingface and recently also got additional support from @nvidia. For more info see: https://t.co/wOUYiev9ns Impact The collaboration started in October 2022 and these are some of the most impactful outcomes the project delivered since then: 1⃣ StarCoder: the model is one of the most transparently built modles (e.g as measured by the Stanford Transparency index) and has been used widely. For example ServiceNow has successfully integrated it across their platform leading to a significant increase in valuation. 2⃣ The Stack: In my opinion the most impactful artefact of BigCode. Since we released it every code LLM and most general LLMs as well have use it (even if they don't publicly say it 🙃). It enables many to pretrain, fine-tune and build on top of it. 3⃣ Collaboration itself: I think BigCode showed that you can build great models and datasets in a transparent and responsible fashion. Some aspects of the collaboration are now serving as a blueprint for some similar projects - hopefully there will be more open research! Community BigCode is organised as an open collaboration with now over 1k people on our Slack channel. While many people are mostly watching, there are around 40-50 active people contributing on some level. BigCode was on purpose a bit less democratic than BigScience which helped to reduce noise and coordinate clearer. Also not everybody can contribute the same amount of hours and some people need a bit more assistance, but I think that's fine. Some people appeared throughout the collaboration while some others phased out. Overall I think it worked quite well. Core team Although the community is quite big, the core work has been done by maybe 2-3 full time people both at Hugging Face and at ServiceNow with the community helping out on specific tasks. E.g. dataset inspection and model evaluation was largely done by community members. I think 3-5 people is a great size even for a project at this scale and allows being very align and thus move fast. Transparency There is a bit of tension about being fully open and transparent. Other organisations can see your timeline and move their releases accordingly while you don’t have much visibility in return. This happened for both StarCoder releases and is quite stressful. Trying to keep releases/timelines a bit opaque might be a good idea. On the other hand building something with controversial aspects in the open as a community effort helps with scrutiny. Overall people look at it already with a more positive view than other similar corporate project. Collaborations Collaborations even with large companies can work out! My learning is that two things are essential for this: 1⃣ Having a very aligned single point of contact that can move things forward internally. Doing this from the outside is very hard. 2⃣ Top-down support: things can move slower in big companies. Having support from the top helps pushing things forward for example if something needs to be pushed through legal or fast decisions need to be made. Focus + motivation There are many interesting things to work on and people bring a wide range of ideas to the table. It's important to go through the sometimes painful process to find focus for the collaboration and work on delivering concrete artefacts. The fuel of the collaboration is traction: the more exciting things are released the more people join and the higher the motivation of the team to keep pushing and build more. Working on concrete milestones that can be released helps a lot. We did this with The Stack and SantaCoder which were stepping stones towards StarCoder. Takeaways ➡️ Small, aligned teams can build great things fast. ➡️ Building datasets is long term more impactful than building models, but you still need to prove your dataset is good, usually by training a good model 🙂 ➡️ Collaborations with other organizations/companies can work well, if there’s a lead pushing things forward on their side with top down support. Otherwise they can turn painful quickly. ➡️ Building in the open has cons and pros, striking the right balance is important. ➡️ Frequently releasing intermediate artefacts is better than just releasing big things and removes a bit the pressure from big releases. Of course @BigCodeProject wasn't the first open collaboration and follows in big footsteps with projects run by for example @AiEleuther, @BigscienceW, @laion_ai, @CohereForAI and @allen_ai. Overall it has been a great pleasure working on @BigCodeProject and I hope that there will be more open collaborations in ML as I think working on one of the most impactful technologies openly is important!

21K

harmdevries77 retweeted

BigCode @BigCodeProject

over 2 years ago

Introducing: StarCoder2 and The Stack v2 ⭐️ StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens. All code, data and models are fully open! https://t.co/fM7GinxJBd

BigCodeProject's tweet photo. Introducing: StarCoder2 and The Stack v2 ⭐️

StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens.

All code, data and models are fully open!

https://t.co/fM7GinxJBd https://t.co/NUeRjHEa05

660

202

241

223K

harmdevries77 retweeted

Terry Yue Zhuo

@terryyuezhuo

over 2 years ago

Instruction Tuning Code LLMs Using #PEFT methods? Introducing 🌠 ✨Astraios Model Suite: A suite of 28 #StarCoder instruct-tuned using #OctoPack, 7 tuning methods & 4 model sizes, and up to 16B parameters. 📝Extensive Evaluation: 5 tasks & 8 datasets in both Code Comprehension 🧠 & Generation ✍️. 🔍Further Analysis: Model Robustness 🛡️ & Code Security 🔒 📜https://t.co/zZC6HszNSn 💻https://t.co/xqr9YxPhkg 1/9

terryyuezhuo's tweet photo. Instruction Tuning Code LLMs Using #PEFT methods? Introducing 🌠
✨Astraios Model Suite: A suite of 28 #StarCoder instruct-tuned using #OctoPack, 7 tuning methods & 4 model sizes, and up to 16B parameters.
📝Extensive Evaluation: 5 tasks & 8 datasets in both Code Comprehension 🧠 & Generation ✍️.
🔍Further Analysis: Model Robustness 🛡️ & Code Security 🔒

📜https://t.co/zZC6HszNSn
💻https://t.co/xqr9YxPhkg
1/9

13K

harmdevries77 retweeted

BigCode @BigCodeProject

over 2 years ago

Exciting times: we are working on the next generation of StarCoder trained on a new dataset! 🚀 If you would like to have your code excluded from the training run you can check if your data is in the dataset and follow the link to opt-out: https://t.co/sLKdt0nLnP

14K

Harm de Vries @harmdevries77

over 2 years ago

First promising results for pre-training with related documents in the context window, nicely addressing the data issue I explained in my last blog post. Looks de-risked enough to go into llama-3. https://t.co/vo1gFG8H5K

harmdevries77's tweet photo. First promising results for pre-training with related documents in the context window, nicely addressing the data issue I explained in my last blog post.

Looks de-risked enough to go into llama-3.
https://t.co/vo1gFG8H5K https://t.co/EFkLWZlG24

Weijia Shi

@WeijiaShi2

over 2 years ago

@harmdevries77 raises a key issue: the lack of long pretraining data (<5% web docs exceed 2k tokens) poses challenges for pretraining LMs with long context windows. In-Context Pretraining offers a scalable solution for creating meaningful long contexts https://t.co/PwY5wfUuRa

Harm de Vries

@harmdevries77

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users