@HuggingPapers I was concerned that loop-based models might reduce inference efficiency and that simply reducing parameters would offer limited gains. PLT seems to address this concern. If a model can be trained with 1× resources while gaining loop× inference benefits, it may more excellent.
LoopCoder-v2 is out
A 7B model trained on 18T tokens that scores 64.4 on SWE-bench Verified with just two loops, beating models 30x larger.
Adding a third loop makes it worse.
Model and code are on Hugging Face.
[1/n]
🎉We are very pleased to introduce FineFineWeb, which is currently the largest open-source fully automatic classification practice for fine-grained web data. Specifically, our contributions are as follows:
🔪We decompose the entire deduplicated version of Fineweb into 67 categories with a significant amount of seed data.
🧮We conduct a correlation analysis between vertical categories as well as between vertical categories and common Benchmarks for FineFineWeb, and also provided the distribution analysis of URLs and other content.
🧑⚖️We provide test sets for PPL evaluation based on the 67 selected vertical domains of FineFineWeb, and offer a "small cup" (Validation) and a "medium cup" (Test).
🪙We provide all the full-process materials for training fasttext and bert.
📅We will give suggestions on data proportioning based on our dataset. (Based on RegMix, Coming Soon in our Report! [Due to tight computing power, it will be as soon as possible])
[1/n]
🔥 Happy to Introduce FullStack Bench: A comprehensive evaluation dataset, focusing on full-stack programming across 16 languages and more than 11 real-world application domains like data analysis, software engineering, and machine learning.
Whether or not your CodeLLM is a FullStack Coder instead of an leetcode nerd?
It's time to put your code LLMs to the test!!! 📝
🚀Now it is the time, Nov. 11 10:24! The perfect time for our best coder model ever! Qwen2.5-Coder-32B-Instruct!
Wait wait... it's more than a big coder! It is a family of coder models! Besides the 32B coder, we have coders of 0.5B / 1.5B / 3B / 7B / 14B! As usual, we not only share base and instruct models, we also provide quantized models in the format of GPTQ, AWQ, as well as the popular GGUF! 💖
👉🏻Blog: https://t.co/7FnV3SUHuD
👉🏻Tech Report: https://t.co/Y3JN2Ly7H6
👉🏻Hugging Face: https://t.co/GgfeNq0XML
👉🏻ModelScope: https://t.co/VJwMAvEaHN
👉🏻Kaggle: https://t.co/7GW9GZJYre
👉🏻GitHub: https://t.co/gMGC8b5Hwv
👉🏻Demo [chat]: https://t.co/JxAYwnLM9u
👉🏻 Demo [Artifacts]: https://t.co/cyJEHV30e1
The flagship model, Qwen2.5-Coder-32B-Instruct, reaches top-tier performance, highly competitive (or even surpassing) proprietary models like GPT-4o, in a series of benchmark evaluation, including HumanEval, MBPP, LiveCodeBench, BigCodeBench, McEval, Aider, etc. It reaches 92.7 in HumanEval, 90.2 in MBPP, 31.4 in LiveCodeBench, 73.7 in Aider, 85.1 in Spider, and 68.9 in CodeArena!
🔍Why could a coding model trained on just 2.5T tokens compete with top-tier models like DeepSeekCoder (10T tokens) and QwenCoder (15T tokens)?
🌟 Curious about the answer? Check out our paper, OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models (🏠 https://t.co/Hh5otarsvx, 📑 https://t.co/mkEr6kBkjk), a new code language model with top-tier code generation performance and fully openness!
In this paper, we reveal the full details of our data cleaning, processing, and synthesis pipeline — insights that top labs often keep under wraps for code pre-training! Here’s what we offer:
✨ 1.5B & 8B code models supporting both English and Chinese
📚 Code to reproduce the 2.5T tokens of training data (coming soon!)
🛠️ 4.5M+ high-quality SFT examples
This work was lead by awesome @SimingHUAN38187 , @crazycth0901 and @ziliwang8011184 . And please find more details in this thread! 🧵
[1/n] ### Discover AutoKaggle: Revolutionizing Data Science Competitions with Multi-Agent Collaboration! 🚀
Introducing AutoKaggle — a multi-agent framework designed to automate the full spectrum of data science competitions on Kaggle! From background understanding to model prediction, AutoKaggle takes on all phases, boosting efficiency and reducing manual overhead.
💡 Highlights of AutoKaggle:
🛠️ Phase-based workflow: Six key phases (Understanding, EDA, Cleaning, Feature Engineering, Model Building).
🤖 Five specialized agents: Reader, Planner, Developer, Reviewer, Summarizer.
🔁 Iterative debugging & unit testing for robust, correct code generation.
📊 Built-in ML tools library to handle data cleaning, feature engineering, and modeling.
🤤 Flexible Customize Support on ML Tool Library allows you to drive the workflow as you want.
Qwen Code Interpreter, with Qwen Code 2.5 & WebLLM
Running locally on your browser
A cool @huggingface space showcasing the power or opensource model and WebLLM
-----
WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.
[6/6] 🔍Deep Insights📚 FuzzCoder's fine-tuning on the Fuzz-Instruct dataset, collected from heuristic fuzzing tools, provides deep insights into the mutation process, leading to more targeted fuzzing.
GitHub: https://t.co/q5BOrQsmuj
💥Introducing FuzzCoder: Revolutionizing Byte-level Fuzzing with Large Language Models!🌟 Experience the future of software security with our groundbreaking approach.
🔗 Learn more: https://t.co/9eaXDmdbaU
#Fuzzing#Cybersecurity#AI#LLM#LargeLanguageModels#SoftwareSecurity
[5/6] 📊Performance📊 FuzzCoder demonstrates remarkable improvements across various input formats. Our extensive experiments show that FuzzCoder, integrated with AFL, outperforms traditional methods in effective mutation proportion and crash discovery rates.
视觉-语言模型(VLM)领域在研究些什么?🧐
VLM是一个从去年末开始快速发展的领域,对研究者来说尚有大量“金矿”未被发掘,且当前探索仍然非常初步,对大模型的初学者上手难度较小🥰
以下是帮你快速掌握VLM领域目前发展的文章推荐📰:
1. 从宏观视角整体了解整个领域有哪些具体的探索方向(例如数据配比、Image Encoder选择、VL connector的设计、当前有哪些benchmark、VLM的训练策略等)
a. Cambrian: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
最全最新没有之一的全方位探索
Link: https://t.co/fqS9zVB5AS
b. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
比较老但仍然推荐一读的文章
Link: https://t.co/Q7HSytHhwB
c. What matters when building vision-language models?
结论相比前两篇有很好的补充
Link: https://t.co/xOTVQj8PZ6
2. VLM特有的提升推理效率方案:设计更优的V-L Attention机制
a. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
发现vision token存在大量的冗余,可以通过token dropping来大幅提升推理速度而不伤害效果
Link: https://t.co/4hvsgy0nr7
b. VoCo-LLaMA: Towards Vision Compression with Large Language Models
通过类似RMT的token压缩方式减少vision token数量从而提升推理速度
Link: https://t.co/8227vL5Sd0
3. vision encoder的分辨率对模型性能的影响,结论简单粗���:影响很大,分辨率越大效果越好
a. InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Link: https://t.co/hcUik3aVlQ
b. DeepSeek-VL: Towards Real-World Vision-Language Understanding
Link: https://t.co/OBJRnRBlK1
4. VLM模型架构选择:All-in-one Decoder (early-fusion)还是Vision Encoder和Language Decoder分离?
a. Unveiling Encoder-Free Vision-Language Models
https://t.co/puhPIluZEB
b. Chameleon: Mixed-Modal Early-Fusion Foundation Models
https://t.co/p5iWwWHbMA
5. 对于较为主流的VLM分离架构,Vision-Language Connector如何设计?
a. TokenPacker: Efficient Visual Projector for Multimodal LLM
https://t.co/s3N1ntrajw
6. VLM分离架构的最佳训练方式
a. Long Context Transfer from Language to Vision
https://t.co/eoKGiw9ufm
7. LLaVA系列的所有文章+博客
Improved Baselines with Visual Instruction Tuning
https://t.co/8pLH4RNxAZ
https://t.co/amvIg2TynU
https://t.co/cfVN0Pf6at
https://t.co/CKiCyN2d0G
https://t.co/By6hrZyNyU
https://t.co/PMX6iqmwHt
8. 一些快速提升你VLM码力的实战仓库推荐(见图)
(列得不够全希望大家在评论区继续补充)
🚀 Thrilled to introduce 🔥McEval🔥, the first massively multilingual code evaluation benchmark of 40 programming languages with 16K test samples, including code generation, completion, and explanation tasks.
McEval: Massively Multilingual Code Evaluation
https://t.co/KjVwoxoX6X
[9/10]
Based on algorithmic complexity, we classify McEval into three levels (Easy/Medium/Hard).
The performance of the CodeQwen model on code generation tasks shows that for most languages, the model can answer most easy questions but struggles with medium and hard ones.