Xinting Huang

Pengyu Cheng @cheng_pengyu

about 2 years ago

Also glad to share: We made it work for instruction following models! NH2-Mixtral-8x7B ➕ NH2-Solar-10.7B ➕and OpenChat-3.5-7B ➡️ new sota for 7B model in MT-BENCH Check it out: https://t.co/VvnEYTZZVe

elvis

@omarsar0

over 2 years ago

Knowledge Fusion of LLMs Is it possible to merge existing models into a more potent model? We have already seen a few ways that show the potential to effectively do this using approaches like weight merging and ensembling of models. This work proposes FuseLLM with the core idea of externalizing knowledge from multiple LLMs and transferring their capabilities to a target LLM. It leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through continual training. To put it simply, the idea is to benefit from the strengths of all the LLMs and combine them into one integrated model. Finds that the FuseLLM can improve the performance of the target model across a range of capabilities such as reasoning, common sense, and code generation. By the way, you can also perform the fusion among fine-tuned LLMs that specialize in specific tasks. This continues to be an interesting research area so hoping to document more on any new ideas and findings I come across.

omarsar0's tweet photo. Knowledge Fusion of LLMs

Is it possible to merge existing models into a more potent model?

We have already seen a few ways that show the potential to effectively do this using approaches like weight merging and ensembling of models.

This work proposes FuseLLM with the core idea of externalizing knowledge from multiple LLMs and transferring their capabilities to a target LLM.

It leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through continual training.

To put it simply, the idea is to benefit from the strengths of all the LLMs and combine them into one integrated model.

Finds that the FuseLLM can improve the performance of the target model across a range of capabilities such as reasoning, common sense, and code generation.

By the way, you can also perform the fusion among fine-tuned LLMs that specialize in specific tasks.

This continues to be an interesting research area so hoping to document more on any new ideas and findings I come across.

960

212

791

122K

timhuangxt retweeted

DailyPapers

@HuggingPapers

8 months ago

The End of Manual Decoding: Meet AutoDeco Researchers unveil AutoDeco, a groundbreaking framework that teaches LLMs to control their own decoding strategy. It dynamically predicts temperature & top-p for each token, eliminating manual tuning & enabling natural language control.

HuggingPapers's tweet photo. The End of Manual Decoding: Meet AutoDeco

Researchers unveil AutoDeco, a groundbreaking framework that teaches LLMs to control their own decoding strategy. It dynamically predicts temperature & top-p for each token, eliminating manual tuning & enabling natural language control. https://t.co/AHm1GyQr6K

timhuangxt retweeted

9 months ago

Introducing Search Self-play (SSP, https://t.co/fF0GyrgMAs)! We let deep search agents act simultaneously as a task proposer and a problem solver. Through competition and cooperation, their agent capabilities co-evolve and uniformly surpass SOTA performance without supervision!

cheng_pengyu's tweet photo. Introducing Search Self-play (SSP, https://t.co/fF0GyrgMAs)! We let deep search agents act simultaneously as a task proposer and a problem solver. Through competition and cooperation, their agent capabilities co-evolve and uniformly surpass SOTA performance without supervision! https://t.co/6CzNLKt3Ya

800

timhuangxt retweeted

9 months ago

🚀 Excited to announce our 🍁Marco‑MT🍁 achieved outstanding results at #WMT2025 General Translation! 🏆 Notably, in English→Chinese it outperformed closed‑source leaders like GPT‑4.1 and Gemini 2.5 Pro. Among 13 language pairs we competed in, Maroc-MT-Algharb achieves (final human evaluation): 🏅6 First Places 🥈4 Second Places 🥉2 Third Places 🎯We did this with key innovations: • Novel M2PO translation paradigm • Two-stage SFT + CPO+MPO reinforcement learning • Hybrid decoding with word alignment & MBR Learn more: - Demo: https://t.co/XxNUm8g578 - Technical Report: https://t.co/6nCk1Q29xK - Hugging Face: https://t.co/3KMnCmcR4o

wangly0229's tweet photo. 🚀 Excited to announce our 🍁Marco‑MT🍁 achieved outstanding results at #WMT2025 General Translation! 🏆 Notably, in English→Chinese it outperformed closed‑source leaders like GPT‑4.1 and Gemini 2.5 Pro.

Among 13 language pairs we competed in, Maroc-MT-Algharb achieves (final human evaluation):
🏅6 First Places
🥈4 Second Places
🥉2 Third Places

🎯We did this with key innovations:
• Novel M2PO translation paradigm
• Two-stage SFT + CPO+MPO reinforcement learning
• Hybrid decoding with word alignment & MBR

Learn more:
- Demo: https://t.co/XxNUm8g578
- Technical Report: https://t.co/6nCk1Q29xK
- Hugging Face: https://t.co/3KMnCmcR4o

timhuangxt retweeted

about 1 year ago

🌺GPT-4o’s image generation is stunning — but how well does it handle complex scenarios? 🤔 We introduce 🚀CIGEVAL🚀, a novel method to evaluate models' capabilities in Conditional Image Generation 🖼️➕🖼️🟰🖼️. Find out how top models perform when conditions get truly challenging! 🔥 #ImageGeneration #AutoEvaluation #Multimodal #GPT4O

wangly0229's tweet photo. 🌺GPT-4o’s image generation is stunning — but how well does it handle complex scenarios? 🤔

We introduce 🚀CIGEVAL🚀, a novel method to evaluate models' capabilities in Conditional Image Generation 🖼️➕🖼️🟰🖼️. Find out how top models perform when conditions get truly challenging! 🔥

#ImageGeneration #AutoEvaluation #Multimodal #GPT4O

over 1 year ago

These findings resonate with my impressions. AFAIC, structured prompting outperforms CoT & ICL by steering LLMs through workflows. Great to see this ‘rebuttal’ backed by such rigorous analysis — reminds me of the insights in LLMs Cannot Self-Correct. We need more like this!

Philipp Schmid

@_philschmid

over 1 year ago

Does Structured Outputs hurt LLM performance? 🤔 The paper "Let Me Speak Freely" paper claimed that it does, but new experiments by @dottxtai (team behind outlines) show it doesn’t if you do it correctly! 👀 TL;DR; 📈 The "Let Me Speak Freely" poor results came from weak prompts and wrong use of structured prompting 📊 Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41 🛠️ Key success criteria is to align your prompt, parser, and generator - it's not just about using JSON mode 📌 JSON generation requires careful prompt design, including specifying the desired schema. 📝 Good prompts should contain enough information for a human to understand the task and expected response format 🎯 JSON generation reached 77% accuracy vs the paper's reported <10% 🔮 Examples in prompts should match the exact format expected in the actual tasks 🧰 Structured generation works best when implemented as "running our response parser as a generator"

_philschmid's tweet photo. Does Structured Outputs hurt LLM performance? 🤔 The paper "Let Me Speak Freely" paper claimed that it does, but new experiments by @dottxtai (team behind outlines) show it doesn’t if you do it correctly! 👀

TL;DR;
📈 The "Let Me Speak Freely" poor results came from weak prompts and wrong use of structured prompting
📊 Structured outputs outperform unstructured on the test GSM8K: 0.78 vs 0.77, Last Letter: 0.77 vs 0.73, Shuffle Object: 0.44 vs 0.41
🛠️ Key success criteria is to align your prompt, parser, and generator - it's not just about using JSON mode
📌 JSON generation requires careful prompt design, including specifying the desired schema.
📝 Good prompts should contain enough information for a human to understand the task and expected response format
🎯 JSON generation reached 77% accuracy vs the paper's reported <10%
🔮 Examples in prompts should match the exact format expected in the actual tasks
🧰 Structured generation works best when implemented as "running our response parser as a generator"

303

233

34K

224

over 1 year ago

paper: https://t.co/xqGDJ2la7i code: https://t.co/5wdGzTdtgK

320

over 1 year ago

RAG matters so much in real-world LLM applications, and so glad to see this new work bring Soooo much efficiency boost! #RAG #BlockAttention #Innovation A quick walk-through of this amazing work 👇

timhuangxt's tweet photo. RAG matters so much in real-world LLM applications, and so glad to see this new work bring Soooo much efficiency boost!

#RAG #BlockAttention #Innovation

A quick walk-through of this amazing work 👇 https://t.co/X6itY429Pv

over 1 year ago

The Block-Attention Mechanism, by independently encoding passages, allows us to reuse the KV states of previously seen passages. From now on, no matter how many passages are retrieved, the latency and computational cost are brought down to nearly the same level as non-RAG LLMs.

370

over 1 year ago

Exciting to see our old friend continuing to push the real-world boundaries of LLM applications (shoutout to MT here)!

over 1 year ago

🔥Our LLM-powered MT (Marco-MT) has achieved massive commercial use, leading the industry in both efficiency and cost-effectiveness. 🌏 Revolutionizing translation in e-commerce and beyond! 🚀 🌍 For more details: https://t.co/UIo3pj2m0l ✨ Try it now: https://t.co/3hruAnSlUF

wangly0229's tweet photo. 🔥Our LLM-powered MT (Marco-MT) has achieved massive commercial use, leading the industry in both efficiency and cost-effectiveness. 🌏 Revolutionizing translation in e-commerce and beyond! 🚀

🌍 For more details: https://t.co/UIo3pj2m0l
✨ Try it now: https://t.co/3hruAnSlUF https://t.co/8PUiiO0Q3f

timhuangxt retweeted

@_akhaliq

almost 2 years ago

To Code, or Not To Code? Exploring Impact of Code in Pre-training discuss: https://t.co/igjUwMtv3p Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

_akhaliq's tweet photo. To Code, or Not To Code?

Exploring Impact of Code in Pre-training

discuss: https://t.co/igjUwMtv3p

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

296

135

30K

timhuangxt retweeted

almost 2 years ago

🚀Check out VideoVista, a comprehensive video-LMMs evaluation benchmark! https://t.co/56a0djR07n 🚀 Dive into our leaderboard: - 📊 Evaluating 33 Video-LMMs across 27 tasks; - 🥉 The latest GPT-4o-Mini clinches 3rd place; - 🏆 InternLM-XComposer-2.5 emerges as the top-performing open-source model.

about 2 years ago

Open-sourced Multimodal models -- fascinating Open-sourced MOE models -- fascinating Open-sourced Multimodal MOE models -- WOW! check this out 👇

about 2 years ago

🥳We introduce Uni-MoE, a unified multimodal LLM based on sparse MoE architecture. It integrates 📹 video, 🖼️ image, 📄 text, 🔊 audio, and 🗣️ speech, supporting 8+ experts in parallel training across mixed modalities. 🌈Paper: https://t.co/u0aNJnT5yv. 💐Project (Code, Data, Model, Demo): https://t.co/l2h0IZtmjU

wangly0229's tweet photo. 🥳We introduce Uni-MoE, a unified multimodal LLM based on sparse MoE architecture. It integrates 📹 video, 🖼️ image, 📄 text, 🔊 audio, and 🗣️ speech, supporting 8+ experts in parallel training across mixed modalities.
🌈Paper: https://t.co/u0aNJnT5yv.
💐Project (Code, Data, Model, Demo): https://t.co/l2h0IZtmjU

144

17K

timhuangxt retweeted