Maggie

@chubbymaggie

Joined January 2011

1.5K Following

257 Followers

3.2K Posts

chubbymaggie retweeted

Rohan Paul

@rohanpaul_ai

6 months ago

This paper studies real coding chats to show why Large Language Models, chatbots trained on text and code, miss instructions. Only 24.07% of chats follow every instruction, and even single instructions are followed only 48.24% of the time. Most coding benchmarks miss real back and forth, so the authors pull coding conversations from large public chat logs and label what happens. They treat each thread as connected turns, and the structure usually looks linear, star like, or tree like. Linear threads fit step by step code polishing, star threads fit quick information queries, and tree threads fit big design work that splits into sub problems. Bug fixing and refactoring are where the model slips most, because the answer must obey many small constraints at once across several turns. Satisfaction drops as threads get longer, since the chat often turns into a loop of error spotting and patching instead of steady progress. People seem happier with structured knowledge questions and algorithm design, and less happy with code generation and code cleanup where mistakes block them. ---- Paper Link – arxiv. org/abs/2512.10493 Paper Title: "Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild"

rohanpaul_ai's tweet photo. This paper studies real coding chats to show why Large Language Models, chatbots trained on text and code, miss instructions.

Only 24.07% of chats follow every instruction, and even single instructions are followed only 48.24% of the time.

Most coding benchmarks miss real back and forth, so the authors pull coding conversations from large public chat logs and label what happens.

They treat each thread as connected turns, and the structure usually looks linear, star like, or tree like.

Linear threads fit step by step code polishing, star threads fit quick information queries, and tree threads fit big design work that splits into sub problems.

Bug fixing and refactoring are where the model slips most, because the answer must obey many small constraints at once across several turns.

Satisfaction drops as threads get longer, since the chat often turns into a loop of error spotting and patching instead of steady progress.

People seem happier with structured knowledge questions and algorithm design, and less happy with code generation and code cleanup where mistakes block them.

----

Paper Link – arxiv. org/abs/2512.10493

Paper Title: "Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild"

138

chubbymaggie retweeted

elvis

@omarsar0

7 months ago

Reasoning models are expensive. Not because the models are huge. It's because they generate thousands of tokens just to think. But what if smaller models could learn to reason efficiently? This new paper compares training 12B models on reasoning traces from two frontier systems: - DeepSeek-R1 - gpt-oss (OpenAI's open-source reasoner) The key finding: gpt-oss traces produce 4x more efficient reasoning. DeepSeek-R1 averages ~15,500 tokens per response. gpt-oss averages ~3,500 tokens. Yet accuracy stays nearly identical across benchmarks. Verbose reasoning doesn't mean better reasoning. Why does this matter? Inference cost scales linearly with tokens. If your reasoning model generates 4x fewer tokens with the same accuracy, you cut costs by 75%. That's a massive efficiency gain. Interesting observation: Nemotron base models already had DeepSeek-R1 traces in pretraining. Training loss on DeepSeek traces started low and stayed flat. Training loss on gpt-oss traces started high and dropped gradually. They showed that the model was learning something new, which also means you can distill reasoning capabilities from frontier models into smaller systems. But the source matters. Different reasoning styles produce different efficiency profiles. (bookmark it) Paper: arxiv. org/abs/2511.19333

omarsar0's tweet photo. Reasoning models are expensive.

Not because the models are huge.

It's because they generate thousands of tokens just to think.

But what if smaller models could learn to reason efficiently?

This new paper compares training 12B models on reasoning traces from two frontier systems:

- DeepSeek-R1
- gpt-oss (OpenAI's open-source reasoner)

The key finding: gpt-oss traces produce 4x more efficient reasoning. DeepSeek-R1 averages ~15,500 tokens per response. gpt-oss averages ~3,500 tokens.

Yet accuracy stays nearly identical across benchmarks. Verbose reasoning doesn't mean better reasoning.

Why does this matter?

Inference cost scales linearly with tokens. If your reasoning model generates 4x fewer tokens with the same accuracy, you cut costs by 75%.

That's a massive efficiency gain.

Interesting observation: Nemotron base models already had DeepSeek-R1 traces in pretraining. Training loss on DeepSeek traces started low and stayed flat. Training loss on gpt-oss traces started high and dropped gradually.

They showed that the model was learning something new, which also means you can distill reasoning capabilities from frontier models into smaller systems. But the source matters. Different reasoning styles produce different efficiency profiles.

(bookmark it)

Paper: arxiv. org/abs/2511.19333

433

337

24K

chubbymaggie retweeted

elvis

@omarsar0

over 1 year ago

Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications Regardless of how generalized LLMs are becoming, there is still a lot of interest in adapting LLMs for specialized domains. Researchers from IBM and MIT-IBM Watson AI Lab introduce SOLOMON, a neuro-inspired reasoning network that enhances LLM adaptability for specialized tasks. Key insights include: • Enabling flexible domain adaptation – Unlike traditional fine-tuning, SOLOMON uses a multi-agent oversight system with prompt engineering and in-context learning to adapt LLMs efficiently across domains without retraining. • Improving spatial reasoning in LLMs – Standard LLMs struggle with spatial reasoning and applying domain knowledge. SOLOMON addresses this by integrating multiple LLMs to generate diverse reasoning paths, refining them through a Thought Assessor mechanism inspired by the Free Energy Principle. • Significant performance gains – Evaluations on 25 semiconductor layout design tasks show SOLOMON outperforms baseline LLMs and achieves results comparable to state-of-the-art reasoning models like OpenAI’s o1-preview. It notably reduces runtime errors and improves reasoning accuracy.

omarsar0's tweet photo. Enhancing Reasoning to Adapt Large Language Models for Domain-Specific Applications

Regardless of how generalized LLMs are becoming, there is still a lot of interest in adapting LLMs for specialized domains.

Researchers from IBM and MIT-IBM Watson AI Lab introduce SOLOMON, a neuro-inspired reasoning network that enhances LLM adaptability for specialized tasks.

Key insights include:

• Enabling flexible domain adaptation – Unlike traditional fine-tuning, SOLOMON uses a multi-agent oversight system with prompt engineering and in-context learning to adapt LLMs efficiently across domains without retraining.

• Improving spatial reasoning in LLMs – Standard LLMs struggle with spatial reasoning and applying domain knowledge. SOLOMON addresses this by integrating multiple LLMs to generate diverse reasoning paths, refining them through a Thought Assessor mechanism inspired by the Free Energy Principle.

• Significant performance gains – Evaluations on 25 semiconductor layout design tasks show SOLOMON outperforms baseline LLMs and achieves results comparable to state-of-the-art reasoning models like OpenAI’s o1-preview. It notably reduces runtime errors and improves reasoning accuracy.

296

245

29K

chubbymaggie retweeted

Mayur Naik

@AI4Code

over 1 year ago

My talk on finding security vulnerabilities by combining classical symbolic reasoners with modern-day LLMs: Recording: https://t.co/BHzt7Y6EiZ Slides: https://t.co/TtmPZxzs7v I gave this talk yesterday at the 2024 Static Analysis Symposium in Pasadena, California. Finding security vulnerabilities is a grand challenge in static analysis. I talked about three kinds of approaches: ➡️ classical symbolic reasoning which has dominated for much of the history of the field; ➡️ modern-day LLMs which excel at many code-reasoning tasks and are rapidly improving; and ➡️ neuro-symbolic approaches, which combine the best of both worlds and are the focus of my group's research ). My favorite illustration, originally due to Hyung Won Chung from OpenAI, puts these approaches in perspective. He argues how with less compute, an approach with more structure wins over an approach with less structure, but as the available compute grows, the former saturates in performance but the latter keeps improving. The question then is: would approaches with even less structure be even more scalable? Not necessarily: it depends on how much compute we have today. If we are at the dotted line, then "even less structure" doesn't make sense, but we should remember to undo the structure already present when more compute becomes available tomorrow. So when I reflect upon my own research over the last decade which developed approaches with a lot of structure, it was the right thing to do at the time. But today we should be careful not to repeat yesterday's approaches -- rather we should determine what parts of the structure in those approaches are still needed, and remember to remove them tomorrow. In the setting of security vulnerabilities, the attached picture shows how our recent neuro-symbolic approach IRIS divides this challenging problem into parts that an LLM like GPT-4 is good at today (e.g. inferring missing specifications and contextual analysis) and what a classical symbolic reasoning tool like CodeQL is good at (e.g. whole-repository taint tracking). Thanks to the SAS'24 chairs Roberto Giacobazzi and Alessandra Gorla for inviting me, and to the many who attended including Patrick Cousot, Francesco Logozzo, Anders Møller, and Kwangkeun Yi (apologies to those whose names I missed!). Thanks also to the SPLASH organizers for an excellent job hosting the conference.

AI4Code's tweet photo. My talk on finding security vulnerabilities by combining classical symbolic reasoners with modern-day LLMs:

Recording: https://t.co/BHzt7Y6EiZ
Slides: https://t.co/TtmPZxzs7v

I gave this talk yesterday at the 2024 Static Analysis Symposium in Pasadena, California.

Finding security vulnerabilities is a grand challenge in static analysis. I talked about three kinds of approaches:
➡️ classical symbolic reasoning which has dominated for much of the history of the field;
➡️ modern-day LLMs which excel at many code-reasoning tasks and are rapidly improving; and
➡️ neuro-symbolic approaches, which combine the best of both worlds and are the focus of my group's research ).

My favorite illustration, originally due to Hyung Won Chung from OpenAI, puts these approaches in perspective. He argues how with less compute, an approach with more structure wins over an approach with less structure, but as the available compute grows, the former saturates in performance but the latter keeps improving.

The question then is: would approaches with even less structure be even more scalable? Not necessarily: it depends on how much compute we have today. If we are at the dotted line, then "even less structure" doesn't make sense, but we should remember to undo the structure already present when more compute becomes available tomorrow.

So when I reflect upon my own research over the last decade which developed approaches with a lot of structure, it was the right thing to do at the time. But today we should be careful not to repeat yesterday's approaches -- rather we should determine what parts of the structure in those approaches are still needed, and remember to remove them tomorrow.

In the setting of security vulnerabilities, the attached picture shows how our recent neuro-symbolic approach IRIS divides this challenging problem into parts that an LLM like GPT-4 is good at today (e.g. inferring missing specifications and contextual analysis) and what a classical symbolic reasoning tool like CodeQL is good at (e.g. whole-repository taint tracking).

Thanks to the SAS'24 chairs Roberto Giacobazzi and Alessandra Gorla for inviting me, and to the many who attended including Patrick Cousot, Francesco Logozzo, Anders Møller, and Kwangkeun Yi (apologies to those whose names I missed!). Thanks also to the SPLASH organizers for an excellent job hosting the conference.

11K

Who to follow

KevinLu

@K3vinLuSec

Bluehat Speaker, Vulnerability Research, Malware Analysis, Reverse Engineering on macOS, Android, Windows, IoT(Views represented are solely my own)

dmolnar

@dmolnar

I like security. He/him. Supporting scaling security superpowers at Meta (Facebook). We're hiring across Meta Security, DMs open.

Daniele Cono D'Elia

@dcdelia

Tenure-track assistant professor at @SapienzaRoma working on software and systems security research. Traveler, cyclist, animal lover.

chubbymaggie retweeted

elvis

@omarsar0

almost 3 years ago

Textbooks Are All You Need II Presents phi-1.5, a new 1.3 billion parameter model trained on 30 billion tokens. The dataset consists of "textbook-quality" synthetically generated data. phi-1.5 competes or outperforms other larger models on reasoning tasks suggesting that data quality plays a more important role than previously thought. The authors claim that phi-1.5 is the first LLM at the 1B scale to exhibit most of the relevant traits of larger LLMs. The model is open-sourced to encourage research around in-context learning, mechanistic interpretability, and safety topics such as hallucinations and toxicity. paper: https://t.co/U4iNWPW7g7 model: https://t.co/4Vppq0ylmc

omarsar0's tweet photo. Textbooks Are All You Need II

Presents phi-1.5, a new 1.3 billion parameter model trained on 30 billion tokens. The dataset consists of "textbook-quality" synthetically generated data.

phi-1.5 competes or outperforms other larger models on reasoning tasks suggesting that data quality plays a more important role than previously thought.

The authors claim that phi-1.5 is the first LLM at the 1B scale to exhibit most of the relevant traits of larger LLMs.

The model is open-sourced to encourage research around in-context learning, mechanistic interpretability, and safety topics such as hallucinations and toxicity.

paper: https://t.co/U4iNWPW7g7
model: https://t.co/4Vppq0ylmc

430

226

75K

chubbymaggie retweeted

Thomas Capelle @capetorch

almost 3 years ago

If you are still confused by all the benchmarks and evaluation pipelines for LLMs, you should definitely watch @dk21 latest video from the LLM fine-tune course. https://t.co/ww8JxYsJNh He covers all the standard used benchmarks, like HumanEval, HellaSwag, ARC and standard metrics like Bleu, rouge, etc... You will finally understand what it means to be at the top of an OSS leaderboard and the caveats about it. In the next lesson @jefrankle will talk about dataset processing and curation!

228

246

39K

chubbymaggie retweeted

Sebastian Raschka

@rasbt

almost 3 years ago

Code Llama was just released 4 days ago. Since then, we already got 1) WizardCoder-34B (https://t.co/WDd94tt3fG) 2) Phind's finetuned CodeLLama-34B (https://t.co/FGtnv2w7xy) *Both reported to be surpassing GPT-4 on HumanEval. The open source community is amazing!

906

156

438

212K

chubbymaggie retweeted

WizardLM @WizardLM_AI

almost 3 years ago

🔥🔥🔥 Introduce the newest WizardCoder 34B based on Code Llama. ✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass@1 🖥️Demo: http://47.103.63.15:50085/ 🏇Model Weights: https://t.co/3jrdUMYPFz 🏇Github: https://t.co/AY7ECXenfT The 13B/7B versions are coming soon. *Note: There are two HumanEval results of GPT4 and ChatGPT-3.5: 1. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of OpenAI. 2. The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).

WizardLM_AI's tweet photo. 🔥🔥🔥
Introduce the newest WizardCoder 34B based on Code Llama.

✅WizardCoder-34B surpasses GPT-4, ChatGPT-3.5 and Claude-2 on HumanEval with 73.2% pass@1

🖥️Demo: http://47.103.63.15:50085/
🏇Model Weights: https://t.co/3jrdUMYPFz
🏇Github: https://t.co/AY7ECXenfT

The 13B/7B versions are coming soon.

*Note:
There are two HumanEval results of GPT4 and ChatGPT-3.5:
1. The 67.0 and 48.1 are reported by the official GPT4 Report (2023/03/15) of OpenAI.
2. The 82.0 and 72.5 are tested by ourselves with the latest API (2023/08/26).

374

882

738K

chubbymaggie retweeted

Piotr Pomorski

@PtrPomorski

almost 3 years ago

Just reminding you that this golden repo on financial machine learning exists: https://t.co/2JdsK9LbF2

221

101K

chubbymaggie retweeted

Guido van Rossum

@gvanrossum

almost 3 years ago

Yeah, so I helped the Excel team with this. Excited that it's out! https://t.co/zH1o2Fiaad

277

15K

chubbymaggie retweeted

@_akhaliq

almost 3 years ago

AgentBench: Evaluating LLMs as Agents paper page: https://t.co/r9ipiPnbfs Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation.

_akhaliq's tweet photo. AgentBench: Evaluating LLMs as Agents

paper page: https://t.co/r9ipiPnbfs

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 25 LLMs (including APIs and open-sourced models) shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and open-sourced competitors. It also serves as a component of an ongoing project with wider coverage and deeper consideration towards systematic LLM evaluation.

381

180

66K

chubbymaggie retweeted

Tim Blazytko

@mr_phrazer

almost 3 years ago

The recording of my talk "Unveiling Secrets in Binaries using Code Detection Strategies" at @reconmtl is now online. It covers strategies to explore unknown, large binaries. Recording: https://t.co/pmpBRoUO6z Slides: https://t.co/ibeUOsfsee Code: https://t.co/wWPTnvRHvJ

mr_phrazer's tweet photo. The recording of my talk "Unveiling Secrets in Binaries using Code Detection Strategies" at @reconmtl is now online. It covers strategies to explore unknown, large binaries.

Recording: https://t.co/pmpBRoUO6z
Slides: https://t.co/ibeUOsfsee
Code: https://t.co/wWPTnvRHvJ https://t.co/xhdtJWIN14

199

26K

chubbymaggie retweeted

jason

@jxnlco

almost 3 years ago

its almost like.. embeddings was compression the whole time. wordcount is a embedding and a compression that loses position sensitive to common words tdidf is an embedding and its a compression that is insensitive to document frequency gzip is an embedding like tfidf but contains positional information and a has a tokenizer. Absolutely love this.

119

707

659K

chubbymaggie retweeted

quarkslab @quarkslab

almost 3 years ago

You want to visualize your firmware binaries and their interaction? Use our new tool Pyrrha, introduced by @_cryptocorn_ during @passthesaltcon #pts23 https://t.co/vd4cAZXPYA

100

16K

chubbymaggie retweeted

👩‍💻 Paige Bailey

@DynamicWebPaige

almost 3 years ago

✨👩‍💻 Our @DeepMind Code AI team delivered a presentation this morning about the work we've done internally and externally—and the path for reinventing what it means to do software development and creative technical work in the age of generative models. 🤖 RL and generative models combined have massive potential for creators: and there has never been a more capable group to build and implement this vision. ✨ Very excited for the years to come! 🙌🏻 If you're curious about what we've been working on, a small fraction of our research and applied work can be found in the links below: • Large sequence models for software development activities https://t.co/gVm8vHtYQ3 • Understanding HTML with Large Language Models https://t.co/hp1OlaB35A • Natural Language to Code Generation in Interactive Data Science Notebooks https://t.co/apiSxNZxAh • ML-Enhanced Code Completion Improves Developer Productivity https://t.co/fHyPl3J9C7 • Code as Policies: Language Model Programs for Embodied Control https://t.co/AWo2eVfDFH • Learning Performance-Improving Code Edits https://t.co/jFdPZ35VTJ • Generative Agents: Interactive Simulacra of Human Behavior https://t.co/hhlwioR83m • AlphaDev discovers faster sorting algorithms https://t.co/Ea1RXVywlG • Competitive programming with AlphaCode https://t.co/HlxXfOOWfv • Baldur: Whole-Proof Generation and Repair with Large Language Models https://t.co/FrAbfvFyP0 ...and more.

$DynamicWebPaige's tweet photo. ✨👩‍💻 Our @DeepMind Code AI team delivered a presentation this morning about the work we've done internally and externally—and the path for reinventing what it means to do software development and creative technical work in the age of generative models. 🤖 RL and generative models combined have massive potential for creators: and there has never been a more capable group to build and implement this vision. ✨ Very excited for the years to come! 🙌🏻 If you're curious about what we've been working on, a small fraction of our research and applied work can be found in the links below: • Large sequence models for software development activities https://t.co/gVm8vHtYQ3 • Understanding HTML with Large Language Models https://t.co/hp1OlaB35A • Natural Language to Code Generation in Interactive Data Science Notebooks https://t.co/apiSxNZxAh • ML-Enhanced Code Completion Improves Developer Productivity https://t.co/fHyPl3J9C7 • Code as Policies: Language Model Programs for Embodied Control https://t.co/AWo2eVfDFH • Learning Performance-Improving Code Edits https://t.co/jFdPZ35VTJ • Generative Agents: Interactive Simulacra of Human Behavior https://t.co/hhlwioR83m • AlphaDev discovers faster sorting algorithms https://t.co/Ea1RXVywlG • Competitive programming with AlphaCode https://t.co/HlxXfOOWfv • Baldur: Whole-Proof Generation and Repair with Large Language Models https://t.co/FrAbfvFyP0 ...and more.$

727

158

627

302K

chubbymaggie retweeted

Dr Alina Utrata @AlinaUtrata

over 4 years ago

I downloaded all the data Amazon has on me, and honestly the creepiest thing about it is that they sent me the *actual audio files* of every time I spoke* to Amazon Alexa *years ago when I was young and foolish about surveillance

AlinaUtrata's tweet photo. I downloaded all the data Amazon has on me, and honestly the creepiest thing about it is that they sent me the *actual audio files* of every time I spoke* to Amazon Alexa

*years ago when I was young and foolish about surveillance https://t.co/XH4Lp4bDob

344

33K

chubbymaggie retweeted

Thomas Wolf

@Thom_Wolf

almost 3 years ago

What was going on with the Open LLM Leaderboard? Its numbers didn't match the ones reported in the LLaMA paper! We've decided to dive in this rabbit hole with friends from the LLaMA & Falcon teams and got back with a blog post of learnings & surprises: https://t.co/bREo0oiQ01

Thom_Wolf's tweet photo. What was going on with the Open LLM Leaderboard?

Its numbers didn't match the ones reported in the LLaMA paper!

We've decided to dive in this rabbit hole with friends from the LLaMA & Falcon teams and got back with a blog post of learnings & surprises: https://t.co/bREo0oiQ01 https://t.co/2KVmOJQ8fz

583

131

276

282K

chubbymaggie retweeted

Nat Friedman

@natfriedman

almost 3 years ago

AI Grant's second batch is now accepting applications! https://t.co/Slw8BTlEfi

540

322

353K

chubbymaggie retweeted

Avanika Narayan

@Avanika15

about 3 years ago

Fine-tuned performance without a step of SGD? Excited to share TART, which transplants transformer-based reasoning modules on arbitrary foundation models to improve in-context learning performance! 📜 https://t.co/bYtxGvfern 💻 https://t.co/nKcrMXWH1F ✍️ https://t.co/7tjSqQLzvm

Avanika15's tweet photo. Fine-tuned performance without a step of SGD?

Excited to share TART, which transplants transformer-based reasoning modules on arbitrary foundation models to improve in-context learning performance!
📜 https://t.co/bYtxGvfern
💻 https://t.co/nKcrMXWH1F
✍️ https://t.co/7tjSqQLzvm https://t.co/dqIxfbfKY3

24K

chubbymaggie retweeted

Sebastian Raschka

@rasbt

almost 3 years ago

I often get requests to dispel some of the jargon behind transformers and LLMs! So here we go, my new article on "Understanding Encoder and Decoder LLMs": https://t.co/muow0Kew8d

649

134

480

115K

Maggie

@chubbymaggie

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users