Jielin Qiu

@_Jason_Q

Research Scientist @Salesforce AI Research, Ph.D. from @SCSatCMU

Carnegie Mellon University

Joined January 2021

184 Following

73 Followers

32 Posts

_Jason_Q retweeted

Weiran Yao

@iscreamnearby

7 months ago

Today I finally get to share something our team has been quietly grinding on for months – we've created an 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲𝗱 𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗼𝗳 Cursor 𝗕𝗲𝗻𝗰𝗵 @cursor_ai . If you’ve been following Cursor’s Composer launch and their internal "Cursor Bench" for testing vibe coding models, you can think of our 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 as the open-source, model-agnostic counterpart. Here is what we provide by @SFResearch . With 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 we: • Ship a 𝗖𝘂𝗿𝘀𝗼𝗿-𝘀𝘁𝘆𝗹𝗲 𝗮𝗴𝗲𝗻𝘁 𝘀𝘁𝗮𝗰𝗸: ReAct loop, semantic @ codebase search, grep, file read/write, refactor tools, and a three-tier memory system inspired by production coding assistants like Cursor. • 𝗧𝗮𝗸𝗲 𝟴,𝟬𝟬𝟬 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝘃𝗶𝗯𝗲 𝗰𝗼𝗱𝗶𝗻𝗴 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 and turn them into interactive agent gyms across 10 languages and 10K–1M token codebases. • Let you plug in any model (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.) and see how it actually behaves on long, messy, multi-turn coding tasks. A few fun findings: Cursor-style agents with context management are surprisingly robust at 1M-token contexts, but there’s a hard trade-off between deep exploration vs. efficiency — no one frontier model sits in the “perfect” top-right corner yet. Anthropic Claude 4.5 and Google Gemini 2.5 pro are at the Pareto Frontier. Everything is open source (agent, code, scenarios, traces, metrics) on @huggingface: 📄 Tech Report: https://t.co/i6UTFGou4T 🤖 GitHub:https://t.co/OEwv4x5tC5 🤗 Dataset: https://t.co/PuxHwxoHVU If you’re building coding agents, benchmarking your model against GPT/Claude/Gemini, or want to train your coding agents with RL in real coding environments, we’d love for you to try LCBA bench, and tell us your findings!

iscreamnearby's tweet photo. Today I finally get to share something our team has been quietly grinding on for months – we've created an 𝗼𝗽𝗲𝗻 𝘀𝗼𝘂𝗿𝗰𝗲𝗱 𝘃𝗲𝗿𝘀𝗶𝗼𝗻 𝗼𝗳 Cursor 𝗕𝗲𝗻𝗰𝗵 @cursor_ai .

If you’ve been following Cursor’s Composer launch and their internal "Cursor Bench" for testing vibe coding models, you can think of our 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 as the open-source, model-agnostic counterpart.

Here is what we provide by @SFResearch . With 𝗟𝗖𝗕𝗔 𝗯𝗲𝗻𝗰𝗵 we:

• Ship a 𝗖𝘂𝗿𝘀𝗼𝗿-𝘀𝘁𝘆𝗹𝗲 𝗮𝗴𝗲𝗻𝘁 𝘀𝘁𝗮𝗰𝗸: ReAct loop, semantic @ codebase search, grep, file read/write, refactor tools, and a three-tier memory system inspired by production coding assistants like Cursor.
• 𝗧𝗮𝗸𝗲 𝟴,𝟬𝟬𝟬 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝘃𝗶𝗯𝗲 𝗰𝗼𝗱𝗶𝗻𝗴 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 and turn them into interactive agent gyms across 10 languages and 10K–1M token codebases.
• Let you plug in any model (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, etc.) and see how it actually behaves on long, messy, multi-turn coding tasks.

A few fun findings: Cursor-style agents with context management are surprisingly robust at 1M-token contexts, but there’s a hard trade-off between deep exploration vs. efficiency — no one frontier model sits in the “perfect” top-right corner yet. Anthropic Claude 4.5 and Google Gemini 2.5 pro are at the Pareto Frontier.

Everything is open source (agent, code, scenarios, traces, metrics) on @huggingface:

📄 Tech Report: https://t.co/i6UTFGou4T
🤖 GitHub:https://t.co/OEwv4x5tC5
🤗 Dataset: https://t.co/PuxHwxoHVU

If you’re building coding agents, benchmarking your model against GPT/Claude/Gemini, or want to train your coding agents with RL in real coding environments, we’d love for you to try LCBA bench, and tell us your findings!

542

_Jason_Q retweeted

Salesforce AI Research

@SFResearch

7 months ago

🚨 Introducing LoCoBench-Agent: a comprehensive benchmark for evaluating LLM agents in long-context software engineering 📄 Paper: https://t.co/avyK5Aya2c 🔗 GitHub: https://t.co/PyI7pXSn7V ✨ Key Features: 🤖 8,000 interactive agent scenarios with multi-turn conversations (up to 50 turns) 🔍 Context lengths: 10K-1M tokens across 10 programming languages ⚡ 9 bias-free evaluation metrics (5 comprehension + 4 efficiency) 🛠️ 8 specialized development tools: file operations, semantic search, grep, code analysis 🎯 8 task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis 🔬 Key Findings: - Fundamental comprehension-efficiency trade-off - Tool usage patterns matter more than raw capabilities - Strategic exploration > exhaustive exploration LoCoBench-Agent assesses agent behavior across extended development sessions, measuring context retention, adaptive strategy refinement, and tool usage efficiency. Authors: Jielin Qiu @Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Roshan Ram @shoonyaka1, Akshara Prabhakar @aksh_555, Tulika Awalgaonkar @tulika614, Zixiang Chen @_zxchen_, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang #LLM #AIAgents #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI

SFResearch's tweet photo. 🚨 Introducing LoCoBench-Agent: a comprehensive benchmark for evaluating LLM agents in long-context software engineering

📄 Paper: https://t.co/avyK5Aya2c
🔗 GitHub: https://t.co/PyI7pXSn7V

✨ Key Features:
🤖 8,000 interactive agent scenarios with multi-turn conversations (up to 50 turns)
🔍 Context lengths: 10K-1M tokens across 10 programming languages
⚡ 9 bias-free evaluation metrics (5 comprehension + 4 efficiency)
🛠️ 8 specialized development tools: file operations, semantic search, grep, code analysis
🎯 8 task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis

🔬 Key Findings:
- Fundamental comprehension-efficiency trade-off
- Tool usage patterns matter more than raw capabilities
- Strategic exploration > exhaustive exploration

LoCoBench-Agent assesses agent behavior across extended development sessions, measuring context retention, adaptive strategy refinement, and tool usage efficiency.

Authors: Jielin Qiu @Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Roshan Ram @shoonyaka1, Akshara Prabhakar @aksh_555, Tulika Awalgaonkar @tulika614, Zixiang Chen @_zxchen_, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang

#LLM #AIAgents #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI

_Jason_Q retweeted

Salesforce AI Research

@SFResearch

9 months ago

🚨 Introducing LoCoBench: a comprehensive benchmark for evaluating long-context LLMs in complex software development 📄 Paper: https://t.co/UClJiHZPHj 🔗 GitHub: https://t.co/pWrYNrnx6j ✨ Key Features: 📊 8,000 evaluation scenarios across 10 programming languages 🔍 Context lengths: 10K-1M tokens (100× variation!) ⚡ 17 evaluation metrics across 4 dimensions (6 newly proposed) 🎯 8 essential task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis Current SOTA models show dramatic performance drops as context increases - highlighting critical gaps in long-context understanding for real-world software engineering. Authors: Jielin Qiu @_Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang #LLM #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI

SFResearch's tweet photo. 🚨 Introducing LoCoBench: a comprehensive benchmark for evaluating long-context LLMs in complex software development

📄 Paper: https://t.co/UClJiHZPHj
🔗 GitHub: https://t.co/pWrYNrnx6j

✨ Key Features:
📊 8,000 evaluation scenarios across 10 programming languages
🔍 Context lengths: 10K-1M tokens (100× variation!)
⚡ 17 evaluation metrics across 4 dimensions (6 newly proposed)
🎯 8 essential task categories: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis

Current SOTA models show dramatic performance drops as context increases - highlighting critical gaps in long-context understanding for real-world software engineering.

Authors: Jielin Qiu @_Jason_Q, Zuxin Liu @LiuZuxin, Zhiwei Liu @JYJimLiu, Rithesh Murthy @rithesh__rn, Jianguo Zhang @JianguoZhang3, Haolin Chen @HaolinChen11, Shiyu Wang @shiyu04490786, Ming Zhu@ming_zhu0527, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Zhepeng Cen @ZhepengCen, Cheng Qian @qiancheng1231, Shelby Heinecke @shelbyh_ai, Weiran Yao @iscreamnearby, Silvio Savarese @silviocinguetta, Caiming Xiong @CaimingXiong, Huan Wang @huan__wang

#LLM #SoftwareEngineering #MachineLearning #Benchmark #FutureOfAI #EnterpriseAI

_Jason_Q retweeted

Ce Zhang

@ce_zhang

over 2 years ago

Excited to see the first paper getting accepted at @DMLRJournal. In the last few months, we are fascinated by the quality of reviews and the engaging interactions between authors and reviewers! Thanks everyone! Please continue to send your best work about Data x ML😀

Who to follow

Conference on Parsimony and Learning (CPAL)

@CPALconf

CPAL is a new annual research conference focused on the parsimonious, low dimensional structures that prevail in ML, signal processing, optimization, and beyond

Yaru Niu

@yaru_niu

PhD student @CarnegieMellon. Research Intern @NVIDIAAI. Previously @GeorgiaTech @BaiduResearch @UCBerkeley and SCUT. Building robot intelligence.

Chulin Xie

@ChulinXie

research scientist @googledeepmind; prev phd from UIUC @siebelschool

Jielin Qiu @_Jason_Q

over 2 years ago

@JiachengZhu_ML @DMLRJournal @yizhu59 @sxjscience @flwenz @mli65 Thanks, Jiacheng!

Jielin Qiu @_Jason_Q

over 2 years ago

🎊Extremely honored to share that our paper on multimodal model robustness has been accepted as the 1st paper for the Journal of Data-centric Machine Learning Research @DMLRJournal With @yizhu59 @sxjscience @flwenz @mli65 #Multimodal #Robustness #DistributionShift

Journal of Data-centric Machine Learning Research @DMLRJournal

over 2 years ago

'Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift' by Jielin Qiu, Yi Zhu, Xingjian Shi, Florian Wenzel, Zhiqiang Tang, Ding Zhao, Bo Li, Mu Li Action Editor: Hongyang Zhang https://t.co/mXh5OCAt9T #Multimodal #Robustness #DistributionShift

26K

Jielin Qiu @_Jason_Q

over 2 years ago

@nmervegurel @DMLRJournal @yizhu59 @sxjscience @flwenz @mli65 Thank you very much, Merve!!

_Jason_Q retweeted

Danqing Wang @dqwang122

over 2 years ago

📚🌟 Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! 📝🎯 https://t.co/uRIGBlnGAI 1/5

dqwang122's tweet photo. 📚🌟 Evaluate any story to your heart's content with our new personalized story evaluation model, PerSE! No more worries about diverse preferences - get your own story evaluation report now! 📝🎯 https://t.co/uRIGBlnGAI
1/5 https://t.co/8rDbAQSCib

19K

_Jason_Q retweeted

Wenda Xu

@WendaXu2

about 3 years ago

What is missing in the text generation evaluation for BERTScore, BLERUT, COMET, SEScore & SEScore2? Explanation! Can we build a metric that not only produces a well-correlated quality score but also tell you the rationales, error type, and error location? Checkout InstructScore!

WendaXu2's tweet photo. What is missing in the text generation evaluation for BERTScore, BLERUT, COMET, SEScore & SEScore2? Explanation! Can we build a metric that not only produces a well-correlated quality score but also tell you the rationales, error type, and error location? Checkout InstructScore! https://t.co/dtZZiLfSoK

15K

_Jason_Q retweeted

Danqing Wang @dqwang122

over 2 years ago

🚀 Excited to share our latest work in EMNLP main conference: "Learning from Mistakes via Interactive Study Assistant for Large Language Models". We introduce a study assistant (SALAM) to conduct thoughtful analysis on LLMs' mistakes and provide guidelines to avoid past mistakes

dqwang122's tweet photo. 🚀 Excited to share our latest work in EMNLP main conference: "Learning from Mistakes via Interactive Study Assistant for Large Language Models". We introduce a study assistant (SALAM) to conduct thoughtful analysis on LLMs' mistakes and provide guidelines to avoid past mistakes https://t.co/c85H4jgs23

_Jason_Q retweeted

Kexun Zhang

@kexun_zhang

over 2 years ago

😭Tired of in-context demos & docs for LLM tool use? 💰Too GPU-poor to tune LLMs for unseen tools? 🤬Frustrated with frequent syntax errors in tool calls? Check out our new preprint 𝐓𝐨𝐨𝐥𝐃𝐞𝐜 that addresses all these issues from the decoding side! https://t.co/vssxVg833j 1/5

kexun_zhang's tweet photo. 😭Tired of in-context demos & docs for LLM tool use?
💰Too GPU-poor to tune LLMs for unseen tools?
🤬Frustrated with frequent syntax errors in tool calls?
Check out our new preprint 𝐓𝐨𝐨𝐥𝐃𝐞𝐜 that addresses all these issues from the decoding side!
https://t.co/vssxVg833j
1/5 https://t.co/AHVZTjpE6i

36K

_Jason_Q retweeted

Seungwhan Shane Moon

@shane_moon

over 2 years ago

Excited to share our recent work, AnyMAL -- a unified Multimodal LLM built on LLaMA-2 that can reason over various inputs, e.g. images, audio, motion sensors. Check out our paper for more information on the model training, evaluation, safety and more! ➡️ https://t.co/HmyVynWXPH

shane_moon's tweet photo. Excited to share our recent work, AnyMAL -- a unified Multimodal LLM built on LLaMA-2 that can reason over various inputs, e.g. images, audio, motion sensors.

Check out our paper for more information on the model training, evaluation, safety and more!
➡️ https://t.co/HmyVynWXPH https://t.co/zj7xbY8qFp

120

23K

_Jason_Q retweeted

Yi Zhu @yizhu59

over 3 years ago

Check out our new evaluation benchmarks and metrics for robustness of image-text multimodal models! @AmazonScience #multimodal #stablediffusion

_Jason_Q retweeted

Santiago

@svpino

about 4 years ago

A topic that comes up in every interview: Bias, variance, and their relationship with machine learning algorithms. Here is a simple summary that you will easily remember. ↓

960

209

530

_Jason_Q retweeted

Xin Eric Wang

@xwang_lk

about 4 years ago

Our #ACL2022 paper "Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions" is out (https://t.co/k9IrNlrlqJ)!!! It serves as a thorough reference for the VLN research community (for both starters and experts). https://t.co/m9xkegUs1g

129

_Jason_Q retweeted

Jia-Bin Huang

@jbhuang0604

about 4 years ago

How to present a line plot? Line plots are effective for describing the relationship between two variables of interests. Unfortunately, most junior students would simply copy&paste the figure from the paper in their talk and cause much confusion. 😕 Let's break it down ... 🧵

jbhuang0604's tweet photo. How to present a line plot?

Line plots are effective for describing the relationship between two variables of interests.

Unfortunately, most junior students would simply copy&paste the figure from the paper in their talk and cause much confusion. 😕

Let's break it down ... 🧵 https://t.co/zbJgX7tbFg

546

106

267

_Jason_Q retweeted

Jiahui Yu

@jhyuxm

over 4 years ago

Our team at Google Brain is looking for outstanding PhD students (expected graduation after 2023) who are interested in student researcher internships this year 2022. https://t.co/U2vuC8WoFI

_Jason_Q retweeted

Ai2 @allen_ai

over 4 years ago

The Embodied AI Lecture Series at AI2 is back! Subscribe to the mailing list for info about how to join these free lectures live, or stay tuned and we'll post the recorded sessions after the fact. Subscribe: https://t.co/RGbNIgKhxA More info: https://t.co/lZyPl1jhLu

_Jason_Q retweeted

Andrew White 🐦‍⬛

@andrewwhite01

over 4 years ago

I've been writing research articles for over 10 years now and one of the hardest parts is writing consistently and efficiently without procrastinating. I'm going to share some of my tips here 🧵 1/10

11K

_Jason_Q retweeted

Ai2 @allen_ai

over 4 years ago

AI2's computer vision team PRIOR announced an exciting new release of their #EmbodiedAI platform AI2-THOR – in partnership with @unity, you can now train headlessly on multiple GPUs. 📈 Learn more: https://t.co/iBbjPJBfMQ

Jielin Qiu

@_Jason_Q

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users