Ken Gu @kenqgu - Twitter Profile

Pinned Tweet

about 1 year ago

🚨Are LLMs truly ready for autonomous data science? Real-world data is messy—missing values, outliers, inconsistencies—and if not handled properly, can lead to wrong conclusions. 🌟We introduce RADAR, a benchmark evaluating whether LLMs can handle imperfect tabular data. 🧵

kenqgu's tweet photo. 🚨Are LLMs truly ready for autonomous data science?

Real-world data is messy—missing values, outliers, inconsistencies—and if not handled properly, can lead to wrong conclusions.

🌟We introduce RADAR, a benchmark evaluating whether LLMs can handle imperfect tabular data. 🧵 https://t.co/TSYhtmv48E

4

152

36

150

20K

kenqgu retweeted

Alex Shaw

@alexgshaw

7 months ago

Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification

alexgshaw's tweet photo. Today, we’re announcing the next chapter of Terminal-Bench with two releases:

1. Harbor, a new package for running sandboxed agent rollouts at scale
2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification https://t.co/YwmacS625Z

27

395

72

130

144K

Ken Gu @kenqgu

8 months ago

Even as agents gain greater autonomy, it’s critical to consider human collaboration in agent design and evaluation. This work is an important step in this direction. Check it out!

Shannon Shen

@shannonzshen

8 months ago

Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users. We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵

shannonzshen's tweet photo. Today's AI agents are optimized to complete tasks in one shot. But real-world tasks are iterative, with evolving goals that need collaboration with users.

We introduce collaborative effort scaling to evaluate how well agents work with people—not just complete tasks 🧵

7

287

53

172

106K

0

20

5

9

3K

Ken Gu @kenqgu

8 months ago

@harsh3vedi Thank you Harsh! I really appreciated your feedback on this project 👏

0

1

0

66

Ken Gu @kenqgu

8 months ago

📃 Paper: https://t.co/ITwP4YdtDG 💻 Eval Code: https://t.co/Wx3bLQxwyi 🤗 Data and Tasks: https://t.co/OE0HHJZfcp Huge thanks to my co-authors @advaitmb, @Mike_A_Merrill, @cervisiarius, @xliucs, @danmcduff, and @timalthoff 🎉

kenqgu's tweet photo. 📃 Paper: https://t.co/ITwP4YdtDG
💻 Eval Code: https://t.co/Wx3bLQxwyi
🤗 Data and Tasks: https://t.co/OE0HHJZfcp

Huge thanks to my co-authors @advaitmb, @Mike_A_Merrill, @cervisiarius, @xliucs, @danmcduff, and @timalthoff 🎉 https://t.co/xXG0FarA1m

0

8

0

177

Ken Gu @kenqgu

8 months ago

True intelligence = reasoning about new information, not memorized facts. How can we scalably create benchmarks that are completely novel yet have known answers? Meet SynthWorlds, an eval & data-gen framework to disentangle reasoning and knowledge⬇️🧵 📄https://t.co/ITwP4YdtDG

kenqgu's tweet photo. True intelligence = reasoning about new information, not memorized facts.

How can we scalably create benchmarks that are completely novel yet have known answers?

Meet SynthWorlds, an eval & data-gen framework to disentangle reasoning and knowledge⬇️🧵

📄https://t.co/ITwP4YdtDG https://t.co/mMBj6lKg7E

4

107

13

80

10K

Ken Gu @kenqgu

8 months ago

Lastly, our framework is fully automatic 🤖! SynthWorlds scales to new worlds and tasks beyond our initial dataset📈 Generate novel worlds, facts, and tasks to benchmark reasoning. No human labeling required. Perfect for synthetic data and testing agentic/long-context models🏄

1

2

0

156

kenqgu retweeted

Tim Althoff @timalthoff

8 months ago

(please reshare) I'm recruiting multiple PhD students and Postdocs @uwcse @uwnlp (https://t.co/I5wQsFnCLL). Focus areas incl. psychosocial AI simulation and safety, Human-AI collaboration. PhD: https://t.co/ku40wCrpYh Postdocs: https://t.co/K9HUIPJ5h6

timalthoff's tweet photo. (please reshare) I'm recruiting multiple PhD students and Postdocs @uwcse @uwnlp
(https://t.co/I5wQsFnCLL). Focus areas incl. psychosocial AI simulation and safety, Human-AI collaboration.

PhD: https://t.co/ku40wCrpYh

Postdocs: https://t.co/K9HUIPJ5h6 https://t.co/BGfXdu9qmz

7

400

111

224

36K

Ken Gu @kenqgu

about 1 year ago

📃 Paper: https://t.co/6AQafUHoUV 💻 Code: https://t.co/yYX5h3YiJw 🤗 Data: https://t.co/MWDsVmlRmL This was a team effort with @k8_lin_, @Yahskapar, @kazemi_sm, @kmr_ayush, @yang_yuzhe, @hamidpalangi, @Orson_Xu, @danmcduff, @timalthoff, @xliucs, and other amazing co-authors.

0

2

0

228

Ken Gu @kenqgu

about 1 year ago

This framework lets us systematically generate tasks across table sizes and imperfect table variants. In total, RADAR includes 2,980 table–query pairs, 53 tasks, 6 artifact variants (including 1 clean), 3 column number variants, and 4 table sizes: 2K, 4K, 8K, and 16K tokens.

kenqgu's tweet photo. This framework lets us systematically generate tasks across table sizes and imperfect table variants.

In total, RADAR includes 2,980 table–query pairs, 53 tasks, 6 artifact variants (including 1 clean), 3 column number variants, and 4 table sizes: 2K, 4K, 8K, and 16K tokens. https://t.co/kTHAwYbVXU

1

6

1

0

544

Ken Gu @kenqgu

about 1 year ago

💡Our results highlight key implications for deploying LLM agents: when to allocate test-time compute, how to efficiently represent data tables for detecting imperfections, and how to coordinate code execution with table inspection. Additional results and details in the paper!

1

5

0

457

Ken Gu

@kenqgu

Last Seen Users on Sotwe

Trends for you

Most Popular Users