Wout Schellaert @WoutSchellaert - Twitter Profile

Wout Schellaert @WoutSchellaert

8 months ago

@Jsevillamol They've also posted a commentary of the paper in question: https://t.co/P4EdPgOoPU

0

1

0

17

Wout Schellaert @WoutSchellaert

10 months ago

@EsbenKC @apartresearch @withmartian Did anything come out of Track 2 that is published? Or routers (as opposed to judges) in general?

1

0

23

Wout Schellaert @WoutSchellaert

about 1 year ago

This is one of the one the best (if not the best) approach to AI evaluation I've seen. You can't blabla your way to the predictive power they report in section 3.4!

Microsoft Research

@MSFTResearch

about 1 year ago

ADeLe, a new evaluation method, explains what AI systems are good at—and where they’re likely to fail. By breaking tasks into ability-based requirements, it has the potential to provide a clearer way to evaluate and predict AI model performance: https://t.co/zPt8DxSLdT

MSFTResearch's tweet photo. ADeLe, a new evaluation method, explains what AI systems are good at—and where they’re likely to fail. By breaking tasks into ability-based requirements, it has the potential to provide a clearer way to evaluate and predict AI model performance: https://t.co/zPt8DxSLdT https://t.co/i4pso5qj7Y

7

154

38

50

17K

0

7

1

0

277

Wout Schellaert @WoutSchellaert

over 1 year ago

These cheeky LLMs make their answers very convincing, and unsurprisingly, this is tricking users into overrelying on them...

Lexin Zhou

@lexin_zhou

over 1 year ago

1/ New paper @Nature! Discrepancy between human expectations of task difficulty and LLM errors harms reliability. In 2022, Ilya Sutskever @ilyasut predicted: "perhaps over time that discrepancy will diminish" (https://t.co/HADDUztzhu, min 61-64). We show this is *not* the case!

lexin_zhou's tweet photo. 1/ New paper @Nature!

Discrepancy between human expectations of task difficulty and LLM errors harms reliability. In 2022, Ilya Sutskever @ilyasut predicted: "perhaps over time that discrepancy will diminish" (https://t.co/HADDUztzhu, min 61-64).

We show this is *not* the case! https://t.co/u2HYQbWE4j

19

1K

292

954

298K

1

6

2

1

537

WoutSchellaert retweeted

Lexin Zhou

@lexin_zhou

over 1 year ago

1/ New paper @Nature! Discrepancy between human expectations of task difficulty and LLM errors harms reliability. In 2022, Ilya Sutskever @ilyasut predicted: "perhaps over time that discrepancy will diminish" (https://t.co/HADDUztzhu, min 61-64). We show this is *not* the case!

19

1K

292

954

298K

Wout Schellaert @WoutSchellaert

over 2 years ago

@S_OhEigeartaigh @ryancbriggs I quite like Danish Energibajer, and it's widely available.

0

65

Wout Schellaert @WoutSchellaert

about 3 years ago

with wonderful collaborators @NandoMartinezP, @karinavold, @JohnJBurden, Pablo A.M. Casares, Aiden Loe, @roireichart, @S_OhEigeartaigh, @annalkorhonen and José Hernández-Orallo.

0

4

0

182

Wout Schellaert @WoutSchellaert

about 3 years ago

New and shiny AI systems have superseded the ones we reference (it took a while to publish), but our perspectives and suggestions for evaluating them have only become more relevant. Go have a read! 👽👽

J. AI Research-JAIR @JAIR_Editor

about 3 years ago

New Article: "Your Prompt is My Command: On Assessing the Human-Centred Generality of Multimodal Models" by Schellaert, Martínez-Plumed, Vold, Burden, Casares, Loe, Reichart, Ó hÉigeartaigh, Korhonen and Hernández-Orallo https://t.co/LE7pWxdsON

0

10

6

4

5K

2

5

3

0

992

Wout Schellaert @WoutSchellaert

about 3 years ago

Happy to be a part of this, among so many I look up to in this field.

Ryan Burnell @DrRyanBurnell

about 3 years ago

Is it time to rethink how we perform system evaluations in AI? In our new @ScienceMagazine paper, we show that over-reliance on aggregate metrics and a lack of transparency in reporting threatens public understanding and hinders progress in the field. 1/8 https://t.co/kZMNCEALbG

5

179

40

61

116K

0

4

0

207

Wout Schellaert @WoutSchellaert

over 3 years ago

@JohnJBurden While a stochastic parrot could be considered a world model, the logic of how to use it, e.g. in self-dialogue or planning, is conceptually external to the parrot, while as a human we seem to encapsulate the whole package.

0

35

WoutSchellaert retweeted

Ryan Burnell @DrRyanBurnell

over 3 years ago

Interested in AI robustness and predictability? Come join us in sunny Valencia for an exciting workshop on March 8th! Information here: https://t.co/5tkt5jqsMU

DrRyanBurnell's tweet photo. Interested in AI robustness and predictability? Come join us in sunny Valencia for an exciting workshop on March 8th! Information here: https://t.co/5tkt5jqsMU https://t.co/37wvUpPnYx

0

4

3

1

785

Wout Schellaert @WoutSchellaert

over 3 years ago

This is a goldmine for labs without the compute to run these huge LMs themselves. Great effort!

Percy Liang

@percyliang

over 3 years ago

From https://t.co/7IIkp7TBOc, you can explore all the results. Drill down beyond aggregate statistics into individual predictions and exact prompts. It’s fully reproducible. Download it all, perform your own analyses, and let us know what you find!

1

18

2

1

0

Wout Schellaert @WoutSchellaert

over 3 years ago

We're starting an old school mailing list for folks interested in how to evaluate AI (and all questions that come with it). Open for all to join and post! Come come! https://t.co/WFv3eQZsDD

WoutSchellaert's tweet photo. We're starting an old school mailing list for folks interested in how to evaluate AI (and all questions that come with it).
Open for all to join and post! Come come!
https://t.co/WFv3eQZsDD https://t.co/xS2yTQGmVZ

0

Wout Schellaert @WoutSchellaert

about 4 years ago

@Choisissez @IJCAIconf @adinamwilliams @AmandaMSeed Our primary selection criterion! Wigs allowed 🤡

0

1

0

Wout Schellaert @WoutSchellaert

about 4 years ago

Still 10 full days to submit your papers for the 📐Evaluation Beyond Metrics workshop @IJCAIconf. Not that you need another excuse to come, with @adinamwilliams and @AmandaMSeed giving a talk! 🔗 https://t.co/RRRSJvt1hR

WoutSchellaert's tweet photo. Still 10 full days to submit your papers for the 📐Evaluation Beyond Metrics workshop @IJCAIconf.

Not that you need another excuse to come, with @adinamwilliams and @AmandaMSeed giving a talk!

🔗 https://t.co/RRRSJvt1hR https://t.co/rF2t1LaoWp

1

8

6

0

Wout Schellaert @WoutSchellaert

about 4 years ago

@yanaiela @VictorButoi Probabilistic outputs are at the instance level, while your measures are typically aggregate. Instances and corresponding confidence differ.

0

Wout Schellaert @WoutSchellaert

about 4 years ago

@LucyCheke @DanajaRutar @JohnJBurden @DrRyanBurnell @TomerUllman and @NandoMartinezP is on Twitter anyway 🙃

1

2

0

Wout Schellaert @WoutSchellaert

about 4 years ago

📐Our Evaluation Beyond Metrics workshop at IJCAI got accepted... so prepare your cool papers! 💻https://t.co/RRRSJvt1hR With @LucyCheke, @DanajaRutar, @JohnJBurden, @DrRyanBurnell, @TomerUllman and twitterless Josh Tenenbaum, José Hernández-Orallo and Fernando Martínez-Plumed

WoutSchellaert's tweet photo. 📐Our Evaluation Beyond Metrics workshop at IJCAI got accepted... so prepare your cool papers!

💻https://t.co/RRRSJvt1hR

With @LucyCheke, @DanajaRutar, @JohnJBurden, @DrRyanBurnell, @TomerUllman and twitterless Josh Tenenbaum, José Hernández-Orallo and Fernando Martínez-Plumed https://t.co/r84tkAIHSc

2

14

7

0

Wout Schellaert @WoutSchellaert

over 4 years ago

@rajiinio "If the so-called 'general' benchmarks were legitimate tests of progress towards general artificial cognitive abilities, we would expect the tasks they embody to be chosen systematically." Great insight. Thanks!

0

1

0

Wout Schellaert @WoutSchellaert

about 5 years ago

@ddfreyne If I understand correctly, this is what the Semantic Web tries to solve. For the web at least.

0

Wout Schellaert

@WoutSchellaert

Last Seen Users on Sotwe

Trends for you

Most Popular Users