Barb @itscientist - Twitter Profile

2 days ago

jane street has some interesting blogs too bad it would take a week to understand each https://t.co/Z00MpzUDiU https://t.co/qMXCn5wO8J https://t.co/OdSGZzzYSY https://t.co/kTgOenAhVT

curlysaarthak's tweet photo. jane street has some interesting blogs
too bad it would take a week to understand each

https://t.co/Z00MpzUDiU

https://t.co/qMXCn5wO8J

https://t.co/OdSGZzzYSY

https://t.co/kTgOenAhVT https://t.co/jJp5Gx2fDr

37

3K

257

5K

841K

itscientist retweeted

Yangzhen Wu

@yangzhen04

4 days ago

Static benchmarks are dying — they tend to get saturated quickly. Evaluation and training data should co-evolve with frontier models. We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL. New work from UC Berkeley @berkeley_ai @BerkeleyRDI @BerkeleySky Project Page: https://t.co/PL1KpGyd87 Paper: https://t.co/gBQOXrZbAV

yangzhen04's tweet photo. Static benchmarks are dying — they tend to get saturated quickly.

Evaluation and training data should co-evolve with frontier models.

We released BenchEvolver — a framework that automatically evolves saturated problems into harder, verified tasks for evaluating frontier models, which can also serve as useful self-improvement signals for RL.

New work from UC Berkeley @berkeley_ai @BerkeleyRDI @BerkeleySky

Project Page: https://t.co/PL1KpGyd87
Paper: https://t.co/gBQOXrZbAV

5

94

19

63

38K

itscientist retweeted

Owain Evans

@OwainEvans_UK

over 4 years ago

New paper on truthful AI! We introduce a definition of “lying” for AI We explore how to train truthful ML models We propose institutions to support *standards* for truthful AI We weigh costs/benefits (economy + AI Safety) (w/ coauthors at Oxford & OpenAI) https://t.co/9QII1WXFqw

3

85

22

0

itscientist retweeted

David Bau @davidbau

2 days ago

We really don't understand AI lying fully. What happens when an AI lies? How can you best detect it? Can white-box detectors beat black-box ones in an adversarial setting? It is an ongoing interpretability challenge. We'd like you to try your hand at it. Join our contest!

1

4

1

1K

Who to follow

GhosT

@gkoushikg

💫Chaos to give birth to the dancing stars💫

Kiru Taye💋

@KiruTaye

Sensual African stories. Nigerian Romance Books. Read a different Africa. Founder @LoveAfricaPress | Co-Founder @RWOWA (she/her)

itscientist retweeted

David Bau @davidbau

2 days ago

But AI lie detection is hard and remains a central research challenge. Recent research suggests that simple probes can pick up on neural "tells" that reveal when it is lying, even when the output looks clean. https://t.co/zTQIaNcpag https://t.co/wVCwKlGVIr

1

5

1

0

897

itscientist retweeted

Aaron Scher @aaronscher

3 days ago

Interestingly, training compute for open-weight AI models doesn't appear to have grown very much in the last 2 years. Llama 3.1-405B (and derivative models) still holds the record 2 years later. Data from Epoch. https://t.co/DWBq23xNAt

aaronscher's tweet photo. Interestingly, training compute for open-weight AI models doesn't appear to have grown very much in the last 2 years. Llama 3.1-405B (and derivative models) still holds the record 2 years later. Data from Epoch. https://t.co/DWBq23xNAt https://t.co/ptjd0uVg96

1

3

1

0

46

itscientist retweeted

Garry Tan

@garrytan

6 days ago

https://t.co/0hcNuYzbMd

81

962

79

2K

520K

itscientist retweeted

BLVCKL!GHT

@BLVCKLIGHTai

8 days ago

https://t.co/WmpxZ0cmx4

31

165

21

87

11K

itscientist retweeted

ModelScope

@ModelScope2022

11 days ago

Introducing Q-Judger and Qwen-Image-Bench, an automated T2I evaluation suite from Qwen team. Apache 2.0. 🤖 https://t.co/hv3R6iU9UX Q-Judger is built on Qwen3.6-27B with thinking mode. Input prompt + image, get structured JSON scores across 5 dimensions: Quality, Aesthetics, Alignment, Real-world Fidelity, Creative Generation. Spearman ρ = 0.92 vs human expert rankings. Trained on 130K+ bilingual pairs supervised by 80 professional annotators from art academies.

ModelScope2022's tweet photo. Introducing Q-Judger and Qwen-Image-Bench, an automated T2I evaluation suite from Qwen team. Apache 2.0. 🤖 https://t.co/hv3R6iU9UX

Q-Judger is built on Qwen3.6-27B with thinking mode. Input prompt + image, get structured JSON scores across 5 dimensions: Quality, Aesthetics, Alignment, Real-world Fidelity, Creative Generation. Spearman ρ = 0.92 vs human expert rankings. Trained on 130K+ bilingual pairs supervised by 80 professional annotators from art academies.

4

78

13

59

15K

itscientist retweeted

Adam Chalmers @adam_chal

10 days ago

Announcing Zoo's completely redesigned sketch mode! You can sketch complex 2D geometry SO much quicker now. So many improvements: - Constraint solver - Trim tool (remove parts of intersecting geometry) - Select regions (overlapping 2D spaces) for extrude It's a big upgrade.

adam_chal's tweet photo. Announcing Zoo's completely redesigned sketch mode! You can sketch complex 2D geometry SO much quicker now. So many improvements:

- Constraint solver
- Trim tool (remove parts of intersecting geometry)
- Select regions (overlapping 2D spaces) for extrude

It's a big upgrade. https://t.co/RYDBFwSeE5

1

55

6

40

5K

itscientist retweeted

Luba Elliott @ CVPR @elluba

11 days ago

The #CVPR2026 Art Gallery is now live 🥳 114 artworks using or about computer vision, presented online and as videos and installations at the @CVPRConf in Denver between 5-7 June next week 😍 Check it out https://t.co/XTGoCWaDV8 #creativeAI #AIart

1

59

22

5

7K

itscientist retweeted

Serena Ge (Datacurve)

@serenaa_ge

12 days ago

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

serenaa_ge's tweet photo. Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.

On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work. https://t.co/HCDcjNuTFK

511

6K

743

3K

2M

itscientist retweeted

Director Michael Kratsios

@mkratsios47

11 days ago

You can’t make scientific breakthroughs without the tools to pursue them. Big news from @NSF: $250M to restart and supercharge the SBIR/STTR program, including a new $40M pilot for next-gen scientific instrumentation. America’s small businesses will build the platforms that define discovery. https://t.co/a0z16TawAy

8

188

32

53

29K

itscientist retweeted

ангел𓂀𓋹🪽𖤓 @cleomythra

11 days ago

‘Hands weaving magnetic-core memory, IBM, Poughkeepsie, New York,’ 1956. Photograph by Ansel Adams.

12

4K

731

708

108K

itscientist retweeted

Zelda

@zeldapoem

15 days ago

Written by @ulkar_aghayeva: https://t.co/Vbo3B1YpPn

4

204

34

227

11K

itscientist retweeted

PyMC Labs @pymc_labs

20 days ago

The radar plot is the most popular chart in football analytics. It might also be the least effective. Chris Fonnesbeck breaks down why and builds the replacement: https://t.co/DeOScY3tfh #SportsAnalytics #DataViz #Bayesian

1

41

6

46

11K

itscientist retweeted

Vlad Feinberg

@FeinbergVlad

20 days ago

How to land a job at a frontier lab https://t.co/oHIqLgBMbC

50

3K

160

7K

1M

itscientist retweeted

Charles 🎉 Frye

@charles_irl

26 days ago

Inference isn't everything, but it does require a new stack -- not Kubernetes, not SLURM. At @modal, we dove deep to build that stack. In this blog post we explain how, from compute management & cloud-native cacheing to CRIU & GPU checkpointing. https://t.co/DQ4wvuXjre