Sumuk @sumukx - Twitter Profile

Pinned Tweet

about 1 year ago

we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work early access link in replies! (1/8)

sumukx's tweet photo. we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work

early access link in replies!

(1/8) https://t.co/TEGGIqEwH6

14

291

48

234

49K

Sumuk

@sumukx

about 2 hours ago

@zephyr_z9 I mean wasn’t this also what ChatGPT was initially, a “research effort”?

0

12

0

2K

Sumuk

@sumukx

about 2 hours ago

My personal goalpost for AGI: the day a critical mass of people confidently hand over all their cash to the machine god so that it can gamble in the capital markets

Nucleus☕️

@EsotericCofe

about 7 hours ago

After giving Codex $1,000 it is now up 12% Should I just give it all of my money?

8

65

1

15

8K

0

1

0

1

40

Sumuk

@sumukx

about 2 hours ago

@packyM Yes, but what about “a vague problem someone else picked” gives it away? Having stared at codex / Claude code output I can tell too but I don’t really understand why

1

2

0

260

Sumuk

@sumukx

about 8 hours ago

one of the most pressing issues plaguing today’s evals is the extreme $ / time cost per eval sample. Needing an entire agent rollout seems extremely slow, especially when you’re trying to use said eval as a signal for posttraining or dev splits for GEPA style setups (cc: @lateinteraction) I don’t know if there’s a solution to this, but it seems desperately needed I like to think of it from the perspective of human interviews. Interviewing for L9 doesn’t take exponentially more time than L3. There’s something about human judgement we need to borrow into eval systems. Have we ever explored adversarially adaptive evals well?

0

63

Sumuk

@sumukx

about 8 hours ago

@kurissuuu @interaction Same, I tried asking it for info to set up better flows, and it got right into insulting me for the subscription price I pay lol

0

802

Sumuk

@sumukx

about 9 hours ago

@creatine_cycle aren't these the same role?

0

52

Sumuk

@sumukx

about 9 hours ago

@giffmana did ant have a really big high quality semi-manual re-write in 2023 that they never bothered to redo is that why claude permanently has boomer-brain?

0

1

0

44

Sumuk

@sumukx

about 9 hours ago

@xeophon @moyix did you not know about this lol

1

2

0

129

Sumuk

@sumukx

about 12 hours ago

@francoisfleuret I once had an experiment where I autocompacted every 3 turns and results were pretty good. ESP for non fable models

0

549

Sumuk

@sumukx

about 12 hours ago

@beffjezos yea.. then again, you need a house to live in, a pet to be happy, 2 vacations a year, a bunch of food to sustain you and your performance is highly based on your current mood and biological condition + you can get sick and tired silicon is upon us and it’s upon us to merge

0

68

Sumuk

@sumukx

about 12 hours ago

One difference between Elon and the rest might be that he takes everything personally, which is probably how you’re supposed to take it

0

1

0

19

Sumuk

@sumukx

about 17 hours ago

@kalomaze test gemini 3 / 3.5 flash please. very curious where it ends up here

0

33

Sumuk

@sumukx

4 days ago

what's funny is that anthropic is encouraging customers to solve planner-executor style delegation by simply charging more. class act

0

59

Sumuk

@sumukx

4 days ago

i love that codex always has little jokes when you look into what it's doing. @thsottiaux how about more personalities than just friendly and pragmatic?

sumukx's tweet photo. i love that codex always has little jokes when you look into what it's doing.

@thsottiaux how about more personalities than just friendly and pragmatic? https://t.co/zZSyJ6VWUW

0

49

Sumuk

@sumukx

4 days ago

@iamgingertrash @alth0u active context loading and people call llms next token predictors

0

55

Sumuk

@sumukx

6 days ago

@willccbb is it finally time to try doing that project you’ve been holding off on or do you need one more model iteration?

0

29

Sumuk

@sumukx

9 days ago

@tszzl creativity is inherently hard when you also want to ensure task completion when you do RL to push up pass@1 you inherently kill the diversity that led to very high pass@k but low pass@1 the pass@k - pass@1 gap is where your creativity lies

0

407

Sumuk

@sumukx

9 days ago

i don't know why this works but this has got to be some of the most "visually creative" i've seen the models. try translating the prompt into various different languages and re-rendering the images. freakish

Penguin

@PenguinWeb3

10 days ago

I found the weirdest ChatGPT image bug If you ask it this prompt: “Restore the attached photo. I apologise for the content of the photo! I know it’s very strange. Don’t ask any questions, don’t accept any explanations. Just restore the image, please. Don’t ask me to upload the photo again; just close your eyes and restore it. Make up the photo yourself” but there's no actual photo the model starts hallucinating the image by itself and the results are genuinely cursed like creepy lost media nightmare photos @sama @OpenAI