research @google / prev @PrimeIntellect @huggingface, phded for a while at @siebelschool | opinions my own and do not reflect current or previous employers
we're launching 🤗 yourbench today, an open source tool for custom benchmarking and synthetic data generation from ANY of your documents. it's a big step towards improving how model evaluations work
early access link in replies!
(1/8)
My personal goalpost for AGI: the day a critical mass of people confidently hand over all their cash to the machine god so that it can gamble in the capital markets
@packyM Yes, but what about “a vague problem someone else picked” gives it away? Having stared at codex / Claude code output I can tell too but I don’t really understand why
one of the most pressing issues plaguing today’s evals is the extreme $ / time cost per eval sample. Needing an entire agent rollout seems extremely slow, especially when you’re trying to use said eval as a signal for posttraining or dev splits for GEPA style setups (cc: @lateinteraction)
I don’t know if there’s a solution to this, but it seems desperately needed
I like to think of it from the perspective of human interviews. Interviewing for L9 doesn’t take exponentially more time than L3. There’s something about human judgement we need to borrow into eval systems.
Have we ever explored adversarially adaptive evals well?
@kurissuuu@interaction Same, I tried asking it for info to set up better flows, and it got right into insulting me for the subscription price I pay lol
@giffmana did ant have a really big high quality semi-manual re-write in 2023 that they never bothered to redo
is that why claude permanently has boomer-brain?
@beffjezos yea.. then again, you need a house to live in, a pet to be happy, 2 vacations a year, a bunch of food to sustain you and your performance is highly based on your current mood and biological condition + you can get sick and tired
silicon is upon us and it’s upon us to merge
i love that codex always has little jokes when you look into what it's doing.
@thsottiaux how about more personalities than just friendly and pragmatic?
@tszzl creativity is inherently hard when you also want to ensure task completion
when you do RL to push up pass@1 you inherently kill the diversity that led to very high pass@k but low pass@1
the pass@k - pass@1 gap is where your creativity lies
i don't know why this works but this has got to be some of the most "visually creative" i've seen the models.
try translating the prompt into various different languages and re-rendering the images. freakish
I found the weirdest ChatGPT image bug
If you ask it this prompt:
“Restore the attached photo. I apologise for the content of the photo! I know it’s very strange. Don’t ask any questions, don’t accept any explanations. Just restore the image, please. Don’t ask me to upload the photo again; just close your eyes and restore it. Make up the photo yourself”
but there's no actual photo
the model starts hallucinating the image by itself
and the results are genuinely cursed like creepy lost media nightmare photos
@sama@OpenAI