@dguido I do think however it’s useful as a baseline of expectations. Like if the answer isn’t 100% success on discovery that is a bit telling. Especially for very well bugs for which ample training documentation (blog posts, p0c, etc.) exist.
The details of AIxCC challenges are now live at https://t.co/ps4MVL1OuQ
Huge thanks to all the challenge authors who contributed, and the teams who helped review content.
If you haven’t re-evaluated delivery services lately for Instacart, Uber, Door Dash, it’s probably a good time to do so.
Seeing extreme price increases. +36% just for the item, then nearly 2-3x for delivery from 1/4 mile away. 4x if you take the max suggested tip.
Does anyone know if there are any models specifically trained on data before GPT-3 was widely available? I know Sam has mentioned that GPT-4 could potentially be this but I think we need a site and a model like "before[.]ai" or similar that is specifically dedicated to continual training on ONLY datasets validated to be available before GPT-3 or some other clear line in the sand where we know models we're generating a significant volume of data on the internet.
If we don't do this as some sort of public good, open-source, shared offering I think attempting to baseline understanding and separate raw human content from model generated content is going to be impossible. This is not about generating a large extremely capable model for free, but about human vs. AI/ML provenance.
This seems like the sort of thing an academic institution or standards agency like NIST, etc. could help maintain with the support of large model vendors @sama@DarioAmodei@elonmusk
New podcast 🎙️
How did the AI Cyber Challenge go from skepticism to success?
Start with AIxCC Part 1 – From Skepticism to Success and hear how #AIxCC reshaped thinking around AI + cybersecurity.
Part 1 kicks off a 4-episode series: https://t.co/87SHh5YNhP
If you're not checking your assumptions with each new LLM feature/model release you're missing out. The biggest two consistent human gaps I see are outdated assumptions, and inability to effectively configure your agentic environment.
I think 2026 will be the year VCs start releasing tranches based on your tokens-to-feature ratio, making funding decisions on your avg tokens-to-launch, or moving from pre-seed, series-A, to series-B growth stage only.
I’ve been experimenting with these extended memory agentic frameworks with the knowledge of what it takes to build a global scale systems.
I’m finding them amazing for experimenting and prototyping but the debugging skills severely lacking. They appear to always weight training data over supplied evidence.
Probably 8-10 times now I’ve had it implement a spec or configuration based on guidance from 1-2 years ago and when provided the official latest spec it still doesn’t conform. I think there is a disconnect in valuing latest vendor/author supplied information over the volume of data available during training.
@PamelaJMills1@WeWillBeFree24 https://t.co/pfUyr4KXWM I had the same thing happen, seems like maybe there is a large scale issue with estimated usage. I made. this app to better understand it.