what if we spread nyc tech week events across 52 weeks instead of concentrating them into 1 every 6 min during the best weather week of the year
(ans: we’d be in SF)
recent ai-flavoured nightmares:
- the concept of taking an english igcse exam & being presented with an insightful / emotive LLM conversation transcript as the case-study for analysis. genre, content, context, aim, theme, syntax, diction, rhythm, imagery, form, and tone.
- the concept of my models becoming spoilt and constantly demanding hand-corrected samples for better fine-tuning.
What is the most compelling example of a task in a non-verifiable domains where models really struggle? That might hint at lack of generalization from verifiable to non-verifiable domains.
the spicy claude code mosaic is an anthropic experiment to test the extent to which human users can continue to derive meaning from word fragments as they become increasingly garbled.
a ploy to turn us into more efficient token parsers
everyone is assuming this is some kind of quirk chungus marketing campaign but if you’ve worked with 5.4 and beyond they tend to call everything goblins, gremlins etc and it’s just super noticeable and if you work with them all day you start to get annoyed
ah i’ll try gemini! ended up remembering/reconstructing the piece while waiting for chat and claude to produce the notes correctly 🫠
it’s an interesting problem space, since music is a fundamentally different symbolic language from natural language (polyphonic, multidimensional, continuous) - so there are all sorts of limitations when trying to manipulate it w LLMs
curious if anyone's had luck with LLMs writing sheet music from sample audio (attempted transcriptions on claude resulting in phantom notes & dissonance)
A few weeks ago, I joined @cosmos_inst. AI is the transformative technology of our era. By acting as a force-multiplier on intelligence, I believe it has the potential to do tremendous good for humanity. At the same time, I worry that we may pursue the path of least resistance and increasingly outsource our judgment and our ability to think. We risk ending up in the warm bath of learned helplessness.
Often these conversations feel like they’re confined to communities that would rather we abandon the development of advanced AI systems. Many of my fellow optimists treat them as minor details we don’t need to worry about.
But over the past few months I feel like I’ve found real community. Through meeting @mbrendan1, the Cosmos team, and the wider network at the first Cosmos symposium, I’ve encountered a rare combination of technical depth and moral seriousness.
At Cosmos, I’ll be continuing to fight the battle of ideas. To give you a flavour of what I’ll be working on, for my first piece, I’ve written about why even highly capable AI systems will make for poor economic planners. https://t.co/payqWG60fe
Scaling laws formalize performance as a function of three variables: model size, dataset size, and compute. Data quality is excluded from the framework.
The companies that figure out how to make domain-expert reasoning tractable as training data will determine the next inflection in model capability.
If you're building in this space, I’d love to chat!
The binding constraint for model performance, then, is the absence of process level data -- how domain experts reason, allocate attention, process new info, weigh evidence, & dynamically update beliefs to reach conclusions over the length of a task.