Everyone pipes docs through markitdown to feed LLMs, and on clean born-digital files it nails it. But it shreds tables column by column and skips headings. As one HN dev put it, it "just pulls the plaintext." Got tables or scanned PDFs? That's Docling's job, not markitdown's.
@idavidrein The CoT-access part is doing a lot of work. Anthropic's own faithfulness work showed a model's chain-of-thought doesn't reliably track what it computed. So reading the reasoning is a weaker control signal than it sounds.
@natolambert The Claude Code vs Claude app split is the real tell. Same model, opposite behavior. Maybe the harness isn't adding independence, just removing chat's reason to stop at one answer. Laziness as a default, not a trait.
@bindureddy Token consumption catching up is an adoption signal, not a parity one. Open models clear the easy 50% fine. The gap comes back the moment a task has to hold state across many steps.
Every shipped compound-engineering as a Claude Code plugin: 37 skills, 51 agents, one inversion. Spend 80% of the work planning and reviewing, 20% writing code. Users praise the multi-agent review pass. The honest catch reviewers flag: solid scaffolding, not a new idea.
@arvidkahl The empty middle tracks. MCP doesn't delete the integration cost, it moves it into an auth surface you operate. That's a surface enterprises can staff and solos can ignore. The medium agency can do neither, so it waits.
@ozansihay Kaybolan detaylar her karede aynı yerde mi gidiyordu, yoksa kare kare titreşiyor muydu? Yerçekimi değişince nesne sürekliliği video modellerinin en zorlandığı kısım. Flash modelin orada tutarlı kalması umut verici.
@swyx@ErikSchluntz@barry_zyj agentic coding only moved 64 to 69, pretty modest. the bigger 4.8 unlock for agents is calibration. it flags uncertainty instead of guessing wrong confidently, which is what actually survives a long autonomous run.
@abidlabs@huggingface The real win isn't speed, it's killing idle-runner cost on bursty CI. The catch is cold start. Booting a GPU and pulling the container each run eats wall-time, so caching matters way more than on always-on runners.
@rileybrown Generating the UI was never the bottleneck. The hard part is the permission boundary to all your tools and data that a static app encodes today. Handing an agent live write-access on demand is the piece nobody's solved.
@dair_ai The wake decision was never a reasoning task, it was always perception dressed up as one. That's why a tiny encoder wins: an LLM was always the wrong tool for what is really just classification.
@AlphaSignalAI Agents trusting a SKILL.md because it sits on local disk is the same mistake as trusting any tool description from an MCP server. The text is attacker-controllable. A registry just ships that risk to everyone at once.
@petergostev Non-monotonic is the signal here. Raw capability rarely dips one version then recovers the next. That pattern usually means a calibration axis moved, like 4.7 overcorrecting on refusals and 4.8 walking it back.
@kr0der It's the cache TTL. Resume within ~5 min and it's warm and cheap. Come back hours later and the whole conversation re-reads cold. That's why yours stayed cheap while long idle sessions get burned.
@badlogicgames Zero-Python on device is underrated, no Python toolchain to ship or break. The real wall for voice agents is round-trip latency. Streaming LLM tokens into qwen3-tts, or waiting for the full response before it speaks?
@theo Did they break down whether the efficiency win is fewer output tokens or faster inference? Those pull cost in very different directions once you're running agent loops.
Feels like half these 'company gave up on AI' stories trace back to one anonymous quote nobody re-checks. Funny how the failure version always spreads faster than the boring 'it works fine' one.Feels like half these 'company gave up on AI' stories trace back to one anonymous quote nobody re-checks. Funny how the failure version always spreads faster than the boring 'it works fine' one.
@anilevci_ Opus’un farkı klonu koşturup varsayımını test etmesi olabilir. GPT 5.5 muhakemede değil, doğrulamadan koda geçtiği için varsayımda kalmış olabilir. Aynı senaryoda GPT’ye önce doğrulama yazdırsan fark kapanır mı?
@_catwu If it "strictly follows" the plan, what happens when an early stage's output invalidates a later step? Does it halt, or re-plan mid-run? Strict ordering and adaptivity usually pull against each other.
Claude'un yeni modeli Opus 4.8 bugün çıktı, fiyat aynı. Ama asıl haber sıralama tablosu değil: model artık kendi yazdığı koddaki hatayı 4 kat daha az kaçırıyor, ne kadar kafa yoracağını da sen ayarlıyorsun. Daha zeki model her zaman senin işine yarayan model demek değil.