AI/ML engineer @ Voortman Steel
Building LLM agents, retrieval engines and CV/3D perception pipelines, including the training and deployment infra around them.
@ClaudeDevs This is huge for longer agent flows. Being able to keep the expensive context cached while changing the system instructions mid-run is basically made for deep research / coding agents.
@benhylak This is exactly why I love reading models’ reasoning traces while they’re working. Sad to see that most providers are trying to abstract/simplify that away now.
@Nathanone Haven't looked at PC development liked that yet.. But the main insight for me is to keep what you build model agnostic & focus on everything around the model aswell.
Building a retrieval/classification system made me distrust “use the best model”.
Best for what?
Sometimes embeddings fail, lexical search wins, reranking helps, or the benchmark lies.
Find the failing layer before swapping models like Pokémon.
@alexshander03 Agent-as-judge for long-horizon evals makes sense in theory.
Any public benchmarks showing it beats LangSmith offline evals or MLflow trajectory scoring on real traces & datasets?
Latency is currently 1–3s total depending on length. Might switch to local transcription models like whisper later to try to bring it even lower. Very lightweight (~15 MB runtime), whole package only 1.38 MB zipped.
Watching @yacineMTB tweet so much using transcription made me realize how much typing is slowing my thinking.
So I built my own simple native Windows transcription tool that pastes right where your cursor is focused.
https://t.co/aY5khmSaKm
One example I’m working on is a single-stock analysis flow: resolve the listing, run a background deep-research job with an agentic loop, stream tool calls/progress to the user, then return a cited report and continue in chat against that company/report context.
LangChain/LangGraph handles enough boring plumbing that I can focus on the research workflow itself.
The deeper trap is that sycophancy makes the path of least resistance feel productive. The model adds nuance and a clarifying question so it feels like a real exchange, then ships code & writes some basic tests. You don't notice you stopped thinking, because nothing externally flags it.
@JohnGal43951639 Not perse a benchmark, but I really like the sheer size of the Uco3D benchmark and variety of data per entry for 3D reconstruction & segmentation. https://t.co/afOBmqQ28C
@JohnGal43951639 Thank you, haven't seen MTEB before, I recognise most of the top models, but am excited to try/benchmark a few I haven't used before on the domain dataset.
@levelsio@dcbuilder Running that exact routing stack on my self-hosted Linux box. Tailscale let's me ssh & redeploy my server even from my phone using Termux tasks lol. CF is to only expose the ports you want people to access + all the nice anti bot/attacker tools it gives for free.