if i ever needed a resume, this isn't far from it
nice to see the math / classic ml on here, as some foundational parts
my biggest weakness is model training and deep understanding of transformer architectures
time to explore weaker areas of my understanding
As an AI Engineer. Please learn:
Harness engineering, not just prompt engineering
Context engineering, not just long prompts
Prompt caching vs. semantic caching tradeoffs
KV cache management, eviction, reuse, and memory pressure at scale
Prefill vs. decode latency and why they optimize differently
Continuous batching, paged attention, and throughput optimization
Speculative decoding vs. quantization vs. distillation tradeoffs
INT8, INT4, FP8, AWQ, GPTQ, and when quantization hurts quality
Structured output failures, schema validation, repair loops, and fallback chains
Function calling reliability, tool contracts, argument validation, and idempotency
Agent guardrails, loop budgets, tool budgets, and termination conditions
Model routing, graceful fallback logic, and degraded-mode UX
RAG architecture: chunking, embeddings, hybrid search, reranking, and freshness
Retrieval evals: recall, precision, grounding, attribution, and citation quality
Evals: golden sets, regression tests, adversarial tests, LLM-as-judge, and human evals
LLM observability as a first-class discipline: traces, spans, tokens, latency, errors, and drift
Cost attribution per feature, workflow, tenant, and user journey not just per model
Safety engineering: prompt injection defense, data leakage prevention, and permission boundaries
Multi-tenant isolation, cache safety, and cross-user context contamination prevention
Fine-tuning vs. in-context learning vs. RAG vs. distillation and when each is the wrong tool
Latency, quality, cost, and reliability tradeoffs across the full inference stack
Production failure modes: hallucinated tool calls, malformed JSON, stale retrieval, runaway agents, and silent eval regressions
Shipping LLM systems as reliable infrastructure, not demos wrapped around prompts
https://t.co/OhK9MK04ld
Every Ai event I go to I love to see this girl! One of the homies for years.
Thanks for the coming out to @arizeai Observe Conf
@temporalio is lucky to have ya!
@MelGoesTech@belizsoyak
this might sound bearish for the current AI cycle we are in, but yes we did unlock language in the ai space, but don't forget to zoom out
before language models we did a pretty damn good job with predictive modeling for classification / regression tasks
but we still have a few pretty hard TODOs to go tho
physics is not solved
world models are not solved
causality is not solved
embodiment is not solved
coordination is not solved
governance is not solved
language was a major unlock
but language is not all of reality
/rant
New research from Microsoft Research
I see a lot of AI engineers handwriting agent skill docs and hope they generalize.
Probably not optimal. This works show why.
It treats the skill doc as a trainable external state of a frozen agent instead.
It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes.
SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.
On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses.
Paper: https://t.co/mNgTmmT32U
Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Had excellent time at Arize AI @arizeai San Francisco 🌉 office. Learned about Harness Arch and enjoyed the amazing SF views from rooftop. Greet to see @jason_lopatecki@aparnadhinak and entire Arize AI team!
Missed you @dat_attacked here!
Honestly I feel they undersold the bench by a lot.
It is basically "how good are models at Spec-driven development"
Or as the cool kids nowadays do, "/goal"
It is the GoalBench.
Since /goal is all the rage now, the findings here are super important.
4 key takeaways:
- more reward hacking in bigger tasks
- better models hack less
- more tests don't fix it
- running longer don't fix it; some time it makes it worse
Yeah so if your model sucks, doing fancier TDD won't save your ass. Having the model run longer won't either.
You only have two levers. Make the task smaller or use a better model.