Muratcan Koylan

@muratcan

Member of Technical Staff (Agents), Research @sullyai Building AI medical employees Previously worked on AI persona development & HCI @ 99Ravens

Toronto, Canada 🇨🇦

Joined December 2022

3.7K Following

21.6K Followers

12.5K Posts

Pinned Tweet

Muratcan Koylan

@muratcan

6 months ago

I’m excited to share a new repo: Agent Skills for Context Engineering Instead of just offering a library of black-box tools, it acts as a "Meta-Agent" knowledge base. It provides a standard set of skills, written in markdown and code, that you can feed to an agent so it understands how to manage its own cognitive resources. https://t.co/vWwrYPAd8k Most agent failures are not model failures; they are context failures. This is still an experimental project. The goal is to establish a platform-agnostic standard for context engineering that can be used in Cursor, Claude Code, Copilot or Codex. skills/ context-fundamentals: What context is, why it matters context-degradation: How context fails (lost-in-middle, poisoning) context-optimization: Compaction, masking, caching multi-agent-patterns: Orchestrator, swarm, hierarchical memory-systems: Vector RAG, knowledge graphs, Zep tool-design: Building tools agents can use evaluation: Testing and measuring agent systems I believe this is a good start, showing developers how to approach context engineering rather than relying on ready-made tools. You will also find the aggregated research documents I used to build these skills in the repo. The skills are synthesized from technical blogs on context and prompt engineering that I bookmarked, AI Labs' documentations, and Anthropic Skills examples. Try the 7 Skills, created using Antrhopic's Skills template format. Experiment with the provided scripts and references, and feel free to contribute to the repo.

muratcan's tweet photo. I’m excited to share a new repo: Agent Skills for Context Engineering

Instead of just offering a library of black-box tools, it acts as a "Meta-Agent" knowledge base. It provides a standard set of skills, written in markdown and code, that you can feed to an agent so it understands how to manage its own cognitive resources.

https://t.co/vWwrYPAd8k

Most agent failures are not model failures; they are context failures. This is still an experimental project. The goal is to establish a platform-agnostic standard for context engineering that can be used in Cursor, Claude Code, Copilot or Codex.

skills/
context-fundamentals: What context is, why it matters
context-degradation: How context fails (lost-in-middle, poisoning)
context-optimization: Compaction, masking, caching
multi-agent-patterns: Orchestrator, swarm, hierarchical
memory-systems: Vector RAG, knowledge graphs, Zep
tool-design: Building tools agents can use
evaluation: Testing and measuring agent systems

I believe this is a good start, showing developers how to approach context engineering rather than relying on ready-made tools.

You will also find the aggregated research documents I used to build these skills in the repo. The skills are synthesized from technical blogs on context and prompt engineering that I bookmarked, AI Labs' documentations, and Anthropic Skills examples.

Try the 7 Skills, created using Antrhopic's Skills template format. Experiment with the provided scripts and references, and feel free to contribute to the repo.

Muratcan Koylan

@muratcan

6 months ago

It’s actually a good question; the difference is subtle but structural. I usually frame it like this: AGENTS[.]md acts as the declarative context. You write this for every repo (and nested directories) to define the project structure, persona, and coding rules. Skills are the functional protocols. They provide the agent with modular capabilities like advanced tool-use and multi-step chaining that are dynamically discovered only when needed. If AGENTS[.]md defines the identity and environment (the body), Skills provide the specialized toolset (the capabilities) used to execute tasks autonomously.

135

131

193K

156

301K

Muratcan Koylan

@muratcan

31 minutes ago

My prompting has changed dramatically over the past five to six months. I used to write detailed, PRD-like prompts. Now, it’s just long, messy voice dictations. They’re really messy (the task details and some background, what i'm feeling, why we are doing this etc like a diary), but current models and harnesses are doing a great job when you provide as much context as possible.

gabriel

@gabriel1

about 1 hour ago

changing your mind in the middle of voice prompting gives the model so much more context like 70% of my prompts i say "actually ignore everything before this" but it gives so much information when i imagined one thing but then decided on something else aim for MAX tokens

381

271

muratcan retweeted

kuz

@kylekuzma

about 11 hours ago

You either use LLMs to learn at an extremely fast rate or use them to skip learning totally.

349

52K

Muratcan Koylan

@muratcan

about 22 hours ago

@bneyshabur @mirendil @a16z @kleinerperkins Congrats

517

Who to follow

A bunch of nerds making progress toward open source AI https://t.co/vrD0aDJeto

Anil Chandra Naidu Matcha

@matchaman11

CTO @VadooAI ⚡️ Youtube : https://t.co/MRA3d9h6pv 🌐 AI apps : https://t.co/Qpxd3qhdOk

Muratcan Koylan

@muratcan

about 24 hours ago

@ExaAILabs Is it only available for enterprises? What is the API pricing if pay-per-usage options are offered?

540

Muratcan Koylan

@muratcan

1 day ago

Pretty great to hear the thinking behind DeepMind's Virtual Agent Economies paper. The LLM is just a component. The harness is what's eating the world, and everything is turning into harnesses and sandbox economies. Why autonomous agents haven't taken off the way we all expected? We have MCPs, CLIs, browser-using agents, and models that are cheap and can do almost anything. Yet humans still do all the real work. Except coding. The reason is verification. Software loops are cheap to close: you write a test, run it, and errors roll back. That's why coding is where agents actually work today. But someting like "plan my wedding," "negotiate this contract," "manage my inbox autonomously" verification cost is roughly equal to doing it yourself, and errors are expensive and irreversible. When verification ≈ doing it yourself, autonomy delivers zero leverage. It doesn't save you work. That's why "models keep getting better" hasn't unlocked it. Frankly, my hypothesis on slow agent adoption was slow human behavior change. But that's downstream; we adopted ChatGPT in a weekend. Behavior change is slow here because the value proposition, net of verification cost, is unproven for autonomous tasks. So we're overestimating that intelligence is the bottleneck. Autonomy only pays when verification is cheap, errors are reversible, and someone can be held accountable. Almost no valuable real-world task has all three yet. One last thing I find interesting is this: cognitive monoculture → correlated failure → systemic risk. A handful of base models means thousands of "independent" agents make correlated decisions, so failures stop being independent and start stacking. That's why I think not just evals, but persona embodiment is becoming a crucial part of adoption, something like a risk-decorrelation tool for multi-agent fleets. One model won't be doing everything; a society of certified specialists with a generalist connective layer. Paper: https://t.co/s1wOOr8uap

Google DeepMind @GoogleDeepMind

1 day ago

What happens when millions of AI agents start negotiating, transacting, and delegating to one another? @weballergy joined our podcast with @fryrsquared to explore the rise of agentic economies – and how we can diversify agent decision-making to avoid AI groupthink. Timecodes: 00:00 Intro 1:07 Defining AI agents 4:44 Agentic exploration in science and research 15:46 Delegation between agents 22:46 Agentic security and traps 29:31 Building an agentic economy 33:22 Cognitive monoculture 36:29 Distributed intelligence

717

443

91K

Muratcan Koylan

@muratcan

1 day ago

@RhysSullivan Wow I just realized this. So true.

194

Muratcan Koylan

@muratcan

1 day ago

@karpathy i miss the old kanye

Muratcan Koylan

@muratcan

1 day ago

@leerob Congrats Lee

570

1 day ago

128

1 day ago

@xdotli Amazing

828

Muratcan Koylan

@muratcan

1 day ago

@JPoehnelt If you take an initiative, you get fired; that’s why you should always choose startups over these enterprises.

613

muratcan retweeted

Muratcan Koylan

@muratcan

8 days ago

Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems. The editable surface is the skill library. The feedback signal is judge-scored task success. Learning happens through external artifact mutation. Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update The router selects a skill for the current task. A query about “extracting patient medication history” and a query about “summarizing patient history” may look semantically close, but the correct execution behavior is different. Their proposed router is trained for behavioral similarity which I find very useful: - generate synthetic user goals for each skill - generate hard negatives that share terminology but should not use that skill - train with multi-positive InfoNCE (contrastive training objective for retrieval) interpret the embedding score as a soft Q-value - route to the skill most likely to produce successful execution So the router is asking: “Which skill historically leads to the right outcome for this kind of task?” The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions. That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence. They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal. The Skill Evolution Engine then has three update modes: - Utility update: track empirical success rate per skill. - Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode. - Discover new skill: create a new skill when repeated failures show the current abstraction is wrong. The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes. Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon. As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.

muratcan's tweet photo. Memento-Skills is a self-evolving skill harness. It's a great paper to understand loop-engineered agent systems.

The editable surface is the skill library.
The feedback signal is judge-scored task success.
Learning happens through external artifact mutation.

Observe -> Read Skill -> Execute -> Judge/Feedback -> Write Skill Update

The router selects a skill for the current task. A query about “extracting patient medication history” and a query about “summarizing patient history” may look semantically close, but the correct execution behavior is different.

Their proposed router is trained for behavioral similarity which I find very useful:
- generate synthetic user goals for each skill
- generate hard negatives that share terminology but should not use that skill
- train with multi-positive InfoNCE (contrastive training objective for retrieval)
interpret the embedding score as a soft Q-value
- route to the skill most likely to produce successful execution

So the router is asking: “Which skill historically leads to the right outcome for this kind of task?”

The repo also has a gateway layer between the agent and skill execution. The execution engine is a ReAct-style loop: build messages, expose allowed tools, let the LLM act, process tool results, update state, and stop on success/failure/loop conditions.

That makes execution observable. The system can tell whether a skill produced artifacts, repeated itself, got blocked by policy, timed out, or claimed success without evidence.

They also have Reflector / Judge. In benchmark mode, the judge knows the gold answer. In production, this could be an eval rubric, human review, unit test, or downstream signal.

The Skill Evolution Engine then has three update modes:
- Utility update: track empirical success rate per skill.
- Optimize existing skill: patch prompts, code, or guardrails for a specific failure mode.
- Discover new skill: create a new skill when repeated failures show the current abstraction is wrong.

The skill is effectively policy. So the model stays fixed, but the action distribution changes because the retrieved skill changes.

Turning skills from static context files into versioned, evaluated, self-improving policy artifacts is still hard. But I expect coding platforms to integrate this into user-facing systems soon.

As frontier model access gets constrained by pricing, policy, rate limits, and availability shocks, more performance will have to come from the substrate around the model.

Muratcan Koylan

@muratcan

3 days ago

@tuhinone Amazing! Congrats, team. Thanks for continued support.

165

Muratcan Koylan

@muratcan

4 days ago

@eliebakouch I love Sakana but the announcement post is disappointing

769

Muratcan Koylan

@muratcan

6 days ago

They are playing a very technical game; it’s difficult for teams that use park-the-bus tactics like Paraguay or Australia, but their first real tests will be against teams that play positive football with talented players after the group stage. The next Turkiye match will be also very fun to watch.

Muratcan Koylan

@muratcan

7 days ago

I’m actually struggling with this the past few days. Without writing documentation, you have to spend a lot of time and tokens onboarding agents to projects. The same goes for new hires. However, documentation becomes outdated so quickly, and keeping it up to date isn’t as easy as it sounds.

526

Muratcan Koylan

@muratcan

7 days ago

Oh boy, the amount of AI-generated job applications is crazy. As someone who became an MTS without an academic background, here are a few things I would do if I were looking for a job in AI right now. I’m writing this post because after sharing a few AI job openings, I’ve been receiving a lot of very low-quality applications from people at genuinely reputable universities. I’m honestly shocked. First: stop using AI terms just to use them. No one cares about your “RAG application” that you built 3 years ago. No one is impressed because your resume says “GenAI.” These terms do not mean anything unless there is actual work behind them. Applying to 1 job a day with a customized application is way better than spamming 100 companies. For example, we have open research, papers, product details, and public writing about what we are building. Go read them. Learn what the company is actually researching. Try to understand the system, not just the job title. The competition is incredibly high. Your CV, unless you have a logo like DeepMind, or your university, is not that important by itself. You need to grind and build stuff. When I look at a candidate, new grad or experienced, I expect to see a GitHub and Hugging Face with a bunch of projects, blog, demos, or some public proof that they are actually working on AI right now. If you want to be an AI engineer but you have not experimented with auto-research, RL environments, meta-harnesses, eval loops, small model training, dataset generation, or similar systems, your chances are very low. These might sound like buzzwords, but they are not. They are real components of where AI is evolving. Building projects in these areas shows that you are paying attention. We rewrite large parts of our AI codebase every 6 months because models are changing fast, but more importantly, harnesses are changing fast. The only legit way to keep up is not just reading research papers, but also following companies and teams that generously share what they are learning: OpenAI, Cohere, Ramp Labs, Dropbox, Thinking Machines, and others. As a candidate, you should have at least 4–5 recent projects from the last few months. Reproduce a paper. Train a unique small model. Build an eval harness. Do some RL work. Build an agent loop. Create a synthetic data pipeline. Ship something weird but useful. It does not need to be perfect. It needs to show that you are alive in the field. Also, ask questions. Challenge us. If you are interviewing with an AI startup, you should be thinking about the same things we think about internally. For example: what happens if a company like Anthropic decides to become a competitor and releases a model that destroys our current advantage? What should we do? How do we become not just defensive, but untouchable? If you want to join an AI startup, you should be thinking about these questions and coming up with creative answers. Not generic answers. The only reason I found my first AI role was because I was a marketer who became obsessed with LLMs embodying different personas. So I read papers and started building open-source persona AI projects. The ex-CMO of Google Canada saw them, brought me into his startup as the first AI hire, and we built 99Ravens. The reason I left and joined Sully is similar. @omar_or_ahmed saw my context engineering projects and small model training examples, and wanted to meet. That is the game. Not “I applied to 400 jobs.” Build things. Make them public. Make them useful. Make them weird enough that someone remembers them. Join hackathons. Reproduce papers. Build tools around startups you like. If a company is hiring, analyze what they are building and create something related to it. @benln from Cursor shares growing and hiring startups. Go analyze those companies, learn about their projects and research. Also, @silviasapora has an amazing writeup on ML interviews with great insights. Read it and follow her steps: https://t.co/RfPw34V3wM You do not need to know everything about this field. It is impossible. AI is too fast and too deep. But you do need to show understanding, taste, and obsession.

Muratcan Koylan

@muratcan

7 days ago

We're hiring for a role very close to the work I care about most. Applied AI Engineer at @sullyai The role is evals + agent harnesses + self-improving loops: - build eval systems for clinical AI agents - turn clinician feedback + traces into self-improving loops - read frontier work, then ship the useful parts fast The job is to turn research into production systems clinicians can trust. DM or apply: https://t.co/lXczGhiJ2p