Yaowei Zheng @code_hiyouga - Twitter Profile

Yaowei Zheng

@code_hiyouga

about 1 hour ago

@alexx_p977 Would love to connect

0

2

Yaowei Zheng

@code_hiyouga

about 18 hours ago

@praj_kesi let's go

0

28

Yaowei Zheng

@code_hiyouga

about 18 hours ago

Without human input and constraints, AI-generated writing often lacks real insight. Maybe it's time to stop asking AI to "just write" and start giving it better prompts.

0

3

0

173

Yaowei Zheng

@code_hiyouga

about 18 hours ago

New week of #buildinpublic 💪 What are you building this week? 📷 Looking to #connect with founders, builders, entrepreneurs, and people working on: SaaS, AI/ML, Vibe coding, AI agents & workflows Drop your project below. I'd love to check it out and connect.

37

25

0

1

771

code_hiyouga retweeted

Finn Mallery

@fin465

1 day ago

in @ycombinator they have a playbook on how to get customers ASAP for your startup. if you follow this, you’ll brute force your way to 100 customers, almost no matter what your product is. Here it is: 1/ launch-max. product hunt, hackerNews, devhunt, betalist, peerlist, indie hackers, etc. YC tells you to launch 3 times MINIMUM 2/ pull your competitor’s strongest backlinks and get yourself listed in the same places. whatever article they have listed, you make a better version and ask the site to replace it (or supplement) with yours. 3/ WARM OUTBOUND. Everyone knows about building in public. but you still need to capitalize on the 99% of leads who see your content but don’t come inbound scrape everyone who likes your posts on Linkedin each week, check if they fit your customer profile, and message them. you set this up to fire automatically with @origamichat (i dropped a prompt in the comments) 4/ find 20 to 30 ugc creators on tiktok / instagram in your niche. ask them to create content about your product, ideally from a fresh account. pay them a fixed fee ($15–$30 per video) plus performance incentives ($1k for 1 million views, etc). you can use @sideshift_app (best creators imo) and line up 20+ of these creators in 1 day 5/ when building in public, a video is 10x better than an image/text - spam use cases of ur product on X/Linkedin 6/ figure out where your customers actually spend time. which slack/discord groups are they in? what newsletters do they open? which podcasts and accounts do they follow? pay those people for shoutouts 7/ there's a fresh trend on x basically every week. jump on the relevant ones and fold your product in (like i’m doing right now). To find trends i just use Origami & search “Lead Gen/GTM posts that are viral on X” to find the best posts every week in my niche Then, I will reply to those, quote tweet them, and use the formats that work myself (that’s the secret to why my account has high engagement BTW - you can do this too) --------- if you are doing all this every single week and DO NOT GIVE UP (launching, posting demos, contacting new customers) I guarantee you will hit your customer goals. Then the game becomes retention. will be posting 2-3 more growth hacks every single week

94

4K

234

10K

272K

Yaowei Zheng

@code_hiyouga

about 23 hours ago

My spec-driven setup: Scenarios → initial requirements Roadmap → system requirements Architecture + Prototype → implementation Each item gets a global ID like PRN-001, SCN-001, or MVP-001, so decisions can be referenced across docs, issues, PRs, and code.

0

2

0

20

Yaowei Zheng

@code_hiyouga

about 23 hours ago

Trying spec-driven development for a large TypeScript project. Today I sketched this 7-layer spec structure to make specs the operating system for product, architecture, and implementation decisions. Curious how others structure specs. Would love to discuss.

code_hiyouga's tweet photo. Trying spec-driven development for a large TypeScript project.

Today I sketched this 7-layer spec structure to make specs the operating system for product, architecture, and implementation decisions.

Curious how others structure specs. Would love to discuss. https://t.co/IcyLaBjEAE

1

2

0

34

Yaowei Zheng

@code_hiyouga

about 23 hours ago

Agent Observability has 4 layers: ①Token-level: CPU/GPU compute & comm. cost per decoded token ②Request-level: network + model latency per API call ③Task-level: LLM / sandbox / tool-call latency per task ④Evolution-level: memory update and dreaming latency across tasks

code_hiyouga's tweet photo. Agent Observability has 4 layers:

①Token-level: CPU/GPU compute & comm. cost per decoded token
②Request-level: network + model latency per API call
③Task-level: LLM / sandbox / tool-call latency per task
④Evolution-level: memory update and dreaming latency across tasks https://t.co/SqdMkzYu6K

0

1

0

11

Yaowei Zheng

@code_hiyouga

1 day ago

Code as Agent Harness: https://t.co/yHLcOck6qW

Yaowei Zheng

@code_hiyouga

1 day ago

We can basically think of an agent as a GitHub repo. The LLM is an executable inside it. The context is a set of text files. The harness is the code that manages how context and the LLM interact. Apply harness engineering to this repo, and you get a self-improving agent.

0

1

0

56

0

18

Yaowei Zheng

@code_hiyouga

1 day ago

We should pay more attention to the DevAI benchmark, from a paper co-authored by Jürgen Schmidhuber, the father of LSTM. They were already exploring self-evolving agents back in 2024. https://t.co/YPVGTqZ0NI

0

1

0

13

Yaowei Zheng

@code_hiyouga

1 day ago

Data Interpreter (Data): https://t.co/gcuucWscfN MLE-Bench (Training): https://t.co/PN5vzUlRNh Agent-as-a-Judge (Evaluation): https://t.co/aKGctoBHXC

0

1

0

17

Yaowei Zheng

@code_hiyouga

1 day ago

Today, we don't just need benchmarks for building software. We need benchmarks for building AI itself. Only then can AI move into the next paradigm: self-evolution. From data curation, to training, to evaluation (yes, benchmark for auto-eval), there is still a long way to go.

1

0

15

Yaowei Zheng

@code_hiyouga

1 day ago

mini-swe-agent is insane. A tiny harness, one bash tool, and somehow it beats or matches many 1P agent setups. Definitely going to study this.

Philipp Schmid

@_philschmid

21 days ago

Interesting new SWE/agentic benchmark (DeepSWE) was released yesterday. 113 tasks across 91 repos in 5 languages. Here are interesting things I noticed: - The evaluation harness (mini-swe-agent) gives every model a single bash tool and the same SI. No vendor editing primitives. - Eval Prompts are shorter than SWE-Bench Pro, but require 5.5× more code and touch 7 files on average. The idea is to mimic how developers actually talk to agents, short behavioral descriptions, not verbose specs. - SI describes a specific workflow: find code, reproduce, fix, verify, edge cases, submit. This maps directly onto how the verifier grades, which could bias toward models that follow instructions literally over models that explore more. - The bash tool is guarded, outputs over 10k chars get truncated. Malformed tool calls get caught and retried with guidance rather than crashing. To prevent to blow up context. - Mini-swe-agent claims to match or beat 1P harnesses on the same tasks. Claude Opus scored +10pp over Claude Code. Gemini 3.1 Pro scored +20pp over Gemini CLI. Would love to see how other harness × model combinations will do, e.g. @cursor_ai, @antigravity, @FactoryAI and how well the eval harness does on more general knowledge work, e.g. GDPval. Great to see the SWE-agent team keep pushing on both the research and eval side. 🤗

_philschmid's tweet photo. Interesting new SWE/agentic benchmark (DeepSWE) was released yesterday. 113 tasks across 91 repos in 5 languages. Here are interesting things I noticed:

- The evaluation harness (mini-swe-agent) gives every model a single bash tool and the same SI. No vendor editing primitives.

- Eval Prompts are shorter than SWE-Bench Pro, but require 5.5× more code and touch 7 files on average. The idea is to mimic how developers actually talk to agents, short behavioral descriptions, not verbose specs.

- SI describes a specific workflow: find code, reproduce, fix, verify, edge cases, submit. This maps directly onto how the verifier grades, which could bias toward models that follow instructions literally over models that explore more.

- The bash tool is guarded, outputs over 10k chars get truncated. Malformed tool calls get caught and retried with guidance rather than crashing. To prevent to blow up context.

- Mini-swe-agent claims to match or beat 1P harnesses on the same tasks. Claude Opus scored +10pp over Claude Code. Gemini 3.1 Pro scored +20pp over Gemini CLI.

Would love to see how other harness × model combinations will do, e.g. @cursor_ai, @antigravity, @FactoryAI and how well the eval harness does on more general knowledge work, e.g. GDPval.

Great to see the SWE-agent team keep pushing on both the research and eval side. 🤗

15

182

10

63

16K

0

1

0

27

Yaowei Zheng

@code_hiyouga

1 day ago

Very useful 🙏

ben hylak

@benhylak

21 days ago

introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents. from personal experience, and from working with the best companies in the world. there's even a quiz. link below.

benhylak's tweet photo. introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.

from personal experience, and from working with the best companies in the world.

there's even a quiz. link below. https://t.co/QIvoyr4VWX

40

1K

84

2K

103K

0

1

0

19

Yaowei Zheng

@code_hiyouga

1 day ago

We can basically think of an agent as a GitHub repo. The LLM is an executable inside it. The context is a set of text files. The harness is the code that manages how context and the LLM interact. Apply harness engineering to this repo, and you get a self-improving agent.

0

1

0

56

Yaowei Zheng

@code_hiyouga

1 day ago

When we say "meta harness" or "agentic harness engineering", we may miss what "harness" means. It is not just code assets for building agents. Harness is the top-level concept: leveraging intelligence inside a Ralph Loop until the human is satisfied. https://t.co/yfdrSIq6O1

0

16

Yaowei Zheng

@code_hiyouga

1 day ago

Heuristic Learning is an RL loop over code assets. https://t.co/aGx7aRdpQ2 OpenAI's Harness Engineering is exactly the same idea in broader practice. https://t.co/fHcIcWCJ0s

1

0

20

Yaowei Zheng

@code_hiyouga

1 day ago

By "forgetting", I don't mean model performance. I mean training horizon. RL is hard to run for thousands of stable update steps. SFT can even run for millions of steps.

0

12

Yaowei Zheng

@code_hiyouga

1 day ago

Agent self-evolving = RL in the whole agent space. Same data. Same eval loop. Same bottleneck: mitigating forgetting during learning.

2

0

16

Yaowei Zheng

@code_hiyouga

1 day ago

RL is more fragile to forgetting than SFT. SFT uses large batches on a simple task: next-token prediction. RL uses small batches on a harder task: finding the right action sequence. In agent evolving, optimizing with one example per step makes forgetting severer.

1

0

14

Yaowei Zheng

@code_hiyouga

1 day ago

Agent = Model + Context + Runtime + Harness. With feedback, we can optimize every layer: ① Model: parameters, architecture ② Context: prompts, rules, memory ③ Runtime: files, tools, software interfaces ④ Harness: orchestration, compaction, control flow

0

8

Yaowei Zheng

@code_hiyouga

Last Seen Users on Sotwe

Trends for you

Most Popular Users