Without human input and constraints, AI-generated writing often lacks real insight.
Maybe it's time to stop asking AI to "just write" and start giving it better prompts.
New week of #buildinpublic 💪
What are you building this week? 📷
Looking to #connect with founders, builders, entrepreneurs, and people working on:
SaaS, AI/ML, Vibe coding, AI agents & workflows
Drop your project below.
I'd love to check it out and connect.
in @ycombinator they have a playbook on how to get customers ASAP for your startup.
if you follow this, you’ll brute force your way to 100 customers, almost no matter what your product is.
Here it is:
1/ launch-max.
product hunt, hackerNews, devhunt, betalist, peerlist, indie hackers, etc. YC tells you to launch 3 times MINIMUM
2/ pull your competitor’s strongest backlinks and get yourself listed in the same places.
whatever article they have listed, you make a better version and ask the site to replace it (or supplement) with yours.
3/ WARM OUTBOUND.
Everyone knows about building in public. but you still need to capitalize on the 99% of leads who see your content but don’t come inbound
scrape everyone who likes your posts on Linkedin each week, check if they fit your customer profile, and message them.
you set this up to fire automatically with @origamichat (i dropped a prompt in the comments)
4/ find 20 to 30 ugc creators on tiktok / instagram in your niche. ask them to create content about your product, ideally from a fresh account.
pay them a fixed fee ($15–$30 per video) plus performance incentives ($1k for 1 million views, etc).
you can use @sideshift_app (best creators imo) and line up 20+ of these creators in 1 day
5/ when building in public, a video is 10x better than an image/text - spam use cases of ur product on X/Linkedin
6/ figure out where your customers actually spend time.
which slack/discord groups are they in? what newsletters do they open? which podcasts and accounts do they follow? pay those people for shoutouts
7/ there's a fresh trend on x basically every week. jump on the relevant ones and fold your product in (like i’m doing right now).
To find trends i just use Origami & search “Lead Gen/GTM posts that are viral on X” to find the best posts every week in my niche
Then, I will reply to those, quote tweet them, and use the formats that work myself
(that’s the secret to why my account has high engagement BTW - you can do this too)
---------
if you are doing all this every single week and DO NOT GIVE UP (launching, posting demos, contacting new customers)
I guarantee you will hit your customer goals. Then the game becomes retention.
will be posting 2-3 more growth hacks every single week
My spec-driven setup:
Scenarios → initial requirements
Roadmap → system requirements
Architecture + Prototype → implementation
Each item gets a global ID like PRN-001, SCN-001, or MVP-001, so decisions can be referenced across docs, issues, PRs, and code.
Trying spec-driven development for a large TypeScript project.
Today I sketched this 7-layer spec structure to make specs the operating system for product, architecture, and implementation decisions.
Curious how others structure specs. Would love to discuss.
Agent Observability has 4 layers:
①Token-level: CPU/GPU compute & comm. cost per decoded token
②Request-level: network + model latency per API call
③Task-level: LLM / sandbox / tool-call latency per task
④Evolution-level: memory update and dreaming latency across tasks
We can basically think of an agent as a GitHub repo.
The LLM is an executable inside it. The context is a set of text files. The harness is the code that manages how context and the LLM interact.
Apply harness engineering to this repo, and you get a self-improving agent.
We should pay more attention to the DevAI benchmark, from a paper co-authored by Jürgen Schmidhuber, the father of LSTM.
They were already exploring self-evolving agents back in 2024. https://t.co/YPVGTqZ0NI
Today, we don't just need benchmarks for building software. We need benchmarks for building AI itself.
Only then can AI move into the next paradigm: self-evolution.
From data curation, to training, to evaluation (yes, benchmark for auto-eval), there is still a long way to go.
Interesting new SWE/agentic benchmark (DeepSWE) was released yesterday. 113 tasks across 91 repos in 5 languages. Here are interesting things I noticed:
- The evaluation harness (mini-swe-agent) gives every model a single bash tool and the same SI. No vendor editing primitives.
- Eval Prompts are shorter than SWE-Bench Pro, but require 5.5× more code and touch 7 files on average. The idea is to mimic how developers actually talk to agents, short behavioral descriptions, not verbose specs.
- SI describes a specific workflow: find code, reproduce, fix, verify, edge cases, submit. This maps directly onto how the verifier grades, which could bias toward models that follow instructions literally over models that explore more.
- The bash tool is guarded, outputs over 10k chars get truncated. Malformed tool calls get caught and retried with guidance rather than crashing. To prevent to blow up context.
- Mini-swe-agent claims to match or beat 1P harnesses on the same tasks. Claude Opus scored +10pp over Claude Code. Gemini 3.1 Pro scored +20pp over Gemini CLI.
Would love to see how other harness × model combinations will do, e.g. @cursor_ai, @antigravity, @FactoryAI and how well the eval harness does on more general knowledge work, e.g. GDPval.
Great to see the SWE-agent team keep pushing on both the research and eval side. 🤗
introducing howtoeval dot com. the no-bullshit guide to eval'ing AI agents.
from personal experience, and from working with the best companies in the world.
there's even a quiz. link below.
We can basically think of an agent as a GitHub repo.
The LLM is an executable inside it. The context is a set of text files. The harness is the code that manages how context and the LLM interact.
Apply harness engineering to this repo, and you get a self-improving agent.
When we say "meta harness" or "agentic harness engineering", we may miss what "harness" means.
It is not just code assets for building agents.
Harness is the top-level concept: leveraging intelligence inside a Ralph Loop until the human is satisfied. https://t.co/yfdrSIq6O1
Heuristic Learning is an RL loop over code assets.
https://t.co/aGx7aRdpQ2
OpenAI's Harness Engineering is exactly the same idea in broader practice.
https://t.co/fHcIcWCJ0s
By "forgetting", I don't mean model performance.
I mean training horizon.
RL is hard to run for thousands of stable update steps.
SFT can even run for millions of steps.
RL is more fragile to forgetting than SFT.
SFT uses large batches on a simple task: next-token prediction.
RL uses small batches on a harder task: finding the right action sequence.
In agent evolving, optimizing with one example per step makes forgetting severer.
Agent = Model + Context + Runtime + Harness.
With feedback, we can optimize every layer:
① Model: parameters, architecture
② Context: prompts, rules, memory
③ Runtime: files, tools, software interfaces
④ Harness: orchestration, compaction, control flow