Setu Chokshi @setuc - Twitter Profile

setuc retweeted

about 4 years ago

Hi all, my name is Alisa, I am making the online training "Zero Day Engineering". If you want to get into the *real* offensive cyber security (reverse eng, vulns, exploits, fuzzing, pwn, ... 0days), eager to get your hands dirty, and haven't seen it yet, you probably should:

16

914

88

766

0

setuc retweeted

Andrew Ng

@AndrewYNg

about 2 months ago

New course: Spec-Driven Development with Coding Agents, built in partnership with @jetbrains, and taught by @paulweveritt. Vibe coding is fast, but often produces code that doesn't match what you asked for. This short course teaches you spec-driven development: write a detailed spec defining what to build, and work with your coding agent to implement it. Many of the best developers already build this way. A spec lets you control large code changes with a few words, preserve context across agent sessions, and stay in control as your project grows in complexity. Skills you'll gain: - Write a detailed specification to define your mission, tech stack, and roadmap, giving your agent the context it needs from the start - Plan, implement, and validate features in iterative loops using a spec as your agent's guide - Apply the same repeatable workflow to both new and legacy codebases - Package your workflow into a portable agent skill that works across agents and IDEs Join and write specs that keep your coding agent on track! https://t.co/hI4GwuvhtN

171

3K

419

4K

454K

setuc retweeted

elvis

@omarsar0

about 2 months ago

Long-horizon AI research agents are mostly a state-management problem. It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days. This new paper introduces AiScientist, a system for autonomous long-horizon engineering for ML research. The key idea is to keep control thin and state thick. A top-level orchestrator manages stage-level progress, while specialized agents repeatedly ground themselves in durable workspace artifacts: analyses, plans, code, logs, and experimental evidence. That "File-as-Bus" design matters. AiScientist improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points. Why does it matter? Autonomous research agents need durable project memory, not just longer chats. Paper: https://t.co/A84c75oumP Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. Long-horizon AI research agents are mostly a state-management problem.

It is not enough for an agent to reason well in the next turn. ML research requires task setup, implementation, experiments, debugging, and evidence tracking over hours or days.

This new paper introduces AiScientist, a system for autonomous long-horizon engineering for ML research.

The key idea is to keep control thin and state thick. A top-level orchestrator manages stage-level progress, while specialized agents repeatedly ground themselves in durable workspace artifacts: analyses, plans, code, logs, and experimental evidence.

That "File-as-Bus" design matters. AiScientist improves PaperBench by 10.54 points over the best matched baseline and reaches 81.82 Any Medal% on MLE-Bench Lite. Removing File-as-Bus drops PaperBench by 6.41 points and MLE-Bench Lite by 31.82 points.

Why does it matter?

Autonomous research agents need durable project memory, not just longer chats.

Paper: https://t.co/A84c75oumP

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

10

351

66

358

34K

setuc retweeted

Yoonho Lee

@yoonholeee

about 2 months ago

We just released code for Meta-Harness! https://t.co/OdU7zocdPl Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation

yoonholeee's tweet photo. We just released code for Meta-Harness! https://t.co/OdU7zocdPl

Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation https://t.co/0H6Zrvg8FQ

26

1K

164

1K

125K

Who to follow

Annie Mathew

@AnnpMathew

empowered people, applied tech. views are my own. #tech #mom #inclusion

Paul | Prismo | BYC

@heypaulroots

@buildprismo Chief Creator. CEO & CTO of Hacktiv & @BYCVentures, former Microsoft MVP & RD

shirleydeng

@denglhs

Nature enthusiast. All views my own

setuc retweeted

Salesforce AI Research

@SFResearch

2 months ago

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey https://t.co/Mh19vMYrH7 As AI agents move beyond static benchmarks into long-horizon, real-world environments, memory becomes the critical infrastructure for bridging the utility gap. This survey unifies foundation agent memory across three dimensions: 🧠 Memory Substrate → internal (weights, KV cache, latent states) vs. external (vector stores, knowledge graphs, text records) 🔄 Cognitive Mechanism → sensory, working, episodic, semantic, and procedural memory, mapped from human cognition to agent architectures 👤 Memory Subject → who is memory serving? User-centric memory for personalization vs. agent-centric memory for skill accumulation and task transfer → Analyzes memory operations across single-agent and multi-agent topologies, including architecture, routing, and conflict resolution → Covers learning policies for memory management: prompting, fine-tuning, and RL-based approaches → Reviews 218 papers across 2023–2025 with evaluation benchmarks and metrics for both user- and agent-centric settings → Identifies six open challenges including continual learning, privacy-preserving memory, multimodal grounding, and real-world evaluation @Salesforce authors: Zixuan Ke @KeZixuan, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Caiming Xiong @CaimingXiong #FutureOfAI #EnterpriseAI #LLMAgents

SFResearch's tweet photo. Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey https://t.co/Mh19vMYrH7

As AI agents move beyond static benchmarks into long-horizon, real-world environments, memory becomes the critical infrastructure for bridging the utility gap. This survey unifies foundation agent memory across three dimensions:

🧠 Memory Substrate → internal (weights, KV cache, latent states) vs. external (vector stores, knowledge graphs, text records)

🔄 Cognitive Mechanism → sensory, working, episodic, semantic, and procedural memory, mapped from human cognition to agent architectures

👤 Memory Subject → who is memory serving? User-centric memory for personalization vs. agent-centric memory for skill accumulation and task transfer

→ Analyzes memory operations across single-agent and multi-agent topologies, including architecture, routing, and conflict resolution

→ Covers learning policies for memory management: prompting, fine-tuning, and RL-based approaches

→ Reviews 218 papers across 2023–2025 with evaluation benchmarks and metrics for both user- and agent-centric settings

→ Identifies six open challenges including continual learning, privacy-preserving memory, multimodal grounding, and real-world evaluation

@Salesforce authors: Zixuan Ke @KeZixuan, Liangwei Yang @Liangwei_Yang, Juntao Tan @chrisjtan, Shelby Heinecke @shelbyh_ai, Huan Wang @huan__wang, Caiming Xiong @CaimingXiong

#FutureOfAI #EnterpriseAI #LLMAgents

2

39

10

36

3K

setuc retweeted

elvis

@omarsar0

2 months ago

NEW research from CMU. (bookmark this one) The biggest unlock in coding agents is understanding strategies for how to run them asynchronously. Simply giving a single agent more iterations helps, but does not scale well. And multi-agent research shows that coordination > compute. A new paper from CMU proves this with a practical multi-agent system. CAID (Centralized Asynchronous Isolated Delegation) borrows proven human SWE practices: a manager builds a dependency graph, delegates tasks to engineer agents who work in isolated git worktrees, execute concurrently, self-verify with tests, and integrate via git merge. CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on the Python library development tasks (Commit0). The key insight is that isolation plus explicit integration beats both single-agent scaling and naive multi-agent approaches. For long-horizon software engineering tasks, multi-agent coordination using git-native primitives should be the default strategy, not a fallback. Paper: https://t.co/cRAbG7SrR5 Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. NEW research from CMU.

(bookmark this one)

The biggest unlock in coding agents is understanding strategies for how to run them asynchronously.

Simply giving a single agent more iterations helps, but does not scale well.

And multi-agent research shows that coordination > compute.

A new paper from CMU proves this with a practical multi-agent system.

CAID (Centralized Asynchronous Isolated Delegation) borrows proven human SWE practices: a manager builds a dependency graph, delegates tasks to engineer agents who work in isolated git worktrees, execute concurrently, self-verify with tests, and integrate via git merge.

CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on the Python library
development tasks (Commit0).

The key insight is that isolation plus explicit integration beats both single-agent scaling and naive multi-agent approaches.

For long-horizon software engineering tasks, multi-agent coordination using git-native primitives should be the default strategy, not a fallback.

Paper: https://t.co/cRAbG7SrR5

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

38

438

86

565

54K

Setu Chokshi @setuc

3 months ago

@ckvishwakarma Thanks. I will reach out to him as well.

0

22

Setu Chokshi @setuc

3 months ago

@CrownRelo We had a very distressing experience with our relocation shipment from Singapore. After our goods were already in transit, we were suddenly told very late, close to delivery, that we had to pay more due to an alleged “volume increase.”

2

1

0

58

Setu Chokshi @setuc

3 months ago

After unpacking, we also discovered one Garmin pedal from our bicycle was missing and the other was damaged. The bicycle had been packed by the moving team. We are still seeking a proper resolution.

0

33

Setu Chokshi @setuc

3 months ago

We paid, but the delivery situation still became chaotic and was delayed. This happened while our family was dealing with a bereavement and funeral. Instead of clear coordination, we were passed between different people and offices and had to keep chasing updates ourselves.

5

0

40

Setu Chokshi @setuc

3 months ago

This was never clearly raised during survey or packing. The timing and manner of this demand were deeply unreasonable. We were contacted late at night and felt pressured to pay in order to avoid further delay to delivery.

0

35

setuc retweeted

Philipp Schmid

@_philschmid

4 months ago

Context-Bench evaluating the performance on models for Filesystems and Skills. It measures an LLM's ability to manage its own context window, deciding what information to retrieve, load, and discard to solve long-horizon tasks. - Filesystem: Evaluates how well agents can chain file operations, trace entity relationships, and manage multi-step retrieval. - Skills: Evalutes how well agents can discover and load skills to complete tasks. Great work by @Letta_AI. Link and Leaderboard below 🤗

_philschmid's tweet photo. Context-Bench evaluating the performance on models for Filesystems and Skills. It measures an LLM's ability to manage its own context window, deciding what information to retrieve, load, and discard to solve long-horizon tasks.

- Filesystem: Evaluates how well agents can chain file operations, trace entity relationships, and manage multi-step retrieval.

- Skills: Evalutes how well agents can discover and load skills to complete tasks.

Great work by @Letta_AI. Link and Leaderboard below 🤗

12

141

16

86

18K

setuc retweeted

Santiago

@svpino

4 months ago

A 100% automated QA team that works 24/7: 1. Write your test in plain English 2. AI generates the test cases 3. Web agents execute the tests in parallel 4. Live browser preview with everything that happens Try this for free in your project.

55

493

55

654

55K

setuc retweeted

Stan Girard

@_StanGirard

4 months ago

I reverse-engineered Claude Code's internal protocol. Now you can spawn and orchestrate agents from TypeScript. No SDK. No -p flag. Full programmatic control. OSS below 👇 https://t.co/7f6GNVKWvr

10

238

25

326

17K

setuc retweeted

elvis

@omarsar0

4 months ago

I think one of the most underappreciated findings in AI engineering is what this paper calls the "Grep Tax." First, they ran nearly 10,000 experiments testing how agents handle structured data, and the headline result is that format barely matters. But here's the weird finding: a compact, token-saving format they tested (TOON) actually consumed *up to 740% more tokens* at scale because models didn't recognize the syntax and kept cycling through search patterns from formats they already knew. It's one of the reasons my preferred formats are XML and Markdown. LLMs know those really well. The models have preferences baked into their training data, and fighting those preferences doesn't save you money. It costs you. The other finding worth sitting with: the same agentic architecture that improves frontier model performance actively *hurts* open-source models. It seems that the universal best-practices guide for AI engineering may not exist.

omarsar0's tweet photo. I think one of the most underappreciated findings in AI engineering is what this paper calls the "Grep Tax."

First, they ran nearly 10,000 experiments testing how agents handle structured data, and the headline result is that format barely matters.

But here's the weird finding: a compact, token-saving format they tested (TOON) actually consumed *up to 740% more tokens* at scale because models didn't recognize the syntax and kept cycling through search patterns from formats they already knew.

It's one of the reasons my preferred formats are XML and Markdown. LLMs know those really well.

The models have preferences baked into their training data, and fighting those preferences doesn't save you money. It costs you.

The other finding worth sitting with: the same agentic architecture that improves frontier model performance actively *hurts* open-source models. It seems that the universal best-practices guide for AI engineering may not exist.

46

520

63

604

84K

setuc retweeted

himanshu

@himanshustwts

4 months ago

New paper on how to finetune any multiagent system on any task. They have used AI feedback as per-action process rewards to solve credit assignment and sample efficiency without needing expensive rollouts.

himanshustwts's tweet photo. New paper on how to finetune any multiagent system on any task.

They have used AI feedback as per-action process rewards to solve credit assignment and sample efficiency without needing expensive rollouts. https://t.co/qLaWVw5Tr9

2

156

19

132

7K

setuc retweeted

AVB

@neural_avb

4 months ago

New article out on @TDataScience implementing custom LLM memory systems from scratch using DSPy. Go give it a read! Code is open source.

neural_avb's tweet photo. New article out on @TDataScience implementing custom LLM memory systems from scratch using DSPy. Go give it a read! Code is open source. https://t.co/bAxQeCFKOi

5

507

76

546

25K

setuc retweeted

Han

@HanchungLee

7 months ago

full deep dive here https://t.co/vDvSfyFZy5

3

87

9

126

6K

setuc retweeted

Charly Wargnier

@DataChaz

4 months ago

Wild. By far the most complete Claude Skills repo yet 🤯 @Composio’s Awesome-Claude-Skills packs 100`s of ready-to-use workflows: ↳ PDF tools, changelog generation ↳ Playwright automation ↳ AWS/CDK tools, MCP builders ... and much more! Free and open-source. Repo in 🧵↓

DataChaz's tweet photo. Wild.

By far the most complete Claude Skills repo yet 🤯

@Composio’s Awesome-Claude-Skills packs 100`s of ready-to-use workflows:
↳ PDF tools, changelog generation
↳ Playwright automation
↳ AWS/CDK tools, MCP builders

... and much more!

Free and open-source.

Repo in 🧵↓ https://t.co/dEAGQzAfRt

11

868

131

2K

77K

setuc retweeted

Ian Nuttall

@iannuttall

4 months ago

The playbooks skills directory looks a bit more diversified now it's sorted by trending skills 📈 Popular skills this week: - bird by @openclaw - prd by @ryancarson - multi-pr-preview by @dyad_sh

iannuttall's tweet photo. The playbooks skills directory looks a bit more diversified now it's sorted by trending skills 📈

Popular skills this week:

- bird by @openclaw
- prd by @ryancarson
- multi-pr-preview by @dyad_sh https://t.co/ujuU5SZHh9

9

120

6

135

8K

Setu Chokshi

@setuc

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users