@cat_eye_on not entirely sure tbh- maybe the MLP alone was not expressiveness enough to learn how to avoid other agents? After all, the inputs to the agents are literally just a flat tensor. I tried GNNs and reward hacking (which I'll talk ab later) and both worked much better
Following up on my last post, here are some of the main pain points I ran into before getting my first successful runs on Simple Spread using @CleanRL.
As for "orchestrator" policy, there are ig levels to the amount of "orchestration." Hierarchical is, as u described, where an orchestrator policy outputs goals to other agents. I haven't really explored that since CTDE has been performing pretty well for me. CTDE is sorta like an orchestrator where a critic network uses all the actions and observations of the individual agents to determine how well the agents are doing
Quick teaser for whatβs next: I started messing around with VMAS and got a 10x SPS speedup out of the box. Also implemented a GNN which is completely crushing standard MLPs (even with homogenous actors). Next goal is scaling to higher N using imitation learning!
The fix: MAPPO with CTDE (Decentralized, Heterogeneous Actors + Centralized Critic).
Giving each agent its own independent weights naturally breaks symmetry. They didn't just learn to move; they learned roles and asymmetric strategies, like implicitly learning to let specific agents go first to clear bottlenecks.
@cat_eye_on@jparkjmc One time I was down 0-12 on ancient with 2 AFK and I thought all hope was lost. Then Catherine was like I gotcha fam and clutched up 13-12. Then they gave me a dragon lore
@aryan33864 Oh there's this cool youtube series I found that talks about MARL fundamentals that I'm working my through https://t.co/vyxSKXSvFt. Otherwise it's a lot of googling and asking gemini why my code doesn't work π
Iβm learning Multi-Agent RL (MARL)! To really understand how it works, Iβm building from scratch starting with @cleanrl_lib in the PettingZoo Simple Spread environment. Iβd like to thank @arth_shukla and @Stone_Tao for helping guide me in the initial understanding here.
Baseline: getting basic decentralized agents to just figure out where to go.
Hey! I was thinking about using pufferlib actually cuz it was recommended to me by a few people (like Stone and Daphne). I decided to start with just cleanrl cuz it has single file implementations which make it easy to understand and modify (and it wasn't TOO slow ~2.5K SPS). I also didn't know pufferlib supports multi agent envs/ pettingzoo. How does this compare with VMAS + BenchMARL/ TorchRL? I was planning on using that for the next step when I need to start scaling up the number of experiments/ agents. But also, in the mean time, I do see https://t.co/f7nIT7H9lG. Does that mean I can just add this wrapper to mpe2 simple spread and my SPS will become much higher? Although, I am doing some weird stuff with frame stack, vector envs and gym wrappers to get comptatbility between pettingzoo and gym which I'll post ab next!
Iβm documenting this to solidify my own understanding and hopefully help others. Iβve done my best to verify the correctness of everything I post, but MARL is hard ;)
If thereβs any flaws in my reasoning or a better way to implement something, let me know in the replies!
Here is my code: https://t.co/NHzBkrrEHl
A quick roadmap: Iβve already worked my way through CTDE, MAPPO, and some reward hacking, and now Iβm working on imitation learning.
The next few threads will be a retrospective on how I built up to that, sharing the pitfalls and bugs I hit along the way. After that, we go live with real-time updates as I move into VMAS and Isaac Gym.