Founder @tensorlake, building sandbox infra for agents. Past - Built Nomad @hashicorp, AI Infra @meta, Container Scheduler tech lead @linkedin and @netlfix
Sneak peek into the new distributed shared file system coming soon to @tensorlake sandboxes!
Itβs a fully POSIX-compatible file system that multiple sandboxes can mount. As sandboxes are migrated between machines and restored from a suspended state, the file system persists.
The two shells at the top are sandboxes with the same file system mounted, and they can read/write to it concurrently.
The file system is built on top of blob stores, with NVMe-based caching to scale reads.
@pavitrabhalla@tensorlake Well @PierreB80788038 works at Tensorlake! Also, file systems are really hard and there is not really any perfect solutions. We put in a ton of work around ZeroFS to serve distributed file systems. There is a lot more work to be done infront of us, and hence the sneak preview :)
@sriramsubram@tensorlake@PierreB80788038 The local blocks of the file system is snapshotted too, so I am not sure what happens when we restore with the local blocks but the file system state has moved forward.
@diptanu and sandboxes is like the best founder/product/market fit I've seen.
Diptanu's entire career has lead to this moment, and he's more than ready to take over the world.
@aaqaishtyaq The machines have 100G NIC. Having said that, it could be a 1-10 seconds depending on the size of the sandbox if the clone happens across the network. Same node clone is 1-1.5 seconds.
tensorlake sbx cp <sbx-id> coming tomorrow.
We have supported full memory snapshot of sandboxes from the early days, so technically you could snapshot a running VMs and clone it to create copies.
Making this first class would mean we make this a 1 line API call, and optimize some aspects of the process like landing a clone as close to the original sandbox to reduce latencies.
@colemurray@modal We do this while a sandbox is still running. Logs from archived sandboxes coming soon. You can see the commands run, stderr and stdout.
We built a simulator to understand the performance of @tensorlake's sandbox scheduler and dataplane during sandbox creation bursts.
We can safely simulate traffic bursts without spinning up 100s of very expensive machines.
Google talks about a similar simulator for Borg in their OSDI 2016 paper. They go further and use cluster traces to optimize the scheduler on an ongoing basis.