we had an incident because we migrated traffic to a brand new s3 bucket.
our millions of servers instantly crushed the new bucket’s partitions and started getting slammed with 5xx errors
we run 18 million EC2 instances per month. At our scale, we see very rare bugs very frequently.
Last week, we received *half* an HTTP request. Not a HTTP 206, literally half a request.
Content-Length was 2350 bytes. Body was actually 1200 bytes, and was truncated mid json doc.
By default, the most popular Rust crate for Postgresql (tokio_postgres) waits for *2 hours!!* before timing out a dead connection.
All because of a bad decision from 1989 🧵👇️
A famous bit of 90s hacker lore, after a server update Trey Harris at MIT realized he couldn't send email to any server over 500 miles away.
The root cause: a timeout erroneously set to "0", which rounded up to 3ms. At the speed of light, 3ms ~= 500 miles.
Last week, something very similar happened to us. Except that instead of a misconfigured timeout, in our case the issue involved speeding up time.
We forked Chromium and bolted on a lock-free, zero-copy, low-latency shared memory ringbuffer written in Rust
We needed to IPC 100+ MB/s of raw video, and Chromium's WebSocket implementation is dreadfully inefficient
At 7:00am PT every day, load on our systems spikes 300% in 60 seconds
Recall is the API for meeting recording.
Everyone starts their meetings at the top of the hour, meaning we handle Black Friday level traffic spikes 20x every day.
At our scale, this involves launching hundreds of thousands of EC2 instances within seconds of each other.
We process over 3 TiB/sec of raw video at our peak load
We're hiring systems engineers.
If this sounds interesting to you, DM me your GitHub!
@_anmonteiro@recallai Any cracked engineers that make it to the last stage of our interview process will also receive a framed photo of Linus Torvalds.
At @recallai, we ran 18M EC2 instances last month.
We use Rust to process 3TB/second of raw video in real-time.
If you’re a cracked engineer in SF, DM me your GitHub – if you interview with us, we’ll gift you a free bundle of K&R’s C, TLPI and TCP/IP Illustrated Vol 1 🔥