Awesome explainer from @MagedMMichael on Monad's caching mechanism, which handles a problem specific to blockchain: caching state key-value pairs for pre-finalized blocks
@builnad@kkuehlz@_jhunsaker 4 & 256 is the current recommended setting. I haven't measured thr and fiber variations recently. Last I checked 5 and 4 were basically equal so might as well go with 4 until 5 is decidedly better. Usually the difference between N and N+1 threads is only a few % points of tput.
@builnad@kkuehlz@_jhunsaker Although we wouldn't want to go too far that running more txns in parallel with tx N makes it slower. It's more important to get tx N done than to allow tx N+500 to start.
@builnad@kkuehlz@_jhunsaker Yes but not always. Depends on the improvement. For ex an improvement that makes the same txn hide io latency doesn't need to increase fibers.
It makes sense to increase fibers if cores are idle while tx N+256 can't run because tx N is not yet completed.
@kkuehlz@_jhunsaker@builnad The more io latency is reduced or hidden the more throughput will improve with more threads.
Potential improvements in the works could bump tx exec threads up to 5 or 6 for higher throughput.
Yes, sort of. The hazard pointer technique is suitable for use in frequent operations to protect access to dynamic objects (the access can be read-only or read-write) against (less frequent) concurrent removal of such objects.
The ratio of frequent to infrequent operations doesn't need to be too high. 10:1 is enough to make it worthwhile.
Examples of frequent operations are hash map lookups and dynamic queue push/pops. The corresponding infrequent operations would be deletions of items from hash maps and deletions of queue segments.
Hazard pointer protection is super fast, but is it scalable too? Yes. it is.
Hazard pointer protection is effectively perfectly scalable:
• Line 8: Writing to the hazard pointer almost never generates cache coherence traffic. Hazard pointers are padded and properly aligned to avoid false sharing among writes to different hazard pointers. Each hazard pointer is only written by one thread at a time (its owner) and is read very rarely (by a reclaimer). The ratio of writes (by owner) to reads (by a reclaimer) is in practice at worst in the order of 1,000:1 and more typically a lot higher (e.g., one write per lookup in a concurrent hash map vs one read per amortized reclamation of thousands of deleted items).
• Line 10: The lightweight asymmetric memory barrier doesn't generate any instructions.
• Line 12: Re-reading the pointer from the source is almost always an L1 cache hit and so it doesn't generate cache coherence traffic.
• Line 14: Perfectly scalable local computation that doesn't involve memory access or synchronization.
How fast is hazard pointer protection?
It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer.
Here is a concise representation of hazard pointer protection:
@davidtgoldblatt@paulmckrcu Operations on Folly ConcurrentHashMap use at most 3 hazard pointers, one for the bucket, 2 for hand-over-hand traversal.
The iterator has 3 HPs, one to protect the current bucket, one to protect the current kv node, the 3rd is just for convenient traversal on '++it'.
How fast is hazard pointer protection?
It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer.
Here is a concise representation of hazard pointer protection:
My point was the opposite. This part of the conversation started with @davidtgoldblatt wondering why nobody is doing hybrid RCU/HP.
I understood that to mean: (1) enter RCU critical section, (2) handle some pointers, (3) decide that some pointers need protection beyond this critical section, (4) protect these pointers using HP, (5) exit RCU critical section.
So I replied that my guess is that the performance cost of just remembering which pointers may need to be protected by HP-s may be on par with the perf cost of protecting pointers with HP-s from the start.
Then you, Paul, not me, brought up the similarity to read-write locks and ref counting.
So I replied that while these are functionally similar, the relative performance costs are different.
After some back and forth I referred to the example of (1) acquire shared mutex, (2) handle some shared pointers, (3) decide that some shared ptr refs need to be retained beyond the end of the critical section, (4) copy these shared ptrs, (5) release the shared mutex. My intent for this example was to be similar functionally to the hybrid RCU/HP case that IIUC David meant, but with different relative performance costs.
My guess is using hybrid shared mutex / shared ptrs is faster than pure atomic shared ptrs, whereas I doubt that hybrid RCU/HP is faster than pure HP (because of the relative cost of remembering which pointers may need HP protection is on par with outright HP protection in the first place, but that's not the case for the loading from atomic shared ptrs, which is relatively expensive).
So this is how we got in this rabbit hole. I hope I didn't make things more confusing :)
https://t.co/SDfbh2QIz6
Copying from same shared_ptr is thread safe. Each copy involves atomic increment of the use count in the control block associated with the protected object.
However, we need the shared mutex to protect from concurrent modification of the shared_ptr being copied from (which should acquire the lock exclusively).
I was thinking of something like this
where it is probably faster to acquire a lock and copy a shared_ptr than to read an atomic shared_ptr without the lock.
This is a big milestone: half of the codebase, implementing MonadBFT, RaptorCast, blocksync, statesync, mempool, etc. is open source!
This codebase is a treasure trove of performant systems engineering work. We hope this materially pushes the space forward.
Step by step.