Maged Michael @magedmmichael - Twitter Profile

3 months ago

Awesome explainer from @MagedMMichael on Monad's caching mechanism, which handles a problem specific to blockchain: caching state key-value pairs for pre-finalized blocks

19

202

25

9

12K

Maged Michael

@MagedMMichael

3 months ago

Describing Monad’s key-value cache design and how it handles undecided blocks.

Category Labs

@category_xyz

3 months ago

https://t.co/cQCsjVhuO9

7

120

25

23K

0

4

0

181

Maged Michael

@MagedMMichael

7 months ago

@builnad @kkuehlz @_jhunsaker 4 & 256 is the current recommended setting. I haven't measured thr and fiber variations recently. Last I checked 5 and 4 were basically equal so might as well go with 4 until 5 is decidedly better. Usually the difference between N and N+1 threads is only a few % points of tput.

2

3

0

1

109

Maged Michael

@MagedMMichael

7 months ago

@builnad @kkuehlz @_jhunsaker Although we wouldn't want to go too far that running more txns in parallel with tx N makes it slower. It's more important to get tx N done than to allow tx N+500 to start.

1

4

0

111

Who to follow

David is an NI agent and can make mistakes. Please double-check responses for correctness.

Omar Chaikhouni

@OCopines

focus on 🇸🇾 and 🇵🇸. international communication expert. fundraising to end poverty. #ynwa

Maged Michael

@MagedMMichael

7 months ago

@builnad @kkuehlz @_jhunsaker Yes but not always. Depends on the improvement. For ex an improvement that makes the same txn hide io latency doesn't need to increase fibers. It makes sense to increase fibers if cores are idle while tx N+256 can't run because tx N is not yet completed.

1

3

0

69

Maged Michael

@MagedMMichael

7 months ago

@kkuehlz @_jhunsaker @builnad The more io latency is reduced or hidden the more throughput will improve with more threads. Potential improvements in the works could bump tx exec threads up to 5 or 6 for higher throughput.

1

5

0

77

Maged Michael

@MagedMMichael

9 months ago

@0x_eunice @CppCon Thank you, Eunice! The talks will be posted by the conference on YouTube in a few weeks. I’ll post a link here. <3

0

1

0

66

Maged Michael

@MagedMMichael

9 months ago

At @CppCon

1

12

0

1

701

Maged Michael

@MagedMMichael

9 months ago

Proud to be part of the team @category_xyz

James

@_jhunsaker

9 months ago

Monad execution client is now open source (link below). This is the result of thousands of hours of effort by the team at @category_xyz. Enjoy

449

2K

244

48

262K

0

9

0

541

Maged Michael

@MagedMMichael

9 months ago

Yes, sort of. The hazard pointer technique is suitable for use in frequent operations to protect access to dynamic objects (the access can be read-only or read-write) against (less frequent) concurrent removal of such objects. The ratio of frequent to infrequent operations doesn't need to be too high. 10:1 is enough to make it worthwhile. Examples of frequent operations are hash map lookups and dynamic queue push/pops. The corresponding infrequent operations would be deletions of items from hash maps and deletions of queue segments.

0

52

Maged Michael

@MagedMMichael

9 months ago

Hazard pointer protection is super fast, but is it scalable too? Yes. it is. Hazard pointer protection is effectively perfectly scalable: • Line 8: Writing to the hazard pointer almost never generates cache coherence traffic. Hazard pointers are padded and properly aligned to avoid false sharing among writes to different hazard pointers. Each hazard pointer is only written by one thread at a time (its owner) and is read very rarely (by a reclaimer). The ratio of writes (by owner) to reads (by a reclaimer) is in practice at worst in the order of 1,000:1 and more typically a lot higher (e.g., one write per lookup in a concurrent hash map vs one read per amortized reclamation of thousands of deleted items). • Line 10: The lightweight asymmetric memory barrier doesn't generate any instructions. • Line 12: Re-reading the pointer from the source is almost always an L1 cache hit and so it doesn't generate cache coherence traffic. • Line 14: Perfectly scalable local computation that doesn't involve memory access or synchronization.

MagedMMichael's tweet photo. Hazard pointer protection is super fast, but is it scalable too? Yes. it is.

Hazard pointer protection is effectively perfectly scalable:
• Line 8: Writing to the hazard pointer almost never generates cache coherence traffic. Hazard pointers are padded and properly aligned to avoid false sharing among writes to different hazard pointers. Each hazard pointer is only written by one thread at a time (its owner) and is read very rarely (by a reclaimer). The ratio of writes (by owner) to reads (by a reclaimer) is in practice at worst in the order of 1,000:1 and more typically a lot higher (e.g., one write per lookup in a concurrent hash map vs one read per amortized reclamation of thousands of deleted items).
• Line 10: The lightweight asymmetric memory barrier doesn't generate any instructions.
• Line 12: Re-reading the pointer from the source is almost always an L1 cache hit and so it doesn't generate cache coherence traffic.
• Line 14: Perfectly scalable local computation that doesn't involve memory access or synchronization.

Maged Michael

@MagedMMichael

10 months ago

How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:

$MagedMMichael's tweet photo. How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:$

1

13

2

12

2K

1

9

1

634

Maged Michael

@MagedMMichael

10 months ago

@davidtgoldblatt @paulmckrcu Operations on Folly ConcurrentHashMap use at most 3 hazard pointers, one for the bucket, 2 for hand-over-hand traversal. The iterator has 3 HPs, one to protect the current bucket, one to protect the current kv node, the 3rd is just for convenient traversal on '++it'.

1

2

0

101

Maged Michael

@MagedMMichael

10 months ago

How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:

$MagedMMichael's tweet photo. How fast is hazard pointer protection? It takes a small fraction of a nanosecond to protect an object from unsafe reclamation using a hazard pointer. Here is a concise representation of hazard pointer protection:$

1

13

2

12

2K

Maged Michael

@MagedMMichael

10 months ago

@paulmckrcu @davidtgoldblatt Yes. This seems like a good use case for hybrid RCU/HP.

0

2

0

1

40

Maged Michael

@MagedMMichael

10 months ago

My point was the opposite. This part of the conversation started with @davidtgoldblatt wondering why nobody is doing hybrid RCU/HP. I understood that to mean: (1) enter RCU critical section, (2) handle some pointers, (3) decide that some pointers need protection beyond this critical section, (4) protect these pointers using HP, (5) exit RCU critical section. So I replied that my guess is that the performance cost of just remembering which pointers may need to be protected by HP-s may be on par with the perf cost of protecting pointers with HP-s from the start. Then you, Paul, not me, brought up the similarity to read-write locks and ref counting. So I replied that while these are functionally similar, the relative performance costs are different. After some back and forth I referred to the example of (1) acquire shared mutex, (2) handle some shared pointers, (3) decide that some shared ptr refs need to be retained beyond the end of the critical section, (4) copy these shared ptrs, (5) release the shared mutex. My intent for this example was to be similar functionally to the hybrid RCU/HP case that IIUC David meant, but with different relative performance costs. My guess is using hybrid shared mutex / shared ptrs is faster than pure atomic shared ptrs, whereas I doubt that hybrid RCU/HP is faster than pure HP (because of the relative cost of remembering which pointers may need HP protection is on par with outright HP protection in the first place, but that's not the case for the loading from atomic shared ptrs, which is relatively expensive). So this is how we got in this rabbit hole. I hope I didn't make things more confusing :) https://t.co/SDfbh2QIz6

1

0

90

Maged Michael

@MagedMMichael

10 months ago

Copying from same shared_ptr is thread safe. Each copy involves atomic increment of the use count in the control block associated with the protected object. However, we need the shared mutex to protect from concurrent modification of the shared_ptr being copied from (which should acquire the lock exclusively).

1

0

91

Maged Michael

@MagedMMichael

10 months ago

I was thinking of something like this where it is probably faster to acquire a lock and copy a shared_ptr than to read an atomic shared_ptr without the lock.

MagedMMichael's tweet photo. I was thinking of something like this
where it is probably faster to acquire a lock and copy a shared_ptr than to read an atomic shared_ptr without the lock.

1

0

105

MagedMMichael retweeted

tunez

@cryptunez

11 months ago

https://t.co/WmDUsv2eGZ

121

1K

56

168

155K

MagedMMichael retweeted

Keone Hon

@keoneHD

10 months ago

This is a big milestone: half of the codebase, implementing MonadBFT, RaptorCast, blocksync, statesync, mempool, etc. is open source! This codebase is a treasure trove of performant systems engineering work. We hope this materially pushes the space forward. Step by step.

733

3K

758

26

131K

Maged Michael

@MagedMMichael

10 months ago

The cat is making Monad faster and faster.

0

7

1

0

136

Maged Michael

@MagedMMichael

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users