A new blog on Windows - thread management, thread synchronization primitives with examples.
Explained :
1. Race conditions with simple and as well as low-level explanation
2. Dead locks
3. Interview ready problems like producer consumer problem etc.
https://t.co/koFr0HaGaU
https://t.co/QOhm0HDWg1
From ARM v8.2-A, we have LSE(large system extension) and LSE2 from v8.4 onwards I guess, could be wrong about the versions here.
We have atomic RMW in the memory subsystem itself. Yes, the modification can happen in memory fabric or in L1/L2 cache controllers.
LSE introduced on Armv8.1-A, it improved the performance of atomic operations in multi-core and reduced bus traffic.
The operations like LDADD(atomic add), CAS (compare and swap) performs the entire atomic read-modify-write in a single instruction.
Most beautiful thing:
"LSE pushes the arithmetic/logic operation down into the cache subsystem (the interconnect or L3 coherency controller) where the data already resides" - Compute near data instead of CPU fetching the data, modifying it and storing it again to avoid cache ping-pong problems which existed in before v8.1 with LDAXR, STLXR (it fails if another core modifies the value so triggers a retry - more on this later)
Mutex Internals: OS Synchronization Object to Hardware internals
When I started reading about multi-threading, I understood that I need to use Mutex for synchronisation. But never actually went into deep to understand it. My understanding about Mutex was that it ensures synchronisation using atomic primitives. Plain and simple.
Recently I've started reading about CPU micro architecture and memory subsystems concepts on ARM-A like Out-of-order execution (speculations, replays, commits etc), Reorder buffers, store to load forwarding, AXI, and AMBA CHI etc.
My understanding was vague back then. I didn't know the difference between memory ordering vs atomicity. Because Mutexes essentially ensure ownership+memory ordering+atomicity, But HOW?
Before going deep into it, let's clarify memory ordering on x86 vs ARM. x86 has a strong memory model and ARM follows weak memory ordering to ensure performance. What do I mean by Strong memory ordering?
TSO (Total Store Ordering) on x86 dictates how a core's memory operations are observed by other cores. In simple terms, loads and stores are generally observed in program order, except that a load may appear before an older store to a different address. In ARM, the memory model is weaker, so we must explicitly specify the ordering we want other cores to observe using acquire/release operations or memory barriers.
What are these acquire and release semantics?
LDAR, STLR are Ordering only semantics* and LDAXR, STLXR are Reservation(exclusive monitor) + Atomic RMW primitives.
LDAR (Load-Acquire) ensures that subsequent memory accesses cannot be observed before the acquire load.
STLR (Store-Release) ensures that prior memory accesses become visible before the release store becomes visible.
LDAXR(Load Exclusive Acquire) and STLXR(Store Exclusive Release) are similar to the above in ordering part and in addition to ordering we also have exclusive monitor + atomicity.
Exclusive monitor : "I loaded this memory location and I intend to update it atomically later."
So, because of this exclusive monitor we get proper atomicity, How ?
For example, after LDAXR executed on core 0, we marked this address/cache line as reserved in Exclusive monitor and when we use STLXR - conflicting write invalidates reservation, so it retries if failed.
Now we understand memory ordering, acquire and release semantics and Atomicity. So, let's get into what a Mutex really is.
Mutex is just a structure that has a shared variable, lock() and unlock() methods (Simplifying a lot here)
We use the synchronization object like this :
lock()
critical_section()
unlock()
*Remember that, CPU can still execute in Out-of-order, but just that commits happen in specific order. Meaning loads after lock() can still be speculated(crazy to think but it can happen) but cannot be observed as if they occurred before the acquire.
Okay, back to Mutex. When we use lock() and unlock() typically the below happens.
void lock(int *l) {
while (atomic_exchange_acquire(l, 1) == 1)
;
}
void unlock(int *l) {
atomic_store_release(l, 0);
}
Or in Assembly something like below.
lock:
retry:
LDAXR X0, [lock]
CBNZ X0, retry
MOV X1, #1
STLXR W2, X1, [lock]
CBNZ W2, retry
RET
unlock:
STLR WZR, [lock] // store 0 with release
RET
Using STLR because all older stores must be globally visible before this store.
Here, we make the mutex object's shared value as 1 (lock acquired) and the other threads can spin or wait for it.
Real OS implementation -
Windows implementation of a Mutex :
4: kd> dt nt!_KMUTANT +0x000 Header : _DISPATCHER_HEADER +0x018 MutantListEntry : _LIST_ENTRY +0x028 OwnerThread : Ptr64 _KTHREAD +0x030 MutantFlags : UChar +0x030 Abandoned : Pos 0, 1 Bit +0x030 Spare1 : Pos 1, 7 Bits +0x031 ApcDisable : UChar
It has an ownerThread for ownership tracking and support waiting and wake-up other threads mechanisms, integrates with the scheduler, and even more.
Locks work not because CPUs stop reordering instructions or executing out of order, but because atomic operations, acquire/release semantics, cache coherence, and architectural memory-ordering guarantees ensure that all cores observe synchronization events in a consistent manner. Although operating-system synchronization objects such as mutexes can be quite sophisticated internally, they ultimately rely on these same hardware foundations: atomic read-modify-write operations, cache coherence protocols, and the memory model provided by the processor architecture. Together, these mechanisms allow CPUs to aggressively speculate and reorder execution internally while still providing correct and predictable synchronization behavior to software.
Atomicity ≠ Memory Ordering
Atomicity means - RMW(read-modify-write) operation should be completed as a single indivisible operation, so no other core can see an in-between state.
Memory ordering is different. It defines the order in which loads and stores can be observed by other cores.
Memory ordering + atomicity together are what locks provide.
LSE introduced on Armv8.1-A, it improved the performance of atomic operations in multi-core and reduced bus traffic.
The operations like LDADD(atomic add), CAS (compare and swap) performs the entire atomic read-modify-write in a single instruction.
Most beautiful thing:
"LSE pushes the arithmetic/logic operation down into the cache subsystem (the interconnect or L3 coherency controller) where the data already resides" - Compute near data instead of CPU fetching the data, modifying it and storing it again to avoid cache ping-pong problems which existed in before v8.1 with LDAXR, STLXR (it fails if another core modifies the value so triggers a retry - more on this later)
I started reading about Store-to-Load forwarding and now I'm exploring the entire memory subsystem microarchitecture.
Currently reading about ARMv8.1-A Large System Extensions(LSE), AMBA CHI protocols(Communication protocol used between memory subsystems like cache-to-cache interconnects). Will write about this in detail soon :)
Single Load and store are Atomic:
What I learnt:-
In modern CPU architectures, *aligned* data accesses are atomic for single loads and stores, provided the data types are *naturally aligned*. In short, what it means is that the reader shouldn't see tearing.
On my hardware :-
@greenboxal I believe this could be it. But I'll check the PMU counters to understand more. So, probably instruction cycles should be more for the unaligned data but let me check.
I started reading about Store-to-Load forwarding and now I'm exploring the entire memory subsystem microarchitecture.
Currently reading about ARMv8.1-A Large System Extensions(LSE), AMBA CHI protocols(Communication protocol used between memory subsystems like cache-to-cache interconnects). Will write about this in detail soon :)
After many years, I finally got a chance to read about Clocks and Timers. I've read about LC circuits, impedance, resonance, oscillators(feedback + loop gains + inverters + amplifiers), and piezoelectric concepts in Electrical engineering. But never had a chance to actually work/understand the concept of CPU clocks or Timers because I never went in-depth into VLSI or into electronics in general.
These are the same concepts used in Clocks and Timers. Quartz Crystal is a kind of LC circuit with High Q factor, which results in precise and stable frequencies. But it's not a conductor ! Quartz Crystal inside an electric field changes its polarity thus generating high voltages. When the AC voltage from the circuit hits the crystal, it physically flexes. If that AC frequency matches the crystal’s "natural" mechanical frequency, the crystal vibrates violently (resonance) and its impedance drops sharply. This is how it generates stable frequency !
The combination below is what makes a CPU clock.
1. Pierce oscillator
2. Phase locked loops (PLL) circuits(Since the oscillator just acts as a frequency selector - stable small frequency in MHz, this circuit amplifies to high frequencies)
3. Clock trees (converting sine-ish signals to digital squarish waves)
Think of Timers as just counters with stable frequency thus providing the proper times for OS.
I am still learning, so lots of over-simplications are there. I'll share more in upcoming tweets.
After many years, I finally got a chance to read about Clocks and Timers. I've read about LC circuits, impedance, resonance, oscillators(feedback + loop gains + inverters + amplifiers), and piezoelectric concepts in Electrical engineering. But never had a chance to actually work/understand the concept of CPU clocks or Timers because I never went in-depth into VLSI or into electronics in general.
These are the same concepts used in Clocks and Timers. Quartz Crystal is a kind of LC circuit with High Q factor, which results in precise and stable frequencies. But it's not a conductor ! Quartz Crystal inside an electric field changes its polarity thus generating high voltages. When the AC voltage from the circuit hits the crystal, it physically flexes. If that AC frequency matches the crystal’s "natural" mechanical frequency, the crystal vibrates violently (resonance) and its impedance drops sharply. This is how it generates stable frequency !
The combination below is what makes a CPU clock.
1. Pierce oscillator
2. Phase locked loops (PLL) circuits(Since the oscillator just acts as a frequency selector - stable small frequency in MHz, this circuit amplifies to high frequencies)
3. Clock trees (converting sine-ish signals to digital squarish waves)
Think of Timers as just counters with stable frequency thus providing the proper times for OS.
I am still learning, so lots of over-simplications are there. I'll share more in upcoming tweets.
@j4_v_3s It's definitely not helpful in all cases. I was referring to simple logics like -
helper(add);
helper(multiply);
helper() takes the function pointer and does something with it - in simple terms, it can be read as helper() accepts behaviour.
Function pointers are really beneficial for better readability and coding patterns. But the key issue in performance critical programs is that the compiler cannot inline if it's determined at runtime(dynamic decisions). Because of indirect branches which hurt branch predictions, which in turn causes pipeline stalls.
Recently I've completed cache internals and full on about Store-to-Load forwarding. In the coming days, we will look more into branch predictions :)
[C] Function Pointers - Everything you need to know!
Function pointers are pointers that point to code. When these are dereferenced, the fetched data is treated like instructions and the CPU executes them.
""
When we dereference a function pointer, the PC register in the CPU is set to the address held by the pointer.
""
The full note is available here: https://t.co/wTKPQDMxQn