Do you like footnotes? Do you like long, rambling lists?
Then, this might be the thing for you:
https://t.co/nUI5pxKTgG
There, I smash my personal footnote count record (20). As for lists? It has lists *within* lists. The only thing left is to write it in LISP.
@supahvee1234 This does work, though I had to disable layering check and the sandbox. It's possible that one or both could be re-enabled with more work.
@supahvee1234 Looking into this myself now, though clang-only.
There's this: https://t.co/GmWOwKYwHc which seems to have worked out as it's still in there today:
https://t.co/PKCuoAPNIP
@tavianator back-to-back transitions (e.g., to S, then E).
However, my testing seems to show that even without that stuff works better than you'd expect if the CPU was just using simple textbook rules about what state to initially bring in a line, no doubt there are predictors involved.
@tavianator That's interesting. My intuition is that prefetchw is the right thing before a CAS because you want the line in E state, since a CAS, like any write, requires the line in that state.
So prefetchw provides the hint needed to get it into that state, potentially two \
The latest release of the simdutf C++ library (6.0.0) brings in more convenient for C++20 users. While you used to have to provide both a pointer and a size parameter... often you can now just pass your container...
std::vector<char> data{1, 2, 3, 4, 5};
// C++11 API
auto cpp11 = simdutf::autodetect_encoding(https://t.co/SfwKsDhuQZ(), data.size());
// C++20 API
auto cpp20 = simdutf::autodetect_encoding(data);
Link in the comments.
@corsix@davidtgoldblatt Yeah, good point and yeah I was thinking of the scalar side.
Without affineqb I guess you could do something not terrible with vanilla PSHUFB: split the top and bottom nibbles and use PSHUB LUT to reverse each, then re-assemble reversed and PSHUFB again to reverse the bytes.
@tavianator@corsix Yes.
There's another one about flag handling for folded immediates too, which is pretty interesting if only to understand the complexity caused by fault handling.
@tavianator@eigenform BTW, feel free to stop testing my suggestions as soon as you get bored (I don't have an Alder Lake, so I can't do it myself).
Now I'm curious if a chain something like:
andn rax, rbx, rcx
lea rbc, [rax + 1]
lea rcx, [rax + 2]
...
executes 3 instructions/cycle.
@tavianator e.g. a 3 uop operation may have 2 independent uops which feed into the third, or have them all serially dependent, or have 1 initial uop which feeds into the other two (this makes sense if there multiple outputs, or side effects).
@tavianator Yeah the syntax is like
MpAB NpCD ...
which is M uops to any of ports A, B and N uops to any of ports C, D, and so on. If N/M are omitted they are 1.
This still isn't really enough to figure out all the details since the uops may have different dependency relationships,