@ciphergoth@oconnor663@cryptodavidw It wouldn't make much sense to benchmark universal hashes against unkeyed collision-resistant hash functions. But an almost (Δ-)universal hash section on eBACS could be useful on its own.
@kste_ @ciphergoth@veorq@WatsonLadd@SchmiegSophie The best trail (the usual caveats apply) for Salsa jumps from 2^-18 to 2^-46 for 3 to 4 rounds; the best trail for Chacha jumps from 2^-12 to 2^-39. But restricting the differences to the attacker-controlled 128 bits instead of the entire space would greatly decrease these probs.
@SeanieCurran @veorq The comparison formulas there were derived independently and, if I remember correctly, unsigned < requires one fewer operation than Hacker's Delight.
@chrisrohlf Probably a similar interface to WRMSR or XSETBV: register index in ECX, upper bits in EDX. Since there's only one 32-bit register so far, both are hardcoded to 0.
@oconnor663 rdtsc(p) no longer counts cycles in most chips; it is a timer that runs at the nominal frequency of the processor, but the processor itself can clock higher or lower. So you need to force it to also run at the nominal frequency to have reasonably accurate cycle counts.
@oconnor663 Another thing---those (particularly the single-threaded) numbers are either too good to be true, or you're not actually disabling Turbo Boost for measuring.
@oconnor663 There's little point in an AVX2 implementation of BLAKE2s, beyond taking advantage of AVX512F's native rotation instructions and such. On another note, have you considered using the compression function directly to specify the hash?
@oconnor663@zooko NEON should make a big difference, seeing that it has native 64-bit addition. On SUPERCOP, blake2b generally outperforms blake2s where NEON is present, e.g., https://t.co/45q5zSFWzC On the other hand, blake2s does not generally benefit from NEON, but tree'd blake2s might.
@oconnor663@zooko Twitter is really not the best medium for this. Everything's out of order. blake2sp is essentially the same speed as blake2bp but is more sensitive to compiler codegen quirks, so depending on compiler version/flags it is often slower.
@pbarreto@cryptojedi @dsp6s The branch is caused by the (unintentional?) conversion of `flip` to `double`, not the mask generation. https://t.co/fRt16doj8t Cleaner version: https://t.co/g1D92s0zxY
@oe1cxw@rygorous Neat, the high part does the inversion itself. You can also compute the xor of any number of rotations of a word; for example SHA-256's S1 is doable as clmul(e, 0x4200080) ^ clmulh(e, 0x4200080).
@oe1cxw@rygorous Since the Gray code is bit-reversed polynomial multiplication by x + 1, whose inverse modulo x^32 is all 1s, you can also have grev32(clmul32(grev32(x ^ (x >> 1), 31), -1), 31) == x.