We just released Oodle 2.9.13.
Significantly increased BC7 encoding speed (about 20-25% encode time reduction for non-RDO on typical content, 25-30% encode time reduction for RDO) at slightly increased quality.
Also several bug fixes and experimental WASM 64-bit support.
@daniel_collin On x86, same thing with PSUBW + PMULHRSW + PADDW, FWIW. (PMULHRSW is basically the same as ARM SQRDMULH, the just-multiply-not-multiply-accumulate version of SQRDMLAH.)
New blog post: "BC7 optimal solid-color blocks" https://t.co/MXQ3btVEGj clearing out my "I should write this up" queue, this technique is from... *checks git logs* May 2017. Oh my. (I have quite the backlog.)
@tom_forsyth PMULHW is at 0x0f 0xe5. PMULHUW is 0x0f 0xe4. MUL and IMUL are ModR/M mod=4 and mod=5 in their group. It's possible they just blocked out things this way by coincidence, but given this and Andy's comments, I doubt it.
New blog post: "Why those particular integer multiplies?" https://t.co/YSyLngUack some explanation and some speculation on the integer SIMD multiplies offered in x86, along with some history
@tom_forsyth Because at the time there was a mandate to be "more RISC-y" which management at the time interpreted as "fewer instructions is good". Andy Glew was still publicly salty about it 5 years later. https://t.co/yL0f2W3deB
@geofflangdale It's different for every "iteration" and BC7 decode does it 1-3 times in a row. The actual decoder has this in vector regs so I don't have PDEP/PEXT to begin with.
@Simon_Fe1@tom_forsyth@FreyaHolmer I was talking about Booth encoding in a regular multiplier (you never Booth encode both operands). I'm pretty sure squarers don't Booth encode at all, yes.
@tom_forsyth@FreyaHolmer The main application I'm aware of is "High-Speed Function Approximation Using a Minimax Quadratic Interpolator" by Piñeiro, Oberman, Muller and Bruguera. (Internals of NVidia GPU SFUs at some point, I think their current SFUs are still descended from this.)
@tom_forsyth@FreyaHolmer Squarers are mostly a thing in special function units for polynomial eval.
You always only Booth encode one of the argument, the other is left alone, so that doesn't save anything, but IIRC there are some shortcuts you can do for squaring.