Speed up a little the slicing-by-8 code path by replacing the (load+shift+xor)*4 sequence with a single u32 load plus a xor. Before: ``` iterative: 1018 MiB/s [000000006c3b110d] small keys: 1075 MiB/s [0035bf3dcac00000] ``` After: ``` iterative: 1114 MiB/s [000000006c3b110d] small keys: 1324 MiB/s [0035bf3dcac00000] ```
5.7 KiB
5.7 KiB