Add matrixMultiplyAVX2b
This version is based on matrixMultiplyAVX2
with some changes:
- Remove the multiplication with
inv
- Use
_mm256_addsub_ps
- Replace the overkill
_mm256_permutevar8x32_ps
with cheaper (and cleaner)_mm256_permute_ps
- Reshuffle
b_1
andb_2
to getb_3
andb_4