Add matrixMultiplyAVX2b
This version is based on matrixMultiplyAVX2 with some changes:
- Remove the multiplication with
inv - Use
_mm256_addsub_ps - Replace the overkill
_mm256_permutevar8x32_pswith cheaper (and cleaner)_mm256_permute_ps - Reshuffle
b_1andb_2to getb_3andb_4