This implements an AVX-256 based replacement for the MC2x2F class. The code is used in the iterative full Jones solver. Depending on the availability of the AVX-256 instruction set the original or the new code is used.
The matrix only has the minimal functionality needed to replace the MC2x2F class in the iterative full Jones solver.
Since it's easier to develop code in one repository the new code is part of DP3. In the future it should be moved to the aocommon library. However using this code in other placed in DP3 may reveal missing functionality in the new class.
LOCAL
Before
------
time ctest -R 'solvers/iterative_full_jones$'
Start 1: extract_resources
1/3 Test #1: extract_resources ................ Passed 0.13 sec
Start 2: buildunittests
2/3 Test #2: buildunittests ................... Passed 0.04 sec
Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones ..... Passed 115.92 sec
100% tests passed, 0 tests failed out of 3
Label Time Summary:
slow = 115.92 sec*proc (1 test)
unit = 115.92 sec*proc (1 test)
Total Test time (real) = 116.10 sec
real 1m56,106s
user 7m11,487s
sys 0m0,468s
After
-----
time ctest -R 'solvers/iterative_full_jones$'
Start 1: extract_resources
1/3 Test #1: extract_resources ................ Passed 0.14 sec
Start 2: buildunittests
2/3 Test #2: buildunittests ................... Passed 0.04 sec
Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones ..... Passed 53.37 sec
100% tests passed, 0 tests failed out of 3
Label Time Summary:
slow = 53.37 sec*proc (1 test)
unit = 53.37 sec*proc (1 test)
Total Test time (real) = 53.55 sec
real 0m53,562s
user 3m14,429s
sys 0m0,427s
Benchmarks
----------
| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | Benchmarking MC2x2F normal versus AVX
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:--------------------------------------
| 4.55 | 219,854,437.94 | 0.5% | 29.00 | 12.71 | 2.282 | 0.00 | 101.4% | 0.03 | ` Conjugate`
| 1.60 | 624,043,067.70 | 0.0% | 8.00 | 4.49 | 1.781 | 0.00 | 0.0% | 0.01 | `AVX Conjugate`
| 12.05 | 82,958,011.32 | 0.1% | 26.00 | 33.76 | 0.770 | 0.00 | 53.6% | 0.07 | ` Transpose`
| 2.00 | 498,794,911.49 | 0.0% | 9.00 | 5.62 | 1.601 | 0.00 | 0.0% | 0.01 | `AVX Transpose`
| 4.54 | 220,244,803.59 | 0.3% | 29.00 | 12.70 | 2.283 | 0.00 | 107.3% | 0.03 | ` HermitianTranspose`
| 1.61 | 621,252,971.44 | 0.3% | 8.00 | 4.51 | 1.773 | 0.00 | 0.0% | 0.01 | `AVX HermitianTranspose`
| 16.15 | 61,906,795.78 | 0.4% | 91.00 | 45.24 | 2.011 | 8.00 | 0.0% | 0.10 | ` Multiply`
| 3.09 | 323,542,585.15 | 0.8% | 24.00 | 8.64 | 2.777 | 0.00 | 0.0% | 0.02 | `AVX Multiply`
DAS6
~~~~
Before
------
time ctest -R 'solvers/iterative_full_jones$'
Start 1: extract_resources
1/3 Test #1: extract_resources ................ Passed 0.62 sec
Start 2: buildunittests
2/3 Test #2: buildunittests ................... Passed 0.03 sec
Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones ..... Passed 13.79 sec
100% tests passed, 0 tests failed out of 3
Label Time Summary:
slow = 13.79 sec*proc (1 test)
unit = 13.79 sec*proc (1 test)
Total Test time (real) = 14.45 sec
real 0m15.162s
user 0m44.323s
sys 0m0.346s
After
-----
time ctest -R 'solvers/iterative_full_jones$'
Start 1: extract_resources
1/3 Test #1: extract_resources ................ Passed 0.10 sec
Start 2: buildunittests
2/3 Test #2: buildunittests ................... Passed 0.03 sec
Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones ..... Passed 8.83 sec
100% tests passed, 0 tests failed out of 3
Label Time Summary:
slow = 8.83 sec*proc (1 test)
unit = 8.83 sec*proc (1 test)
Total Test time (real) = 8.96 sec
real 0m8.968s
user 0m25.785s
sys 0m0.316s
Benchmarks
----------
| ns/op | op/s | err% | ins/op | bra/op | miss% | total | Benchmarking MC2x2F normal versus AVX
|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:--------------------------------------
| 1.63 | 612,281,078.64 | 0.2% | 22.00 | 0.00 | 50.7% | 0.01 | ` Conjugate`
| 0.54 | 1,846,403,267.64 | 0.0% | 4.00 | 0.00 | 95.7% | 0.00 | `AVX Conjugate`
| 0.72 | 1,384,820,487.38 | 0.0% | 4.00 | 0.00 | 95.6% | 0.00 | ` Transpose`
| 0.82 | 1,225,432,003.60 | 0.5% | 5.00 | 0.00 | 93.0% | 0.00 | `AVX Transpose`
| 1.63 | 612,210,869.07 | 0.1% | 22.00 | 0.00 | 50.7% | 0.01 | ` HermitianTranspose`
| 0.72 | 1,381,799,253.23 | 0.2% | 4.00 | 0.00 | 95.6% | 0.00 | `AVX HermitianTranspose`
| 9.26 | 108,002,363.09 | 0.2% | 140.00 | 10.00 | 0.0% | 0.06 | ` Multiply`
| 1.77 | 563,976,985.23 | 0.2% | 19.00 | 0.00 | 50.0% | 0.01 | `AVX Multiply`
Closes RAP-28