Skip to content

RAP-28 Use AVX-256 MC2x2F

Mark de Wever requested to merge RAP-28-AVX-256-MC2x2F into master

This implements an AVX-256 based replacement for the MC2x2F class. The code is used in the iterative full Jones solver. Depending on the availability of the AVX-256 instruction set the original or the new code is used.

The matrix only has the minimal functionality needed to replace the MC2x2F class in the iterative full Jones solver.

Since it's easier to develop code in one repository the new code is part of DP3. In the future it should be moved to the aocommon library. However using this code in other placed in DP3 may reveal missing functionality in the new class.

LOCAL


Before
------

time ctest -R 'solvers/iterative_full_jones$'

    Start  1: extract_resources
1/3 Test  #1: extract_resources ................   Passed    0.13 sec
    Start  2: buildunittests
2/3 Test  #2: buildunittests ...................   Passed    0.04 sec
    Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones .....   Passed  115.92 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
slow    = 115.92 sec*proc (1 test)
unit    = 115.92 sec*proc (1 test)

Total Test time (real) = 116.10 sec

real	1m56,106s
user	7m11,487s
sys	0m0,468s

After
-----

time ctest -R 'solvers/iterative_full_jones$'

    Start  1: extract_resources
1/3 Test  #1: extract_resources ................   Passed    0.14 sec
    Start  2: buildunittests
2/3 Test  #2: buildunittests ...................   Passed    0.04 sec
    Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones .....   Passed   53.37 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
slow    =  53.37 sec*proc (1 test)
unit    =  53.37 sec*proc (1 test)

Total Test time (real) =  53.55 sec

real	0m53,562s
user	3m14,429s
sys	0m0,427s

Benchmarks
----------
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | Benchmarking MC2x2F normal versus AVX
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:--------------------------------------
|                4.55 |      219,854,437.94 |    0.5% |           29.00 |           12.71 |  2.282 |           0.00 |  101.4% |      0.03 | `    Conjugate`
|                1.60 |      624,043,067.70 |    0.0% |            8.00 |            4.49 |  1.781 |           0.00 |    0.0% |      0.01 | `AVX Conjugate`
|               12.05 |       82,958,011.32 |    0.1% |           26.00 |           33.76 |  0.770 |           0.00 |   53.6% |      0.07 | `    Transpose`
|                2.00 |      498,794,911.49 |    0.0% |            9.00 |            5.62 |  1.601 |           0.00 |    0.0% |      0.01 | `AVX Transpose`
|                4.54 |      220,244,803.59 |    0.3% |           29.00 |           12.70 |  2.283 |           0.00 |  107.3% |      0.03 | `    HermitianTranspose`
|                1.61 |      621,252,971.44 |    0.3% |            8.00 |            4.51 |  1.773 |           0.00 |    0.0% |      0.01 | `AVX HermitianTranspose`
|               16.15 |       61,906,795.78 |    0.4% |           91.00 |           45.24 |  2.011 |           8.00 |    0.0% |      0.10 | `    Multiply`
|                3.09 |      323,542,585.15 |    0.8% |           24.00 |            8.64 |  2.777 |           0.00 |    0.0% |      0.02 | `AVX Multiply`


DAS6
~~~~

Before
------

time ctest -R 'solvers/iterative_full_jones$'

    Start  1: extract_resources
1/3 Test  #1: extract_resources ................   Passed    0.62 sec
    Start  2: buildunittests
2/3 Test  #2: buildunittests ...................   Passed    0.03 sec
    Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones .....   Passed   13.79 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
slow    =  13.79 sec*proc (1 test)
unit    =  13.79 sec*proc (1 test)

Total Test time (real) =  14.45 sec

real	0m15.162s
user	0m44.323s
sys	0m0.346s

After
-----

time ctest -R 'solvers/iterative_full_jones$'

    Start  1: extract_resources
1/3 Test  #1: extract_resources ................   Passed    0.10 sec
    Start  2: buildunittests
2/3 Test  #2: buildunittests ...................   Passed    0.03 sec
    Start 17: solvers/iterative_full_jones
3/3 Test #17: solvers/iterative_full_jones .....   Passed    8.83 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
slow    =   8.83 sec*proc (1 test)
unit    =   8.83 sec*proc (1 test)

Total Test time (real) =   8.96 sec

real	0m8.968s
user	0m25.785s
sys	0m0.316s

Benchmarks
----------

|               ns/op |                op/s |    err% |          ins/op |         bra/op |   miss% |     total | Benchmarking MC2x2F normal versus AVX
|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:--------------------------------------
|                1.63 |      612,281,078.64 |    0.2% |           22.00 |           0.00 |   50.7% |      0.01 | `    Conjugate`
|                0.54 |    1,846,403,267.64 |    0.0% |            4.00 |           0.00 |   95.7% |      0.00 | `AVX Conjugate`
|                0.72 |    1,384,820,487.38 |    0.0% |            4.00 |           0.00 |   95.6% |      0.00 | `    Transpose`
|                0.82 |    1,225,432,003.60 |    0.5% |            5.00 |           0.00 |   93.0% |      0.00 | `AVX Transpose`
|                1.63 |      612,210,869.07 |    0.1% |           22.00 |           0.00 |   50.7% |      0.01 | `    HermitianTranspose`
|                0.72 |    1,381,799,253.23 |    0.2% |            4.00 |           0.00 |   95.6% |      0.00 | `AVX HermitianTranspose`
|                9.26 |      108,002,363.09 |    0.2% |          140.00 |          10.00 |    0.0% |      0.06 | `    Multiply`
|                1.77 |      563,976,985.23 |    0.2% |           19.00 |           0.00 |   50.0% |      0.01 | `AVX Multiply`

Closes RAP-28

Merge request reports