From f74f5528b12184c67e40a7a289d79a9b82ac7d46 Mon Sep 17 00:00:00 2001
From: Bram Veenboer <bram.veenboer@gmail.com>
Date: Mon, 26 Apr 2021 09:43:21 +0000
Subject: [PATCH] COB-121: Update PerformanceTest.md for
 tSubbandProcPerformance

---
 .../doc/performanceTest/PerformanceTest.md    | 83 ++++---------------
 1 file changed, 18 insertions(+), 65 deletions(-)

diff --git a/RTCP/Cobalt/GPUProc/doc/performanceTest/PerformanceTest.md b/RTCP/Cobalt/GPUProc/doc/performanceTest/PerformanceTest.md
index 170927b9041..7be0a0e8695 100644
--- a/RTCP/Cobalt/GPUProc/doc/performanceTest/PerformanceTest.md
+++ b/RTCP/Cobalt/GPUProc/doc/performanceTest/PerformanceTest.md
@@ -1,73 +1,26 @@
 # Goal
-We run a set of representative benchmarks to test performance regression of the kernels for the different pipelines. As we only aim to test a representative parameter space for the kernels it seems best to run a set of pipeline tests with representative use cases for LOFAR and LOFAR Mega Mode. Tests are specified through the LOFAR parsets and are passed to the gpu_load test that sets up a relative simple pipeline with performance benchmarking enabled.
+We run a set of representative benchmarks to test performance regression of the kernels for the different pipelines. As we only aim to test a representative parameter space for the kernels it seems best to run a set of pipeline tests with representative use cases for LOFAR and LOFAR Mega Mode. Tests are specified through the LOFAR parsets and are passed to the `gpu_load` test that sets up a relative simple pipeline with performance benchmarking enabled.
 
-# Run benchmarks
+# The `tSubbandProcPerformance` test
+Start the performance test using the `ctest` framework:
 ```
-INPUT:  (required) location of gpu_load binary,
-        (required) location of source parsets,
-        (required) root directory for results,
-        (optional) specify [int] number of iterations for gpu_load, default is 1000 iterations
-        (optional) enable storing of parsets,
-        (optional) select specific parset(s)
-        (optional) disable processing (e.g. dry run)
-
-Take as test suites a list of: <parset-name><observation ID>
-
-FOR EACH test suite P in a list of test suites (P1 through Pn):
-    Take the base parset
-    SET Cobalt.Benchmark.file=<benchmark file path>/<P.name>.csv
-    IF storing of parsets is enabled store modified parset to <path>/<P.name>.parset
-    IF processing is enabled
-        blocking call gpu_load with parset as argument, results are written (by gpu_load) to <benchmark file path>/<P.name>.csv
-    ENDIF
-END
-
-Optionally later add test cases (variation of parset parameters)
-FOR EACH test suite P in a list of test suites (P1 through Pn):
-    Take its base parset (observation ID) or a minimal version of that (TBD)
-    FOR EACH test case C in the test suite
-        Modify the parset with use case specific parameters
-        SET Cobalt.Benchmark.file=<benchmark file path>/<P.name><C.name>.csv
-        IF storing of parsets is enabled store modified parset to <path>/<P.name><C.name>.parset
-        IF processing is enabled
-            blocking call gpu_load with parset as argument, results are written (by gpu_load) to <benchmark file path>/<P.name><C.name>.csv
-        ENDIF
-    END
-END
+ctest -V -R tSubbandProcPerformance
 ```
+This will run all the `tSubbandProcPerformance_*.parset` in `test/SubbandProc`, and compare the results with the reference output in `tSubbandProcPerformance_reference`. Note that these parsets do not yet contain the `Cobalt.Benchmark` keys, these are filled in during the test.
 
-# Analyze benchmarks
-```
-INPUT:  (required) root directory with benchmark results
+The format of the reference output is `<OBSID>_<GPUNAME>.csv`. The GPU name
+should match the name reported by `nvidia-smi`, with spaces replaced with dashes. E.g. "Tesla V100-PCIE-16GB" becomes "Tesla-V100-PCIE-16GB". In case no reference output is found (for instance, when running on a system with different GPUs), the test is skipped.
 
-FOR EACH file in <benchmark file path> (list of files <P.name>.csv)
-    Read the file
-    FOR EACH KernelName in the file 
-        Filter for "PerformanceCounter" and list the performance statistics
-        <File name (test ID) > ; <kernel name> ; <statistics>
-        Construct kernelIndex from test ID and kernel name
-        Write in a separate file: <kernelIndex> ; <mean>
-    END
-END
-```
-Example of input that we are filtering for: 
-```
-format;             kernelName;             count;  mean;       stDev;      min;        max;        unit
-PerformanceCounter; output (incoherent);     10;    0.28483;     0.01499;   0.24310;    0.29318;    ms
-```
-# Compare analysis
-```
-INPUT:  (required) file with reference mean values (output format from analysis script)
-        (required) file with current mean values (output format from analysis script)
+One can add new or update existing reference files as follows:
+1. Get a parset and make sure that the `OBSID` is either unique (for new tests), or matches the `OBSID` of an existing `tSubbandProcPerformance_*.parset` (when updating an existing reference output).
+2. Add the `Cobalt.Benchmark` keys to enable dumping of the performance counter data to a CSV file.
+3. Run `gpu_load` with the new parset.
+4. Rename the resulting CSV file as described above and copy it to the `tSubbandProcPerformance_reference` directory.
 
-Construct a dict with kernelIndex and mean values for both files
-FOR EACH key in the reference dict
-    IF key also exists in the current dict
-        Calculate diff and write output as <kernelIndex> ; <referece mean> ; <current mean> ; <difference of mean values>
-    END
-END
-```
+The `tSubbandProcPerformance_compare.py` script compares two CSV files and has a configurable tolerance. E.g. if the tolerance is 10%, a measurement is considered as PASS as it is no more than 10% slower than the reference. The benchmark considers measurements running faster than the reference as PASS as well. This can for instance happen when a kernel is optimized.
+
+It is adviced to add new reference files seperately from any code-changes (that may have caused performance differences). This way, one can not (accidentally) introduce a performance regression.
 
-Optionally extend with:
-* Calculation of % real-time
-* Compare with performance (golden) reference file, pass fail output to top script for CI pipeline
\ No newline at end of file
+This test has two caveats:
+1. Some timings are unreliable, as the kernels run to short to provide a meaningfull measurement. Therefore, any timing of less than 5% of the total runtime is ignored. This is currently a hard-coded value in the comparison script.
+2. When using an LMM parset, performance counters with the same names are likely to co-exist. This case is not yet taken into account in the benchmark.
\ No newline at end of file
-- 
GitLab