Update timings
The performance counter class didn't properly work with asynchronous launches, as events would be reused while they may not have been recorded. This is solved by using a double-ended queue. The pipeline benchmark now relies on the performance counters to print statistics about individual kernels, simplifying the code. Because events are now created upon kernel launch, we need to make sure that the device context is set properly. This affects the benchmarks for the individual kernels.