Improve performance consistency for the different quantization use cases
Now that tQuantizeOutput
tests various 'extreme cases', it became apparent that some cases took much longer to quantize than others. These changes fix that by moving to a different thread-block mapping and by additional finetuning of the launch configuration.