We were executing 1 non-helper invocation and 3 helpers per CS tgsi_exec
machine, which was a total waste of the CPU when we could trivially have
all 4 invocations do real work (at least in the common case of a
gl_WorkGroupSize.x >= 4).
This didn't have the effect on dEQP that I was hoping for, as it turns out
that its shaders are almost all 1x1x1 workgroups. However, it does reduce
the runtime of piglit arb_compute_shader-local-id from 2:10 to 47 seconds
on my system.
Part of #4097
Reviewed-by: Dave Airlie <airlied@redhat.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14728>