Threads have program counters individually according to nvidia, and have done for nearly 10 years
https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...
> the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity
Divergence isn't good, but sometimes its necessary - not supporting it in a programming model is a mistake. There are some problems you simply can't solve without it, and in some cases you absolutely will get better performance by using divergence
People often tend to avoid divergence by writing an algorithm that does effectively what pascal and earlier GPUs did, which is unconditionally doing all the work on every thread. That will give worse performance than just having a branch, because of the better hardware scheduling these days
Look at my user profile. Divergence in modern NVidia GPUs does not work the way you think it does. A separate program counter per thread does not mean that on each clock each thread is issuing a different instruction. See section 3.2.2.1. of https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...
Of course divergence is sometimes unavoidable. That is why GPUs support it. But substantially divergent code comes at a significant cost.