Managed to connect PIX profiler to Shade today, so I collected actual GPU timings for the first time. Previously I was using WebGPU’s timestamp-query
, which have wonky timings for “security reasons”.
Here are precise timings from PIX for 1024 lights scenario:
Culling - 7.84 µs
Binning - 193 µs
Shading - 274.37 µs
This is pretty much the first time I’ve been using a real GPU profiler, and I must say I’m really impressed. I’ve been creating a lot of tooling in Shade just to provide a window into what the GPU is doing, and this tool does it all and more our of the box. I wouldn’t say I regret writing the tools that I have, but I will definitely be relying on PIX more in the future.
That said, setting it up has been a real pain. I’ve done it in the past only to get fail to record anything, today I managed to obtain a recording, but only on Chrome Canary, and only with a very specific version of PIX (tried 5 different version, only one worked). And even then, many of the recording features don’t work and recording only works about 20% of the time. So a lot left to be desired.
For posterity,
I’m on Windows 11 x64
PIX version is 2405.15.002-OneBranch_release
Chrome Canary is Version 137.0.7125.0 (Official Build) canary (64-bit)
Launch parameters are: --disable-gpu-sandbox --disable-gpu-watchdog --enable-dawn-features=emit_hlsl_debug_symbols,disable_symbol_renaming "http://localhost:5173/"
Tickboxes
When everything works (The 20% I mentioned), you get something like this:
Especially enlightening for me was wave occupancy on ray traced shadows:
The warps are only ~30% utilized on average, with the rest sitting idle. Another way to picture this would be to imagine something like this:
if(x){
// code 1
}else if(y){
// code 2
}else{
// code 3
}
Where each each thread hits a random branch on average, so only 30% of the threads are running the same branch at any given time on average.
This is not too bad for an inline ray tracer for somewhat obvious reasons. But the main takeaway for me here was that optimizing the memory access further is pointless, and any BVH improvements will have marginal gains, as the main issue is occupancy. I’m curious about that last bit at the end where we see a lot of blue, this indicates a lot of stragglers.
Imagine if you could reach close to 90% occupancy - suddenly ray tracing would be 3 times faster.
More relevant to this topic, is shading. Here’s shading occupancy:
Red is occupied, orange is idle and blue is free. occupancy is ~40% throughout.
The reasons seems to be register pressure
which hovers around 93% at the peak, so we can only schedule about 40% of the threads in a warp it seems
Also, take the timings with a grain of salt, GPU timings are not super reliable because:
- GPU has acceleration/parking strategies to limit power usage. This causes timings to drift
- Ambient temperature has impact on the above, I (and I assume you too) are not in a temperature-controlled environment
- GPU is doing other stuff, from drawing your other application windows, to doing stuff like video decoding
- Modern GPUs runs on virtual memory, so there’s paging involved
- Browser actually interacts with the driver for you, so there’s a fat layer of abstraction between WebGPU and the GPU
- GPU scheduler provides no guarantees on reproducibility of scheduling. That is - your thread/warp allocation will vary pretty much every time, and some allocations might lead to better occupancy
- GPU has caches
I’m sure I’m missing at least a few extra reasons in there, the point is - timings will vary. The only useful metrics are averages for relatively large sample sizes, and relative to other metrics.