Shade - WebGPU graphics

Now it’s much better :slight_smile:

Looking forward to put hands on your engine and try my point cloud rendering with FX in VR project on it.

1 Like

GI off ~80fps, GI on ~40fps
RTX4060, Win 11, latest Chrome, 1920x1080

1 Like

Another demo link

This is an actual arch viz scene done in blender, I scaled down the textures to something reasonable, but otherwise the scene is as-is, warts and all.

image
image

The entire scene is ~170Mb with

  • 158 textures
  • 223 individual meshes
  • 446,446 triangles
  • 21 lights
  • 111 unique PBR materials
4 Likes

This is stunning, it looks amazing!! How are you managing draw calls in this scene and how many do you have currently? have you considered trying to use ktx2 CompressedArrayTexture’s to store and assign all materials seperate texture channels as layers? How to create a Motion Vector map - #24 by vis_prime

a quote from the thread…

4 Likes

The number of draw calls is roughly 111, i.e. the number of materials.

This is not exactly true, as there is some fixed overhead.

Here’s a basic breakdown of how a frame is rendered in Shade:

  1. All instances (meshes) are filtered using a compute shader into 2 sets:
    a. Visible (passing frustum check, and a conservative occlusion check)
    b. Potentially visible (“maybe” set)
  2. All visible instances are expanded to meshlets, again, we have “visible” and “maybe” groups
  3. All visible meshlets are expanded to triangles, same story with “visible” and “maybe” sets
  4. All visible triangles are now rasterized into a vizibility buffer, this is a rg32uint texture, with mesh_id and triangle_id. Actual rasterizer is dead-simple, just about as complex as a depth pre-pass shader.
  5. Using what we rasterized, we build a depth pyramid, this is the basis for occlusion testing mentioned above
  6. Using the current depth pyramid, we process the “maybe” sets, this gives remaining visible triangles for this frame
  7. We rasterize what we filtered in previous step
  8. We once again re-build the depth pyramid, this will be used in the next frame for steps 1-4

At this point we have a Vizibility Buffer and we spent 2 draw calls for actual geometry drawing so far. We also spent something like 20 draw calls for depth pyramid, but it’s relatively cheap as each pass processes ~33% of screen pixels (mip mapping)

Next is material pass, we fetch mesh_id from viz buffer and draw “depth” in a depth-only pass, where “depth” value is the material ID for the mesh at that pixel.

Next we do a draw pass for each material, with depth test set to equal and depth value being set to match material ID. Essentially we abuse depth-test hardware to get 0 overdraw. And I don’t mean it hyperbolically, like “virtually zero”, I mean that we run material shader once per-pixel only for pixels that are actually visible in the final render.

As a result, cost of texture switching is actually very low. Also, the material shader is uniform, meaning we don’t actually do any lighting here, we output g-buffer instead, things like roughness, albedo, normal etc.

The advantage is that we scale incredibly well with material and texture counts as well as number of instances and geometry size, at the cost of high GPU bandwidth.

I can’t say if this is a good trade as it depends on your usecase, obviously. But if you’re dealing with large scenes, and/or you want to run some post-processing, it’s definitely a huge win.

I did think about it, the problem is uniformity, you have to force every texture to have the same dimensions. I’m not exactly opposed to it, but it seems like a big ask. I already have texture resizing shaders that would make this transparent to the user, but loss of quality due to scaling would be a nasty surprise.

One more issue is the layer count limit, I actually use a texture array for ray-tracing path, that is, I have a special code path to do full inline ray-tracing, and there you have to have access to all textures at the same time, so I pack them into a texture array.

However, even then I ran into an issue, larger scenes, like Lumberyard’s Bistro

This scene has 400 textures

WebGPUDevice.limits.maxTextureArrayLayers is 256 by default

So, you simply can’t support larger scenes, full stop.

I had to get creative, I emulate texture sampling in shader, I skip mip maps and I treat the texture array as an atlas, meaning that I pack multiple textures per layer. This works alright for ray-tracing API, as I mostly use it for global illumination, so loss of texture quality and mips is something quite acceptable there

Here’s an example of what ray-tracer sees with all texture resolution fixed at 128x128

It looks surprisingly good for such a low resolution, but this is not acceptable for general usecase.

So, in short - the best asnwer would be bindless textures, but alas, this is not part of WebGPU spec and doesn’t look like we’re going to be getting bindless resources anytime soon.

The other alternative, which I consider to be workable would be virtual textures. In fact, virtual textures have the benefit of managing memory as well, since your “physical” texture is where sampling will be done, and it’s quite small, so you’re going to be getting way better cache utilization on the GPU. Virtual textures are hard though, and even though I have implemented them in the past and on WebGL, which is a less powerful API - it’s still a lot of work to do a proper solution, so something to look into in the future.

Anyway, thanks for the interesting questions @Lawrence3DPK

2 Likes

Thanks for such an elaborate and clear reply, this is really insightful!

Going down some what of a rabbit hole here, could layers themselves be divided into sprite maps? For example 30% of layers could be comprised of smaller texture atlases, let’s say 8 x 8 atlases of 256 squared textures, 30% could be 4 x 4 atlases at 512 squared, 30% 2 x 2 atlases at 1024 squared and the remaining 10% as hq 2048 textures? Even thinking about this almost hurts and the more challenging / complex the idea gets as it’d take a completely custom pipeline (file type specification?) to manage and assign every packed texture to thier relative material / channel / atlas id counterpart :hot_face:

1 Like

Yes, but the worth of doing that is somewhat arguable. First issue is the packing, it’s not exactly free. I have a sophisticated packer implemented in meep which is incredibly fast, but it needs a lot of memory to keep current state. Essentially this is an allocator problem. Instead of 1d as in with most memory allocators we are now going 2d in texture space, and then we say “hey, that’s not hard enough” and we add layers, so it’s a 3d allocator.

Not to say that it’s impossible, it very much is possible, but the complexity I believe is not worth the trouble.

One simpler approach would be just to bin your layers, say N layers are dedicated to texture slots of resolution X by Y where X and Y are whole divisors of the layer resolution.

Of course one of the problems is that you kind of have to allocate all of that texture space ahead of time, to avoid re-allocations which would absolutely destroy your performance.

Your texture lookup gets more messy as well, now you have to carry not only slot, but also the size of the texture when trying to read it on the GPU. In my case, just a single u32 ID is enough to reconstruct UV bounds for a texture, because each slot is the same size.


But as much as it’s an interesting thought experiment, consider 2 more things:

  1. texture caches. Say you’re reading a texel for a texture in slot 0 and a texel for a texture in slot 99, they are probably in different layers and not even adjacent layers. Your cache utilization is going to be bad, the hardware does not expect you to access texture like that. Rather - it’s optimized for something else. The problem gets worse the bigger the overall underlying texture array.
  2. sampling. I implemented software sampling, but it’s almost an order of magnitude slower than hardware sampling. In fact, for I don’t even do trilinear sampling, if you add that - it would be even worse. Now, texture sampling is fast anyway, so that cost is pretty trivial in the grand scheme of things, but these samples add up.

And that’s to say nothing of data types and channel counts. For example, say you have an HDR texture, you’d be hard-pressed to put it into rgba8unorm texture array. Or, you might have a single channel texture, like roughness or a light map for example - you’d be wasting 3/4 of the space on unused channels.

I’m all for over-engineering things in general, and I think it would be an interesting research project, but in my experience the more you fight against the hardware and the API (WebGPU and GPU architecture), the more pain you incur. And the less likely you are to beat the defaults.

That said, if you go down this route - you might enable some usecases that are simply impossible without this, such as ray tracing. There’s a lot of fun and cool stuff to be had here, but I think you really need these unique and special use cases to make it worth it.

2 Likes

Added vertex color support, not a huge thing since it’s somewhat of a niche tool. If you’re in CAD world - material color tends to be a better choice, and when you’re in PBR flow - vertex color is just unncessary.

But either way, it’s there:

the “before”, and here’s the version with vertex color support:

Pretty close to three.js too, here’s a comparisson with the gltf-preview, first is three.js r176

Here’s current version of Shade

Also implemented a stochastic PCSS shadows, for now just directional. This is for devices where ray traced shadows are too much of a burden. It’s somewhat funny that PCSS shadows with just 5 “nearest” taps and TAA works incredibly well.

Here are a couple of shorts of just the shadows



Here are a couple of things to note

There are shadows from very far away, so they are quite blurry, contact hardening is working as intended.

And just to prove that there’s nothing untoward going on, I’m drawing shadows at a very low resolution, here’s the same shadowmap without filtering:

Other than this, fixed a few bugs, and managed to get rid of an extra render target, which reduces bandwidth by a bit.

Finished implementing spotlight support as well, integrating spotlights into clustered rendering is a bit of a pain.

That’s about it for now.

In case anyone is interested, my engine is very opinionated when it comes to rendering, vertex format and layout is fixed, this makes a lot of the code much simpler than it would have to be otherwise. Of course, I don’t want to limit the user in what data they can bring in, so there are efficient converters that transform the data at load-time into the right form. This is how I can get away with using just 8 bytes for the entire tangent frame for example (vertex normal + vertex tangent), where normally you would use 12 just for the normal and another 16 for the tangent.

5 Likes

Out of curiosity… which factor(s) contribute to shadow alignment, and/or squareness?

  • does the workflow allocate predefined maps, which accumulate taps/bounces?
  • or is it like radar arrays, in which closer spectrum bands have more precision but also “dead zones” in the debris ball?

Since things appear square, it must be the former?