Shade - WebGPU graphics

For a while, I’ve had an idea to write a graphics engine. I like three.js, and I understand why it is the way it is, but I always wanted something with the ease and elegance of three.js, but with the features of something like Unreal.

Since webgpu was announced, it started me thinking that you could finally implement a fully GPU-resident renderer in a browser. This year I had some time, so I though I would give it a try.

Before I go on, here are some mandatory pictuctures:





Why make another graphics engine?

I’ve had a set of very specific goals:

  • Occlusion culling. I don’t want to draw stuff that’s behind other stuff. This is one of the biggest performance killers for complex scenes. It’s so crucial, to the point where there is even a company that just does occlusion culling and does very well for themselves.
  • GPU draw dispatch. I don’t want to dispatch an individual draw command for each mesh. I’ve done some experiments, and on the web you’re limited to about 60,000 draw calls per frame (at 60 FPS), as long as you’re using the same material and geometry. If you switch geometries, that drops significantly, and if you switch materials as well - you’re in the ballpark of about 3000 draw calls per frame. Of course CPU/GPU etc. will impact these numbers, but I’m on very high-end hardware, so these numbers will be significantly lower typically. I want to send the entire scene to the GPU and let it draw all objects, there should be no communication with the CPU. This gives you best possible GPU utilization and essentially removes draw call limit. Meaning you can draw even millions of objects per frame. Unreal has been so-called GPU-resident for ages now, and Unity seems to be pushing for it now as well.
  • Deferred visibility-based shading. This basically means that we only want to shade pixels that will be visible on the screen. Eliminate all material and lighting calculations for pixels that will be occluded. This is what Unreal is doing since 4.2, and it’s a really significant step to allow large number of materials to be allowed in a single scene.
  • Efficient and good-looking post-processing. Not much to say here, but I think a modern graphics engine has to come with things like SSAO, SSR and AA out of the box.
  • Global illumination solution. Again, I think that having a turn-key global illumination is something is possible today and as such it’s a very desirable feature for a modern graphics engine. You can see that if you look at any in-house graphics engine of any major AAA game studio.
  • Rock-solid shadowing solution. Not much to add here, shadows are pain to tweak, and there is no technical reason today for that. There are solutions that remove that need.

As I said earlier, three.js is great, but it’s optimized as a project towards being easy to understand and easy to work with. This goes against complexity, and complexity unfortunately is inherent in graphics. This means that under the current philosophy of three.js - it is unable to have complex features.

You could extend three.js quite far, you can add features and modify existing behavior, and I think the team has done an incredible job with that. In my experience three.js is getting more and more flexible and extensible every year. That being said - it is still quite rigid. And it has to be, in some sense. Extensibility sacrifices performance, it increases complexity and creates a headache for maintainers.

Since 2013 or so, since I got into three.js, I have been extending it, with shaders, with alternative lighting, with simulation systems etc. But I would always run up against the limits of what I am allowed to do.

What is already achieved

GPU draw dispatch

Here’s a scene that blender foundation released, that has 40,765 individual pebbles:

Here’s the same scene in three.js 162:

You’ll notice that it run’s around 9.5 FPS

Here’s another scene that looks pretty bad because of blender export, but it has 118,352 blades of grass and flowers

Here’s Shade:


And here’s three.js again (FPS is 1.5 or so):

Just to re-iterate, no matter how complex your scene is - it’s always O(1) work on the CPU side, all of the actual culling and draw command dispatch happens 100% on the GPU. This means that the only bottle-neck is your GPU, if it can draw 100,000 objects per frame - that’s what you’ll get, if it can do 100,000,000 - you’ll get that instead.

Occlusions culling

Here’s sponza scene:

We can see that culling is doing a good job, there are ~270,000 triangles in the scene, and we’re only drawing ~96,000 in the current view

If we move the camera behind the pillar, we’re not going to see much of the complex geometry in the curtains, the lion head at the back etc:

And thanks to occlusion culling, we’re not drawing those triangles:


We have reduced the amount of work for the GPU by 85%.

The culling is conservative, and it’s heavily based on Activision’s and Epic’s published research, but computationally it’s almost free.

Deferred visibility-based shading

Not too much to say here, visually you can’t see much of a difference. It eliminates over-shading completely. This is huge. Another huge benefit is that there is no material extraneous switching. That is, all pixels on the screen that should be drawn with a particular material are drawn at the same time. Even if they span across multiple separate meshes. Another another benefit is in scenes with high poly-count, where you start running into poor quad utilization in hardware and end up shading texels that are not even drawn due to how shading hardware works.

Efficient and good-looking post-processing

GTAO

So far I’ve got a FrameGraph system implemented, I’ve had a prototype laying around for years now, but I never managed to fit it into meep’s graphics pipeline, so with a little bit of tweaking I got it to work in Shade with webgpu, and I must say it’s a dream to work with. It automagically takes care of efficient resource allocation and aliasing. Some people are talking about postprocessing in threejs and saying things like “let’s reuse FrameBuffer X for Y, and avoid extra draw in Z”, well - FrameGraph just gives you all of that for free, you don’t even need to think about it. And if you want to turn something off - you can do it at runtime and it just works :exploding_head:

I’ve implemented an SSAO solution following GTAO paper and an excellent resource published by Intel, the solution uses temporal and spatial blue noise, so you get an incredible value out of it, with just 9 taps you get pictures like these:


Here’s without:

Here’s another one:


And without:

Here are some screenshots from pure AA without denoising:



One key trick that is used here is sampling of depth mip map, this way we can accumulate much larger contribution with very few samples. That is, with just 2 depth mips we can effectively sample 16 pixels with just 1 tap, I’m using 5 mip levels, so we’re getting up to 1024 texels worth of contribution with each tap. To my knowledge, this is state of the art currently. The implementation is very cheap, cheaper even than what we have in three.js, and incredibly good-looking.

TAA

I’ve looked at a few different options for anti-aliasing. From the start hardware multi-sampling is not an option in deferred, so we have to look for alternatives. I’ve tried FXAA, and it just looks terrible, no surprise there. I wanted to go for TAA from the beginning, so after reading a whole load of papers I got something that doesn’t smear and has no visible ghosting. Here are 2 screenshots with and without TAA:

here’s with TAA on:

And here’s with TAA and mip-map bias:

The effect is somewhat subtle, but you can notice it in the book titles especially as they become more crisp and readable.

For completeness, here’s the FXAA:

And here’s the GTAO + TAA:


Colclusion

It’s a new era for graphics engines. We can see influence of traditional graphics engines all over even modern API’s like WebGPU, but for high-performance graphics you just have to go with a GPU-driven graphics pipeline. CPU is no longer in charge, most of your data doesn’t even live on the CPU anymore, it’s just a glorified butler for the GPU now.

I still don’t have a shadowing solution. I looked into CSM and virtual shadow maps, but I’m not entirely happy with either, as the first is quite expensive in terms of draw calls and the second is a pain to write using WebGPU as it turns out. I might go for ray-traced shadows with a bit of temporal reprojections, maybe use upscaling to make it cheaper too.

As for global illumination, I plan to port my existing work with light probes over.

Overall I’m very happy with what I’ve got so far and I’m really excited about the future, both for this project as well as for 3d on web in general.

PS:

Probably close to 90% of the shading model is lifted from three.js, so thank you everyone for your amazing work. I spent a lot of time researching state of the art and almost every time I would find that three.js already implement that or it’s not particularly feasible to do on the web.

39 Likes

Duuuude. You’re on a whole other level.

Where is your repo? I want to join Your discourse! BRING THE LINKS!!!

This is incredible.

5 Likes

Some people see Three.js as an engine, for me it’s a set of tools or framework. It has all the 3D math part, along with shader instructions, both front and backbuffer. I think it meets all needs, from game creation to web advertising. Of course, the final result and its performance depend on the programmer and the machine that will run it, just like any program that uses a GPU.

3 Likes

IMO something like this is the Actual delivered promise of WebGPU.
This style of end to end/azdo pipeline is what will bring the web into the future.

4 Likes

@manthrax
Thanks for the kind words

I’m not quite ready to show a working demo yet, and the project is not going open-source anytime soon. As the saying goes - “I’m too poor for that”.

Regarding the AA and texture filtering, I managed to fix up a few issues in the Bistro scene so that three.js would import it correctly. Here is as close as I could get to the same book shelf shot in three.js:

And as a reminder, here’s the same shot from Shade:

Here are a couple of highlights:

Three Shade
image image
image image
image image
2 Likes

A bit more on the rasterization process. I mentioned that number of instances (Meshes) doesn’t matter anymore, you can draw as many as you like provided your GPU can actually handle it.

Here’s an artifical scene with 1,000,000 individual Meshes. They all use the same geometry and material, but they are individual objects. There is no instancing going on or batching of any kind:

Each cube is 6 sides with 2 triangles per side, so it’s 12 triangles per instance and a total of 12 * 1,000,000 = 12 million triangles.

If we look at the stats from the rasterizer, it’s only drawing 1,090,040 triangles, or about ~9% of the total amount of triangles because the rest are occluded and we don’t waste resources trying to draw them.

if we look at the GPU (I’m on RTX 4090) utilization it’s at ~34%
image

And we’re running at 144 FPS which is my screen’s cap.

If we look at the CPU side of things in the profiling tab we see the following:

Each frame takes about 1.4ms, including post processing. This means that the CPU is free to do other work, such as processing input events or updating css.


Let’s add some material variation, here’s the same benchmark, but now each mesh gets assigned one of 1,000 randomly generated materials


Let’s add geometry variation as well, now we generate 10,000 unique geometries and assign each mesh one of these:

We’re still at 144 FPS, the CPU effort is about the same as before at 1.13ms for a random sample:

Again, no instancing or batching takes place here. Each mesh is a fully dynamic object, you can move it, change it’s material or geometry, remove it etc.

11 Likes

Working on shadows. I looked into various shadow mapping techniques, including virtual shadow maps. I was originally going to go for virtual shadow maps, but after working on them for a while, I came to a realization that they are very complex and it’s difficult to avoid drawing parts of the scene multiple times. I didn’t like that, and I didn’t like the complexity involved.

For the past few days I’ve been dusting off my path tracing code from meep when I was working on global illumination and with a little bit of effort I got ray tracing to work.

Here are a few shots with ray-traced shadows:

Here’s are a few close-ups:



Here’s an even closer zoom on the spider’s leg:



As you can see - shadows are pixel-perfect and there’s no peterpanning going on.

All in all - I’m pretty happy with the results. Performance seems to be a bit of an issue as my BVH traversal was not optimized for GPU. My path tracer was originally written for the CPU, and it shows :slight_smile:

That said, performance is okay in most scenes. I feel this is a good approach overall, even if there is no RTX api available in WebGPU at the moment. We can get within ~20% of the dedicated API’s performance, according to some numbers I read in an Intel paper a while ago. And considering that hard shadows only require a single ray per pixel - it’s quite cheap.

I’m thinking to add SDF and ray marched shadows later on, as that should be significantly cheaper and it’s proven to work well in production as Unreal have been using that tech for a number of years now.

The goal is to offer shadowing solution that requires no tweaking and looks good, and I think I’m close to achieving that already.

PS

For sharp-eyed of you, you’ll notice that some of the screenshots show 42 FPS, that was a software cap I introduced for prototyping, to keep my room a bit less hot.

Why 42? - why not? :person_shrugging:

10 Likes

I did a bit more work on shadows, managed to get performance to a good point.

While doing that, I did some debugging to help me understand what’s going on, turns out the rays that don’t produce any show are the most expensive ones. It makes sense if you think about it, but somehow it was a discovery for me.

Here are a few heatmaps using inferno scale, “hot” areas are with more intersection tests and cold areas are with fewer. The scale here goes up to 1024 tests.





And here’s something close to worst-case scenario using Rungholt test scene:



Here are a few more:


Conclusion

I think ray traced shadows are viable. There are a few extra things I want to do in the future, such as

  • variable-rate-shading
  • upscaling
  • occluder caching
  • ray length limiting with cheaper alternative fallback such as SDF

And a few things I can’t even remember now. Going to be moving onto global illumination now.

8 Likes

B"H
Where’s the code? You owe it to us after an intro like the

1 Like

        ↓

3 Likes

:frowning: that’s pretty disappointing. Hopefully one day?

1 Like

This is insane! Looking forward to seeing where it goes.

Do you have an idea of what the performance is like on a more mid tier device / mobile ?

Is depth peeling something you are going to tackle?

1 Like

Not at the moment, it’s a technique that requires more rasterization, and as much as my rasterization pipeline is optimized, I don’t think that depth peeling gives you enough to justify the cost, at least not in the usecases I’m targeting. That, and the fact that depth peeling requires a fairly high amount of extra memory for depth slices.

In general, transparency is not handled at all right now, it will be handled in the future, first via alpha masking and then a proper forward transparency pass.

What I’m doing is quite specific, and in a way - it has to be, I’m just one person and implementing every feature that a competitive graphics engine could have is just not in the cards for me. So I have to focus on just a few features that serve as many usecases as possible ( :firecracker: / :dollar: ) and do it well.

In fact, I’d say that about 70% of the time so far was spent on making the work go faster. Creating abstractions layers that speed up development, automating repetitive tasks and in general trying to keep code base small and understandable.

I implemented a fat abstraction on top of WebGPU API to handle BindGroups and Pipelines, which resulted in the code that uses these being ~10% of what it would be otherwise.

I created a caching abstraction on top of textures, buffers, bind groups, pipelines and shaders that allows me not to have to worry about actual handles and memory management directly. Three.js in fact does something similar under the hood for much the same reason I suppose.

Anyway, back to point. I’m trying to keep feature set as well as the scope small. I’m trying to not have divergence as well, that means that my engine will not have an arbitrary postprocessing solution, it will ship with just the 1 anti-aliasing solution, just one SSR/SSAO etc. Same with shadows and lighting solution.

I’ve done a bit of testing on an old integrated VEGA graphics card, using a laptop from 6 year ago or so. I think it was second generation of VEGA but I don’t remember off the top of my head.

Just the rasterization + lighting was running at stable 60 FPS on Sponza. Which doesn’t say too much, as post process was off and there were no shadows yet.

The goal is to target mid-to-high-end hardware for now. I want the engine to run well on mobile, but something ~ iPhone10 generation and above.

I’m adding various variables in the codebase that allow performance to be tweaked later on, but I’m not yet focused on optimization for different devices.

It’s just economics and skill issue. I’m not rich enough to work on this for free and I’m not skilled at community management to make a successful open-source project out of it.

There’s also the complexity. I love complex problems, but they don’t fit well to casual open-source development.


Here’s one example:

I and others have advocated for a spatial index to be included into three.js since ~2018. There is no spatial index in three.js, even though it would provide a very significant performance improvement.

Why?

  • Because it would make the codebase more complex and there would be no one to maintain that part of the codebase.

Here’s another example:

Light probe volume was being discussed on three.js github since early 2019. There has been a lot of excitement, and very smart people participating in working on that such as Ben Houston, @donmccurdy , Langley West and others.

What do we have to show for it?

  • Only the LightProbe class, which is the smallest part of the entire puzzle.

Why is that?

  • It’s complex. That is, this topic is complex. The number of people who can understand it is very very small, and when considering contributors to three.js, the number is even smaller. You would need smart people to spent a lot of time to understand the topic and implement a viable solution. This takes time, time that is not paid for. And then, who will maintain it?

I could go on, but I think my point is pretty clear. Does that mean that three.js is :poop: and people who contribute to the project are dum-dums? Not at all, three.js is simple and that’s a strength. If it was more complex - it would turn away many potential contributors and cause the project to deteriorate over time as features break from version to version with no one to support them.

I’m not a genius, I’m just stubborn and motivated. But this approach is not very suitable for open-source.

8 Likes

You have done a very impressive work so far—bravo @Usnul !
I have to ask: do you have an end goal for these efforts? You said not open source. So will it remain a project that only you will ever use? Or would you eventually license the engine somehow?

I’m also curious what language you are using. JS, Typescript or other? I myself am a big fan of Haxe, which has an impressive number of end target platforms. I use it with Three.js and would love to see the 3-D options grow, especially ones written in Haxe.

1 Like

Hey @Confidant

The goal is to license in the end, yes.

Regarding Haxe, I’m not very familiar with the language, but from what I’ve seen - I like it a lot. The community that’s developing the language seems like a likable bunch too.

I thought about what language to use, and settled on JS. If this was a project for a larger team - I’d go with TS probably.

Why JS? There are basically 3 options when it comes to the web, if you want to stay somewhat native:

  1. JavaScript
  2. TypeScript
  3. WebAssembly

TypeScript is lovely, but it does have a small hurdle of having to be compiled to JS first, so I decided against it for now. I type-annotate pretty much every piece of JS that I write using JSDoc, so I don’t loose a lot by skipping TypeScript. My code does end up being more verbose and there’s more to write as a result, but that’s something I’m quite used to at this point.

The main thing I miss from TypeScript are generics, JSDoc offers similar functionality but in a severely diminished way.

WebAssembly is an interesting option. And I think there is a lot going for it, the project is development itself. WASM typically is slow to compile, depending on the source language, it has inferior debugging experience and it ultimately takes away browser API from you. Oh, sure, you can get it back with hooks and they are pretty much given to you by default with any XWASM compiler, where X (formerly known as twitter) is your favorite language of choice.

Finally, there is a fault in WASM that just doesn’t sit well with me and that is - it pretends to be a C target, but it really isn’t. You don’t have access to any of the typical C target platform such as multithreading, SIMD etc. In terms of speed WASM doesn’t seem to be faster than JS in any meaningful way. When my JS is too slow, I can just rewrite it to be more “C-like” et violà - it runs about as fast as WASM.

There are cases where WASM is just straight-up faster, but they are very niche.

So in the end I chose to go with JS.

PS:

I know some people will be upset about me saying that there’s no SIMD in WASM, or multithreading. There is SIMD but not on all platforms so it’s unreliable, and I don’t count WebWorkers as multithreading, they aren’t threads in the way that most languages model threads, so if you use them as such - you will just be wasting CPU cycles and thrashing memory.

4 Likes

This looks great. I think for what it adds it is totally reasonable to be a commercial product, it will be interesting to see how much you want and your licensing model.

For those of us who already have large ThreeJS projects, but want the advantages you bring, how feasible will be to be able to swap this into an existing setup?

1 Like

The main focus is on larger and more complex scenes here. If you have a shoe or a t-shirt to demo on your e-commerce platform, what I’m creating is likely to be more hassle than it’s worth for you.

If you have a large CAD model, or a complex scene - that’s a different story. For those, Sahde’s rasterization will be a clear win. I’d say anything above 100,000 polygons will already be worth it.

For scenes with a lot of materials this will be a win as well.

For scenes with many individual instances, above, say 1000 instances or so this will be a win, because Shade’s rasterization pipeline doesn’t really care about how many individual objects you have.

Overall, the engine is designed for heavy work loads.

Same story for materials, if you have a large number of different materials being used - you will see a clear performance improvement, especially for complex materials.

Other than that, it’s a turn-key solution. You can set up three.js to look beautiful, but it takes careful consideration of various parameters, such as light positions, intensities, shadow camera position / size etc. Shade is intended to look great without tweaks.

So far the API is not too different from three.js, there are a few key differences, but it will still feel familiar to anyone used to three.js. There are your typical things like:

  • Camera
  • Scene
  • Mesh
  • Material
  • Geometry
  • Light

I intend to offer a light-weight translation layer as well, for you to be able to supply a three.js Scene + Camera objects for rendering. The translation layer will take care of the rest. I don’t think that will be the best way to use Shade, but it should be a good starting point for most people.

2 Likes

Have I missed you mention the eta for v1 of this?

I’m really glad that you are working with WebGPU. We all need to be migrating from WebGL to WebGPU as fast as possible and examples like these show the advantages and possibilities of doing so.

No, but I think I will have some live demoes to put out in a month or so.

Meanwhile, I made some improvements to the material model, here are a few new screenshots:


Currently working on a path tracer to enable global illumination. Turns out WebGPU limits are a lot closer than they seem at first, when you try to write something that requires a lot of different types of data, such as a path tracer.

For context, WebGPU allows you to use storage buffers inside a shader. These are roughly a flat structure or an array of dynamic size. This is pretty good, and fits well to how a GPU works, but the problem is - you’re limited to 8 of those by default, and on most platforms you can only push it up to 10. That seems like a lot at first, but it really isn’t.

Consider a path tracer, you need to have the entire scene data inside of a single compute shader, because to trace a single path - you need to know about instances, geometries, materials as well as lights. You typically want to use a single buffer for input data into the compute shader. Here’s what that looks like in terms of arrays:

  1. array of Mesh Instances
  2. array of Geometries
  3. array of Materials
  4. array of Geometry attributes
  5. array of Geometry indices
  6. array of Lights
  7. array of BVH nodes for the scene (technically referred to as TLAS or Top-Level Acceleration Structure)
  8. array of BVH nodes for each individual geometry (BLAS or Bottom-Level Acceleration Structure)
  9. array of pointers for each geometry into the BLAS, this is basically a lookup table.

That’s already 9, and this is bare minimum of what you would need. Add the inputs to this and you’re done. A compute shader is practically useless if it doesn’t output anything, so you also need to reserve a buffer for output. And herein lies the problem - you’re out of space.

In WebGL era we would just pack everything into textures and basically build our own memory and data model inside of the shader. It was very inefficient, and you had some of the same limitations - texture slots are limited too.

Not to say that things are as bad as in WebGL, they are not, but there are some serious limitations.

2 Likes