Dr.Strangelight or: How I Learned to Stop Worrying and Love the Cluster

Hey guys,

I’ve been working on a forward+ renderer for a while and wanted to show off a little the results so far.

updated link

The scene is recognizable Sponza with 1000 moving point lights.

Original demo

Link


The scene contains 500 torus knots and 1000 point light scattered around.

Controls


The navigation is via mouse, it’s the standard OrbitControls. So left mouse drag to look around and right mouse to pan, wheel to zoom in/out. There are also 3 interactable things on the screen, buttons in the bottom left that toggle different render layers. Of interest there are 2 and 3 that contain most of the scene, 2 contains debug material that shows heatmap of number of lights in each cluster.
Top left left corner contains toggle for 2d debugger that shows cluster light counts.
Finally, bottom right corner contains rendering of cluster texture itself.

Motivation


There have been quite a few tiled and even clustered renderer projects in the past, but they all have various limitations, like maximum number of lights per cluster/tile or hard limit on number of lights. My approach follows a more flexible approach with a 3d texture encoding clusters and 2 levels of de-referencing to keep memory requirement of this solution down.

I’m trying to build an extension to WebGLRenderer that would just work out of the box with existing materials in three.js. I tried going down the deferrer rendering path, but three.js doesn’t support multiple render targets currently, so that’s a no-go. Forward+ is the only logical alternative. Tiled forward+ example was a great starting point, having clusters adds another dimension for culling, which is just a logical extension. My implementation is fully configurable in clustering resolution, the example linked here uses 24x12x12 resolution, that is a total of 3456 unique “clusters” in the scene.

Optimizations


One major difference is also to do with precise frustum/sphere intersection testing. That one took a couple of days to fully figure out. What this means is that my implementation does not have false-positives. Pretty much every other clustered renderer is satisfied with the standard aabb/frustum intersection test that over estimates bounds of the sphere by around ~47.6% from the outset as well producing a massive number of false-positives per cluster because of inherent fault in multi-plane intersection test.

References


Here are some links I found very useful during my research in no particular order:

Show
22 Likes

everyone: [building our cute, tiny games, engines, and workflow optimisations to bring quality of life improvements to our happy, little development community]
meanwhile @Usnul: So, anyways, here’s a thermonuclear renderer with 3 depths of quantum-computing layers from Mars - it also transcends time and can semi-accurately render most scents and flavours. :upside_down_face:

Already started reading the linked PDFs - but could you describe briefly how it’s working, how the 3D texture (afaiu) is translated into lights?

6 Likes

Sure, that’s pretty simple to understand, just hard to convey properly. The idea is this:

You have the view frustum, this is a truncated pyramid.

image

You transform that into a cube, because that’s what the fragment shader works in. To a fragment shader your geometry is all inside a cube, mapped from the view frustum.
image

You then proceed to slice this cube along each axis, X, Y and Z forming “clusters”:
image

Inside the material shader, you can now very easily figure out which one of these little boxes (clusters) the fragment that you are shading is located in. In fact, you can do this in constant time O(1) if you ignore the hardware specifics.

Clustered rendering takes advantage of this by encoding all the light information into these clusters, so that when you need to compute lighting for a pixel (frament) you only need to consider lights that are affecting this cluster, you don’t need to consider all lights in the scene. Another advantage compared to standard froward rendering is that it usually hard-codes number of lights into the shader as a constant, for example, in three.js adding or removing a light triggers a re-compilation of all shaders currently being rendered, this is fairly expensive process.

8 Likes

To help illustrate the clusters, here’s a scene with 1 light being tightly fitted into the view frustum, in the bottom right corner you can see the 3d cluster texture:

If we increase the clustering resolution to 64x64x64, it becomes more obvious:
image
you can see the single light being represented as a voxelized sphere inside the frustum.

The un-intuitive thing is that the sphere is not a sphere under projection, you can see that the sphere becomes egg-shaped along the Z axis, due to perspective projection, it gets smaller and smaller. Here’s the view of the same cluster but from above (looking at the camera frustum from above).
image
To illustate it better, here’s the same thing with FOV of 50 degrees (up from previous 30) :
image

Because this stuff is hard to understand without visualization, I created debug tools for myself, here’s the 2d visualisation with a counter in each cell as well as heat map, showing how many clusters along the depth (Z) axis contain some light information:

8 Likes

I think I love these 2 replies as much as the original showcase - if not even more. :grin:

Thanks for an amazing, in-depth explanation!

3 Likes

@Usnul
You’ve got that incredible ability to explain a complex conception in the way, that even me can understand :slight_smile:

4 Likes

This is really impressive and it works like a charm on my laptop ! Also thank you for the explanation, I wish someday I will be have your level…

Originaly I tried your renderer on my mobile and I had some issues though, are you aware of hardware-specific bugs ?

Here is a screenshot from my mobile ( HTC U11 EYES ) on Chrome for Android :

Most geometries are not lighted, and when they are it seems to betray a bug with the clusters. You can see on one geometry on the bottom left of my screen that’s partially lit in yellow. The lit parts look right, but it’s cut abruptly.

I got the logs for about 10 second on page load if it can help :

bundle.js:1 DataTexture constructor
Vr @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
n @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
bundle.js:1 DataTexture constructor
Vr @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
n @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
bundle.js:1 DataTexture constructor
Vr @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
n @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
bundle.js:1 done
bundle.js:1 Dr {uuid: "8DFA7800-2ED1-412B-A0F1-C7FBD9493F3C", name: "", type: "PerspectiveCamera", parent: null, children: Array(0), …} ea {domElement: canvas, debug: {…}, autoClear: true, autoClearColor: true, autoClearDepth: true, …} ia {uuid: "F45E95BD-A4CD-4F53-B61E-0DB5B8723EBA", name: "", type: "Scene", parent: null, children: Array(3), …}
bundle.js:1 positions are undefined, assuming uniform distribution
write @ bundle.js:1
wm.width @ bundle.js:1
(anonymous) @ bundle.js:1
n @ bundle.js:1
(anonymous) @ bundle.js:1
(anonymous) @ bundle.js:1
main.css:1 GET http://server1.lazy-kitty.com/tests/renderer_forward_plus_2021_02_03/data/textures/ui/Fantasy%20Cursors/Mouse3/x32/Mouse03%20standart.png 404 (Not Found)
main.css:1 GET http://server1.lazy-kitty.com/tests/renderer_forward_plus_2021_02_03/data/textures/ui/Fantasy%20Cursors/Mouse9/x32/Mouse09%20Bright.png 404 (Not Found)
main.css:1 GET http://server1.lazy-kitty.com/tests/renderer_forward_plus_2021_02_03/data/fonts/tt0246m_.ttf net::ERR_ABORTED 404 (Not Found)
bundle.js:1 Mean: 48.29ms, Median: 35.10ms, traversal[ checks: 49,633, rejects: 13,850, s.rejects: 10563]
bundle.js:1 Mean: 27.99ms, Median: 15.60ms, traversal[ checks: 49,633, rejects: 13,850, s.rejects: 10563]
bundle.js:1 Mean: 28.31ms, Median: 15.40ms, traversal[ checks: 50,686, rejects: 13,747, s.rejects: 11172.791666666666]
bundle.js:1 Mean: 26.50ms, Median: 20.20ms, traversal[ checks: 58,088, rejects: 13,030, s.rejects: 15629.09]
bundle.js:1 Mean: 26.70ms, Median: 23.00ms, traversal[ checks: 67,789, rejects: 12,564, s.rejects: 21026.25]
bundle.js:1 Mean: 29.29ms, Median: 25.00ms, traversal[ checks: 76,825, rejects: 12,953, s.rejects: 25236.33]
bundle.js:1 Mean: 28.00ms, Median: 25.00ms, traversal[ checks: 81,671, rejects: 13,931, s.rejects: 26493.26]
bundle.js:1 Mean: 29.37ms, Median: 26.40ms, traversal[ checks: 86,088, rejects: 15,483, s.rejects: 26690.14]
bundle.js:1 Mean: 30.41ms, Median: 28.40ms, traversal[ checks: 89,419, rejects: 15,675, s.rejects: 27748.19]
bundle.js:1 Mean: 31.85ms, Median: 28.70ms, traversal[ checks: 94,231, rejects: 14,338, s.rejects: 30999.37]
bundle.js:1 Mean: 31.53ms, Median: 28.50ms, traversal[ checks: 93,317, rejects: 14,526, s.rejects: 30565.91]
bundle.js:1 Mean: 26.34ms, Median: 27.80ms, traversal[ checks: 75,758, rejects: 12,601, s.rejects: 24349.37]
bundle.js:1 Mean: 21.64ms, Median: 18.90ms, traversal[ checks: 60,253, rejects: 11,963, s.rejects: 17941.35]
bundle.js:1 Mean: 18.51ms, Median: 14.90ms, traversal[ checks: 46,891, rejects: 9,963, s.rejects: 13619.08]
bundle.js:1 Mean: 19.23ms, Median: 15.20ms, traversal[ checks: 51,542, rejects: 10,017, s.rejects: 15930.6]
bundle.js:1 Mean: 22.91ms, Median: 23.50ms, traversal[ checks: 61,318, rejects: 11,566, s.rejects: 19203.27]
bundle.js:1 Mean: 24.76ms, Median: 23.50ms, traversal[ checks: 71,451, rejects: 15,424, s.rejects: 20294.88]
bundle.js:1 Mean: 23.92ms, Median: 22.50ms, traversal[ checks: 70,060, rejects: 15,302, s.rejects: 19720.6]
bundle.js:1 Mean: 22.49ms, Median: 22.00ms, traversal[ checks: 65,415, rejects: 16,192, s.rejects: 16483.32]
bundle.js:1 Mean: 23.28ms, Median: 21.90ms, traversal[ checks: 65,007, rejects: 15,827, s.rejects: 16638.48]
bundle.js:1 Mean: 24.61ms, Median: 21.50ms, traversal[ checks: 70,669, rejects: 17,942, s.rejects: 17282.73]
bundle.js:1 Mean: 25.59ms, Median: 21.70ms, traversal[ checks: 75,367, rejects: 19,447, s.rejects: 18053.52]
bundle.js:1 Mean: 25.23ms, Median: 21.50ms, traversal[ checks: 75,365, rejects: 19,501, s.rejects: 18046.81]
bundle.js:1 Mean: 24.83ms, Median: 21.30ms, traversal[ checks: 74,417, rejects: 19,351, s.rejects: 17729.63]
bundle.js:1 Mean: 25.23ms, Median: 22.10ms, traversal[ checks: 75,682, rejects: 19,447, s.rejects: 18262.22]
bundle.js:1 Mean: 24.30ms, Median: 22.00ms, traversal[ checks: 74,792, rejects: 19,544, s.rejects: 17697.92]
bundle.js:1 Mean: 25.86ms, Median: 22.10ms, traversal[ checks: 76,600, rejects: 19,632, s.rejects: 18495.5]
bundle.js:1 Mean: 26.14ms, Median: 22.00ms, traversal[ checks: 77,289, rejects: 19,653, s.rejects: 18819]

Edit: another screenshot :

1 Like

There are still bugs there, it’s somewhere in the de-referencing part. I’m still quite new to this kind of shaders :slight_smile:

There might be some hardware limitations too, I haven’t explored this topic at all yet :bug:

This has been kind of like a hobby project that I wanted to do for a while, it’s by no means complete, but I felts it was kinda close enough to show-off and interesting enough to for others :bowing_woman:

4 Likes

“They’re after our precious bodily clusters!”
:laughing:

1 Like

Loaded up sponza today for testing. I feel that a large proper piece of geometry conveys what’s going on a lot better. Here are two views, one is just lights and the other is a heatmap, showing number of intersecting lights per cluster. In this image it goes from 0 to 5, indicentally most of the scene has is ~2 lights per pixel on average.


Some stats:
Total number of lights in the scene: 5000
Number of lights in the camera frustum: 147
Resolution of cluster texture: 32x32x32

In essence, this demonstrates quite handily the benefit of clustered rendering. If we used standard forward rendering approach, same as what’s used in three.js - we would have to do 5000 light calculations per pixel. If we were very clever and only calculated lights in the view frustum - we’d still be doing 147 light calculations per pixel and would likely have to recompile all materials in the scene every few frames which would absolutely destroy our performance. With clustered rendering, however, this is roughly equivalent to having only 2 lights in your entire scene performance-wise.

[update]

A few more pictures, using three.js light calculation code



4 Likes

Added support for MeshStandardMaterial

There are 139 light visible in this shot alone. Here are the light volumes:


7 Likes

Latest demo

Has a lot of issue with planar interpolation for building clusters as well as some light maths.

5 Likes

An update.

I’ve been working on the engine on and off for the past few months and now it’s production-ready and is being used in the main branch of the game engine.

A few things I noticed quite early on about implementations of others.

Cluster resolution

Cluster resolution is typically kept very low, a lot of implementations go with 15x15x15 cluster slices. After experimenting a lot (like… a lot) I use 24x14x8 which provides more uniform slices for my purposes while requiring fairly trivial amount of time for update.

Common sense would tell us that having smaller clusters is better, as it would produce tighter bounds and result in fewer unnecessary light evaluations. Why do people keep this resolution fairly low? This is because each cluster needs to be updated each frame, so if you have 15x15x15 slices you end up with 3375 clusters that need to be updated, for 64x64x64 slices you’d need to update 262144 clusters, which is roughly 80 times more work (yes, 80 times, not 2 times, not 10 times). So if we assume that updating 15x15x15 clusters takes 1ms, updating 64x64x64 would be expected to take 80ms.

You don’t want your “lighting acceleration” solution to reduce performance of your overall application, so… small cluster resolution. I think if you have access to compute shaders or threads - this is less of an issue, as you can split the work up. With a single-threaded CPU solution though you have to be quite careful about your performance budget.

Per-cluster light limit

Another common theme I found is every implementation that I’ve seen (and I’ve seen many) imposes an arbitrary limit on the number of lights per cluster. This limit is entirely arbitrary, usually its decided by concrete needs of the engine. Typically it’s set to a relatively high value, like 64 or more. The issue is that at some point in some usecases it will not be enough, in these cases engines just drop lights that don’t fit without any kind of magic.

However, I there is a simple and elegant solution to that which has been proposed by various authors - using a separate lookup table. You store a pointer to the lookup table in the cluster and then you pull out actual lights from a non-uniform table (separate texture). It does add the overhead of an extra texture fetch per cluster, but I found it to be negligible, especially since it’s a non-filtered lookup.

Using a lookup table - I don’t have to limit number of lights, if lookup table overflows - I simply resize it. When the table is too large - I scale it down. This way I keep overall memory usage very low, in turn getting very nice cache locality. And you only see your lookup table grow on complex scenes where there are a lot of overlapping lights. A bit of extra complexity in code, but you don’t have to worry about per-cluster light limit at all.

To mitigate growth of the lookup table, I use a small cache to re-use cluster light assignments between clusters (point different clusters to the same area of the lookup table). I found that it saves up to 99% space in simple scenes with a few large lights. The benefits drops as the scene becomes more complex and there are fewer clusters with identical light assignments to exploit. Since my aim with this engine was to be a general-purpose solution, I aim for those more common cases where there aren’t too many lights in a single camera frame with respect to optimization.

Culling

I found that most implementations don’t do culling well, they iterate over all lights and try to reject them as early as possible, but that’s not a very scalable solution. For my engine I used spatial index that I had previous implemented to select lights in view frustum before considering per-cluster light assignment.

When assigning lights to clusters, I use the same spatial index trick but with only visible lights, this way culling actually happens twice:

  • Once for the entire camera frustum
  • Once per individual cluster frustum

I found that for small number of lights in the scene (around 500) this approach with a spatial index is actually slower than using a straight-up array. Because of memory access patterns being less linear, however, it amortizes very well, so when number of lights grows, performance drops sub-linearly (close to log, but I didn’t build proper performance metrics so it’s just a feel based on a handful of exploratory tests).

When using a spatial index, the overall number of lights in the scene almost doesn’t matter, I can have interactive performance with a million lights in the scene, provided only a few of those lights are in the camera frustum at any given time.

Performance considerations

I found that the biggest obstacle to performance is memory access, because you end up crunching numbers on a relatively small dataset but with a lot of numbers being crunched.

Even instruction count plays a role here, having a light-to-cluster assignment function that takes up less memory can result in fewer cache misses and significantly faster overall speed.

Same applies to the data, I converted all cluster frustum data to uniform Float32 arrays and got around 50% execution time reduction from only that, because you get fewer random memory accesses that come from de-referencing object fields as well as overall reduction in the data size since objects in V8 take up some 60 bytes or so just for existing(I don’t remember exact numbers, but it’s a lot).

Many lights without clusters

I asked myself the question “Can we really not have many lights in the scene without clusters or some other magic?” And so I explored how far you can push three.js with number of lights before performance starts to become an issue. And it turns out - quite far. On my GTX 1080 I can run around 400 point lights at 1080p at interactive frame rate. This will not be the case on weaker hardware - sure, but you will rarely need 400 lights in the scene at the same time.

Another issue is that adding/removing lights in three.js causes material re-compilation, which is a much larger problem in my view. This can be mitigated though, I coded up a “light cache” which is basically like an object-pool pattern, you pre-allocate certain number of lights of each type, and when a light is not needed - it’s just dimmed to 0 intensity. That way as long as have relatively fixed maximum number of lights in your application - you can enjoy the benefits of not having to worry about adding/removing lights to/from the scene and get somewhat decent performance overall.

I haven’t done any robust testing, but my gut feeling is that my clustered implementation on GPU pays off at around 16 lights on my hardware, on lower-end GPUs it will probably be much lower number like 2-4. However, building clusters is not free, so that’s something you need to include in your calculations as well, and that’s worth a bit of computational time too. I don’t think that clustered rendering is useful for cases where you have less than 16 lights visible at any given time in the scene, light “caching” would be a smarter option.

Incidentally in meep I use both techniques, since clustered rendering engine I wrote only supports point lights so far.

Future Work

  • Add support for spotlights, these are a pain, because of having to code up frustum/cone intersection tests
  • Move cluster assignment code to GPU
  • Experiment with pushing the entire light BVH to GPU so that no clusters are required in the first place
4 Likes

This is awesome work! Do you have any updated demos? Would you consider providing light clustering in the form of a lib?

1 Like

Hey Joe, thanks for asking. I’m not planning on releasing this for free currently. I do have a couple more demos, I’m currently moving house, so I’ll try to upload them once that’s done in a month or so.

I think the basics of clustered rendering are fairly easy to implement. It takes a lot of time, and you will need to do a lot of research unless you’re really good at computational geometry already. But it’s nothing like research-grade stuff. Based on my previous experience, I don’t think this engine would benefit many people if I released it as is, without proper documentation and packaging, and I don’t have the time to do all that currently.

Also, my engine makes heavy use of Meep’s internals, like BitSet, BVH and various rendering abstractions. Those are not essential and one could replace all that if they were determined, but it also means that the system is not “standalone” and couldn’t be packaged into a tiny library that could just be dropped in.

That may be true of most people, although I think there are some of us here who would love to read the source to learn from it, and maybe improve on it.

Looking forward to seeing the next demo! :smiley:

2 Likes

Hey guys, here’s a slightly more interesting demo:
http://server1.lazy-kitty.com/tests/renderer_forward_plus_2021_12_16/

there are a few widgets for debugging purposes, feel free to play around with those. Most interesting one probably is the light bounds render, that’s the number 1 in the bottom right corner, it will display bounds of each light like so:


here’s a corner of the scene, the spherical bounds are a bit more readable here:

overall there are 500 lights moving about the scene. This is more of a stress test for the engine. I’m curious what kind of FPS you’re seeing on different hardware, for me it’s around 100 with the default camera view.

I was inspired to add the moving lights and visualize the actual point-sources by this wonderful example from @wizgrav:
https://threejs.org/examples/?q=forwa#webgl_tiled_forward

For reference, that example uses 32 lights.

3 Likes

Hm - is there a reason why sponza model loads super quick in the network panel, but shows up only after 1-2 mins on the actual scene :thinking: ?


90-120fps on 2021 M1 MBP (depending on the zoom level)
30-40fps on iPhone 12 (suddenly drops to 10fps after a moment / doesn’t show sponza model at all, only floating lights / and throws a shader error :pensive:)

1 Like

on the iphone it looks like it’s the GLES 3.0 shader model (the #version 300 bit) that messes with things. Not sure what that’s about :face_with_raised_eyebrow:
Maybe webgl 2.0 not supported? …Strange. Anyway, thanks for sharing the log, food for thought :thinking:

The loading/display thing, I think it’s related to texture decode, and possibly shader compilation. I’m using a sponza version from some random github repo, it’s got a 68 individual textures and all pretty large (147 Mb of png). Decode could take a while depending on how well chrome has that part optimized on M1. Interesting to know :slight_smile:

Extremely slow to load here too - but mostly it’s all the images. Sponza model tanks about 15s and then it starts on the images which takes several minutes. Then once they’re all done there’s another 30s or so before the model shows up (this is on an i7).

The actual demo is very cool though, great work! It’s running at 4k@120fps with 60% gpu use on a 3070m

Would love to see this added to three.js someday.

1 Like