Optimizing Point Lights

Hello there :cowboy_hat_face:

I was thinking about how we can improve current dynamic light limit in three.js. In a small scene my NVidia GTX 960 could only render 50 point lights before dropping frames. I chose point lights for this as its the most usual case for large number of lights. The simplest method seems to be range check. Currently in Phong material every shaded pixel will iterate over every light to calculate it’s contribution and all materials share the same lights list uniform. Its a naive method and checks for lights which have no chance to be affect the pixel at all.

Forward+ rendering uses (usually) compute shaders which we dont have yet (coming in WebGPU). Another option is to do bounding sphere to light’s range distance checks on the CPU to cull lights, but its a lot of work :smiling_face_with_tear:, so I started with the simplest GPU checking. It’s not as optimal due to code branching, but it has information of pixel position and light position/distance, making this check very accurate (unlike bounding spheres which are only an approximation).

Currently #pragma_unroll is used for looping over lights, but afaik it doesnt offer breaking specific iterations, so I just changed it to dynamic for loops without unroll and did something along the lines of

for ( int i = 0; i < NUM_POINT_LIGHTS; i ++ ) {
  if ( distanceFromPointLightToPixel > pointLight.distance ) { continue; }
  ... // continue with shading and lighting calculations

to break the loop iteration for point lights which are not in range to affect the fragment.
This has caused a success of point light limit increased by 60%! - from max 50 to max 80 point lights. However GPU usage is still extremely high and in a real app/game there would be a lot more going on.

Is this an increase even worth reporting to github? It ain’t much, but requires very little shader chunks changes. What do think?

Here is the scene with 80 lights:

Actual shader code changes is just 2 tiny chunks edits:

  1. THREE.ShaderChunk[ 'lights_pars_begin' ]:
    128: light.color = pointLight.color;
    →
    light.color = pointLight.color * step( lightDistance, pointLight.distance );
    This makes the light color black if the pixel is outside the range of the point light. This 2 lines later marks the light.visible to false

  2. THREE.ShaderChunk[ 'lights_fragment_begin' ]:
    38: #pragma unroll_loop_start
    39: for ( int i = 0; i < NUM_POINT_LIGHTS; i ++ ) {
    →

for ( int i = 0; i < NUM_POINT_LIGHTS; i ++ ) {
	pointLight = pointLights[ i ];
	getPointLightInfo( pointLight, geometry, directLight );

	if ( !directLight.visible ) { continue; }

which prevents further lighting calculations done inside
RE_Direct( directLight, geometry, material, reflectedLight );
thus saving some performance when there are a lot of distance-limited lights. I tested in the editor and it seems like the point light distance is always accurately respected, regardless of renderer.physicallyCorrectLights

8 Likes

This seems interesting. Did you get anywhere with this?

My understanding about glsl might be out of date, but I thought there was no way to “bail early” on a shader. That conditionals (including early returns) are a fiction, and that the compiler hides the truth that all code paths are always executed.

Taking this specific case, if you continue early, that would just mean that the compiler will secretly ignore your statement, and simply multiply any subsequent effects in that loop by zero if the condition is met. Again, I may be totally wrong, but this is the way shaders used to work. Maybe graphics cards and/or GLSL are much cleverer these days.

If you have a working prototype, I would check whether this approach measurably improves performance. Simply bogging things down to 30fps, and seeing a framerate increase with your change applied would be enough to prove me wrong.

But if I’m right, a more effective approach (at least for WebGL) might be to try to limit NUM_POINT_LIGHTS to a reasonable number, and make sure that only the closest lights to that mesh are added as uniforms. This might require a bit of faffing around with onBeforeRender so you can update the visibility of lights per mesh. In the special case of the ground plane, which seems to be lit by every light, it may need to be partitioned so that each piece only needs to calculate a subset of lights (under the limit you set).

What do you think?

1 Like

Even though its’ a very old topic, I thought I’d throw in a few musings as I have some experience on the subject.

The reason why having many lights is hard on your GPU - you end up looping over each light for each pixel on the screen.

Even if your light has no contribution, that is - the pixel being shaded is too far away from the light - it will still do the computations.

With the branching early exit that @DolphinIQ has suggested - you will potentially skip all of those extra computations. However, you’ll pay the branching cost, which traditionally has been quite high for GPUs. It’s much better now, at least on desktop. I don’t know about mobile though.

So, no matter how well you do with your early exists, you will still have to at least read some data about the light, do some computations and only then have enough info to decide whether to exit or not.

If you have 1000 lights which will be optimized via early exit, that’s still 1000 distance calculations and 1000 conditionals as a minimum.

The Forward+ that was mentioned, or more accurately I supposed “clustered” or “tiled” rendering, breaks the whole viewport (the picture to be rendered) into small rectangular regions, and then stores small amount of data for each region. This data will be your pre-computed list of lights that actually affect pixels in that region. For any given scene - you might have 1000 lights, but it’s pretty rare that any given pixel will receive light from more than 2-3 lights at the same time.

The technique allows us to consider a very practical number of lights for each pixel now, at the cost of some pre-computation. Problem with clustered rendering and JS, JS is slow, and the technique gets better and better as you split your viewport into more pieces. That’s a lot of computations. You can do those computation on GPU, but without compute shaders it’s reeeealy painful. In C++ you can take advantage of compiler optimizations and threading, but in JS - no such luck. And please don’t get me started on WebWorkers - they are nice, but they are not anywhere nearly as powerful as native threads.

4 Likes

One more technique that I wanted to mention, which doesn’t strictly speaking increase number of lights you can see, but it helps a lot is pooling+prioritzation.

You create a static pool of lights, say 16 point lights, 1 directional, 1 ambient. And you resize the pool as necessary, which will necessitate shader recompilations.

What you do with those lights is the interesting bit. You treat your actual lights as “virtual”, and at point in time, a virtual light MAY be assigned a “physical” light from the pool.

If we have a scene with 1000 lights, depending on the camera position/angle we would assign a priority for each light, how “important” they are. We then sort the list and basically give top 16 entries a light to use from the pool

This may sound dumb, but it actually works incredibly well and it’s a default technique used in many somewhat older games. I believe Unity’s UPR renderer also uses this approach.

The technique does have a few gotcha’s however:

  1. What to do when a light suddenly gets re-assigned? If you don’t handle this well, you’ll see a lot of flicker as you move around, lights getting turned on and off.
  2. What to do with the lights that should be visible but are not assigned? Again, similar to above, if you don’t figure this out - your smaller light sources will have absolutely no presence in the world and no contribution.
  3. How big should the pool be? This one is relatively easy, it can just be a quality setting, letting the user pick between, say 4,16 and 32 or something similar.
  4. You no longer have any guarantee about light interaction. An object may or may not actually be lit by a nearby light depending on the camera position/orientation.

Not super relevant, but I do use light pooling as well in my engine, mainly for the nice side-effect of letting you avoid recompilation of materials.

5 Likes

Usnul, thanks for the nuance re: “branching cost”. My understanding was that branching cost was always maximal, but it sounds like that’s not always true these days, especially on desktop.

That virtual lights + pool approach seems useful enough to justify at least a Threejs addon… :thinking:

More grist for the mill :slight_smile:

https://tiansijie.github.io/Tile_Based_WebGL_DeferredShader/

2 Likes

Greetings :eyes:

I did not push with this anywhere due to lack of interest, however because of resurrection of this thread I’ve pushed the code of this demo to github:

If anyone is interested in this, it would be awesome if you could run it and let us know here what are the FPS/GPU time differences with and without the shader edits!

As for branching in the shader, as @Usnul said - there are costs and benefits. The goal of this experiment is to figure out whether the benefits outweigh the costs :slight_smile: Due to heavy parallel design of GPUs they aren’t fond of code branching, but it is by no means impossible. The results, the framerate doesn’t lie. If better performance is achieved, then why not try this?

Of course Forward+ (or tiled, or clustered) rendering is the “proper” way to limit light computations within a forward rendering pipeline (which Three.js uses). However that is an entirely different beast on it’s own, and even just custom light uniforms overriding can be a huge hassle if you don’t have years of experience with the 3js WebGLRenderer.
The goal of this experiment was to see if there is a simple couple lines of code edit, that would allow to optimize most scenes with large number of lights, on most GPUs.

Another approach is a completely different pipeline - Deferred Rendering mentioned by @manthrax. That once again is tons of work and lots of inner workings edits… which I’m happy to say I undertook and recently built something that could be considered a working prototype! :partying_face: :tada:

It even has live demos, one of which shows how you can run thousands of point lights in your scene as long as they dont all lit the same surfaces (which is of course very rare in games).

TL;DR In coclusion it would be great if anyone interested in this topic tried the GitHub - DolphinIQ/threejs-point-lights-optimization: A test to see if Three.js point lights can be shader optimized to branch out of the light loop if pixel is outside the light's range demo and reported here or in my DMs the FPS/GPU time differences, as well as the GPU it runs on. I will then try to compile them into some meaningful data that we could draw conclusions from. Thanks! :smiling_face:

4 Likes

Wow, Ogar looks like a lot of work! I haven’t taken a close look yet, but that seems pretty amazing.

In terms of FPS, have you done the test yourself? If so, what speed up did you see?

If not, it helps to be able to make a specific performance claim to encourage others to try to reproduce it. There are plenty of FPS meters, but I recommend mrdoob’s excellent stats.js especially because it can run easily as a bookmarklet.

Once you have an fps meter, bog down your render workload enough that any increase in performance is not clamped by the browser refresh (raf) rate. I like to target 30fps as a “baseline” framerate.

So, once you have “30fps worth” of lights, reload with your patch applied, and compare the new framerate. As long as the new framerate is below the raf limit, you can divide new by old rate to get a reasonably accurate speed-up factor.

The higher the number, the more interesting the optimization. Very high numbers for realistic cases can often justify code changes that have other disadvantages. (If the optimization is very simple and has no drawbacks, the bar can be lower.)

1 Like

For my Nvidia 960M laptop, I get up to 30fps and 35-44ms GPU time usage at 78 point lights without edits.
Once I enable the optimization, I get 60fps just barely and 16ms GPU usage.

_______|_ With _|_ Without _|
FPS ___|___60___|___26-30___|
GPU ms |__15-16_|___35-44___|

Thats over 100% performance increase for my machine :exploding_head:

1 Like

This is fantastic stuff! I’ve been looking for a deferred rendering pipeline that allows me to build a particle system that actually lights up surfaces based on particle (emitters) for a top-down game I’ve been tinkering with.

I’m very interested to see Ogar becoming “production ready”.

On a different note : Your online examples don’t work on my Galaxy A70. It says that my browser doesn’t support Webgl, but the regular 3js examples work just fine. Are you using specific extensions that I might be missing on this device?

1 Like

Woah, that’s huge. And you know what’s crazy, I notice your “with” number is 60fps, so your browser may actually be capping the true framerate. That means the actual speedup is possibly even higher than what you’re measuring.

Try adding even more lights until the “with optimization” number drops consistently well below 60 (e.g. 30-50). If we’re lucky, that could mean your “without” framerate might get pushed very low, revealing a very potent optimization (for this case).

This result also shows that, at least on desktop machines, webgl shaders are absolutely capable of true branching, and significantly so. Today I learned! Thanks for teaching me!

Once you’ve got numbers that you’re sure aren’t being “clipped” by the browser, I’ll try to reproduce this to give some extra weight to the claim. It would also be helpful to show whether the optimization works well on other classes of device too (e.g. older and low powered devices, especially phones and tablets).

Yes, it is capped, but barely. There happened frames where it dropped to 55, 58, etc. I suspect the real number is around 62-65fps. Either way GPU profiling doesnt lie and the performance boost is clearly visible even with 30fps to 60fps - thats 200%+!
With that said, im very curious to see what times other users’ machines provide.

500 lights, 60fps, js 5ms, gpu 43%
1000 lights, 60fps, js 7ms, gpu 62%
1500 lights, 60fps, js 8ms, gpu 76%
2000 lights, 60fps, js 10ms, gpu 92%
2500 lights, 58fps, js 12ms, gpu 100%

Thanks for running the test! Are these stats with or without the “optimization”? Did you note the difference?

I tested only this page: Ogar Engine - Massive Point Lights
Where i can test same page without optimisation ?

Sorry, this thread got a little convoluted. The app to test is

How to run is included in the Readme file. Ogar engine is only a side topic.

1 Like

Increasing 200% Videocard: AMD Radeon RX 560
50 lights with optimisation. 60fps, js 4ms, gpu 5ms
50 lights without optimisation. 60fps, js 4ms, gpu 10ms
100 lights with optimisation. 60fps, js 4ms, gpu 10ms
100 lights without optimisation. 45fps, js 4ms, gpu 20ms

1 Like

capped, but barely

To get a reasonably accurate number, both the optimized and unoptimized fps need to be significantly and consistently below the browser framerate cap.

It’s not clear whether it’s an apples-to-apples test, but based on the results @Chaser_Code reported, optimization looks much more powerful than a 2 times speedup. For them, the unoptimized code drops below 60fps somewhere between 50 and 100 lights, and the optimized code drops below 60fps possibly at 2500, but certainly more than 2000. Using only the most conservative numbers from this, 100 and 2000, we have evidence of a twenty times speedup (1900% faster).

(Edit: :point_up: This was a my mistake, reading the wrong test results. Even so, optimization is even faster than this, so …)

It’s possible there are other effects at play here (window size, for example), but you may be selling this optimization short, @DolphinIQ !

I’ll try it now

This optimization has such a strong effect that I couldn’t accurately measure the improvement with fps: the dynamic range is too big. I had to switch to using the time between renders (“frame time”).

I did three tests comparing “stock” Threejs with @DolphinIQ’s mod (“optimized”):

A. Stock with 250 lights: 1519ms (0.66 fps … oof)
B. Optimized with 250 lights: 22–26ms (38–45 fps)
C. Stock with 31 lights: 34ms (29 fps)

Comparing A and B shows us that for very large numbers of lights (the maximum my gpu would allow) the rendering performance in this scene is about 58 times faster with the optimization applied.

Also, comparing tests B and C we can see that, holding frame time constant (approximately), we can render about 8 times more lights, and that’s being conservative.

To be fairer, since test B is performing better than test C, another way of looking at this is “lights rendered per millisecond”. Stock is is 0.9 lr/ms and optimized is 9.6 lr/ms. This isn’t a super scientific measure, but I think it’s a useful way to adjust for this performance difference.

I also noticed that when you zoom in with the optimization enabled, the frame rate improves further. In stock three, the framerate is consistent no matter where the camera is pointing. This suggests the early return is working as intended, as though lights are being culled.

To check the effect on smaller, more “reasonable” numbers of lights, I did some tests with 10 lights. For these tests I rendered the scene 50 times per frame to multiply the small load enough to measure it.

D. Unoptimized with 10 lights (x50): 17–42ms (23–58 fps)
E. Optimized with 10 lights (x50): 16–37ms (27–60 fps)

I might be wrong, but I think taken together, these results mean two things:

  • This optimization has large effects for large numbers of dynamic lights
  • There seems to be no performance impact on smaller numbers of lights

I’d love to hear from the rest of you about what you think of these results. Further and/or more rigorous testing would be welcome from anyone.

Overall, given the potency and simplicity of this optimization, I think this warrants further discussion and consideration. I’d love to hear what core maintainers think of this: perhaps there are important aspects we’re overlooking.

PS. All these tests were on a 2021 Macbook Pro (M1 Max) running Safari 16.5.2 on MacOS 13.4.1. Canvas size 3174 * 2274 for all tests.

Just following up here. In my limited (but careful) tests, I found @DolphinIQ’s simple patch dramatically improves the performance of scenes with large numbers of point lights, without hurting the performance of regular scenes. The performance gain can be more than 50x.

To recap, this patch does an “early exit” from the shader loop of dynamic lights when a light is out of range.

Is there interest from Threejs maintainers in an improvement like this? If so, what needs to be done to move this forward? If not, what objections are there?