Ideas on performing fast per-instance frustum culling on InstancedMesh

Hi all !

I would like to ask the forum about advice on how to perform frustum culling on the instances of an InstancedMesh.

I know it’s a feature not supported natively by three.js, but I have some intuition that ‘something’ could be done in that regard, even if it’s not considered ‘true’ culling.

The most straigtforward way seems to be reordering the instances and all attribute buffers and then play with the ‘count’ property to draw the non-culled intances. If I’m not mistaken it’s what @agargaro did in his awesome InstancedMesh2 library. But accessing, reordering and computing all those arrays/buffers in javascript every frame (specially if you have a good amount of custom attribute buffers, which I do) seems to be way more computing work than just rendering all instances without doing the culling…

But instead I’ve been toying in my mind for some time with the idea of stopping early the computations in the vertex buffer, and therefore then discard also the computations in the fragment shader, for the culled instances as far as all the instanced meshes materials have some kind of texture/attribute that informs them if a certain instance should be visible.

All my ‘high level’ objects are static, and I compute on init a BVH to know their position and size. So in theory knowing if they should be culled or not is pretty fast and efficient (on the CPU side). How could I pass this data also in the most efficient way to all the meshes in the GPU then?

Maybe store the culling value every frame in some kind of shared buffer/attribute/texture among all affected InstancedMeshes? This is where I start to doubt if it would be worth it of the updating/uploading to the GPU, specially on low-end hardware which is the target I would like to improve the performance on.

Maybe there is some way using GPU compute to do the culling 100% on the GPU side?

Any idea would be super welcomed, maybe it could en up in something that we could add to the engine.

Thx!

1 Like
2 Likes

Hi!

My library worked this way in the first version, but now it works a bit better:

  • It has now indirection: uses an InstancedBufferAttribute to manage the indexes of instances to be rendered, allowing for selective rendering, efficient culling, and sorting before sending data to the GPU.
  • Uses SquareDataTexture (an extended version of DataTexture supporting partial updates) to store instance matrices and data.

So you do not use InstancedBufferAttribute, but the utility methods (setUniformAt), and nothing is reordered.

The shader is now similar to BatchedMesh one.

You can use onFrustumEnter if you don’t want to render an instance based on attributes:

myInstancedMesh.onFrustumEnter = (instanceIndex, camera) => {
  // if returns false, it doesn't render the instance.
  return iMesh.getUniformAt(instanceIndex, 'myCustomVisibility');
};

Yes and I would like to implement it in the future when my library also supports WebGPURenderer.

3 Likes

What I find sorely missing when using InstancedMesh and apparently @agargaro’s enhanced implementation as well, is the ability to attach instances to arbitrary parents in the scene graph.

I’m making a space shmup where I have a large number of ships on the screen, and each ship has thrusters as billboards attached to them. Now each ship requires several draw calls because of those child billboards, which could all be rendered in a single draw call if we could attach instanced meshes to parent meshes.

335 draw calls for this, could be done in 120.

1 Like

@agargaro I wish I could just use your InstancedMesh2 library, I can’t because of some peculiarities of my project, but it’s awesome and always a source of inspiration and new ideas. I’m using your BVH implementation which I find genius (with some modifications to fit my needs).

In that spirit let me share some ideas, in the best constructive mindset, not a negative criticism at all :slightly_smiling_face:.

I have doubts that using those partial updates you implemented in SquareDataTexture are worth it.
The overhead created by figuring out which rows should be updated I think suppose more load for the CPU because of additional function calls, temporary objects creation and specially access to non-contiguos memory on intermediary arrays/objects, in comparison to just updating the whole array related to the texture. JIT compiler is super optimized for array access by index too.

If I’m not mistaken Three.js already uses texSubImage2D for DataTexture when you flag the texture as needsUpdate, with the big benefit vs texImage2D of not having to reallocate the texture storage in the GPU.

The benefit of those partial updates could make sense if you were to update big textures (something like 2k/4k textures), but when using them for example to store a single value per instance a 512x512 can potentially store data for more than 250k instances, so the CPU/GPU bandwidth savings are negligible in comparison of the additional CPU work for figuring out those partial updates.

That said it could make sense if the data that has to be updated every frame on the texture doesn’t change much, but to store the culling data all the texture data has to change every frame (unless you made a diff of the previous and next arrays, but that would be much more work than just feeding the new culling values for every instance).

1 Like

Your use case could be rendered in just 2 draw calls :wink:. (ship body in one and thrusters in another).

It’s more implementation work, but let me tell you how I do it:
You create ‘proxy’ objects that work in the traditional parent-child tree display list style. But then those children are really rendered using InstancedMeshes.

So what is rendered are ‘layers’ of instancedMeshes, while you keep your objects logic apart, the objects figure out what layer they have to use to render.

On your animation loop you then ‘sync’ those proxy objects with the data of your instanced meshes.

That’s what I did when implementing https://www.make.com/en/grid, and the whole graph renders using less than 30 draw calls, no matter the total number of nodes/links needed.

FYI: I’m getting VERY good performance (considering the culling is done on the CPU side, looking forward to do it with WebGPU soon, so it runs only in the GPU side with some compute shader). Running on an integrated Intel GPU, just 5-10% CPU overhead to perform the culling and send the data to the GPU.

The 4 data textures you see on the right are 4 different cullings I’m doing (team/folder/link/node).
Each one relies on a different set of BVHs for those objects.
It brings the GPU usage from absolutely maxed-out to 40-50% in the worst hardware no matter how big the scene is.

A sneak peek of it (red squares in the right side textures represent the culled objects inside the red rectangle/camera frustum, it’s half the width/height of the normal viewport):

Then I use the values on those textures on the materials of many instanced meshes (around 30) to skip the vertex and fragment shader computations as early as possible, which is a HUGE difference specially on shitty integrated GPUs.

I will share more details once I end with a definitive solution.

1 Like

I’ve been thinking about the proxy approach by subclassing Object3D but did not try to implement it yet, was hoping someone would point me to an existing implementation and save me the effort :wink:

But how would you render those 72 meshes in one draw call? Each ship is unique, and they have moving parts. I could cram the geometries into a single buffer, but what then?

Thank you :smiley: I implemented it by reading this article and studying the three-mesh-bvh library (an incredible source of inspiration). There are a few things that I want to add.

The only way to improve the library, is user feedback, so I am happy to read it.

Great idea, I will prepare a benchmark page to see if with smaller textures it is worth it. When I implemented and tested this feature, I had about 30k instances and I was updating the matrix and the color of only once instance, based on user interaction, in that case the savings was significant.

In any case, you can disable it by setting the partialUpdate flag, or you can update the maxUpdateCalls property.

The culling data (indexes to be rendered) are not stored in a texture, but in an InstancedBufferAttribute.

3 Likes

If you would like to join my discord server, there is maybe some nice optimization that can be done, w can discuss. Also there are a couple of changes to BVH that I could make that might help you.

1 Like

It is quite a requested feature that I will definitely add, however, I have been messing around with so many projects and have little time. It would be nice to discuss it on discord or open an issue on github so we can figure out together the best strategy to implement it.

Have you already seen BatchedMesh and my batched-mesh-extensions library? If all instances shares the same material, you can use it.

There’s an official example

2 Likes

I know about BatchedMesh but it’s not supported in all browsers, so it doesn’t help in Firefox, which I care dearly about. Performance of that example is abysmal in Firefox. But I’m working on having a single material for all ships, and I hope they will support the batch extension at one point.

Ok, I got a solution that at least works for me.

At first I thought that just providing a ‘visibility’ buffer to the shaders to stop them as ‘early’ as possible (by using return in the vertex shader just on the first line of the main(), and discard just in the first line of the main() of the fragment shader) would do the trick. But much to my demise I found out that, at least in low-en hardware (integrated GPUs…), it made 0 difference performance wise, GPU usage didn’t change a bit :face_with_spiral_eyes:.

So I had to try a similar approach to what you @agargaro use in InstancedMesh2, recompute the list of the instances to draw every frame and fetch any necessary additional data by ‘indirect’ access in the shaders, avoiding the need to rearrange the rest of the buffers.

And yeah, it worked like a charm, considering the extra CPU overhead of evaluating those ‘renderingLists’ for each InstancedMesh (I have around 40 in my scene) and the extra overhead of using textures instead of instanced attributes in the GPU.

Sadly I not only rely on InstancedMeshes, but BatchedMeshes and also a lot on ‘batched’ Troika text (which uses neither of those, more like a huge merged geometry approach) so I can’t squeeze the whole performance gain I wish I could do.

For the BatchedMeshes using the native threejs perObjectFrustumCulled did the trick, but for the Troika texts I still have to investigate more and see if there is something to be done about it, for now they are sadly excluded from culling.

But I’m kinda happy with the result, on the typical Intel integrated GPU my app goes from 100% GPU utilization and around 30-50% frame drop to 70-80% GPU utilization and no frame drop when the camera is zoomed enough as to cull most of the vertices.

I still feel there is more to do there, there’s a surprisingly big amount of our customers that don’t have even a dedicated GPU on their system, so I have to try my best to give them the smoothest experience, I’m stubborn like that :slight_smile: .

Here’s a little video of what I finally came up with.
The 4 textures on the right of the red half-viewport rectangle represent, from top to bottom, the culling of the Teams > Folders > Links > Symbols (I perform culling only on those 4 high-level objects and then reuse that info across the necessary objects).
I’m reducing the viewport frustum to half to show you how the objects are culled. The teams and folders rectangles are controlled by threejs so you can’t see them appear/disappear as they use the full viewport for culling, but they are being culled too.
All the texts, rendered using Troika, are not culled as I mentioned :cry:

2 Likes

@agargaro FYI:

My bigger bottleneck is usually on the CPU, and specially when working with big arrays, v8/js is very picky with what is fast/slow depending on memory access and machine code optimizations.

Instead of working with a renderList object as you do I just chose to modify directly a buffer of instances ids, I just run something like this each frame:

  • this.cullingData is a reference to an array of the top level culled objects that is shared across many (around 10-15) children instanced meshes.
updateRenderList(){
        if(this.cullingData){
            let count = 0;
            for(let i = 0; i < this.elems.length ;i++){
                if (this.cullingData.array [ this.elemsGlobalInstanceId[i] ] && this.elemsVisibilities[ i ]){
                    this.instanceId!.array[count++] = i;
                }
            }
            this.count = count;

            this.instanceId!.needsUpdate = true;
        }
    }

As you see the code above is very similar to what you did in InstancedMesh2 lib, but at the same time a kinda different data architecture and execution stack :slight_smile: .

I think we are some of the few people in the community nowadays very focused on the performance of instancing/batching, so please ping me if I can be of any help, I will try to join your discord as you proposed too, just need a minute to try to get my head above water, I’m currently absolutely drowned in work with all this :weary_cat:.

Also I’ve been scratching my head a lot trying to think how to incorporate per-instance frustrum culling directly into Threejs InstancedMesh. I mean, if BatchedMesh has it why not InstancedMesh too? (I deeply think the APIs of both classes should converge sooner than later)
Even if it’s not the most efficient system, the one on BatchedMesh isn’t either…, so what could be the harm? :thinking:

1 Like

I think it is right not to add too much complexity to the main repository, there would be too much complex code to maintain.

Also the shaders would be different and there would be breaking changes

I created an external library for this :innocent:

2 Likes