Ideas on performing fast per-instance frustum culling on InstancedMesh

Hi all !

I would like to ask the forum about advice on how to perform frustum culling on the instances of an InstancedMesh.

I know it’s a feature not supported natively by three.js, but I have some intuition that ‘something’ could be done in that regard, even if it’s not considered ‘true’ culling.

The most straigtforward way seems to be reordering the instances and all attribute buffers and then play with the ‘count’ property to draw the non-culled intances. If I’m not mistaken it’s what @agargaro did in his awesome InstancedMesh2 library. But accessing, reordering and computing all those arrays/buffers in javascript every frame (specially if you have a good amount of custom attribute buffers, which I do) seems to be way more computing work than just rendering all instances without doing the culling…

But instead I’ve been toying in my mind for some time with the idea of stopping early the computations in the vertex buffer, and therefore then discard also the computations in the fragment shader, for the culled instances as far as all the instanced meshes materials have some kind of texture/attribute that informs them if a certain instance should be visible.

All my ‘high level’ objects are static, and I compute on init a BVH to know their position and size. So in theory knowing if they should be culled or not is pretty fast and efficient (on the CPU side). How could I pass this data also in the most efficient way to all the meshes in the GPU then?

Maybe store the culling value every frame in some kind of shared buffer/attribute/texture among all affected InstancedMeshes? This is where I start to doubt if it would be worth it of the updating/uploading to the GPU, specially on low-end hardware which is the target I would like to improve the performance on.

Maybe there is some way using GPU compute to do the culling 100% on the GPU side?

Any idea would be super welcomed, maybe it could en up in something that we could add to the engine.

Thx!

1 Like
2 Likes

Hi!

My library worked this way in the first version, but now it works a bit better:

  • It has now indirection: uses an InstancedBufferAttribute to manage the indexes of instances to be rendered, allowing for selective rendering, efficient culling, and sorting before sending data to the GPU.
  • Uses SquareDataTexture (an extended version of DataTexture supporting partial updates) to store instance matrices and data.

So you do not use InstancedBufferAttribute, but the utility methods (setUniformAt), and nothing is reordered.

The shader is now similar to BatchedMesh one.

You can use onFrustumEnter if you don’t want to render an instance based on attributes:

myInstancedMesh.onFrustumEnter = (instanceIndex, camera) => {
  // if returns false, it doesn't render the instance.
  return iMesh.getUniformAt(instanceIndex, 'myCustomVisibility');
};

Yes and I would like to implement it in the future when my library also supports WebGPURenderer.

3 Likes

What I find sorely missing when using InstancedMesh and apparently @agargaro’s enhanced implementation as well, is the ability to attach instances to arbitrary parents in the scene graph.

I’m making a space shmup where I have a large number of ships on the screen, and each ship has thrusters as billboards attached to them. Now each ship requires several draw calls because of those child billboards, which could all be rendered in a single draw call if we could attach instanced meshes to parent meshes.

335 draw calls for this, could be done in 120.

1 Like

@agargaro I wish I could just use your InstancedMesh2 library, I can’t because of some peculiarities of my project, but it’s awesome and always a source of inspiration and new ideas. I’m using your BVH implementation which I find genius (with some modifications to fit my needs).

In that spirit let me share some ideas, in the best constructive mindset, not a negative criticism at all :slightly_smiling_face:.

I have doubts that using those partial updates you implemented in SquareDataTexture are worth it.
The overhead created by figuring out which rows should be updated I think suppose more load for the CPU because of additional function calls, temporary objects creation and specially access to non-contiguos memory on intermediary arrays/objects, in comparison to just updating the whole array related to the texture. JIT compiler is super optimized for array access by index too.

If I’m not mistaken Three.js already uses texSubImage2D for DataTexture when you flag the texture as needsUpdate, with the big benefit vs texImage2D of not having to reallocate the texture storage in the GPU.

The benefit of those partial updates could make sense if you were to update big textures (something like 2k/4k textures), but when using them for example to store a single value per instance a 512x512 can potentially store data for more than 250k instances, so the CPU/GPU bandwidth savings are negligible in comparison of the additional CPU work for figuring out those partial updates.

That said it could make sense if the data that has to be updated every frame on the texture doesn’t change much, but to store the culling data all the texture data has to change every frame (unless you made a diff of the previous and next arrays, but that would be much more work than just feeding the new culling values for every instance).

1 Like

Your use case could be rendered in just 2 draw calls :wink:. (ship body in one and thrusters in another).

It’s more implementation work, but let me tell you how I do it:
You create ‘proxy’ objects that work in the traditional parent-child tree display list style. But then those children are really rendered using InstancedMeshes.

So what is rendered are ‘layers’ of instancedMeshes, while you keep your objects logic apart, the objects figure out what layer they have to use to render.

On your animation loop you then ‘sync’ those proxy objects with the data of your instanced meshes.

That’s what I did when implementing https://www.make.com/en/grid, and the whole graph renders using less than 30 draw calls, no matter the total number of nodes/links needed.

FYI: I’m getting VERY good performance (considering the culling is done on the CPU side, looking forward to do it with WebGPU soon, so it runs only in the GPU side with some compute shader). Running on an integrated Intel GPU, just 5-10% CPU overhead to perform the culling and send the data to the GPU.

The 4 data textures you see on the right are 4 different cullings I’m doing (team/folder/link/node).
Each one relies on a different set of BVHs for those objects.
It brings the GPU usage from absolutely maxed-out to 40-50% in the worst hardware no matter how big the scene is.

A sneak peek of it (red squares in the right side textures represent the culled objects inside the red rectangle/camera frustum, it’s half the width/height of the normal viewport):

Then I use the values on those textures on the materials of many instanced meshes (around 30) to skip the vertex and fragment shader computations as early as possible, which is a HUGE difference specially on shitty integrated GPUs.

I will share more details once I end with a definitive solution.

1 Like

I’ve been thinking about the proxy approach by subclassing Object3D but did not try to implement it yet, was hoping someone would point me to an existing implementation and save me the effort :wink:

But how would you render those 72 meshes in one draw call? Each ship is unique, and they have moving parts. I could cram the geometries into a single buffer, but what then?

Thank you :smiley: I implemented it by reading this article and studying the three-mesh-bvh library (an incredible source of inspiration). There are a few things that I want to add.

The only way to improve the library, is user feedback, so I am happy to read it.

Great idea, I will prepare a benchmark page to see if with smaller textures it is worth it. When I implemented and tested this feature, I had about 30k instances and I was updating the matrix and the color of only once instance, based on user interaction, in that case the savings was significant.

In any case, you can disable it by setting the partialUpdate flag, or you can update the maxUpdateCalls property.

The culling data (indexes to be rendered) are not stored in a texture, but in an InstancedBufferAttribute.

2 Likes

If you would like to join my discord server, there is maybe some nice optimization that can be done, w can discuss. Also there are a couple of changes to BVH that I could make that might help you.

It is quite a requested feature that I will definitely add, however, I have been messing around with so many projects and have little time. It would be nice to discuss it on discord or open an issue on github so we can figure out together the best strategy to implement it.

Have you already seen BatchedMesh and my batched-mesh-extensions library? If all instances shares the same material, you can use it.

There’s an official example

2 Likes

I know about BatchedMesh but it’s not supported in all browsers, so it doesn’t help in Firefox, which I care dearly about. Performance of that example is abysmal in Firefox. But I’m working on having a single material for all ships, and I hope they will support the batch extension at one point.