Optimization of large amounts (100-1000) of Skinned Meshes (CPU bottlenecks)

Can you sort by self time and after that also check with view sensitive enabled?

With sensitive enabled it is about 50-55 fps.

image

Please show me the bottom-up tab and sort by self time.

This one?

Yes, thanks. When you click on textSubImage2D, you’ll notice that most of the time is spent uploading textures to the GPU.

@luisherasme @PavelBoytchev Yo guys, while its good to discuss different optimization strategies, I think it would be best to move debugging conversations to the DMs. Javascript profiler instructions are rather off-topic, while also unnecessarily taking up a large part of the thread :face_with_spiral_eyes:

2 Likes

Update:
I have improved the speed of the instanced skinned mesh by implementing an approach inspired by @DolphinIQ proposal. Before, I used to recalculate the skeleton for each instance individually. But now, I organize all the instances beforehand. This enables me to calculate the skeleton for a particular animation and time and assign it to all instances with the same animation and time. As a result, I no longer need to recalculate the skeleton for each instance, which saves a lot of time.

Demo on my phone:

1 Like

Realistically speaking, the problem is matrix multiplication. You have a lot of bones and a lot of instances.

Sampling animation curves is an issue too.

Off the top of my head, I would recommend two paths:

  1. Move curve sampling and matrix multiplications off the main core. Simplest path would be to move them to a separate worker thread I guess, but a better path would be to send animation matrix for each instance to the GPU and let sampling and matrix operations happen there.
    That is, you have a matrix with one side being animation indices for a model, and second side being timing. You can optimize this to say a list of 2-3 animations instead. Up to you really.

  2. Bake animation states. Instead of doing matrix multiplications and animation curve sampling, pre-bake a table for each animation, where each row contains matrices for each bone. This way you only need to retrieve a relevant row from the table, no computation requited, only memory access.

It’s a fun problem to think about. I did work on a similar problem for fun a month or two ago. It was just a fun prototype for me, and I was leaning towards the first approach myself, since you end up with bone matrices baked into a texture on the GPU anyway, might as well move earlier animation stages to the GPU also.

One more thing to think about in these types of problems are memory access patterns. You want to keep data as cache-coherent as possible, and your code that loops over instances to be as tight as possible as well. Often it’s more optimal to take a complex piece of loop logic and break it into stages, then have multiple loops, one per stage, instead of having one larger loop.

PS:
To clarify the first point, about animation matrix or a list, that was to enable animation blending support.

4 Likes
  1. replace matrix.element with sharedArrayBuffer
  2. disable the matrixAutoUpdate property of bones
  3. start the animation in another thread
  4. call the updateMatrixWorld method on the bones

Works: provided that the object is added to the render and step 2 is completed.
Doesn’t work: if you perform the second step before adding the object to the render.

1 Like