Performance ideas (loading objects in workers, faster texture uploads, etc)

One idea: toTransferable API (for working across Worker thread boundaries)

Problem:

Getting things like ColladaLoader and GLTFLoader to function in a web Worker is not too difficult: it requires mocking (or polyfilling) some APIs on the worker side (f.e. document.createElement('img')) to make those loader classes believe they are using DOM APIs, and filling in the mocks to use what is available in a worker (f.e. fetch for fetching image data).

That’s the easy part!

Where it gets tricky is when we want to send the loaded objects to another thread (f.e. back to the main thread if that is where Three.js is running).

What I’ve been doing so far is calling toJSON on collada.scene or gltf.scene and passing that through postMessage (using Google’s Surma’s fabulous Comlink library makes it easier), then on the receiving end using ObjectLoader to convert the payload back to usable Three.js object.

But this has costs:

  • On the sender side, in toJSON calls, TypedArrays are converted to regular Arrays (cpu time cost in sender).
  • Transferring of the object over postMessage means that the regular arrays will cause a mem copy (CPU time cost during postMessage).
  • On the sender side, Textures will have their images converted into Blobs, and finally converted into object URLs (CPU time cost in sender).
  • On the receiving side, Arrays will be converted back into TypedArrays in ObjectLoader (CPU time cost in receiver).
  • On the receiving side, images need to be fetched and loaded from the blob URLs (CPU time cost in receiver).

What we really want to avoid is cpu time cost during postMessage and CPU time cost in receiver, because those are the ones that will pause (jank) the UI thread. We don’t really care if some processing time takes longer in the worker, as long as we eventually get the result without janking the user’s experience, although saving any time wherever possible is still better.

But, it works! With minimal polyfill, we can load objects in a worker, and actually get some performance gain, but nowhere near as much as could be possible.

Solution:

What I think would be super nice is an API that is basically identical to toJSON, perhaps called toTransferable, that does exactly what toJSON does, except it will

  • On the sender side, leave TypedArrays as typed arrays (CPU savings in sender).
  • Transferring TypedArrays as Transferables is super fast, no mem copy (CPU savings during postMessage).
  • On the sender side, convert images to ImageBitmap (CPU time cost in sender, but savings during postMessage), or update the Texture class to automatically always load images into ImageBitmap by default (cpu savings both in sender and during postMessage).
  • On the receiving side, TypedArrays remain TypedArrays, simply passed around by ObjectLoader (or perhaps a new class called TransferableLoader) (CPU savings in receiver).
  • On the receiving side, ImageBitmaps remain ImageBitmaps, simply passed around (CPU savings in receiver).

Similarly, what really matters is CPU savings during postMessage, CPU time cost in sender, but savings during postMessage, and CPU savings in receiver, all of which cause savings on the important receiver side where UI is running (even if it means more time cost for ImageBitmap on the sender side), where making the UI-side experience Jank-free is the most important goal.

Note that the result of toTransferable would no longer be compatible with JSON.stringify, but the performance gain will be very nice for use with web workers.

toTrasferable would return a tuple: [object, transferables] where object is the object output (similar to toJSON, but containing TypedArays, ImageBitmaps, etc), and a list of all Transferable objects.

Transferable objects are objects that get sent as a pointer copy instead of as an array copy. Passing an Array over postMessage means all the array data will be copied, while transferring a Transferable (f.e. a TypedArray or ImageBitmap) means that only a pointer will be copied to the other side (plus some minimal object wrapper created on the receiving side, but the cost is very cheap compared to copying all the array memory).

Implementation:

toTransferable would simply be a matter of copy/pasting all toJSON code, and modifying it a little so as to preserve all the Transferable objects in the output object, and instead of returning the output object, return a tuple with that object as well as a list of Transferables.

The list of Transferables is needed because of how postMessage needs to be called:

const [object, transferables] = superDuperScene.toTransferable()
self.postMessage(object, transferables)

// or

self.postMessage(...superDuperScene.toTransferable())

where object contains (anywhere inside of it at any level deep) the transferables that are listed in transferables.

Question: what is the fastest way to pass textures into the GPU? If it is something other than ImageBitmap (f.e. maybe just a TypedArray, or perhaps ImageData?), then perhaps we could default to using that instead of ImageBitmap so that the process of worker -> UI thread -> GPU communication is as fast as possible (even if this means slightly more time cost in the worker, but giving us the least amount of :japanese_ogre: jank :japanese_ogre: in UI thread).

I’d be curious if this is actually saving you much time on the main thread, as compared to transferring the original ArrayBuffer from the .GLB and parsing that with GLTFLoader? glTF includes a number of optimizations for network transfer that other formats like DAE, OBJ, and FBX lack, and network transfer is a very similar problem to transfer with the postMessage API.

Beyond that this sounds a lot like three.js#21035, or see three.js#18234 for some details. I’m a bit skeptical that loading in a Web Worker will benefit GLTFLoader — it already uses Web Workers for the heavy lifts like Draco and KTX processing — but certainly for other formats there would be value here.

It’d also be possible write an even faster version of GLTFLoader with fewer allocations I think, but with some drawbacks that don’t necessarily make for a good replacement of the current version.

Thanks for the links! I saw the second one before (scheduling bits of work). I hadn’t seen @gkjohnson’s PR.

network transfer is a very similar problem to transfer with the postMessage API.

True, although after a network transfer, there isn’t the Array-to-TypedArray conversion in the UI thread.

After thinking a bit though, I think the bottle neck will still be uploading textures to the GPU.

How do those infinite-world games (f.e. where you land on any planet) load new textures depending on where you go, without freezing, without ever stopping the game’s real-time rendering? That’s effectively the desired goal.

Can we define a texture buffer and change the pixels over time? An updating atlas?

Basically in the particular case I have we need to load models and images over time.

GLTFLoader never does Array-to-TypedArray conversion for vertex data: the vertex data is in an ArrayBuffer, creating the typed array is a zero copy operation: array = new Float32Array( buffer, offset, length ). This is true whether the ArrayBuffer was transferred from the network or from some other source.

After thinking a bit though, I think the bottle neck will still be uploading textures to the GPU.

That is a common one, yeah. You have a few options:

  • ImageBitmap: Moves decompression of PNG/JPG textures off the main thread. Still need to upload the fully decompressed textures on the main thread. Not supported in all browsers yet.
  • GPU textures (e.g. KTX 2.0 / Basis): Transcode KTX to a GPU compressed format in a worker. The compressed texture is 4-8x smaller than uncompressed data from a PNG or JPEG file, and uploads to the GPU much faster.
  • Incremental texture upload: There are a few ways to do this, like uploading mips one at a time or subTexImage2D, see Load textures progressively - #5 by ataylor09. This would take a fair bit of work to do, and I don’t know of good examples for it.

With glTF I’d recommend converting textures to KTX for this use case, that’s exactly what it’s designed for. With formats like DAE, efficient loading is much harder.

What I meant is that after running GLTFLoader in a worker and receiving the final gltf.scene object, then transferring the scene object (after toJSONing it first) is what would eventually result in Array-to-TypedArray conversions on the receiving thread, whereas using GLTFLoader on the same UI thread would not have those conversions.

That’s where the changes Garret made in that PR you linked would come in handy. Thanks for linking that. I will give it a try soon.

Luckily I’m on a project targetting Chrome as minimal version of browsers.

That’s interesting. I’ll look into that.

Seems we should just switch everything to GLTF then, as all the (your) great work is there. :slight_smile:

For DAE, it seems I would need to update ColladaLoader or other loaders to make them convert images to ktx (the extra conversion cost on a worker is outweighed by the cancellation of jank on the UI side). It would be a bit of work compared to switching to GLTF I think.

I took a look at that. Ultimately we still want the higher-res textures. So loading those (despite having already-visible lower-res textures while loading), and then uploading to the GPU would still cause the pauses (jank) I am wishing to avoid. It seems ultimately compression is going to be the only thing that can help here (based on what you described, regardless if we have progressive enhancement during loading or not).

Does any API (maybe even in WebGPU) allow us to modify texture memory directly? For example, if we could do that, then I imagine modifying certain pixels (bin-packing multiple textures on one texture).

How does using a <video> element as a texture not freeze the UI on every frame. Does the browser have special access to GPU memory that otherwise normal users don’t have? It seems the browser modifies memory in-place when it comes to updating textures from <video>, otherwise how could it be stutterless and smooth?

Is it possible to replicate what the browser does with <video> manually with an arraybuffer or bitmap? subTexImage2D will copy memory, not modify in-place, or will it?

The topic got a little off-topic, but I suppose it’s all related (performance). I modified the title.

1 Like

This is possible but extremely slow; converting images to KTX is best done offline. It may also require some artistic tuning, especially for normal maps, it’s a bit more effort than adjusting JPEG compression.

Does any API (maybe even in WebGPU) allow us to modify texture memory directly?

Not sure, subTexImage2D is the only thing I’m aware of here.

It sounds like ImageBitmap might be the best place to start here. GLTFLoader should already use ImageBitmap when it can, and switching over to KTX textures could give you further wins. For glTF there’s a workflow here: CLI | glTF-Transform.

@donmccurdy Would it be possible to have two scenes in separate threads, and loading GPU textures to them so that one scene can render while one is uploading a texture? I think no matter how many offscreen canvases we have, my guess is this won’t work because uploading to the GPU would block all threads depending on the GPU, right?

So it would be like this (if it could work, but I’m guessing GPU will block both threads):

  • receive texture from network in scene B (thread B)
  • in thread B, upload the texture to GPU.
  • keep rendering scene A (thread A) to its OffscreenCanvas while thread B loads the texture to GPU
  • When texture upload is complete, transfer the texture to thread A as a Transferable (fast)
  • now thread B takes over rendering to its OffscreenCanvas (main thread has to swap which canvas element is visible)
  • thread A uploads the texture to the GPU while thread B is rendering so that it can get to the same state
  • after thread A is caught up (texture uploaded), continue rendering with thread A (stop thread B from rendering)
  • later when thread B receives a new texture, repeat the whole process.

Is this possible? Or will thread B force any gl contexts (no matter what thread they are in) to be blocked?

I don’t think it’s possible to keep WebGL rendering during GPU texture upload, even with a worker and OffscreenCanvas. Not sure if that’s a WebGL thing or a lower-level GPU API constraint, but in any case I’ve never heard of this being done. See https://stackoverflow.com/q/62612868/1314762.

1 Like

This is interesting. Here is a demo with two scenes, each one in a worker, and only one scene freezes a lot more during texture upload while the other one continues to mostly render (if only with a tiny pause):

This seems to indicate that bouncing between two scenes in two separate threads might be viable. Wdyt?

Any hiccup in the first scene is totally negligible, at least in this case. If this works like it seems, imagine loading textures at any time (f.e. use pastes new URLs any time), and while panning around smoothly textures pop into place when they’re ready with no pause.

Update: Following is an example that implements the concept of swapping the scenes when texture upload is complete. Things can be further optimized by not re-drawing the scene that is hidden (currently both always render, even if not visible). There is still a tiny pause, but it is totally better than the pause that would be caused by the texture handling with only one scene:

We can probably further optimize this by introducing texture compression. Also using subTexImage2D instead of texImage2D with a pre-allocated texture might help, so as not to create a new texture.

Ah, ok! I’d missed that you were creating two parallel WebGL contexts here, that could work. One cost with this approach is 2X GPU memory; on mobile devices especially you may hit that limit. I guess I’d still reach for ImageBitmap and GPU texture compression first, but could see this method being helpful for things like switching between products in a configurator.

Yeah, the main use case I have is receiving arbitrary GLTF models over network periodically (f.e. a sensor scans physical space and sends each scanned piece while moving through the space) and needing to patch those into the scene, but not wanting the scene to pause.

1 Like

It seems this WebGL proposal from 8 years ago, that hasn’t been implemented by a browser yet, would allow asynchronous texture uploading like in the above example (also with webgl contexts in separate workers), but without needing to duplicate the memory usage:

https://www.khronos.org/webgl/wiki/SharedResouces

2 Likes