Shade - WebGPU graphics

Worked on ensuring C0 continuity near surfaces across the entire map. For now this is achieved by purging incomplete node levels, works well enough even if it’s a bit of a blunt tool.

Actual locations for probes during the bake are now going through an optimization phase, which allows me to push probes behind surfaces out into the open resulting in much fewer light leakage artifacts.

This video here has only 28,376 probes in the map, and takes 1.8Mb of VRAM

3 Likes

Holly cow, that’s a lot of bounces, I never go beyond 5 and 256 samples on cpu, very interesting yo see your results, keep up the good work!

Spent more time optimizing probe placements. Here’s what I started with

and here’s what we have now

At a first glance it may look like there is some denoising going on - that’s not the case.

The first is more noisy because it’s baked at 1024 samples per probe, and second is 16k samples per probe. But that’s not super important.

Let’s take a look at some artifacts which are a result of poor probe placement

These are just some of the more prominent leakage artifacts.

The reason this happens is because our probes have implicit locations, based on a recursive grid

The geometry of the scene doesn’t care about this fact, so we end up commonly with situations like this

If we follow the surface across the probe grid, from A to B

We can see that lighting will change drastically, because the closest probe at A is on the left side of the surface and at B it’s on the right side. Imagine if the surface is a solid sphere and B is inside of the sphere - we’d get a massive light leak, with B being shadowed just because the nearest probe is sunken into the surface.

So B is problematic. But actually, so is A, A is too close to the surface, and is going to oversample the surface. This is often just referred to as aliasing.

Ideally this is what we want

We take probes behind surfaces and push them through, so they don’t cause leaks, and we take probes in front of the surfaces that are too close and push them out from the surface.

Let’s back up a bit. I said just a little earlier that the probe locations are implicit from the recursive grid, which means that where we sample is fixed. So we want the locations in blue, but when we will sample the light map, we will always have locations in pink.

This may seem like cheating, but the answer is “yes”. That is -we can bake with locations in blue, and sample with locations in pink.

But isn’t this wrong?

Yes, it’s wrong, in that - it creates a bias. But this bias produces end-result which is less wrong than if we didn’t bias. So actually we’re cancelling out the bias that comes from the grid-like nature of out probe mesh.

Second thing about this bias is that if we choose between light leaks and slight lighting shifts - lighting shifts are preferable. Light leaks are very obvious to our eyes, subtle lighting shift because we moved the probe during baking is going to be incredibly subtle.

Light leaks create visual discontinuities and increase contrast (erroneously).


How do we achieve this?

Here’s the relevant piece of code:

const hit = new SurfacePoint3();

for (let i = 0; i < probe_count; i++) {
	let probe_location_x = locations[i * 3];
	let probe_location_y = locations[i * 3 + 1];
	let probe_location_z = locations[i * 3 + 2];

	if (!bvh.query_point_distance_to_nearest(hit, probe_location_x, probe_location_y, probe_location_z)) {
		// nothing nearby, this should never happen
		continue;
	}

	// got something close by

	const near_surface_x = hit.position.x;
	const near_surface_y = hit.position.y;
	const near_surface_z = hit.position.z;

	const to_hit_x = near_surface_x - probe_location_x;
	const to_hit_y = near_surface_y - probe_location_y;
	const to_hit_z = near_surface_z - probe_location_z;

	const near_surface_orientation = v3_dot(
		to_hit_x, to_hit_y, to_hit_z,
		hit.normal.x, hit.normal.y, hit.normal.z
	);

Hopefully this is enough to figure out the rest.

One quite important piece to keep in mind, is that when you move probes - you should be careful not to worsen aliasing. I cast a ray from the original position to the desired location and if we get a collision - we move the probe to the mid-point between where it was and the raycast hit.

It’s dry and boring stuff, but it’s something I’ve learned the hard way not to neglect.

1 Like

Yeah, 7 is a bit of an overkill. You get 90% of the lighting from 3 bounces typically.

As for the samples - that’s a tough one, if your scene has a lot of complexity and you want the nearby samples to be uniform - you need a lot of samples.

I remember watching an EA presentation from around 2013-15 where they were presenting their light map baking approach, and they were citing ~30,000 samples per pixel.

You usually start to see convergence around 4k in my experience, but unless you denoise your probes, you’re going to need a lot of samples to achieve a smooth transition across your probe mesh.

1 Like

Latest results

Fixed a bunch of smaller bugs

light map stats:

  • VRAM Size: 20 Mb
  • Probe samples: 16,384
  • Probe count: 324,674
  • Bake time: 267s
  • Bake hardware: RTX 4090
3 Likes

Implemented a different compression scheme for probes, using 26 bytes per probe now, instead of previous 56. Visually there is no difference, so a definite win.

Reworked statistics for outlier filtering during baking. Previously it was based on mean, now I’m using median, which is less susceptible to blowing up.

Calculating median on the GPU is a pain, especially per-probe, so I’m using a histogram instead. 32 buckets seems to produce a good result.

Slightly changed energy compensation process of the outlier elimination as well, it diffuses into L0 only now, but still respecting chromacity of the probe.

Here’s with old:

And here’s with new:


Here’s Sibenik, it’s link almost entirely indirectly, so it’s a torture test for the system

Old

New

To highly how bad of a stress-test it is, here’s the same scene path-traced


Old

New


The effect is more pronounced in highly specular scenes.

Bounce counts and sample counts are the same.

Integrated the sparse volumetric lightmap into the GI pipeline:

Here’s GI off


Specular is done via probes as well, using GGX convolution

4 Likes

Spent more time on the specular component of the light maps

Still using SH3 probes, using GGX ZH basis ( thanks to Matt Pettineo ).

There’s a bit of chroma undersampling going on, but in the final output it’s not particularly noticeable.

Using reservoir sampling to pull 2 unique probes per pixel, instead of blending whole 8 corners of the voxel.

Thanks to NVIDIA’s Marcos Fajardo et. al for the inspiration from their 2023 paper “Stochastic Texture Filtering”

Applying parallax correction weights to the samples using sphere proxies. This is different from correcting individual probes, but it still improves accuracy.

Frame timing is 0.05ms on RTX 4090 at 1080x1080 resolution.

2 Likes

Worked on the specular GI some more. Improved the selection logic for 2 samples, there was a bit of a bias in the second sample selection.

Decided to drop parallax correction after some testing.

SH3 is too low frequency to have enough angular resolution for parallax to make a lot of difference. I didn’t measure it numerically, but overlaying 2 image with and without - I can’t tell the difference.

My “Sparse Volumetric Lightmap” implementation ends up having relatively high spatial resolution, which reduces average correction that parallax would produce even further.

Glad to have investigated this, but in the end doing less work on the GPU is always better :sweat_smile:



3 Likes

GI demo with sparse volumetric lightmap
Screenshot 2026-02-10 170035

The light map is only 2.2MB, the format (SVLM) was created specifically for this project and it maps directly to the GPU buffer without any translation.

For comparison, this grass albedo texture is 7.52MB as PNG

It needs to be decoded before we can push it to the GPU, where it will take up 2048*2048 pixels at 4 bytes per pixel, or 16MB

But this texture will not be enough to render the grass material, you also need the normal maps and ORM (occlusion, roughness, metalness)


each of which is also needs 16Mb of VRAM. So just this grass meterial will need 48 MB of VRAM in total, vs this lightmap which takes up 2.2 MB for the entire scene.

The map was baked at 7 bounces per sample, and 32,000 samples per probe.

There are a total of 60,826 probes in the lightmap.


Here’s a flythrough:


Would be curious to know what the performance is like, for me the GI part is blazingly fast, taking ~0.1ms in total for both diffuse and specular.

5 Likes

Hey Antonio - how can I reach you for help with a 360 pano viewer? Multiresolution panorama | Pannellum

Perhaps through a private message?

1 Like

So, as someone working on WebGPU renderer, I think got this.

First you click their avatar :one: and then you look for a button that says “Message” :two: , click that thing and you’re good to go!

2 Likes

Just to make it 100% foolproof, @Soma should click on @Antonio ‘s profile, not his own. Odd, but it seems that you can message yourself.

2 Likes

Ignacio Castaño on twitter pointed out a new thing to me: MKS (Magic Kernel Sharp)

It’s a different type of filter kernel. I whipped up support for it in Shade, it’s way more expensive than what I currently use which is Mitchell-Netravali but it has some nice properties.

Linear - this is base (what three.js uses)


Mitchell


MKS


The screenshots are pure albedo, that’s why they look a bit weird. But the textures are very high resolution (4k each book cover) which serves as a good test case.

MKS is actually a little softer than Mitchell, which is not surprising, as Mitchell is still a sharpening filter. However, MKS does an amazing job at removing dinging artefacts and moire patterns. You can see this on the “robot dreams” cover most prominently, as the upper part of the cover has some texture to it which results in ringing artefacts for both linear and mitchell filters, but MKS does an incredible job of blending it out.

Not sure if it’s worth the cost, I might refactor my mimap generation code to split up the kernel and make it run in reasonable time, but it’s something genuinely new to me! You live and you learn :woman_shrugging:

3 Likes

Spent way more time on mipmap texture filtering than is healthy to. Ended up switching to MKS as a default filter for color textures.

Spent a lot of time tuning MKS specifically.

Here are screenshots for comparison:

Linear


MKS


CatmullRom


Mitchell


Wronski 2021

(10 tap kernel with MagicKernel pre-pass)



MKS does well at removing aliasing and ringing. It preserves overall image sharpness quite well, but is less aggressive than cubic spline filters.

Here is another scene with the books

Linear


MKS


CatmullRom


Mitchell


Wronski


References

3 Likes

Reworked occlusion culling architecture.

The reason being 2-fold:

  • I was using OneSweep prefix scan algorithm on the GPU, and it doesn’t jive with Apple silicone, which was causing horrible performance artifacts like stuttering and generally low FPS
  • HZB rebuilds were taking a significant chunk of overall frame time on lower-end GPUs

So now the engine runs pretty well on older macs. I got 2 updated demos:

:a_button_blood_type: Full resolution verion
:b_button_blood_type: Performance upscale version (60% internal resolution)

Some perf numbers

Demo Device FPS Resolution
A Apple M1 Pro 32 3456 x 2234
B Apple M1 Pro 47 3456 x 2234
B GTX 1080 61 3840 x 2160
A rdna2 iGPU 19 1080 x 1080
B rdna2 iGPU 33 1080 x 1080

The demoes feature:

  • GTAO
  • Bent Normals
  • Volumetric Lightmap ( Diffuse & Specular )
  • Bloom
  • Automatic Exposure
  • 3-cascade CSM (shadows)

The scene stats:

  • Meshes: 5202
  • Materials: 32
  • Lights: 131
  • Polycount: 267,302

If you do run the demo, I would be very grateful for if you could post your performance numbers :folded_hands:


Performance Uplift

Somewhat unrelated, but because of this and a few other changes the overall FPS has gone up by 10 to 15% on most scenes, with higher complexity scenes seeing more benefit.

The most notable is the Blender 3.3 splash screen scene, which is basically a torture test, I wrote about it earlier in this topic:

  • Mesh count: 374,734
  • Unique Geometries: 353
  • Total polycount on the scene: 717,869,562

It was previously running at 21 FPS, now it’s 46 FPS, and I wish I could say exactly why, but I honestly have no idea exactly what was the main cause, as I made so many little improvements since that time.

The upshot is that it’s ~21.74ms of frame time. That’s with shadows, volumetric light map etc (see above).

Three.js takes 3671ms on this scene, which makes Shade about 168 times faster on this scene. Or about 2.2 times faster relative to before.

4 Likes

Been working on a website for Shade, compiled a comparison table. I’m not super happy with it, but I tried to keep things fair. Feedback is very welcome:

All entries reflect web-deployed (browser-based) capabilities only.

Native/desktop-only features are excluded. Data current as of April 2026.


Rendering Architecture

Feature Shade three.js Babylon.js Unity PlayCanvas
Rendering pipeline GPU-driven visibility buffer CPU-driven forward CPU-driven forward CPU-driven SRP CPU-driven clustered fwd
Draw dispatch GPU-resident indirect CPU per mesh CPU per mesh CPU, SRP Batcher CPU per mesh
Meshlet rendering ✓ Built-in ✗ None ✗ None ✗ None ✗ None
Visibility buffer ✓ Deferred visibility shading ✗ None ✗ None ✗ None ✗ None
FrameGraph ✓ Engine-centric, auto VRAM aliasing ✗ None ~ API exists, unused internally ✗ None ~ Render-pass based
Language JavaScript (native) JavaScript TypeScript C# → WASM JavaScript

Culling & Scene Scale

Feature Shade three.js Babylon.js Unity PlayCanvas
Frustum culling GPU, per meshlet CPU, per mesh CPU, per mesh CPU, per mesh CPU, per mesh
Occlusion culling ✓ GPU HZB ✗ None ✗ None (raw queries only) ~ CPU-side, baked ✗ None
Culling granularity Sub-mesh (meshlet) Object bounding box Object bounding box Object bounding box Object bounding box
Max meshes @ 60 FPS Millions Hundreds–low thousands Thousands (instanced) Thousands (batched) Thousands (instanced)
Max active lights Thousands (clustered) ~5–50 (forward) Hundreds (clustered) Dozens–hundreds Hundreds (clustered)
Instancing required No — every object dynamic Yes, manual Yes, manual Yes, manual Yes, manual

Shadows

Feature Shade three.js Babylon.js Unity PlayCanvas
Cascaded shadows ✓ On by default ~ Addon, manual config ~ Manual config ✓ Built-in ✓ Built-in
Cascade blending ✓ Cross-cascade ✗ Hard splits ~ Manual ✓ Built-in ~ Limited
Cascade selection ✓ Projection-based (+50% texel density) Distance-based Distance-based Distance-based Distance-based
Shadow GPU culling ✓ Same GPU pipeline ✗ CPU-issued ✗ CPU-issued ✗ CPU-issued ✗ CPU-issued
Ray-traced shadows ✓ Software BVH (TLAS+BLAS) ✗ None ✗ None ✗ None ✗ None
Out-of-box quality ✓ No tweaking needed ✗ Manual biases, bounds, resolution ~ Manual biases & bounds ~ Some tweaking ~ Some tweaking

Post-Processing

Feature Shade three.js Babylon.js Unity PlayCanvas
Integrated stack ✓ Full stack, production-grade ✗ Third-party required ~ Build-your-own (API only) ✓ URP pipeline ✓ CameraFrame
TAA ✓ Motion vectors, disocclusion detection, variance clipping, YCoCg ✗ FXAA / SMAA only ~ Heavy ghosting, disabled during motion ~ FXAA / SMAA (TAA experimental) ~ Basic TAA
Temporal upscaling ✓ TAAU (dynamic resolution) ✗ None ✗ None ~ Experimental (STP) ✗ None
SSAO ✓ GTAO, temporal reprojection, à-torus spatial filter, PBR-integrated ~ Image-level, ignores PBR AO ~ Image-level, ignores PBR AO ✓ Integrated ~ Image-level
SSR ✓ HiZ stochastic trace+resolve, temporal+spatial denoising, IBL energy-conserving ~ No PBR awareness, no IBL mixing ~ No IBL mixing, energy issues ~ Basic ✗ None
HDR Bloom ✓ Multi-pass, Karis-filtered, spatially-stable HDR bloom ✗ SDR source, not true HDR ✗ SDR source, not true HDR ✓ HDR pipeline ~ Bloom
Auto exposure ✓ Eye adaptation ✗ None ~ Manual ✓ Built-in ✗ None
HDR display output ✓ Native (>100 nits) ✗ None ✗ None ✗ None ✗ None

Transparency & Materials

Feature Shade three.js Babylon.js Unity PlayCanvas
OIT ✓ MBOIT ✗ Sort-based ~ Depth peeling (expensive) ✗ Sort-based ✗ Sort-based
Alpha testing Hashed (volume-preserving) Binary cutoff Binary cutoff Binary cutoff Binary cutoff
Material overdraw Zero (ID buffer) Full (forward) Full (forward) Full (fwd/deferred) Full (forward)
Shader compilation Single shader, instant ~ Per-material variant ~ Per-material variant ~ Per-material variant ~ Per-material variant
Runtime stutter None by design ✗ On-demand compilation, recompiles on light/feature changes ✗ On-demand compilation, recompiles on changes ~ Pre-warm available, stutter possible ✗ On-demand compilation
Vertex compression ✓ On-line (~40% savings) ✗ None ✗ None ~ Offline only ✗ None

Global Illumination & Ray Tracing

Feature Shade three.js Babylon.js Unity PlayCanvas
2D Lightmaps (SDR) ✓ Via PBR AO + UV2 ✓ External bake + UV2 ✓ External bake + UV2 ✓ Baked in editor ✓ Editor bake tool
3D Lightmaps (HDR) ✓ Sparse Volumetric, SH3, full HDR ✗ None ✗ None ✗ None ✗ None
Lightmap specular ✓ GGX SH convolution ✗ Diffuse only, no angular data ✗ Diffuse only, no angular data ~ Limited ✗ Diffuse only, no angular data
Extra UV required ✓ No — volumetric Yes (UV2) Yes (UV2) Yes (UV2) Yes (UV2)
Software ray tracing ✓ Full (TLAS+BLAS+Materials) ✗ None ✗ None ✗ None ✗ None
In-engine GI bake ✓ Own RT engine ✗ None ✗ External tools only ~ Editor only, not in browser ~ Editor bake, SDR only

Memory, Performance & Integration

Feature Shade three.js Babylon.js Unity PlayCanvas
Memory management ✓ Custom allocator + pooling ✗ Manual / GC ~ Semi-automatic ~ WASM heap + GC ~ Semi-automatic
Resolution scaling ✓ Dynamic + temporal upscale ✗ Manual ✗ Manual ~ Limited ✗ Manual
Built-in profiler ✓ GPU timing ✗ DevTools only ✓ Inspector (CPU) ~ Limited in browser ✓ Profiler
Web integration Native JS Native JS Native TS/JS WASM (heavy) Native JS
Bundle size ~250 KB gz ~150 KB core, ~400 KB typical ~400 KB gz 10–50+ MB (WASM) ~300 KB gz
License Commercial MIT Apache 2.0 Commercial MIT

Summary

Strength area Shade three.js Babylon.js Unity PlayCanvas
Best for Massive scenes, AAA-quality web rendering Prototyping, small-medium scenes, ecosystem Full-featured 3D apps, tooling Porting native games to web Lightweight collaborative 3D apps
Unique advantage GPU-driven pipeline with RT, TAAU, HZB — unmatched scene scale and visual quality on web Massive community, simple API Rich tooling, Node Material Editor Full game engine feature set In-browser editor, Gaussian splatting
Key limitation No animation/skinning yet, no editor No occlusion culling, no built-in PP, stutter-prone, limited scene scale No GPU-driven pipeline, no RT, stutter-prone Huge bundle, WASM overhead, experimental WebGPU No occlusion culling, no RT, no SSR
4 Likes

First of all, this is amazing stuff: Kaze playtest 2026 April

I literally just saw this, looks incredible. Though I’m assuming some of your assets are ai, but even that is hard to tell. Good stuff. Did you try implementing GI in this game too?

1 Like

Hey Kitanga,

Sorry for the confusion. This the video in question is Unity

I run a small gamedev studio and we’ve been working on an ARPG for the past 4+ years

There are a few custom shaders there, but it’s mostly just Unity HDRP

None of the assets are AI generated though :sweat_smile: They are licensed assets mostly, as we have no 3d artists in the team


So yeah, that’s not Shade, I guess shade would look somewhat similar if it was, because techniques are similar under the hood and post-process stack is similar, but it’s not Shade, it’s Unity 6.0 with HDRP


One might ask 2 questions:

  1. Why not use Shade for your own game?

Shade as a project started almost full 2 years after the game project started

  1. Why not use Meep for your own game?

I judged that it would be much easier to be faster, and we were aiming for consoles primarily

In hindsight - I probably would go with Meep, because Unity doesn’t give you nearly as much as it claims on paper.

Anyway, a huge tangent. Glad you liked the gameplay video though.


As for GI specifically in the video, he’s what the shading stack looks like:

  • SSGI
  • SSR
  • GTAO
  • TAA
  • IBL

But again - it’s all Unity’s built’in stack

1 Like