We were doing memory testing last week on phones...
1. Don't just allocate. Fill. Reserved virtual pages don't consume any RAM. You need to commit the pages.
2. Filling with zero = perfect memory compression. Fill with random data to avoid memory compression.
Memory is hard :)
I'm a little late to the party on this one since it's from January, but I just read this great blog post by Jure Triglav walking through implementing surfel-based global illumination. It's got a bunch of really cool interactive toys/visualizers!
https://t.co/T5fsPEVx2N
At 6 bytes per splat, 1920x1080 (2Mpix) is just 12MB of read bandwidth. Claybook's SDF ray-tracer (1GB multilevel volume) consumed 8MB of read bandwidth for the distance field in my GDC 2018 benchmark slides.
Splats don't have perfect occlusion culling, so there's some overhead of course. I did my math using tiny 16 splat clusters (a 4x4 screen region in the perfect case). So we should have culling granularity close to hardware HiZ. Much better than Nanite's 128 triangle clusters.
This is one of the reasons we developed GPU-driven rendering at Ubisoft (for Rainbow Six Siege and Assassin's Creed: Unity). We spoke about it in SIGGRAPH 2015. GPU->CPU roundtrip latency is awful. Have to make all visibility decisions on the GPU side.
https://t.co/0rKJP6lNj1
The recorded EUROGRAPHICS 2026 talk for our wave tracing paper. Very honored that this paper received the Günter Enderle Best Paper Honorable Mention award.
Collaboration with Matt Pharr (NVIDIA).
Fun fact: DOOM: The Dark Ages uses Ray-Traced Global Illumination. If it had used older baked GI tech instead, the lighting data could have required up to 110GB and taken up to 68 days to bake 👀
From: https://t.co/Xo74fE97Mz
Had a great time talking to @SebAaltonen about his work! Rendering technology, back then & now, Hype Hype works, as well as "No Graphics API"
Sebastian, thank you so much for the time and effort! :)
https://t.co/LJhzMVbcnh
New Paper🙂
Nanite has shown that small triangles can be rendered fast in compute, we're exploring how fast for large meshes with up to 18.9 billion triangles, without the need to precompute LOD structures.
Paper: https://t.co/F9u4xE6Na3
Source: https://t.co/1LgdSHVi7i
Let's check NVidia's RTXDI SDK. Huge update v3.0.0, published Mar 10: a month before the paper. Highlights include "ReSTIR PT resampling functions", that doesn't sound super specific.
But this SDK is mostly samples and docs, the runtime is in a separate project RTXDI-Library.
Vulkan just got a new descriptor heap extension. Another step towards the right direction: https://t.co/0QDsD6qYJ2
It makes push constants a struct in memory instead of separate API calls, which is a super nice improvement. Not exactly a GPU pointer to root data, but close.
Super nice that Nvidia open sourced Lyra 2.0. Github repo and weights in HuggingFace.
Seems much more usable than Genie 3. Seems to have a real 3d understanding of the environment, and doesn't drift.
What if your Unreal Engine 5 renders looked like living oil paintings? 🎨
Elena Felici just released a full free course on building a procedural painterly brush-stroke shader in UE5
Free course + project files linked 👇
#UnrealEngine5#UE5#StylizedArt#ShaderArt#NPR#3DArt
New Shade (WebGPU engine) demo:
https://t.co/ZjkK3pGD17
Emphasis is on performance improvements, especially for Macs.
The engine is intended for high-end GPUs, but I hope to include as many devices, so it at least runs OK for most people.
Features:
- Global Illumination via Sparse Volumetric Lightmaps (SH3)
- Ambient Occlusion + Bent Normals
- Clustered Lighting
- Automatic Exposure
- HDR display support
- Realtime Cascaded Shadow Maps (3 cascades)
- HDR bloom with temporal stabilization
- TAAU (60% upscale in this demo)
- HZB occlusion culling (including shadow views)
- Meshlet-based rendering
- GPU-driven draw
Optimized SSAO (GTAO) shader: 1.10ms -> 0.758ms
New version didn't have fp16 optimized inner loop and did noise lookups inside the inner loop. Now I apply noise to line sample offset outside the loop. This is fine since only one sample is used (the max horizon).
Occlusion culling shadow cascades is a clear win for large scenes with lots of depth complexity, especially if the sun angle is low. You can also plot receiver pixels to shadow map with atomics (pack 2x16bit min/max to 32-bit) to know the receiver range -> cheap early out.
Bindless with 64-bit pointers is just 1 trip to memory. Don't need to fetch a buffer descriptor from the descriptor heap (using a 32-bit index). That's how CUDA and Metal operate. Available also in GLSL using Vulkan BDA extension. DX12 has no pointer support.