GPU Caching

GPU Cache Hierarchy

GPU caches help reduce latency and increae throughput by buffering memory access in cached lines. Maximizing these cache hit rate is critical for performance, thus providing smooth and pleasing user experience for the users. Given the scale of parallelism (usually thousands of threads), GPU pipeline has to be well designed, reflecting its type of the workload.

Modern GPUs have a multi-level cache hierarchy. L1 (per-core) caches usually has private icache and dcache, and are generally much smaller per thread than on CPUs, while L2 (device-wide) caches are shared accross all cores, and could provide coherent view when required.

GPU caches are typically incoherent across cores for efficiency; we need explicit memory barriers and synchronization primitives to ensure visibility between dispatches or devices.

Using more information: VRS (Variable rate shading)

Heatmap of shading granularity

Tracing your eye and perception

#[repr(C)]
#[derive(Copy, Clone)]
struct GazeConstants {
  gaze_x: i32,
  gaze_y: i32,
}

// Tile definition
#[repr(C)]
#[derive(Copy, Clone)]
struct Tile {
  x: u32,
  y: u32,
  score: f32,
}

fn compute_priority(tile_x: u32, tile_y: u32, mouse_x: i32, mouse_y: i32) -> f32 {
  let dx = tile_x as f32 - mouse_x as f32;
  let dy = tile_y as f32 - mouse_y as f32;
  -((dx * dx + dy * dy).sqrt()) // Higher score when closer
}

Variable rate compression based on the scene gist / Colorization of the perception of scene gist

Each of tiles should be sorted before before launching threads and recording the command buffers.

// Score and sort tiles by proximity to mouse before dispatching threads
fn sort_tiles_by_priority(tiles: &mut Vec<Tile>, mouse_x: i32, mouse_y: i32) {
  for tile in tiles.iter_mut() {
    tile.score = compute_priority(
      tile.x + 8, // Center of tile (TILE_SIZE / 2)
      tile.y + 8,
      mouse_x,
      mouse_y,
    );
  }
  tiles.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap());
}

Angular cone test using gaze information

float depth = LoadCameraDepth(i.positionCS.xy);
PositionInputs posInput = GetPositionInput(
    i.positionCS.xy,
    _ScreenSize.zw,
    depth,
    UNITY_MATRIX_I_VP,
    UNITY_MATRIX_V);

float3 posWS = posInput.positionWS;
float3 posVS = TransformWorldToView(posWS);

// 1) Angular cone test: pixel must lie within gaze half-angle
float inside = InsideCone(posVS, _EyePosVS, normalize(_GazeDirVS), _ConeCosHalf);

// 2) Radial fade near the expected base plane
float targetZ = _BaseCenterVS.z;
float t = (targetZ - _EyePosVS.z) / max(1e-5, (posVS.z - _EyePosVS.z));
t = saturate(t);
float3 projVS = lerp(_EyePosVS, posVS, t);

float2 planar = (projVS - _BaseCenterVS).xy;
float dist = length(planar);

float edge0 = _BaseRadius - _FeatherMeters;
float edge1 = _BaseRadius + _FeatherMeters;
float radial = 1.0 - smoothstep(edge0, edge1, dist);

// Combine angular and radial masks
float mask = inside * radial;

float3 mono = Luma(color.rgb);
color.rgb = lerp(color.rgb, mono, mask);

return color;

Simple box blurring (3x3)

#version 450

layout(location = 0) in vec2 texCoord; // Input texture coordinate
layout(binding = 0) uniform sampler2D inputImage; // Input texture

layout(location = 0) out vec4 outColor;

void main() {
    vec2 texelSize = 1.0 / vec2(textureSize(inputImage, 0));
    vec4 sum = vec4(0.0);

    for (int dx = -1; dx <= 1; ++dx) {
        for (int dy = -1; dy <= 1; ++dy) {
            vec2 offset = vec2(dx, dy) * texelSize;
            sum += texture(inputImage, texCoord + offset);
        }
    }
    outColor = sum / 9.0;
}

Simple box blurring