Skip to content

Commit bb7311e

Browse files
Switch edge buffers to IMemoryOwner
1 parent 5a3b89a commit bb7311e

2 files changed

Lines changed: 202 additions & 147 deletions

File tree

src/ImageSharp.Drawing.WebGPU/WEBGPU_BACKEND_PROCESS.md

Lines changed: 116 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This document describes the current runtime flow used by `WebGPUDrawingBackend`
77
```text
88
DrawingCanvasBatcher.Flush()
99
-> IDrawingBackend.FlushCompositions(scene)
10-
-> capability checks first
10+
-> capability checks
1111
-> TryGetCompositeTextureFormat<TPixel>
1212
-> AreAllCompositionBrushesSupported<TPixel>
1313
-> if unsupported: scene-scoped fallback (DefaultDrawingBackend)
@@ -18,69 +18,133 @@ DrawingCanvasBatcher.Flush()
1818
-> compute scene command count + composition bounds
1919
-> if no visible commands: return
2020
-> acquire one WebGPUFlushContext for the scene
21-
-> ensure command encoder (single encoder reused for the scene)
22-
-> resolve source backdrop texture view for composition bounds
23-
-> non-readback path: sample target view directly
24-
-> readback path: copy target region into transient source texture and sample that
25-
-> allocate transient output texture for composition
26-
-> build coverage texture from prepared geometry
27-
-> flatten prepared path geometry
28-
-> upload line/path/tile/segment buffers
29-
-> run compute sequence:
30-
1) PathCountSetup
31-
2) PathCount
32-
3) Backdrop
33-
4) SegmentAlloc
34-
5) PathTilingSetup
35-
6) PathTiling
36-
7) CoverageFine
37-
-> build one flush-scoped composite command parameter stream from prepared batches
38-
-> run composite dispatch sequence:
39-
1) PreparedCompositeBinning
40-
2) PreparedCompositeTileCount
41-
3) PreparedCompositeTilePrefix
42-
4) PreparedCompositeTileFill
43-
5) PreparedCompositeFine
44-
-> solid brush uses Color.ToScaledVector4()
45-
-> image brush samples Image<TPixel> texture directly
46-
-> writes composed pixels to one transient output texture
47-
-> copy output texture bounds back into the destination target once
48-
-> finalize once
49-
-> non-readback: finish encoder + single queue submit
50-
-> readback: encode texture->buffer copy, finish encoder + single queue submit, map/copy once
51-
-> on any GPU failure path: scene-scoped fallback (DefaultDrawingBackend)
21+
-> TryRenderPreparedFlush
22+
-> ensure command encoder (single encoder reused for the scene)
23+
-> use target texture view directly as backdrop source (no copy)
24+
-> allocate transient output texture for composition bounds
25+
-> deduplicate coverage definitions across batches via CoverageDefinitionIdentity
26+
-> TryCreateEdgeBuffer (CPU-side edge preparation)
27+
-> for each unique coverage definition:
28+
-> path.Flatten() to iterate flattened vertices
29+
-> build fixed-point (24.8) GpuEdge via MemoryAllocator (IMemoryOwner<GpuEdge>)
30+
-> compute min_row/max_row per edge, clamped to interest
31+
-> build CSR (Compressed Sparse Row) band-to-edge mapping:
32+
1) count edges per 16-row band
33+
2) exclusive prefix sum over band counts
34+
3) scatter edge indices into CSR index array
35+
-> merge per-definition edges into single buffer with metadata stamps
36+
-> single-definition fast path: stamp in-place
37+
-> multi-definition: merge via Span.CopyTo
38+
-> upload edge buffer via dirty-range detection (word-by-word diff)
39+
-> upload merged CSR offsets and indices via QueueWriteBuffer
40+
-> TryDispatchPreparedCompositeCommands
41+
-> build per-command PreparedCompositeParameters (destination, edge placement,
42+
brush type/color/region, blend mode, composition mode)
43+
-> upload parameters + dispatch config via QueueWriteBuffer
44+
-> single compute dispatch: CompositeComputeShader
45+
-> workgroup size: 16x16 (one tile per workgroup)
46+
-> dispatched as (tileCountX, tileCountY, 1)
47+
-> each workgroup:
48+
-> loads backdrop pixel from target texture
49+
-> for each command overlapping this tile:
50+
-> clears workgroup shared memory (tile_cover, tile_area, tile_start_cover)
51+
-> cooperatively rasterizes edges from CSR bands using fixed-point scanline math
52+
-> X-range spatial filter: edges left of tile only update start_cover
53+
-> barrier, then each thread accumulates its coverage from shared memory
54+
-> applies fill rule (non-zero or even-odd)
55+
-> samples brush (solid color or image texture)
56+
-> composes pixel using Porter-Duff alpha composition + color blend mode
57+
-> writes final pixel to output texture
58+
-> destination writeback:
59+
-> NativeSurface: copy output texture region into target texture
60+
-> CPU Region: set ReadbackSourceOverride to output texture (skip extra copy)
61+
-> TryFinalizeFlush
62+
-> NativeSurface: finish encoder + single QueueSubmit (non-blocking)
63+
-> CPU Region: encode texture->buffer copy, finish encoder, QueueSubmit,
64+
synchronous BufferMapAsync + poll wait, copy mapped bytes to CPU region
65+
-> on any GPU failure: scene-scoped fallback (DefaultDrawingBackend)
5266
```
5367

68+
## GPU Buffer Layout
69+
70+
### Edge Buffer (`coverage-aggregated-edges`)
71+
72+
Each edge is a 32-byte `GpuEdge` struct (sequential layout):
73+
74+
| Field | Type | Description |
75+
|---|---|---|
76+
| X0, Y0 | i32 | Start point in 24.8 fixed-point |
77+
| X1, Y1 | i32 | End point in 24.8 fixed-point |
78+
| MinRow | i32 | First pixel row touched (clamped to interest) |
79+
| MaxRow | i32 | Last pixel row touched (clamped to interest) |
80+
| CsrBandOffset | u32 | Start index into CSR offsets for this definition |
81+
| DefinitionEdgeStart | u32 | Edge index offset for this definition in merged buffer |
82+
83+
### CSR Buffers
84+
85+
- `csr-offsets`: `array<u32>` — per-band prefix sum. `offsets[band]..offsets[band+1]` gives the range of edge indices for that 16-row band.
86+
- `csr-indices`: `array<u32>` — edge indices within each band, ordered by band.
87+
88+
### Command Parameters
89+
90+
Each `PreparedCompositeParameters` struct contains destination rectangle, edge placement (start, fill rule, CSR offsets start, band count), brush configuration, blend/composition mode, and blend percentage.
91+
92+
### Dispatch Config
93+
94+
`PreparedCompositeDispatchConfig` contains target dimensions, tile counts, source/output origins, and command count.
95+
96+
## Shader Bindings (CompositeComputeShader)
97+
98+
| Binding | Type | Description |
99+
|---|---|---|
100+
| 0 | `storage, read` | Edge buffer (`array<Edge>`) |
101+
| 1 | `texture_2d` | Backdrop texture (target) |
102+
| 2 | `texture_2d` | Brush texture (image brush or same as backdrop) |
103+
| 3 | `texture_storage_2d, write` | Output texture |
104+
| 4 | `storage, read` | Command parameters (`array<Params>`) |
105+
| 5 | `uniform` | Dispatch config |
106+
| 6 | `storage, read` | CSR offsets (`array<u32>`) |
107+
| 7 | `storage, read` | CSR indices (`array<u32>`) |
108+
54109
## Context and Resource Lifetime
55110

56-
- `WebGPUFlushContext` is created once per `FlushCompositions` execution.
57-
- The same command encoder is reused across all GPU passes in that flush.
58-
- Transient textures/buffers/bind-groups are tracked in the flush context and released on dispose.
59-
- Source image texture views are cached per flush context to avoid duplicate uploads.
111+
- `WebGPUFlushContext` is created once per `FlushCompositions` execution and disposed at the end.
112+
- The same command encoder is reused across all GPU operations in that flush.
113+
- Transient textures, texture views, buffers, and bind groups are tracked in the flush context and released on dispose.
114+
- Source image texture views are cached within the flush context to avoid duplicate uploads.
115+
- CPU-side edge geometry (`IMemoryOwner<GpuEdge>`) is allocated via `MemoryAllocator` and disposed within the flush.
116+
- Shared GPU buffers (edge buffer, CSR buffers, params buffer, dispatch config buffer) are managed by `DeviceState` with grow-only reuse across flushes.
117+
- Edge upload uses dirty-range detection: compares current data word-by-word against a cached copy, uploading only the changed byte range via `QueueWriteBuffer`.
60118

61-
## Destination Writeback and Flush Count
119+
## Destination Writeback
62120

63121
- `FlushCompositions` performs one command-buffer submission (`QueueSubmit`) per scene flush.
64-
- Destination writeback to the render target is one copy from the fine output texture into composition bounds.
65-
- No destination storage init/blit pass is used in the active flush path.
66-
- CPU-region targets perform one additional texture->buffer copy and one map/read after the single submit.
122+
- NativeSurface targets: one GPU-side `CommandEncoderCopyTextureToTexture` from output into the target at composition bounds. No CPU stall.
123+
- CPU Region targets: readback from the output texture directly (skipping the output-to-target copy). Uses `CommandEncoderCopyTextureToBuffer`, `QueueSubmit`, synchronous `BufferMapAsync` with device polling, then copies mapped bytes to the CPU `Buffer2DRegion<TPixel>`.
67124

68125
## Fallback Behavior
69126

70-
Fallback is scene-scoped:
127+
Fallback is scene-scoped and triggered when:
128+
- The pixel format has no supported WebGPU texture format mapping.
129+
- Any command uses an unsupported brush type (only `SolidBrush` and `ImageBrush` are GPU-composable).
130+
- Any GPU operation fails during the flush.
131+
132+
Fallback path:
133+
- If target exposes a CPU region: run `DefaultDrawingBackend.FlushCompositions(...)` directly.
134+
- If target is native-surface only: rent CPU staging frame, run fallback on staging, upload staging pixels back to native target texture.
135+
136+
## Shader Source
137+
138+
`CompositeComputeShader` generates WGSL source per target texture format at runtime, substituting format-specific template tokens for texel decode/encode, backdrop/brush load, and output store. Generated source is cached by `TextureFormat` as null-terminated UTF-8 bytes.
71139

72-
- if target exposes a CPU region:
73-
- run `DefaultDrawingBackend.FlushCompositions(...)` directly
74-
- if target is native-surface only:
75-
- rent CPU staging frame
76-
- run `DefaultDrawingBackend.FlushCompositions(...)` on staging
77-
- upload staging pixels back to native target texture
140+
The following static WGSL shaders exist for the legacy CSR GPU pipeline but are not used in the current dispatch path (CSR is computed on CPU):
141+
- `CsrCountComputeShader`, `CsrScatterComputeShader`
142+
- `CsrPrefixLocalComputeShader`, `CsrPrefixBlockScanComputeShader`, `CsrPrefixPropagateComputeShader`
78143

79-
## Shader Source and Null Terminator
144+
## Performance Characteristics
80145

81-
Static WGSL shaders are stored as null-terminated UTF-8 bytes (`U+0000` terminator required at call site), including:
146+
Coverage rasterization and compositing are fused into a single compute dispatch. Each 16x16 tile workgroup computes coverage inline using a fixed-point scanline rasterizer ported from `DefaultRasterizer`, operating on workgroup shared memory with atomic accumulation. This eliminates the coverage texture, its allocation, write/read bandwidth, and the pass barrier that a separate coverage dispatch would require.
82147

83-
- coverage shaders: `PathCountSetup`, `PathCount`, `Backdrop`, `SegmentAlloc`, `PathTilingSetup`, `PathTiling`, `CoverageFine`
84-
- prepared-composite shaders: `PreparedCompositeBinning`, `PreparedCompositeTileCount`, `PreparedCompositeTilePrefix`, `PreparedCompositeTileFill`
148+
Edge preparation (path flattening, fixed-point conversion, CSR construction) runs on the CPU. The `path.Flatten()` cost is shared with the CPU rasterizer pipeline. CSR construction is three passes over the edge set: count, prefix sum, scatter.
85149

86-
`PreparedCompositeFine` is generated per target texture format and emitted as null-terminated UTF-8 bytes at runtime.
150+
For the benchmark workload (7200x4800 US states GeoJSON polygon, 2px stroke, ~262K edges), NativeSurface performance is at parity with the CPU rasterizer (~28ms).

0 commit comments

Comments
 (0)