Performance Notes¶
Precision Modes¶
Snapshot storage supports:
- storage_compression=False / "none": raw storage.
- storage_compression="bf16": bfloat16-compressed storage.
TM2D compute also supports:
- compute_precision="default": existing float32/float64 execution.
Current API note: - The historical fp16_scaled TM2D path is no longer part of the current public API surface.
CPU vs CUDA¶
- CPU path supports
float32/float64propagation and gradients. - CUDA path is recommended for large
ny x nx x ntworkloads. maxwell3d(..., experimental_cuda_graph=True)is available as an experimental forward-only CUDA mode on the C/CUDA backend.
Batching¶
- Throughput scales with
n_shotsuntil memory bandwidth saturation. - For gradient workloads, tune
model_gradient_sampling_intervalandstorage_modeto control memory use.
Maxwell3D CUDA Graph¶
- The current implementation only applies to forward-only
maxwell3dcalls on CUDA withpython_backend=False. - Graph capture is aligned to the existing chunk boundary:
effective_chunk = callback_frequencywhen aforward_callbackis present, otherwise the whole forward pass is one chunk. - Graphs are cached at module scope and reused across compatible later calls on
the same device. The cache key is shape/layout/config based rather than tied
to one
maxwell3dinvocation. - Callback execution stays outside the graph. That keeps
CallbackStatesemantics unchanged while still allowing graph replay between callbacks. - In-place edits to callback-visible wavefields are respected by later graph replays because the graph keeps the same storage. Avoid resizing, reassigning, or changing tensor layout/device inside callbacks.
- Speedup is workload-dependent. Small, launch-bound chunks may benefit; larger
chunks can be neutral or slightly slower because graph capture and staging
copies are not free. Use
uv run python examples/benchmark_maxwell3d_cuda_graph.py --verifyon the target GPU before enabling it broadly.
Storage Impact¶
storage_mode="device"is fastest but VRAM-heavy.storage_mode="cpu"/"disk"reduce VRAM pressure with transfer overhead.
Practical Optimization Checklist¶
- Choose CUDA for medium/large models and batches.
- Increase n_shots until throughput saturates, then stop.
- Balance stencil order and grid spacing for target fidelity.
- Use storage_mode=auto with byte limits on memory-constrained systems.
- Profile representative workloads rather than synthetic tiny benchmarks.