Performance Notes¶

Precision Modes¶

Snapshot storage supports: - storage_compression=False / "none": raw storage. - storage_compression="bf16": bfloat16-compressed storage.

TM2D compute also supports: - compute_precision="default": existing float32/float64 execution.

Current API note: - The historical fp16_scaled TM2D path is no longer part of the current public API surface.

CPU path supports float32/float64 propagation and gradients.
CUDA path is recommended for large ny x nx x nt workloads.
maxwell3d(..., experimental_cuda_graph=True) is available as an experimental forward-only CUDA mode on the C/CUDA backend.

Throughput scales with n_shots until memory bandwidth saturation.
For gradient workloads, tune model_gradient_sampling_interval and storage_mode to control memory use.

The current implementation only applies to forward-only maxwell3d calls on CUDA with python_backend=False.
Graph capture is aligned to the existing chunk boundary: effective_chunk = callback_frequency when a forward_callback is present, otherwise the whole forward pass is one chunk.
Graphs are cached at module scope and reused across compatible later calls on the same device. The cache key is shape/layout/config based rather than tied to one maxwell3d invocation.
Callback execution stays outside the graph. That keeps CallbackState semantics unchanged while still allowing graph replay between callbacks.
In-place edits to callback-visible wavefields are respected by later graph replays because the graph keeps the same storage. Avoid resizing, reassigning, or changing tensor layout/device inside callbacks.
Speedup is workload-dependent. Small, launch-bound chunks may benefit; larger chunks can be neutral or slightly slower because graph capture and staging copies are not free. Use uv run python examples/benchmark_maxwell3d_cuda_graph.py --verify on the target GPU before enabling it broadly.