Chapter 35Project Gpu Compute In Zig

项目

概述

第34章概述了GPU执行模型、地址空间和调度规划;现在我们构建一个端到端工作负载,从Zig源码开始,到经验证的二进制转储结束,准备提交给Vulkan或WebGPU队列系列。Target.zig该项目将三个部分拼接在一起:用纯Zig编写的SPIR-V内核、具有CPU回退的主机端编排CLI,以及用于将捕获的GPU缓冲区与预期结果进行比较的差异实用工具。build.zig

学习目标

  • 使用自托管后端将Zig计算内核转换为SPIR-V,并了解它期望的资源布局。
  • 从可以带或不带GPU访问运行的主机应用程序协调缓冲区、调度几何和验证路径。
  • 构建轻量级诊断,通过确定性CPU参考评估GPU输出。

参考:34__gpu-fundamentals.xmlRandom.zig

构建计算流水线

我们的工作负载对向量的元素求平方。主机创建提交元数据和数据缓冲区,内核对每条通道求平方,差异工具验证设备捕获。静态通道容量镜像GPU存储缓冲区布局,而主机强制执行逻辑边界,以便内核在调度额外线程时能够退出。builtin.zig

拓扑和数据流

调度有意适中(64线程块中的1000个元素),因此你可以专注于正确性而不是占用调优。主机注入随机浮点值,记录校验和以便观察,并发出下游工具或真实GPU驱动程序可以重用的二进制blob。因为存储缓冲区在原始字节上运行,我们将每个指针参数与extern struct正面配对,以保证与描述符表的布局奇偶性。

编写SPIR-V内核

内核接收三个存储缓冲区:描述逻辑长度的提交头、输入向量和输出向量。每个调用读取一条通道,对它求平方,并在边界内时写回结果。防御性检查防止散乱的工作组触及逻辑长度之后的内存,这是主机填充调度以匹配wavefront粒度时的常见危险。

Zig
//! SPIR-V Kernel: element-wise vector squaring
//!
//! This kernel expects three storage buffers: an input vector (`in_values`), an
//! output vector (`out_values`), and a descriptor struct `Submission` that
//! communicates the logical element count. Each invocation squares one element
//! and writes the result back to `out_values`.

const builtin = @import("builtin");

/// Maximum number of elements the kernel will touch.
pub const lane_capacity: u32 = 1024;

/// Submission header shared between the host and the kernel.
///
/// The `extern` layout ensures the struct matches bindings created by Vulkan or
/// WebGPU descriptor tables.
const Submission = extern struct {
    /// Logical element count requested by the host.
    len: u32,
    _padding: u32 = 0,
};

/// Storage buffer layout expected by the kernel.
const VectorPayload = extern struct {
    values: [lane_capacity]f32,
};

/// Squares each element of `in_values` and writes the result to `out_values`.
///
/// The kernel is written defensively: it checks both the logical length passed
/// by the host and the static `lane_capacity` to avoid out-of-bounds writes when
/// dispatched with more threads than necessary.
pub export fn squareVector(
    submission: *addrspace(.storage_buffer) const Submission,
    in_values: *addrspace(.storage_buffer) const VectorPayload,
    out_values: *addrspace(.storage_buffer) VectorPayload,
) callconv(.kernel) void {
    const group_index = @workGroupId(0);
    const group_width = @workGroupSize(0);
    const local_index = @workItemId(0);
    const linear = group_index * group_width + local_index;

    const logical_len = submission.len;
    if (linear >= logical_len or linear >= lane_capacity) return;

    const value = in_values.*.values[linear];
    out_values.*.values[linear] = value * value;
}

// Guard compilation so this file is only compiled when targeting SPIR-V.
comptime {
    switch (builtin.target.cpu.arch) {
        .spirv32, .spirv64 => {},
        else => @compileError("squareVector must be compiled with a SPIR-V target, e.g. -target spirv32-vulkan-none"),
    }
}
运行
Shell
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
    -femit-bin=kernels/vector_square.spv \
    01_vector_square_kernel.zig
输出
Shell
no output (binary module generated)

实验结束时删除kernels/vector_square.spv,以便重复运行始终从源码重建着色器。fs.zig

主机编排和CPU回退

主机CLI规划调度,种下确定性输入,运行CPU回退,并应请求将参考转储写入out/reference.bin。它还验证SPIR-V头(0x07230203),以便损坏的构建立即浮出水面,而不是在图形API深处失败。可选钩子允许你放入捕获的GPU缓冲区(out/gpu_result.bin)进行比较。

Zig
// Project host pipeline for the vector-square kernel.
//
// This program demonstrates the CPU orchestration that pairs with the
// `squareVector` SPIR-V kernel. It prepares input data, plans a dispatch,
// validates the compiled shader module, and runs a CPU fallback that mirrors the
// GPU algorithm. When requested via `--emit-binary`, it also writes the CPU
// output to `out/reference.bin` so external GPU runs can be compared bit-for-bit.

const std = @import("std");

/// Must match `lane_capacity` in 01_vector_square_kernel.zig.
const lane_capacity: u32 = 1024;
const default_problem_len: u32 = 1000;
const workgroup_size: u32 = 64;
const spirv_path = "kernels/vector_square.spv";
const gpu_dump_path = "out/gpu_result.bin";
const cpu_dump_path = "out/reference.bin";

/// Encapsulates the GPU workgroup dispatch geometry, accounting for padding
/// when the total workload doesn't evenly divide into workgroup boundaries.
const DispatchPlan = struct {
    workgroup_size: u32,
    group_count: u32,
    /// Total invocations including padding to fill complete workgroups
    padded_invocations: u32,
    /// Number of unused lanes in the final workgroup
    tail: u32,
};

/// Tracks a validated SPIR-V module alongside its filesystem path for diagnostics.
const ModuleInfo = struct {
    path: []const u8,
    bytes: []u8,
};

pub fn main() !void {
    // Initialize allocator with leak detection for development builds
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer switch (gpa.deinit()) {
        .ok => {},
        .leak => std.log.err("general-purpose allocator detected a leak", .{}),
    };
    const allocator = gpa.allocator();

    // Parse command-line arguments for optional flags
    var args = try std.process.argsWithAllocator(allocator);
    defer args.deinit();
    _ = args.next(); // skip program name

    var emit_binary = false;
    var logical_len: u32 = default_problem_len;

    while (args.next()) |arg| {
        if (std.mem.eql(u8, arg, "--emit-binary")) {
            emit_binary = true;
        } else if (std.mem.eql(u8, arg, "--length")) {
            const value = args.next() orelse return error.MissingLengthValue;
            logical_len = try std.fmt.parseInt(u32, value, 10);
        } else {
            return error.UnknownFlag;
        }
    }

    // Clamp user-provided length to prevent buffer overruns in the kernel
    if (logical_len == 0 or logical_len > lane_capacity) {
        std.log.warn("clamping problem length to lane capacity ({d})", .{lane_capacity});
        logical_len = @min(lane_capacity, logical_len);
        if (logical_len == 0) logical_len = @min(lane_capacity, default_problem_len);
    }

    // Calculate how many workgroups we need to process this many elements
    const plan = computeDispatch(logical_len, workgroup_size);
    std.debug.print(
        "launch plan: {d} groups × {d} lanes => {d} invocations (tail {d})\n",
        .{ plan.group_count, plan.workgroup_size, plan.padded_invocations, plan.tail },
    );

    // Use deterministic PRNG for reproducible test runs across environments
    var prng = std.Random.DefaultPrng.init(0xBEEFFACE);
    const random = prng.random();

    // Generate input data with a predictable pattern plus random noise
    var input = try allocator.alloc(f32, logical_len);
    defer allocator.free(input);
    for (input, 0..input.len) |*slot, idx| {
        const base: f32 = @floatFromInt(idx);
        slot.* = base * 0.5 + random.float(f32);
    }

    // Execute CPU reference implementation to produce expected results
    var cpu_output = try allocator.alloc(f32, logical_len);
    defer allocator.free(cpu_output);
    runCpuFallback(input, cpu_output);

    // Compute simple checksum for quick sanity verification
    const checksum = checksumSlice(cpu_output);
    std.debug.print("cpu fallback checksum: {d:.6}\n", .{checksum});

    // Attempt to load and validate the compiled SPIR-V shader module
    const module = try loadSpirvIfPresent(allocator, spirv_path);
    defer if (module) |info| allocator.free(info.bytes);

    if (module) |info| {
        std.debug.print(
            "gpu module: {s} ({d} bytes, header ok)\n",
            .{ info.path, info.bytes.len },
        );
    } else {
        std.debug.print(
            "gpu module: missing ({s}); run kernel build command to generate it\n",
            .{spirv_path},
        );
    }

    // Check if a GPU execution captured output for comparison
    const maybe_gpu_dump = try loadBinaryIfPresent(allocator, gpu_dump_path);
    defer if (maybe_gpu_dump) |blob| allocator.free(blob);

    if (maybe_gpu_dump) |blob| {
        // Compare GPU results against CPU reference lane-by-lane
        const mismatches = compareF32Slices(cpu_output, blob);
        std.debug.print(
            "gpu capture diff: {d} mismatched lanes\n",
            .{mismatches},
        );
    } else {
        std.debug.print(
            "gpu capture diff: skipped (no {s} file found)\n",
            .{gpu_dump_path},
        );
    }

    // Display first few lanes for manual inspection
    const sample_count = @min(input.len, 6);
    for (input[0..sample_count], cpu_output[0..sample_count], 0..) |original, squared, idx| {
        std.debug.print(
            "lane {d:>3}: in={d:.5} out={d:.5}\n",
            .{ idx, original, squared },
        );
    }

    // Write reference dump if requested for external GPU validation tools
    if (emit_binary) {
        try emitCpuDump(cpu_output);
        std.debug.print("cpu reference written to {s}\n", .{cpu_dump_path});
    }
}

/// Computes dispatch geometry by rounding up to complete workgroups.
/// Returns the number of groups, total padded invocations, and unused tail lanes.
fn computeDispatch(total_items: u32, group_size: u32) DispatchPlan {
    std.debug.assert(group_size > 0);
    // Divide ceiling to ensure all items are covered
    const groups = std.math.divCeil(u32, total_items, group_size) catch unreachable;
    const padded = groups * group_size;
    return .{
        .workgroup_size = group_size,
        .group_count = groups,
        .padded_invocations = padded,
        .tail = padded - total_items,
    };
}

/// Executes the squaring operation on the CPU, mirroring the GPU kernel logic.
/// Each output element is the square of its corresponding input.
fn runCpuFallback(input: []const f32, output: []f32) void {
    std.debug.assert(input.len == output.len);
    for (input, output) |value, *slot| {
        slot.* = value * value;
    }
}

/// Calculates a simple sum of all f32 values in double precision for observability.
fn checksumSlice(values: []const f32) f64 {
    var total: f64 = 0.0;
    for (values) |value| {
        total += @as(f64, @floatCast(value));
    }
    return total;
}

/// Attempts to read and validate a SPIR-V binary module from disk.
/// Returns null if the file doesn't exist; validates the magic number (0x07230203).
fn loadSpirvIfPresent(allocator: std.mem.Allocator, path: []const u8) !?ModuleInfo {
    var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
        error.FileNotFound => return null,
        else => return err,
    };
    defer file.close();

    const bytes = try file.readToEndAlloc(allocator, 1 << 20);
    errdefer allocator.free(bytes);

    // Validate minimum size for SPIR-V header
    if (bytes.len < 4) return error.SpirvTooSmall;
    // Check little-endian magic number
    const magic = std.mem.readInt(u32, bytes[0..4], .little);
    if (magic != 0x0723_0203) return error.InvalidSpirvMagic;

    return ModuleInfo{ .path = path, .bytes = bytes };
}

/// Loads raw binary data if the file exists; returns null for missing files.
fn loadBinaryIfPresent(allocator: std.mem.Allocator, path: []const u8) !?[]u8 {
    var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
        error.FileNotFound => return null,
        else => return err,
    };
    defer file.close();
    const bytes = try file.readToEndAlloc(allocator, 1 << 24);
    return bytes;
}

/// Compares two f32 slices for approximate equality within 1e-6 tolerance.
/// Returns the count of mismatched lanes; returns expected.len if sizes differ.
fn compareF32Slices(expected: []const f32, blob_bytes: []const u8) usize {
    // Ensure blob size aligns with f32 boundaries
    if (blob_bytes.len % @sizeOf(f32) != 0) return expected.len;
    const actual = std.mem.bytesAsSlice(f32, blob_bytes);
    if (actual.len != expected.len) return expected.len;

    var mismatches: usize = 0;
    for (expected, actual) |lhs, rhs| {
        // Use floating-point tolerance to account for minor GPU precision differences
        if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
            mismatches += 1;
        }
    }
    return mismatches;
}

/// Writes CPU-computed f32 array to disk as raw bytes for external comparison tools.
fn emitCpuDump(values: []const f32) !void {
    // Ensure output directory exists before writing
    try std.fs.cwd().makePath("out");
    var file = try std.fs.cwd().createFile(cpu_dump_path, .{ .truncate = true });
    defer file.close();
    // Convert f32 slice to raw bytes for binary serialization
    const bytes = std.mem.sliceAsBytes(values);
    try file.writeAll(bytes);
}

math.zig

运行
Shell
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
    -femit-bin=kernels/vector_square.spv \
    01_vector_square_kernel.zig
$ zig run 02_host_pipeline.zig -- --emit-binary
输出
Shell
launch plan: 16 groups × 64 lanes => 1024 invocations (tail 24)
cpu fallback checksum: 83467485.758038
gpu module: kernels/vector_square.spv (5368 bytes, header ok)
gpu capture diff: skipped (no out/gpu_result.bin file found)
lane   0: in=0.10821 out=0.01171
lane   1: in=1.07972 out=1.16579
lane   2: in=1.03577 out=1.07281
lane   3: in=2.33225 out=5.43938
lane   4: in=2.92146 out=8.53491
lane   5: in=2.89332 out=8.37133
cpu reference written to out/reference.bin

如果你计划捕获GPU缓冲区,请保留生成的out/reference.bin;否则,删除它以保持工作区清洁。

验证设备转储

差异工具消费两个二进制转储(预期与捕获)并报告不匹配的通道,预览前几个差异以帮助你快速发现数据相关的错误。它假设小端f32值,与大多数主机API公开原始映射缓冲区的方式匹配。mem.zig

Zig
// Utility to compare two float32 binary dumps.
//
// The files are expected to be raw little-endian 32-bit float arrays. The
// program prints the number of mismatched lanes (based on absolute tolerance)
// and highlights the first few differences for quick diagnostics.

const std = @import("std");

/// Maximum number of mismatched differences to display in diagnostic output
const max_preview = 5;

pub fn main() !void {
    // Initialize allocator with leak detection for development builds
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer switch (gpa.deinit()) {
        .ok => {},
        .leak => std.log.warn("compare_dump leaked memory", .{}),
    };
    const allocator = gpa.allocator();

    // Parse command-line arguments expecting exactly two file paths
    var args = try std.process.argsWithAllocator(allocator);
    defer args.deinit();
    _ = args.next(); // Skip program name

    const expected_path = args.next() orelse return usageError();
    const actual_path = args.next() orelse return usageError();
    if (args.next()) |_| return usageError(); // Reject extra arguments

    // Load both binary dumps into memory for comparison
    const expected_bytes = try readAll(allocator, expected_path);
    defer allocator.free(expected_bytes);

    const actual_bytes = try readAll(allocator, actual_path);
    defer allocator.free(actual_bytes);

    // Reinterpret raw bytes as f32 slices for element-wise comparison
    const expected = std.mem.bytesAsSlice(f32, expected_bytes);
    const actual = std.mem.bytesAsSlice(f32, actual_bytes);

    // Early exit if array lengths differ
    if (expected.len != actual.len) {
        std.debug.print(
            "length mismatch: expected {d} elements, actual {d} elements\n",
            .{ expected.len, actual.len },
        );
        return;
    }

    // Track total mismatches and collect first few for detailed reporting
    var mismatches: usize = 0;
    var first_few: [max_preview]?Diff = .{null} ** max_preview;

    // Compare each lane using floating-point tolerance to account for minor precision differences
    for (expected, actual, 0..) |lhs, rhs, idx| {
        if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
            // Store first N differences for diagnostic display
            if (mismatches < max_preview) {
                first_few[mismatches] = Diff{ .index = idx, .expected = lhs, .actual = rhs };
            }
            mismatches += 1;
        }
    }

    // Print summary of comparison results
    std.debug.print("mismatched lanes: {d}\n", .{mismatches});
    
    // Display detailed information for first few mismatches to aid debugging
    for (first_few) |maybe_diff| {
        if (maybe_diff) |diff| {
            std.debug.print(
                "  lane {d}: expected={d:.6} actual={d:.6}\n",
                .{ diff.index, diff.expected, diff.actual },
            );
        }
    }
}

/// Prints usage information and returns an error when invocation is invalid
fn usageError() !void {
    std.debug.print("usage: compare_dump <expected.bin> <actual.bin>\n", .{});
    return error.InvalidInvocation;
}

/// Reads entire file contents into allocated memory with a 64 MiB size limit
fn readAll(allocator: std.mem.Allocator, path: []const u8) ![]u8 {
    var file = try std.fs.cwd().openFile(path, .{});
    defer file.close();
    return try file.readToEndAlloc(allocator, 1 << 26);
}

/// Captures a single floating-point mismatch with its location and values
const Diff = struct {
    index: usize,
    expected: f32,
    actual: f32,
};
运行
Shell
$ zig run 03_compare_dump.zig -- out/reference.bin out/reference.bin
输出
Shell
mismatched lanes: 0

要验证真实的GPU运行,请将设备缓冲区保存为out/gpu_result.bin,并针对该文件重新运行03_compare_dump.zig以显示任何差异。Io.zig

注意事项与限制

  • 存储缓冲区需要显式对齐;保持你的extern struct定义与主机描述符绑定同步,以避免静默填充错误。
  • 自托管SPIR-V后端在CPU目标上拒绝不支持的地址空间,因此将内核源文件与主机构建隔离(不要从CPU二进制文件@import)。
  • 确定性PRNG种下使CPU和GPU执行在运行和CI环境中可比较。

练习

  • 通过绑定第二个输入缓冲区将内核扩展为融合乘法和加法(a * a + b);相应地更新主机和差异工具。
  • 教主机CLI发出描述调度计划的JSON元数据,以便外部分析器可以获取运行配置。json.zig
  • 捕获真实GPU输出(通过Vulkan、WebGPU或wgpu-native)并将二进制文件输入03_compare_dump.zig,记下硬件所需的任何公差调整。

替代方案和边缘案例

  • 供应商对存储缓冲区的映射不同;在假设f32数组密集打包之前,检查所需的最小区齐(例如某些驱动程序上的16字节)。
  • 对于非常大的缓冲区,流式比较而不是将整个转储加载到内存中,以避免在低端机器上对分配器施加压力。
  • 当针对CUDA(nvptx64)时,将调用约定交换为.kernel并调整地址空间(.global/.shared)以满足PTX期望。

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.