Posted on 2024-12-10

Goal description

This post is for those who have some knowledge of Vulkan and are purely interested in using Vulkan for computing. You've followed along with the vulkan-tutorial, maybe you've read Thales Sabino's Vulkan compute example. However, you may be a bit foggy on the finer points of synchronization, you don't know all the essential functions off the top of your head and certainly if you were going to populate most structs' fields or try to use some more advanced functionality you'd have to look up the spec and figure out what the jargon used actually meant. Also, basically all the tutorials you've found are only about graphics programming, or they mostly cover compute shaders as a way to complement graphics rendering.

So, the goal here is to go over the bare minimum you need to set up to use compute shaders (which is actually far less than what you need for graphics, since you don't need to worry about surfaces, fixed functions, viewports or any of that jazz, and afaik everything you need for compute is platform independent) as well as tips for avoiding some hickups along the way.

The bare minimum

For anything Vulkan, the bare minimum you need is an instance, a physical device, a logical device, a queue, a pipeline layout, a pipeline and a command pool.

I would call the trio of instance, physical device, and logical device the core, or perhaps skeleton of the application. Creating these works the same whether you're doing headless compute or pure graphics rendering.

The first divergence from the graphics oriented vulkan-tutorial is in queue creation. You need a queue family that supports compute, which you can check for with VK_QUEUE_COMPUTE_BIT, or vk::QueueFlagBits::eCompute with vulkan-hpp. Some GPUs have queue families that exclusively support compute. These can do asynchronous computing, and it's not necessary to use these unless you foresee running multiple commands in parallel, so when you're finding queue families just use the first one that supports compute.

Synchronization and lost devices

Another thing you may want to set up is a fence. In some cases it may actually be fine to skip using a fence and just use vkQueueWaitIdle(...) or, in vulkan-hpp, queue.waitIdle() where queue is the queue you created. However, a library I use called VkFFT has a fence member in its main struct, so I made a fence. If you have one overarching fence, then instead of waiting until the queue is idle you wait for the fence to become idle, so e.g. in vulkan-hpp

queue.submit(submitInfo, nullptr);
auto result = queue.waitIdle();

becomes

queue.submit(submitInfo, fence);
auto result = device.waitForFences(fence, vk::True, -1)
result = device.resetFences(1, &fence);

However, you shouldn't use both for the same synchronization. Nothing bad happens, it's just a bit wasteful and clutters your code.

On the topic of submitting, you might be tempted to offload as much of the work as possible to the compute shader and minimize the length of your command queue or submissions you make to the queue. This sounds good in theory, but when a single pipeline or shader runs for a very long time, the logical device may be lost. I'm not sure where exactly this occurs, might be a driver thing, but it's happened to me more than a few times. I think a good rule of thumb would be that a single pipeline shouldn't run for longer than 2 seconds, and if a pipeline is getting to that point you should split it into multiple calls.

This can happen when you have a very long and expensive for loop in your shader, which is a very common case in scientific computing. In this case you should keep the looping part on the cpu and have your shader just be the computation part. Afaik know the best way to do this is setting up a command buffer created with no flags. This buffer can be submitted multiple times and doesn't need to be reset or rewritten. In this command buffer you'll bind the pipeline you want to run as well as any memory barrier you might need up to around 1000 times. This is because submitting a command buffer has rather high latency and overhead, so it's best if you can submit as much work as possible in a single command buffer. I've found that you can get up to or over double the speed by recording hundreds of commands in a command buffer vs submitting one command at a time.

Specialization constants and creating pipelines

One cool thing that Vulkan offers is specialization constants. So, while you do compile glsl shaders to SPIRV bytecode, this bytecode still needs to be compiled to machine code and transferred to the GPU. It's at this point that Vulkan can insert specialization constants into the shader that allow it to be optimised as if the constant had been hardcoded into the GLSL code, as opposed to being a variable that's set on the fly. Afaict this feature isn't very commonly used in graphics programming, perhaps there aren't a ton of situations where having to rebuild a pipeline is worth making the shader probably just slightly faster, but for physics simulations where you usually have tons of numbers that you want to be constant in each run, but you'd like to be able to modify them between runs without much effort it's actually very cool.

So, working backward, to create a compute pipeline you call vkCreateComputePipelines, or device.createComputePipeline() like so:

auto result = device.createComputePipeline(pipelineCache, cPCI)
vk::Pipeline bleh = result.value;

The hpp version returns a result struct so you can check whether the creation was successfull with result.result. I think it's generally good practice to use pipeline cache even if it's not strictly necessary. Here cPCI is a vk::ComputePipelineCreateInfo struct. You could also set a vk::AllocationCallbacks, but that seems like a pretty advanced thing that I still haven't needed to dive into (use nullptr in the C function if not using vulkan-hpp).

So, the compute pipeline create info is a struct that you can default initialise and populate the fields of, or construct in one line in the hpp version. You need to set stype to VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO, set layout to the compute pipeline layout created previously and set stage to the appropriate pipeline shader stage create info. In the hpp version you don't need to set stype, but instead you should pass vk::PipelineShaderStageCreateFlags(). You can see what flags you can set here, but personally I haven't felt the need to set any of them.

Setting the pipeline shader stage create info works much like it does in vulkan-tutorial, except that you set the shader stage flag bit to VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT/vk::ShaderStageFlagBits::eCompute and here you can set a pointer to a vk:SpecializationInfo struct in the pSpecializationInfo field, or pass it as the last argument in the hpp constructor.

Finally, to set specialization constants you first make a vk::SpecializationMapEntry (or a vector of them as would usually be the case). There you set constantID to some number by which you can reference it in the corresponding shader, set the offset field to the offset of the constant from the start of the data pointer in bytes, and the size field to the constant's size in bytes. A sample using vulkan-hpp:

std::vector<vk::SpecializationMapEntry> bleh(nSpecConsts);
for (uint32_t i = 0; i < nSpecConsts; i++) {
  bleh[i].constantID = i;
  bleh[i].offset = i * 4;
  bleh[i].size = 4;
}
vk::SpecializationInfo specInfo;
specInfo.mapEntryCount = nSpecConsts;
specInfo.pMapEntries = bleh.data();
specInfo.dataSize = sizeof(SimConstants);
specInfo.pData = &params;

Here I'm using a struct to group the constants together. In that case you need to be careful about the alignment of the struct. So each fundamental type in C/++ has a property called alignment that controls where it can be placed in memory, and when you have a struct containing fields with different alignments, the struct will inherit the alignment from the field with the largest alignment. This means that the struct members may not be placed where you think they are. For example:

struct A {
    bool x;
    float y;
    double z;
}

In this struct, the member with the largest alignment is z, which has an alignment of 8 because it's a double. The size of the struct, however, is not 1 + 4 + 8 = 13, nor is it 8 + 8 + 8 = 24, but 8 + 8 = 16. So the bool takes up at least 1 byte. Now, the float needs to be 4 byte aligned, so instead of being placed directly next to the bool it's actually 3 bytes away and then takes up 4 bytes. Together, float and bool thus take up 8 bytes, which means that the double does get placed next to float and the struct takes up 16 bytes total. If we remove the float, the struct will still take up 16 bytes since the double needs to be 8 byte aligned and will thus be placed 7 bytes away from the bool. So the message is, if you have data members of different sizes in your specialization constant struct, you need to be careful with what offsets you use. In my case I could get away with using only 4 byte data members in SimConstants and get nSpecConstants by dividing the size of the struct by 4. Also, in general you should order the data members of a struct in descending order by size, and avoid not fully utilizing the size of the struct (technically most compilers support what's called "packed" data members through compiler directives which allows the struct to ignore alignment requirements, which means the struct takes up the same amount of memory regardless of the order of data members. But this requires the operating system to do unaligned accesses, which can be very slow on many older CPUs, although apparently it's basically as fast as aligned accesses on reasonably modern CPUs, and introduces an extra layer of complexity to worry about. For more on this and other hardware considerations, see Timur Doumler's talk at CppCon 2016).

Table of Contents

Goal description

The bare minimum

Synchronization and lost devices

Specialization constants and creating pipelines