Posted on 2025-01-05

What is Kompute?

Kompute is a Vulkan based GPU compute framework that supports C++ and Python. From what I gather the main selling point of Kompute is that it allows you to get to computing without having to deal with the copious amounts of boilerplate and fiddly bits that raw Vulkan entails and provides a neat abstraction for multidimensional arrays in the Tensor class.

What is VkFFT?

VkFFT is a library for performing FFTs on the GPU using various backends, though here I'm only interested in Vulkan.

What is an FFT?

This is one of those things where you either basically already know what it is, or if I tried to explain it the post would be mostly that explanation. Besides the Wikipedia page, this video is a good presentation on it.

Vat is problem?

VkFFT needs to know a number of Vulkan handles that are private members of the Kompute Manager and Tensor classes in order to even initialize the essential VkFFTApplication. Basically, in order to use VkFFT with Kompute you need to use Kompute's BYOV (bring your own Vulkan) feature, meaning you need to do some manual Vulkan setup. You might worry that this negates the point of using Kompute in order to minimise boilerplate and micromanagement, but I think the solution presented here still keeps things pretty minimal.

The Code

The Kompute examples all use handwritten shaders, or at least assume you have a single shader that you can insert as a function argument. AFAIK VkFFT is far too involved for that to be possible. However, basically the way Kompute works is you have a Sequence instance and call record methods on it where X is a subclass of OpBase, and this constructs an instance of X and calls X's record method. You can also create (shared pointers to) instances of X and feed those into a record call without a template argument, which avoids adding redundant cruft to our class constructor and allow us to reuse objects. Of course, the record method can execute arbitrary code as long as the class has access to the right information. Fortunately, the record method gets passed Kompute's command buffer which means we have a way to call VkFFTAppend through Kompute.

Thus I wrote the following class:

class FFT : public kp::OpBase {
public:
  FFT(VkFFTApplication* app, i64 direction, VkFFTLaunchParams* lParams)
      : app{app}, lParams{lParams}, direction{direction} {}

  void record(const vk::CommandBuffer& commandBuffer) override {
      lParams->commandBuffer = bit_cast<VkCommandBuffer*>(&commandBuffer);
      VkFFTAppend(app, direction, lParams);
  }
  virtual void preEval(const vk::CommandBuffer& commandBuffer) override {};
  virtual void postEval(const vk::CommandBuffer& commandBuffer) override {};
  virtual ~FFT() override {};
  VkFFTApplication* app;
  VkFFTLaunchParams* lParams;
  i64 direction;
};

I could check whether the launch parameters have a valid command buffer to avoid unnecessary writes, but that means doing an extra check instead which I think basically negates whatever advantage could be gained from that. I also decided to make the VkFFTApplication and launch parameters raw pointer members with the understanding that the FFT class is not responsible for any resources. It's simpler than using shared pointers, and unique pointer is a no go if we want to reuse the application since each record call on the sequence creates a new FFT instance.

You might also notice that this hack allows us to grab Kompute's command buffer that isn't meant to be exposed to the user.

This approach requires that you set up all the VkFFT stuff beforehand, which in turn requires that you have access to the following:

A VkInstance
A VkPhysicalDevice
A VkDevice
A VkQueue
A VkFence
A VkCommandPool
A VkBuffer

Getting all of this set up is IMO much simpler than pipelines and descriptors so I think it's still worth it to use Kompute even if you need to manually set up these components.

So what I end up with in order to perform an FFT is the following:

int main() {
  VulkanApp myApp{};
  std::vector<f32> buff(128 * 2);
  for (u32 i = 0; i < 128; i++) {
    f32 x = 2 * M_PI * (f32)i / 128.;
    buff[2 * i] = std::sin(x);
    buff[2 * i + 1] = 0.;
  }
  auto tensor = std::make_shared<kp::TensorT<f32>>(
      std::shared_ptr<vk::PhysicalDevice>(&myApp.pDevice,
                                          [](vk::PhysicalDevice*) {}),
      std::shared_ptr<vk::Device>(&myApp.device, [](vk::Device*) {}), buff);
  auto seq = std::make_shared<kp::Sequence>(
      std::shared_ptr<vk::PhysicalDevice>(&myApp.pDevice,
                                          [](vk::PhysicalDevice*) {}),
      std::shared_ptr<vk::Device>(&myApp.device, [](vk::Device*) {}),
      std::shared_ptr<vk::Queue>(&myApp.queue, [](vk::Queue*) {}),
      myApp.getComputeQueueFamilyIndex());
  u64 bufferSize = 128 * 8;
  VkFFTConfiguration conf{};
  conf.device = bit_cast<VkDevice*>(&myApp.device);
  conf.queue = bit_cast<VkQueue*>(&myApp.queue);
  conf.FFTdim = 1;
  conf.size[0] = 128;
  conf.fence = bit_cast<VkFence*>(&myApp.fence);
  conf.commandPool = bit_cast<VkCommandPool*>(&myApp.commandPool);
  conf.physicalDevice = bit_cast<VkPhysicalDevice*>(&myApp.pDevice);
  conf.buffer = bit_cast<VkBuffer*>(tensor->getPrimaryBuffer().get());
  conf.bufferSize = &bufferSize;
  VkFFTApplication fftApp{};
  initializeVkFFT(&fftApp, conf);
  VkFFTLaunchParams lp{};
  std::shared_ptr<kp::OpBase> forward{new FFT(&app, -1, &lp)};
  std::shared_ptr<kp::OpBase> backward{new FFT(&app, 1, &lp)};


  seq->record<kp::OpSyncDevice>({tensor})
      ->record(forward)
      ->record(backward)
      ->record<kp::OpSyncLocal>({tensor})
      ->eval();
  deleteVkFFT(&fftApp);

  for (const auto& e : tensor->vector()) {
    std::cout << e << ' ';
  }
  std::cout << '\n';
}

Here all the components I listed above except VkBuffer are contained in the VulkanApp class which also automatically cleans up its resources in its destructor. You'll notice there's no instance of the Manager class that Kompute provides. At first I thought you'd need it since it provides convenient methods to create and manage Sequences and Tensors, but creating it from pre-existing Vulkan resources turns off the flag for it to manage resources (seems fine), but that also means it doesn't create its own compute queue (and as far as I can tell there's no way to provide it with a queue afterwards), which means if you call sequence on it you'll get a segfault.

To work around this you can skip creating a Manager entirely and construct tensors and sequences from their own class constructors instead. Both classes clean up after themselves so you still don't need to do any manual destruction. Subclassing Manager may seem like it could provide an easier pathway, but the private members still can't be accessed from the subclass. This does make me wonder why you would ever create a Manager with pre-existing Vulkan resources? Because surely it'd still be handy if you could have the Manager for pure compute tasks and have it manage all of those resources while you manage other resources for e.g. rendering, but apparently it can't do that.

You'll also notice the way I've created shared pointers to my Vulkan resources is a bit odd. That's because VulkanApp is stack allocated, or at least the handles, (because I'm a troglodyte who is scared of heap allocations) and will free its resources once it goes out of scope, but when the last shared pointer to those resources goes out of scope it calls the destructor of those resources, which causes double frees. So I provide the shared pointer with a destructor that does nothing in order to avoid this.

Finally, I made a github repo that demonstrates a simple example of all of this working together.

Table of Contents

What is Kompute?

What is VkFFT?

What is an FFT?

Vat is problem?

The Code