I Use This!
Very High Activity

News

Analyzed 2 days ago. based on code collected 4 days ago.
Posted over 2 years ago
Deeper Into Software I don’t feel like blogging about zink today, so here’s more about everyone’s favorite software implementation of Vulkan. The existing LLVMpipe architecture works like this from a top-down view: mesa / st - this is the ... [More] GL/Gallium state tracker llvmpipe - this is the Gallium driver gallivm - this is the LLVM program compiler llvm - this is where the fragment shader runs In short, everything is for the purpose of compiling LLVM programs which will draw/compute the desired result. Lavapipe makes a slight change: lavapipe - this is the Vulkan state tracker llvmpipe - this is the Gallium driver gallivm - this is the LLVM program compiler llvm - this is where the fragment shader runs It’s that simple. Thus, any time a new feature is added to Lavapipe, what’s actually being done is plumbing that Vulkan feature through some number of layers to change how LLVM is executed. Some features, like samplerAnisotropy, require significant work at the gallivm layer just to toggle a boolean flag at the lavapipe level. Other changes, like KHR_timeline_semaphores are entirely contained in Lavapipe. What Are Timeline Semaphores? Vulkan has a number of mechanisms for synchronization including fences, events, and binary semaphores, all of which serve a specific purpose. For more conrete on all of them, please read the blog of an actual expert. The best and most awesome (don’t @ me, it’s not debatable) of these synchronization methods, however is the timeline semaphore. A timeline semaphore is an object that can be used to signal and wait on specific integer-assigned points in command execution, also known as timelines. Each queue submission can be accompanied by an array of timeline semaphores to wait on and an array to signal; command buffers in a given submission will wait before executing, then signal after they’re done. This enables parallel code design where one thread can assemble command buffers and submit them, and the GPU can be made to pause at certain points for buffers/images referenced to become populated by another thread before continuing with execution. Typically, semaphores are managed through signals which pass through the kernel and hardware, meaning that “waiting” on a timeline is really just waiting on an ioctl (DRM_IOCTL_SYNCOBJ_TIMELINE_WAIT) to signal that the specified timeline id has occurred, which requires no additional host-side synchronization. Things get a bit trickier in software, however, as the kernel is not involved, so everything must be managed in the driver. Lavapipe And Timelines This was a todo item sitting on the list for a while because it was tricky to handle. The most visible problems here were: connecting timeline identifiers with queue submissions; timelines only need to be monotonic, not sequential, meaning that using something like a sliding array wouldn’t be very efficient the actual synchronization when threads are involved After some thought and deliberation about my life choices up to this point, I decided to tackle this implementation. The methodology I selected was to add a monotonic counter to the internal command buffer submission and then create a series of per-object timeline “links” which would serve to match the counter to the timeline identifier. This would enable each timeline semaphore to maintain a singly-linked list of links each time they were submitted, and the list could then be pruned at any given time—referenced against the internal counter—to update the “current” timeline id and then evaluate whether a specified wait condition had passed. In the case where the condition had not passed, the timeline link could also store a handle to the fence from the llvmpipe queue submission that could be waited on directly. Did it work? Almost on the first try, actually. But then I ran into a wall in CI while running piglit tests through zink. It turns out that the CTS tests are considerably less aggressive than the piglit ones for things like this: specifically, there don’t appear to be any cases where a single timeline has 16 threads all trying to wait on it at different values, iterating thousands of times over the course of a couple seconds. Oops. But that’s now taken care of, and conformance never felt so good. The road to Vulkan 1.2 continues! [Less]
Posted over 2 years ago
This is a title I’m back. Where did I go? My birthday passed recently, so I gifted myself a couple weeks off from blogging. Feels good. For today, this is a Lavapipe blog. What’s New With Lavapipe? Lots. Let’s check out what conformant features ... [More] were added just in July: EXT_line_rasterization EXT_vertex_input_dynamic_state EXT_extended_dynamic_state2 EXT_color_write_enable features.strictLines features.shaderStorageImageExtendedFormats features.shaderStorageImageReadWithoutFormat features.samplerAnisotropy KHR_timeline_semaphores Also under the hood now is a new 2D rasterizer from VMWare which yields “a 2x to 3x performance improvement for 2D workloads”. Why Aren’t You Using Lavapipe Yet? Have a big Vulkan-using project? Do you constantly have to worry about breakages from all manner of patches being merged without testing? Can’t afford or too lazy to set up and maintain actual hardware for testing? Why not Lavapipe? Seriously, why not? If there’s features missing that you need for your project, open tickets so we know what to work on. [Less]
Posted over 2 years ago
Part 1, Part 2, Part 3 After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So ... [More] let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline: .freebsd: variables: FDO_DISTRIBUTION_VERSION: '13.0' FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read build-image: extends: - .freebsd - .fdo.qemu-build@freebsd variables: FDO_DISTRIBUTION_PACKAGES: "curl wget" Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now). Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm: test-build: extends: - .freebsd - .fdo.distribution-image@freebsd script: # start our QEMU image - /app/vmctl start # copy our current working directory to the VM # (this is a yaml multiline command to work around the colon) - | scp -r $PWD vm: # Run the build commands on the VM and if they succeed, create a .success file - /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true # Copy results back to our run container so we can include them in artifacts: - | scp -r vm:$CI_PROJECT_NAME/builddir . # kill the VM - /app/vmctl stop # Now that we have cleaned up: if our build job before # failed, exit with an error - [[ -e .success ]] || exit 1Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job. Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now. [Less]
Posted over 2 years ago
Part 1, Part 2, Part 3 After getting thouroughly nerd-sniped a few weeks back, we now have FreeBSD support through qemu in the freedesktop.org ci-templates. This is possible through the qemu image generation we have had for quite a while now. So ... [More] let's see how we can easily add a FreeBSD VM (or other distributions) to our gitlab CI pipeline: .freebsd: variables: FDO_DISTRIBUTION_VERSION: '13.0' FDO_DISTRIBUTION_TAG: 'freebsd.0' # some value for humans to read build-image: extends: - .freebsd - .fdo.qemu-build@freebsd variables: FDO_DISTRIBUTION_PACKAGES: "curl wget" Now, so far this may all seem quite familiar. And indeed, this is almost exactly the same process as for normal containers (see Part 1), the only difference is the .fdo.qemu-build base template. Using this template means we build an image babushka: our desired BSD image is actual a QEMU RAW image sitting inside another generic container image. That latter image only exists to start the QEMU image and set up the environment if need be, you don't need to care what distribution it runs out (Fedora for now). Because of the nesting, we need to handle this accordingly in our script: tag for the actual test job - we need to start the image and make sure our jobs are actually built within. The templates set up an ssh alias "vm" for this and the vmctl script helps to do things on the vm: test-build: extends: - .freebsd - .fdo.distribution-image@freebsd script: # start our QEMU image - /app/vmctl start # copy our current working directory to the VM # (this is a yaml multiline command to work around the colon) - | scp -r $PWD vm: # Run the build commands on the VM and if they succeed, create a .success file - /app/vmctl exec "cd $CI_PROJECT_NAME; meson builddir; ninja -C builddir" && touch .success || true # Copy results back to our run container so we can include them in artifacts: - | scp -r vm:$CI_PROJECT_NAME/builddir . # kill the VM - /app/vmctl stop # Now that we have cleaned up: if our build job before # failed, exit with an error - [[ -e .success ]] || exit 1Now, there's a bit to unpack but with the comments above it should be fairly obvious what is happening. We start the VM, copy our working directory over and then run a command on the VM before cleaning up. The reason we use touch .success is simple: it allows us to copy things out and clean up before actually failing the job. Obviously, if you want to build any other distribution you just swap the freebsd out for fedora or whatever - the process is the same. libinput has been using fedora qemu images for ages now. [Less]
Posted over 2 years ago
Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from ... [More] each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move. This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess. The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems. Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling. Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller. As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future. [Less]
Posted over 2 years ago
Thanks to the work done by Josè Expòsito, libinput 1.19 will ship with a new type of gesture: Hold Gestures. So far libinput supported swipe (moving multiple fingers in the same direction) and pinch (moving fingers towards each other or away from ... [More] each other). These gestures are well-known, commonly used, and familiar to most users. For example, GNOME 40 recently has increased its use of touchpad gestures to switch between workspaces, etc. Swipe and pinch gestures require movement, it was not possible (for callers) to detect fingers on the touchpad that don't move. This gap is now filled by Hold gestures. These are triggered when a user puts fingers down on the touchpad, without moving the fingers. This allows for some new interactions and we had two specific ones in mind: hold-to-click, a common interaction on older touchscreen interfaces where holding a finger in place eventually triggers the context menu. On a touchpad, a three-finger hold could zoom in, or do dictionary lookups, or kill a kitten. Whatever matches your user interface most, I guess. The second interaction was the ability to stop kinetic scrolling. libinput does not actually provide kinetic scrolling, it merely provides the information needed in the client to do it there: specifically, it tells the caller when a finger was lifted off a touchpad at the end of a scroll movement. It's up to the caller (usually: the toolkit) to implement the kinetic scrolling effects. One missing piece was that while libinput provided information about lifting the fingers, it didn't provide information about putting fingers down again later - a common way to stop scrolling on other systems. Hold gestures are intended to address this: a hold gesture triggered after a flick with two fingers can now be used by callers (read: toolkits) to stop scrolling. Now, one important thing about hold gestures is that they will generate a lot of false positives, so be careful how you implement them. The vast majority of interactions with the touchpad will trigger some movement - once that movement hits a certain threshold the hold gesture will be cancelled and libinput sends out the movement events. Those events may be tiny (depending on touchpad sensitivity) so getting the balance right for the aforementioned hold-to-click gesture is up to the caller. As usual, the required bits to get hold gestures into the wayland protocol are either in the works, mid-flight or merge-ready so expect this to hit the various repositories over the medium-term future. [Less]
Posted over 2 years ago
I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone. In particular I have been hacking away at CTS and got to a point ... [More] where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable. As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image: Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games. Why is it slow? TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable. AMD raytracing primer Raytracing with Vulkan works with two steps: You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH). Then you trace rays using some traversal shader through the acceleration structure you just built. With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be A triangle A box node specifying 4 AABB boxes Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more: an AABB box an instance of another BVH combined with a transformation matrix Building the BVH With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes. And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use. BVH traversal After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is stack = empty insert root node into stack while stack is not empty: node = pop a node from the stack if we left the bottom level BVH: reset ray origin/direction to initial origin/direction result = amd_intersect(ray, node) switch node type: triangle: if result is a hit: load some node data process hit box node: for each box hit: push child node on stack custom node 1 (instance): load node data push the root node of the bottom BVH on the stack apply transformation matrix to ray origin/direction custom node 2 (AABB geometry): load node data process hit We already knew there were inherently going to be some difficulties: We have a poor BVH so we’re going to do way more iterations than needed. Calling shaders as a result of hits is going to result in some divergence. Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle). However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here. A fast GPU traversal stack needs some work too Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels. In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations. So ideally we get this stack size down significantly. Where do we go next? First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage. After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks. Finally, with some luck better shaders to build a BVH will materialize as well. [Less]
Posted over 2 years ago
I have not talked about raytracing in RADV for a while, but after some procrastination being focused on some other things I recently got back to it and achieved my next milestone. In particular I have been hacking away at CTS and got to a point ... [More] where CTS on dEQP-VK.ray_tracing.* runs to completion without crashes or hangs. Furthermore, I got the passrate to 90% of non-skiped tests. So we’re finally getting somewhere close to usable. As further show that it is usable my fixes for CTS also fixed the corruption issues in Quake 2 RTX (Github version), delivering this image: Of course not everything is perfect yet. Besides the not 100% CTS passrate it has like half the Windows performance at 4k right now and we still have some feature gaps to make it really usable for most games. Why is it slow? TL;DR Because I haven’t optimized it yet and implemented every shortcut imaginable. AMD raytracing primer Raytracing with Vulkan works with two steps: You built a giant acceleration structure that contains all your geometry. Typically this ends up being some kind of tree, typically a Bounding Volume Hierarchy (BVH). Then you trace rays using some traversal shader through the acceleration structure you just built. With RDNA2 AMD started accelerating this by adding an instruction that allowed doing intersection tests between a ray and a single BVH node, where the BVH node can either be A triangle A box node specifying 4 AABB boxes Of course this isn’t quite enough to deal with all geometry types in Vulkan so we also add two more: an AABB box an instance of another BVH combined with a transformation matrix Building the BVH With a search tree like a BVH it is very possibly to make trees that are very useless. As an example consider a binary search tree that is very unbalanced. We can have similarly bad things with a BVH including making it unbalanced or having overlapping bounding volumes. And my implementation is the simplest thing possible: the input geometry becomes the leaves in exactly the same order and then internal nodes are created just as you’d draw them. That is probably decently fast in building the BVH but surely results in a terrible BVH to actually use. BVH traversal After we built a BVH we can start tracing some rays. In rough pseudocode the current implementation is stack = empty while stack is not empty: node = pop a node from the stack if we left the bottom level BVH: reset ray origin/direction to initial origin/direction result = amd_intersect(ray, node) switch node type: triangle: if result is a hit: load some node data process hit box node: for each box hit: push child node on stack custom node 1 (instance): load node data push the root node of the bottom BVH on the stack apply transformation matrix to ray origin/direction custom node 2 (AABB geometry): load node data process hit We already knew there were inherently going to be some difficulties: We have a poor BVH so we’re going to do way more iterations than needed. Calling shaders as a result of hits is going to result in some divergence. Furthermore this also clearly shows some difficulties with how we approached the intersection instruction. Some advantages of the intersection instruction are that it avoids divergence in computing collisions if we have different node types in a subgroup and to be cheaper when there are only a few lanes active. (A single CU can process one ray/node intersection per cycle, modulo memory latency, while it can process an ALU instruction on 64 lanes per cycle). However even if it avoids the divergence in the collision computation we still introduce a ton of divergence in the processing of the results of the intersection. So we are still doing pretty bad here. A fast GPU traversal stack needs some work too Another thing to be noted is our traversal stack size. According to the Vulkan specification a bottom level acceleration structure should support 2^24 -1 triangles and a top level acceleration structure should support 2^24 - 1 bottom level structures. Combined with a tree with 4 children in each internal node we can end up with a tree depth of about 24 levels. In each internal node iteration of our loop we pop one element and push up to 4 elements, so at the deepest level of traversal we could end up with a 72 entry stack. Assuming these are 32-bit node identifiers, that ends up with 288 bytes of stack per lane, or ~18 KiB per 64 lane workgroup (the minimum which could possibly keep a CU busy with an ALU only workload). Given that we have 64 KiB of LDS (yes I am using LDS since there is no divergent dynamic register addressing) per CU that leaves only 3 workgroups per CU, leaving very little options for parallelism between different hardware execution units (e.g. the ALU and the texture unit that executes the ray intersections) or latency hiding of memory operations. So ideally we get this stack size down significantly. Where do we go next? First step is to get CTS passing and getting an initial merge request into upstream Mesa. As a follow on to that I’d like to get a minimal prototype going for some DXR 1.0 games with vkd3d-proton just to make sure we have the right feature coverage. After that we’ll have to do all the traversal optimizations. I’ll probably implement a bunch of instrumentation so I actually have a clue on what to optimize. This is where having some runnable games really helps get the right idea about performance bottlenecks. Finally, with some luck better shaders to build a BVH will materialize as well. [Less]
Posted over 2 years ago
If you want to write an X application, you need to use some library that speaks the X11 protocol. For a long time this meant libX11, often called xlib, which - like most things about X - is a fantastic bit of engineering that is very much a product ... [More] of its time with some confusing baroque bits. Overall it does a very nice job of hiding the icky details of the protocol from the application developer.One of the details it hides has to do with how resource IDs are allocated in X. A resource ID (an XID, in the jargon) is a 32 29-bit integer that names a resource - window, colormap, what have you. Those 29 bits are split up netmask/hostmask style, where the top 8 or so uniquely identify the client, and the rest identify the resource belonging to that client. When you create a window in X, what you really tell the server is "I want a window that's initially this size, this background color (etc.) and from now on when I say (my client id + 17) I mean that window." This is great for performance because it means resource allocation is assumed to succeed and you don't have to wait for a reply from the server.Key to all this is that in xlib the XID is the return value from the call that issues the resource creation request. Internally the request gets queued into the protocol's write buffer, but the client can march ahead and issue the next few commands as if creation had succeeded - because it probably did, and if it didn't you're probably going to crash anyway.So to allocate XIDs the client just marches forward through its XID range. What happens when you hit the end of the range? Before X11R4, you'd crash, because xlib doesn't keep track of which XIDs it's allocated, just the lowest one it hasn't allocated yet. Starting in R4 the server added an extension called XC-MISC that lets the client ask the server for a list of unused XIDs, so when xlib hits the end of the range it can request a new range from the server.But. UI programming tends to want threads, and xlib is perhaps not the most thread-friendly. So XCB was invented, which sacrifices some of xlib's ease of use for a more direct binding to the protocol and (in theory) an explicitly thread-safe design. We then modified xlib and XCB to coexist in the same process, using the same I/O buffers, reply and event management, etc.This literal reflection of the protocol into the API has consequences. In XCB, unlike xlib, XID generation is an explicit step. The client first calls into XCB to allocate the XID, and then passes that XID to the creation request in order to give the resource a name.Which... sorta ruins that whole thread-safety thing.Let's say you call xcb_generate_id in thread A and the XID it returns is the last one in your range. Then thread B schedules in and tries to allocate another XID. You'll ask the server for a new range, but since thread A hasn't called its resource creation request yet, from the server's perspective that "allocated" XID looks like it's still free! So now, whichever thread issues their resource creation request second will get BadIDChoice thrown at them if the other thread's resource hasn't been destroyed in the interim.A library that was supposed to be about thread safety baked a thread safety hazard into the API. Good work, team. How do you fix this without changing the API? Maybe you could keep a bitmap on the client side that tracks XID allocation, that's only like 256KB worst case, you can grow it dynamically and most clients don't create more than a few dozen resources anyway. Make xcb_generate_id consult that bitmap for the first unallocated ID, and mark it used when it returns. Then track every resource destruction request and zero it back out of the bitmap. You'd only need XC-MISC if some other client destroyed one of your resources and you were completely out of XIDs otherwise.And you can implement this, except. One, XCB has zero idea what a resource destruction request is, that's simply not in the protocol description. Not a big deal, you can fix that, there's only like forty destructors you'd need to annotate. But then two, that would only catch resource destruction calls that flow through XCB's protocol binding API, which xlib does not, xlib instead pushes raw data through xcb_writev. So now you need to modify every client library (libXext, libGL, ...) to inform XCB about resource destruction.Which is doable. Tedious. But doable.I think. I feel a little weird writing about this because: surely I can't be the first person to notice this. [Less]
Posted over 2 years ago
Debugging programs using printf statements is not a technique that everybody appreciates. However, it can be quite useful and sometimes necessary depending on the situation. My past work on air traffic control software involved using several forms ... [More] of printf debugging many times. The distributed and time-sensitive nature of the system being studied made it inconvenient or simply impossible to reproduce some issues and situations if one of the processes was stalled while it was being debugged. In the context of Vulkan and graphics in general, printf debugging can be useful to see what shader programs are doing, but some people may not be aware it’s possible to “print” values from shaders. In Vulkan, shader programs are normally created in a high level language like GLSL or HLSL and then compiled to SPIR-V, which is then passed down to the driver and compiled to the GPU’s native instruction set. That final binary, many times outside the control of user applications, runs in a quite closed and highly parallel environment without many options to observe what’s happening and without text input and output facilities. Fortunately, tools like glslang can generate some debug information when compiling shaders to SPIR-V and other tools like Nsight can use that information to let you debug shaders being run. Still, being able to print the values of different expressions inside a shader can be an easy way to debug issues. With the arrival of Ray Tracing, this is even more useful than before. In ray tracing pipelines, the shaders being executed and resources being used are chosen based on the scene geometry, the origin and the direction of the ray being traced. printf debugging can let you see where you are and what you’re using. So how do you print values from shaders? Vulkan’s debug printf is implemented as part of the Validation Layers and the general procedure is well documented. If you were to implement this kind of mechanism yourself, you’d likely use a storage buffer to save the different values you want to print while shader invocations are running and, later, you’d go over the contents of that buffer and print the associated message with each value or values. And that is, essentially, what debug printf does but in a very convenient and automated way so that you don’t have to deal with the gory details and corner cases. In a GLSL shader, simply: Enable the GL_EXT_debug_printf extension. Sprinkle your code with debugPrintfEXT() calls. Use the Vulkan Configurator that’s part of the SDK or manually edit vk_layer_settings.txt for your app enabling VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT. Normally, disable other validation features so as not to get too much output. Take a look at the debug report or debug utils info messages containing printf results, or set printf_to_stdout to true so printf messages are sent to stdout directly. You can find an example shader in the validation layers test code. The debug printf feature has helped me a lot in the past, so I wanted to make sure it’s widely known and used. Due to the observer effect, you may end up in situations where your code works correctly when enabling debug printf but incorrectly without it. This may be due to multiple reasons but one of the main ones I’ve encountered is improper synchronization. When debug printf is used, the layers use additional synchronization primitives to sync the contents of auxiliary buffers, which can mask synchronization bugs present in the app. Finally, RenderDoc 1.14, released at the end of May, also supports Vulkan’s shader printf statements and will let you take a look at the print statements produced during a draw call. Furthermore, the print statements don’t have to be present in the original shader. You can also use the shader edit system to insert them on the fly and use them to debug the results of a particular shader invocation. Isn’t that awesome? Great work by Baldur Karlsson as always. PS: As a happy coincidence, just yesterday LunarG published a white paper on Vulkan’s debug printf with additional information on this excellent feature. Be sure to check it out! [Less]