I Use This!
Very High Activity

News

Analyzed 13 days ago. based on code collected 14 days ago.
Posted over 2 years ago
22.0 I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release: fewer hangs on RADV massively improved usability on NVIDIA greatly improved performance with unsupported ... [More] texture download formats (e.g., CS:GO, L4D2) more extensions: ARB_sparse_texture, ARB_sparse_texture2, ARB_sparse_texture_clamp, EXT_memory_object, EXT_memory_object_fd, GL_EXT_semaphore, GL_EXT_semaphore_fd ~1000% improved glxgears performance (be sure to run with --i-know-this-is-not-a-benchmark to see the real speed) tons and tons and tons of bug fixes All around looking like another great release. I Hate gl_PointSize And So Can You Yes, we’re here. After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan. What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize to a shader, you follow up with as you push your glasses higher up your nose. Allow me to explain. In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize shader output controls it, and that’s it. In OpenGL (core profile): 14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command void PointSize( float size ); 11.2.3.4 Tessellation Evaluation Shader Outputs Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs. 11.3.4.4 Geometry Shader Inputs Structure member gl_PointSize holds the per-vertex point size written by the upstream shader to the built-in output variable gl_PointSize. If the upstream shader does not write gl_PointSize, the value of gl_PointSize is undefined, regardless of the value of the enable PROGRAM_POINT_SIZE. In short, if PROGRAM_POINT_SIZE is enabled, then points are sized based on the gl_PointSize shader output of the last vertex stage, but only if the last stage is a vertex shader or tessellation evaluation, because if there’s a geometry shader, you ignore PROGRAM_POINT_SIZE unconditionally. In OpenGL ES (versions 2.0, 3.0, 3.1): (3.3 | 3.4 | 13.3) Points The point size is taken from the shader built-in gl_PointSize written by the vertex shader, and clamped to the implementation-dependent point size range. In OpenGL ES (version 3.2): 13.5 Points The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range. Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two). What do the specs agree on? If a vertex shader is the last vertex stage, it can write gl_PointSize Literally that’s it. Awesome. Zink As we know, Vulkan has a very simple and clearly defined model for point size: The point size is taken from the (potentially clipped) shader built-in PointSize written by: • the geometry shader, if active; • the tessellation evaluation shader, if active and no geometry shader is active; • the vertex shader, otherwise - 27.10. Points It really can be that simple. So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value. That would be easy. Simple. It would make sense. HAHA hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha XFB It gets worse (obviously). gl_PointSize is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable. Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize value, this value must be exported for XFB. But not used for point rasterization. It’s Actually Even Worse Than That …Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec. Remember above when I said it was going to be funny that gl_PointSize is not a legal output for non-vertex stages in ES contexts? CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool? Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display. To Sum Up I hate GL point size, and so should you. [Less]
Posted over 2 years ago
Checking In I keep meaning to blog, but then I get sidetracked by not blogging. Truly a tough life. So what’s new in zink-land? Nothing too exciting. Mostly bug fixes. I managed to sneak ARB_sparse_texture_clamp in for zink just before the ... [More] branchpoint, so all the sparse texturing features supported by Mesa will be supported by zink. But only on NVIDIA since they’re the only driver that fully supports Vulkan sparse texturing. The past couple days I’ve been doing some truly awful things with gl_PointSize to try and make this conformant for all possible cases. It’s a real debacle, and I’ll probably post more in-depth about it so everyone can get a good chuckle. The one unusual part of my daily routine is that I haven’t rebased my testing branch in at least a couple weeks now since I’ve been trying to iron out regressions. Will I find that everything crashes and fails as soon as I do? Probably. More posts to come. [Less]
Posted over 2 years ago
There was an article on Open for Everyone today about Nobara, a Fedora-based distribution optimized for gaming. So I have no beef with Tomas Crider or any other creator/maintainer of a distribution targeting a specific use case. In fact they are ... [More] usually trying to solve or work around real problems and make things easier for people. That said I have for years felt that the need for these things is a failing in itself and it has been a goal for me in the context of Fedora Workstation to figure out what we can do to remove the need for ‘usecase distros’. So I thought it would be of interest if I talk a bit about how I been viewing these things and the concrete efforts we taken to reduce the need for usecase oriented distributions. It is worth noting that the usecase distributions have of course proven useful for this too, in the sense that they to some degree also function as a very detailed ‘bug report’ for why the general case OS is not enough. Before I start, you might say, but isn’t Fedora Workstation as usecase OS too? You often talk about having a developer focus? Yes, developers are something we care deeply about, but for instance that doesn’t mean we pre-install 50 IDEs in Fedora Workstation. Fedora Workstation should be a great general purpose OS out of the box and then we should have tools like GNOME Software and Toolbx available to let you quickly and easily tweak it into your ideal development system. But at the same time by being a general purpose OS at heart, it should be equally easy to install Steam and Lutris to start gaming or install Carla and Ardour to start doing audio production. Or install OBS Studio to do video streaming. Looking back over the years one of the first conclusions I drew from looking at all the usecase distributions out there was that they often where mostly the standard distro, but with a carefully procured list of pre-installed software, for instance the old Fedora game spin was exactly that, a copy of Fedora with a lot of games pre-installed. So why was this valuable to people? For those of us who have been around for a while we remember that the average linux ‘app store’ was a very basic GUI which listed available software by name (usually quite cryptic names) and at best with a small icon. There was almost no other metadata available and search functionality was limited at best. So finding software was not simple, at it was usually more of a ‘search the internet and if you find something interesting see if its packaged for your distro’. So the usecase distros who focused on having procured pre-installed software, be that games, or pro-audio software or graphics tools ot whatever was their focus was basically responding to the fact that finding software was non-trivial and a lot of people maybe missed out on software that could be useful to them since it they simply never learned about its existence. So when we kicked of the creation of GNOME Software one of the big focuses early on was to create a system for providing good metadata and displaying that metadata in a useful manner. So as an end user the most obvious change was of course the more rich UI of GNOME Software, but maybe just as important was the creation of AppStream, which was a specification for how applications to ship with metadata to allow GNOME Software and others to display much more in-depth information about the application and provide screenshots and so on. So I do believe that between working on a better ‘App Store’ story for linux between the work on GNOME Software as the actual UI, but also by working with many stakeholders in the Linux ecosystem to define metadata standards like AppStream we made software a lot more discoverable on Linux and thus reduced the need for pre-loading significantly. This work also provided an important baseline for things like Flathub to thrive, as it then had a clear way to provide metadata about the applications it hosts. We do continue to polish that user experience on an ongoing basis, but I do feel we reduced the need to pre-load a ton of software very significantly already with this. Of course another aspect of this is application availability, which is why we worked to ensure things like Steam is available in GNOME Software on Fedora Workstation, and which we have now expanded on by starting to include more and more software listings from Flathub. These things makes it easy for our users to find the software they want, but at the same time we are still staying true to our mission of only shipping free software by default in Fedora. The second major reason for usecase distributions have been that the generic version of the OS didn’t really have the right settings or setup to handle an important usecase. I think pro-audio is the best example of this where usecase distros like Fedora Jam or Ubuntu Studio popped up. The pre-install a lot of relevant software was definitely part of their DNA too, but there was also other issues involved, like the need for a special audio setup with JACK and often also kernel real-time patches applied. When we decided to include Pro-audio support in PipeWire resolving these issues was a big part of it. I strongly believe that we should be able to provide a simple and good out-of-the box experience for musicians and audio engineers on Linux without needing the OS to be specifically configured for the task. The strong and positive response we gotten from the Pro-audio community for PipeWire I believe points to that we are moving in the right direction there. Not claiming things are 100% yet, but we feel very confident that we will get there with PipeWire and make the Pro-Audio folks full fledged members of the Fedora WS community. Interestingly we also spent quite a bit of time trying to ensure the pro-audio tools in Fedora has proper AppStream metadata so that they would appear in GNOME Software as part of this. One area there where we are still looking at is the real time kernel stuff, our current take is that we do believe the remaining unmerged patches are not strictly needed anymore, as most of the important stuff has already been merged, but we are monitoring it as we keep developing and benchmarking PipeWire for the Pro-Audio usecase. Another reason that I often saw that drove the creation of a usecase distribution is special hardware support, and not necessarily that special hardware, the NVidia driver for instance has triggered a lot of these attempts. The NVidia driver is challenging on a lot of levels and has been something we have been constantly working on. There was technical issues for instance, like the NVidia driver and Mesa fighting over who owned the OpenGL.so implementation, which we fixed by the introduction glvnd a few years ago. But for a distro like Fedora that also cares deeply about free and open source software it also provided us with a lot of philosophical challenges. We had to answer the question of how could we on one side make sure our users had easy access to the driver without abandoning our principle around Fedora only shipping free software of out the box? I think we found a good compromise today where the NVidia driver is available in Fedora Workstation for easy install through GNOME Software, but at the same time default to Nouveau of the box. That said this is a part of the story where we are still hard at work to improve things further and while I am not at liberty to mention any details I think I can at least mention that we are meeting with our engineering counterparts at NVidia on almost a weekly basis to discuss how to improve things, not just for graphics, but around compute and other shared areas of interest. The most recent public result of that collaboration was of course the XWayland support in recent NVidia drivers, but I promise you that this is something we keep focusing on and I expect that we will be able to share more cool news and important progress over the course of the year, both for users of the NVidia binary driver and for users of Nouveau. What are we still looking at in terms of addressing issues like this? Well one thing we are talking about is if there is value/need for a facility to install specific software based on hardware or software. For instance if we detect a high end gaming mouse connected to your system should we install Piper/ratbag or at least make GNOME Software suggest it? And if we detect that you installed Lutris and Steam are there other tools we should recommend you install, like the gamemode GNOME Shell extenion? It is a somewhat hard question to answer, which is why we are still pondering it, on one side it seems like a nice addition, but such connections would mean that we need to have a big database we constantly maintain which isn’t trivial and also having something running on your system to lets say check for those high end mice do add a little overhead that might be a waste for many users. Another area that we are looking at is the issue of codecs. We did a big effort a couple of years ago and got AC3, mp3, AAC and mpeg2 video cleared for inclusion, and also got the OpenH264 implementation from Cisco made available. That solved a lot of issues, but today with so many more getting into media creation I believe we need to take another stab at it and for instance try to get reliable hardware accelerated encoding and decoding on video. I am not ready to announce anything, but we got a few ideas and leads we are looking at for how to move the needle there in a significant way. So to summarize, I am not criticizing anyone for putting together what I call usecase distros, but at the same time I really want to get to a point where they are rarely needed, because we should be able to cater to most needs within the context of a general purpose Linux operating system. That said I do appreciate the effort of these distro makers both in terms of trying to help users have a better experience on linux and in indirectly helping us showcase both potential solutions or highlight the major pain points that still needs addressing in a general purpose Linux desktop operating system. [Less]
Posted over 2 years ago
In defense of NIR NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various ... [More] reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call. A bit of history Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much. That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute. Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call. Distro and packaging issues I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems. First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.) Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release. Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason. And that, in the words of Forrest Gump, is all I have to say about that. A compiler built for GPUs One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well. To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch piglit test: #version 120 #extension GL_EXT_gpu_shader4: require #define ivec1 int flat varying ivec4 tc; uniform vec4 divisor; uniform sampler2D tex; out vec4 fragColor; void main() { vec4 color = texelFetch2D(tex, ivec2(tc), tc.w); fragColor = color/divisor; } When compiled to NIR, this turns into shader: MESA_SHADER_FRAGMENT name: GLSL3 inputs: 1 outputs: 1 uniforms: 1 ubos: 1 shared: 0 decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0) decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0) decl_function main (0 params) impl main { block block_0: /* preds: */ vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */) vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */ vec1 32 ssa_2 = deref_var &tex (uniform sampler2D) vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y vec1 32 ssa_4 = mov ssa_1.z vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod) vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */ vec1 32 ssa_7 = frcp ssa_6.x vec1 32 ssa_8 = frcp ssa_6.y vec1 32 ssa_9 = frcp ssa_6.z vec1 32 ssa_10 = frcp ssa_6.w vec1 32 ssa_11 = fmul ssa_5.x, ssa_7 vec1 32 ssa_12 = fmul ssa_5.y, ssa_8 vec1 32 ssa_13 = fmul ssa_5.z, ssa_9 vec1 32 ssa_14 = fmul ssa_5.w, ssa_10 vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14 intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */ /* succs: block_1 */ block block_1: } Then, the AMD driver turns it into the following LLVM IR: ; ModuleID = 'mesa-shader' source_filename = "mesa-shader" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7" target triple = "amdgcn--" define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 { main_body: %22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4 %23 = bitcast float %22 to i32 %24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4 %25 = bitcast float %24 to i32 %26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4 %27 = bitcast float %26 to i32 %28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0 %29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0 %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4 %31 = ptrtoint float addrspace(6)* %2 to i32 %32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0 %33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4 %34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4 %35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4 %36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4 %37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4 %38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4 %39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4 %40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4 %41 = extractelement <4 x float> %30, i32 0 %42 = fmul nsz arcp float %41, %37 %43 = extractelement <4 x float> %30, i32 1 %44 = fmul nsz arcp float %43, %38 %45 = extractelement <4 x float> %30, i32 2 %46 = fmul nsz arcp float %45, %39 %47 = extractelement <4 x float> %30, i32 3 %48 = fmul nsz arcp float %47, %40 %49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4 %50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5 %51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6 %52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7 %53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8 %54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19 ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54 } ; Function Attrs: nounwind readnone speculatable willreturn declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1 ; Function Attrs: nounwind readonly willreturn declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2 ; Function Attrs: nounwind readnone willreturn declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3 ; Function Attrs: nounwind readnone speculatable willreturn declare float @llvm.amdgcn.rcp.f32(float) #1 attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" } attributes #1 = { nounwind readnone speculatable willreturn } attributes #2 = { nounwind readonly willreturn } attributes #3 = { nounwind readnone willreturn } attributes #4 = { nounwind readnone } !0 = !{} For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines: GLSL: vec4 color = texelFetch2D(tex, ivec2(tc), tc.w); NIR: vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod) LLVM: %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4 ; Function Attrs: nounwind readonly willreturn declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2 attributes #2 = { nounwind readonly willreturn } attributes #4 = { nounwind readnone } In NIR, a texelFetch() shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff. In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4 is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture. There are a couple of important things to note here. First is the @llvm.amdgcn prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization. Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together. The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind, readonly, and willreturn. The readonly gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited. In contrast, NIR knows exactly what sort of thing nir_texop_txf is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex (texture() in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings. Code sharing When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM. The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan. As an example of this, consider nir_lower_tex() a pass that lowers various different types of texture operations to other texture operations. It can, among other things: Lower texture projectors away by doing the division in the shader, Lower texelFetchOffset() to texelFetch(), Lower rectangle textures by dividing the coordinate by the result of textureSize(), Lower texture swizzles to swizzling in the shader, Lower various forms of textureGrad*() to textureLod*() under various conditions, Lower imageSize(i, lod) with an LOD to imageSize(i, 0) and some shader math, And much more… Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it. And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp() to simpler math to flipping gl_FragCoord and gl_PointCoord when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux. Tight driver integration I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah! In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic: We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants. If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table. Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit. If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses. And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all this insanity, we have a compiler pass called anv_nir_apply_pipeline_layout() which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver. We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo and some of them result in magic push constants which the driver has to know need pushing. Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan. Structured control-flow One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl. In each function, is a list of control-flow nodes that may be nir_block, nir_if, or nir_loop. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break and nir_jump_continue instructions which act exactly as their C counterparts. At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use. What we didn’t see coming when we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture() has to be in uniform control flow”, is the following shader ok? void main #version 120 varying vec2 tc; uniform sampler2D tex; out vec4 fragColor; void main() { if (tc.x > 1.0) tc.x = 1.0; fragColor = texture(tex, tc); } Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture() call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example: void main #version 120 varying vec2 tc; uniform sampler2D tex; out vec4 fragColor; void main() { /* Block 0 */ float x = tc.x; while (1) { /* Block 1 */ if (x < 1.0) { /* Block 2 */ tc.x = x; break; } /* Block 3 */ x = x - 1.0; } /* Block 4 */ fragColor = texture(tex, tc); } The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts. First, the loop exit condition is non-uniform and, since texture() takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once. The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.) Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have nir_jump_goto and nir_jump_goto_if instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile. Algebraic optimizations Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders that were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0 implemented with three levels of nested function calls, multiple splats out to a vec4 and a truly impressive pile of control-flow. I only wish I were joking…. To chew through this mess, we have nir_opt_algebraic(). We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py starting with the simple description at the top: # Written in the form (, ) where is an expression # and is either an expression or a value. An expression is # defined as a tuple of the form ([~], , , , ) # where each source is either an expression or a value. A value can be # either a numeric constant or a string representing a variable name. # # optimizations = [ ... (('iadd', a, 0), a), This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax opcodes: (('fmax', ('fmax', a, b), b), ('fmax', a, b)), Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types: # For any float comparison operation, "cmp", if you have "a == a && a cmp b" # then the "a == a" is redundant because it's equivalent to "a is not NaN" # and, if a is a NaN then the second comparison will fail anyway. for op in ['flt', 'fge', 'feq']: optimizations += [ (('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)), (('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)), ] Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse() and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0 everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns. Not only have we made it easy to add new patterns but the nir_search framework has some pretty useful smarts in it. The expression I first showed matches a + 0 and replaces it with a but nir_search is smart enough to know that nir_op_iadd is commutative and so it also matches 0 + a without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search also magically handles swizzles for you. You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions. NIR is a low(ish) level IR This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful. One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because we have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1 through vec4 x80 at the top and then uintBitsToFloat() and floatBitsToUint() all over everywhere. We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break and continue were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it. The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp(), and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.) The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code. As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa. Conclusion If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes. Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right? [Less]
Posted over 2 years ago
In defense of NIR NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various ... [More] reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call. A bit of history Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much. That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute. Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call. Distro and packaging issues I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems. First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.) Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release. Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason. And that, in the words of Forest Gump, is all I have to say about that. A compiler built for GPUs One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well. To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch piglit test: #version 120 #extension GL_EXT_gpu_shader4: require #define ivec1 int flat varying ivec4 tc; uniform vec4 divisor; uniform sampler2D tex; out vec4 fragColor; void main() { vec4 color = texelFetch2D(tex, ivec2(tc), tc.w); fragColor = color/divisor; } When compiled to NIR, this turns into shader: MESA_SHADER_FRAGMENT name: GLSL3 inputs: 1 outputs: 1 uniforms: 1 ubos: 1 shared: 0 decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0) decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0) decl_function main (0 params) impl main { block block_0: /* preds: */ vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */) vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */ vec1 32 ssa_2 = deref_var &tex (uniform sampler2D) vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y vec1 32 ssa_4 = mov ssa_1.z vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod) vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */ vec1 32 ssa_7 = frcp ssa_6.x vec1 32 ssa_8 = frcp ssa_6.y vec1 32 ssa_9 = frcp ssa_6.z vec1 32 ssa_10 = frcp ssa_6.w vec1 32 ssa_11 = fmul ssa_5.x, ssa_7 vec1 32 ssa_12 = fmul ssa_5.y, ssa_8 vec1 32 ssa_13 = fmul ssa_5.z, ssa_9 vec1 32 ssa_14 = fmul ssa_5.w, ssa_10 vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14 intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */ /* succs: block_1 */ block block_1: } Then, the AMD driver turns it into the following LLVM IR: ; ModuleID = 'mesa-shader' source_filename = "mesa-shader" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7" target triple = "amdgcn--" define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 { main_body: %22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4 %23 = bitcast float %22 to i32 %24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4 %25 = bitcast float %24 to i32 %26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4 %27 = bitcast float %26 to i32 %28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0 %29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0 %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4 %31 = ptrtoint float addrspace(6)* %2 to i32 %32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0 %33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4 %34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4 %35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4 %36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4 %37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4 %38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4 %39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4 %40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4 %41 = extractelement <4 x float> %30, i32 0 %42 = fmul nsz arcp float %41, %37 %43 = extractelement <4 x float> %30, i32 1 %44 = fmul nsz arcp float %43, %38 %45 = extractelement <4 x float> %30, i32 2 %46 = fmul nsz arcp float %45, %39 %47 = extractelement <4 x float> %30, i32 3 %48 = fmul nsz arcp float %47, %40 %49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4 %50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5 %51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6 %52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7 %53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8 %54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19 ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54 } ; Function Attrs: nounwind readnone speculatable willreturn declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1 ; Function Attrs: nounwind readonly willreturn declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2 ; Function Attrs: nounwind readnone willreturn declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3 ; Function Attrs: nounwind readnone speculatable willreturn declare float @llvm.amdgcn.rcp.f32(float) #1 attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" } attributes #1 = { nounwind readnone speculatable willreturn } attributes #2 = { nounwind readonly willreturn } attributes #3 = { nounwind readnone willreturn } attributes #4 = { nounwind readnone } !0 = !{} For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines: GLSL: vec4 color = texelFetch2D(tex, ivec2(tc), tc.w); NIR: vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod) LLVM: %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4 ; Function Attrs: nounwind readonly willreturn declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2 attributes #2 = { nounwind readonly willreturn } attributes #4 = { nounwind readnone } In NIR, a texelFetch() shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff. In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4 is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture. There are a couple of important things to note here. First is the @llvm.amdgcn prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization. Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together. The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind, readonly, and willreturn. The readonly gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited. In contrast, NIR knows exactly what sort of thing nir_texop_txf is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex (texture() in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings. Code sharing When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM. The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan. As an example of this, consider nir_lower_tex() a pass that lowers various different types of texture operations to other texture operations. It can, among other things: Lower texture projectors away by doing the division in the shader, Lower texelFetchOffset() to texelFetch(), Lower rectangle textures by dividing the coordinate by the result of textureSize(), Lower texture swizzles to swizzling in the shader, Lower various forms of textureGrad*() to textureLod*() under various conditions, Lower imageSize(i, lod) with an LOD to imageSize(i, 0) and some shader math, And much more… Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it. And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp() to simpler math to flipping gl_FragCoord and gl_PointCoord when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux. Tight driver integration I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah! In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic: We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants. If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table. Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit. If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses. And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all his insanity, we have a compiler pass called anv_nir_apply_pipeline_layout() which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver. We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo and some of them result in magic push constants which the driver has to know need pushing. Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan. Structured control-flow One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl. In each function, is a list of control-flow nodes that may be nir_block, nir_if, or nir_loop. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break and nir_jump_continue instructions which act exactly as their C counterparts. At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use. What we didn’t see coming we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture() has to be in uniform control flow”, is the following shader ok? void main #version 120 varying vec2 tc; uniform sampler2D tex; out vec4 fragColor; void main() { if (tc.x > 1.0) tc.x = 1.0; fragColor = texture(tex, tc); } Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture() call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example: void main #version 120 varying vec2 tc; uniform sampler2D tex; out vec4 fragColor; void main() { /* Block 0 */ float x = tc.x; while (1) { /* Block 1 */ if (x < 1.0) { /* Block 2 */ tc.x = x; break; } /* Block 3 */ x = x - 1.0; } /* Block 4 */ fragColor = texture(tex, tc); } The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts. First, the loop exit condition is non-uniform and, since texture() takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once. The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.) Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have nir_jump_goto and nir_jump_goto_if instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile. Algebraic optimizations Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0 implemented with three levels of nested function calls, multiple splats out to a vec4 and a truly impressive pile of control-flow. I only wish I were joking…. To chew through this mess, we have nir_opt_algebraic(). We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py starting with the simple description at the top: # Written in the form (, ) where is an expression # and is either an expression or a value. An expression is # defined as a tuple of the form ([~], , , , ) # where each source is either an expression or a value. A value can be # either a numeric constant or a string representing a variable name. # # optimizations = [ ... (('iadd', a, 0), a), This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax opcodes: (('fmax', ('fmax', a, b), b), ('fmax', a, b)), Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types: # For any float comparison operation, "cmp", if you have "a == a && a cmp b" # then the "a == a" is redundant because it's equivalent to "a is not NaN" # and, if a is a NaN then the second comparison will fail anyway. for op in ['flt', 'fge', 'feq']: optimizations += [ (('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)), (('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)), ] Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse() and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0 everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns. Not only have we made it easy to add new patterns but the nir_search framework has some pretty useful smarts in it. The expression I first showed matches a + 0 and replaces it with a but nir_search is smart enough to know that nir_op_iadd is commutative and so it also matches 0 + a without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search also magically handles swizzles for you. You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions. NIR is a low(ish) level IR This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful. One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1 through vec4 x80 at the top and then uintBitsToFloat() and floatBitsToUint() all over everywhere. We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break and continue were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it. The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp(), and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.) The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code. As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa. Conclusion If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes. Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right? [Less]
Posted over 2 years ago
(This post was first published with Collabora on Jan 25, 2022.)A Pixel's ColorMy work on Wayland and Weston color management and HDR support has been full of learning new concepts and terms. Many of them are crucial for understanding how color works. ... [More] I started out so ignorant that I did not know how to blend two pixels together correctly. I did not even know that I did not know - I was just doing the obvious blend, and that was wrong. Now I think I know what I know and do not know, and I also feel that most developers around window systems and graphical applications are as uneducated as I was.Color knowledge is surprisingly scarce in my field it seems. It is not enough that I educate myself. I need other people to talk to, to review my work, and to write patches that I will be reviewing. With the hope of making it even a little bit easier to understand what is going on with color I wrote the article: A Pixel's Color.The article goes through most of the important concepts, trying to give you, a programmer, a vague idea of what they are. It does not explain everything too well, because I want you to be able to read through it, but it still got longer than I expected. My intention is to tell you about things you might not know about, so that you would at least know what you do not know.A warm thank you to everyone who reviewed and commented on the article.A New Documentation RepositoryOriginally Wayland CM&HDR extension merge request included documentation about how color management would work on Wayland. The actual protocol extension specification cannot even begin to explain all that.To make that documentation easier to revise and contribute to, I proposed to move it into a new repository: color-and-hdr. That also allowed us to widen the scope of the documentation, so we can easily include things outside of Wayland: EGL, Vulkan WSI, DRM KMS, and more.I hope that color-and-hdr documentation repository gains traction and becomes a community maintained effort in gathering information about color and HDR on Linux, and that we can eventually move it out of my personal namespace to become truly community owned. [Less]
Posted over 2 years ago
It’s Happening (For Real) After weeks of hunting for the latest rumors of jekstrand’s future job prospects, I’ve finally done it: zink now supports more extensions than any other OpenGL driver in Mesa. That’s right. Check it on mesamatrix if you ... [More] don’t believe me. A couple days ago I merged support for the external memory extensions that I’d been putting off, and today we got sparse textures thanks to Qiang Yu at AMD doing 99% of the work to plumb the extensions through the rest of Mesa. There’s even another sparse texture extension, which I’ve already landed all the support for in zink, that should be enabled for the upcoming release. What’s Next? Zink (sometimes) has the performance, now it has the features, so naturally the focus now is going to shift to compatibility and correctness. Kopper is going to mostly take care of the former, which leaves the latter. There aren’t a ton of CTS cases failing. Ideally, by the end of the year, there won’t be any. [Less]
Posted over 2 years ago
Suspicion The last thing I remember Thursday was trying to get the truth out about Jason Ekstrand’s new role. Days have now passed, and I can’t remember what I was about to say or what I did over the extended weekend. But Big Triangle sure has been ... [More] busy. It’s clear I was on to something, because otherwise they wouldn’t have taken such drastic measures. Look at this: jekstrand is claiming Collabora has hired him. This is clearly part of a larger coverup, and the graphics news media are eating it up. Congratulations to him, sure, but it’s obvious this is just another attempt to throw us off the trail. We may never find out what Jason’s real new job is, but that doesn’t mean we’re going to stop following the hints and clues as they accumulate. Sooner or later, Big Triangle is going to slip up, and then we’ll all know the truth. Progress In the meantime, zink goes on. I’ve spent quite a long while tinkering with NVIDIA and getting a solid baseline of CTS results. At present, I’m down to about 800 combined fails for GL 4.6 and ES 3.2. Given that lavapipe is at around 80 and RADV is just over 600, both excluding the confidential test suites, this is a pretty decent start. This is probably going to be the last time I’m on nvidia for a while, and it hasn’t been too bad overall. The Year’s First Rebrand The (second) biggest news story for today is a rebrand. Copper is being renamed. It will, in fact, be named Kopper to match the zink/vulkan naming scheme. I can’t overstate how significant this change is and how massive the ecosystem changes around it will be. Just huge. Like the number of words in this blog post. [Less]
Posted over 2 years ago
Hello, Collabora! Ever since I announced that I was leaving Intel, there’s been a lot of speculation as to where I’d end up. I left it a bit quiet over the holidays but, now that we’re solidly in 2022, It’s time to let it spill. As of January 24 ... [More] , I’ll be at Collabora! For those of you that don’t know, Collabora is an open-source consultancy. They sell engineering services to companies who are making devices that run Linux and want to contribute to open-source technologies. They’ve worked on everything from automotive to gaming consoles to smart TVs to infotainment systems to VR platforms. I’m not an expert on what Collabora has done over the years so I’ll refer you to their brag sheet for that. Unlike some contract houses, Collabora doesn’t just do engineering for hire. They’re also an ideologically driven company that really believes in upstream and invests directly in upstream projects such as Mesa, Wayland, and others. My personal history with Collabora is as old as my history as an open-source software developer. My first real upstream work was on Wayland in early 2013. I jumped in with a cunning plan for running a graphics-enabled desktop Linux chroot on an Android device and absolutely no idea what I was getting myself into. Two of the people who not only helped me understand the underbelly of Linux window systems but also helped me learn to navigate the world of open-source software were Daniel Stone and Pekka Paalanen, both of whom were at Collabora then and still are today. After switching to Mesa when I joined Intel in 2014, I didn’t interact with Collabora devs quite as much since they mostly stayed in the window-system world and I tried to stay in 3D. In the last few years, however, they’ve been building up their 3D team and doing some really interesting work. Alyssa Rosenzweig and I have worked quite a bit together on various NIR passes as part of her work on Panfrost and now agx. I also worked with Boris Brezillon and Erik Faye-Lund on some of the CLOn12, GLOn12, and Zink work which layers OpenGL and OpenCL on top of D3D12 and Vulkan. In case you haven’t figured it out already from my glowing review, Collabora has some top-notch people who are doing great work and I’m excited to be joining the team and working more closely with them. So how did this happen? What convinced me to leave the cushy corporate job and join a tiny (compared to Intel) open-source company? It’s not been for lack of opportunities. I get pinged by recruiters on LinkedIn on a regular basis and certain teams in the industry have been rather persistent. I’ve thought quite a lot over the years about where I’d want to go if I ever left Intel. Intel has been my engineering home for 7.5 years and has provided the strange cocktail on which I’ve built my career: a stable team, well-funded upstream open-source work, fairly cutting edge hardware, and an IHV seat at Khronos. Every place I’d ever considered going would mean losing one or more of those things and, until Collabora, no one had given me a good enough reason to give any of that up. Back in September, I was chatting on IRC with other Mesa devs about OpenCL, SPIR-V, and some corner-case we were missing in the compiler when the following exchange happened: 11:39 < jenatali> I hope I get time to get back to CL at some point, I hate leaving it half-finished, but stupid corporate priorities mean I have to do other stuff instead :P 11:41 < jekstrand> Yeah... Corporations... Why do we work for them again? Oh, right, so we can afford to eat. About an hour later, Daniel Stone replied privately: 12:40 hey so if corporations ever get you down, there are always less-corporate options … :) 12:40 timing completely coincidental of course 12:42 Of course... 12:42 I'm always open to new things if the offer is right... This kicked off the weirdest and most interesting career conversation I’ve had to date. At first, I didn’t believe him. The job he was describing doesn’t exist. No one gets that offer. Not unless you’re Dave Airlie or Linus Torvalds. But, after multiple 1 – 2 hour video chats, more IRC chatter, and an hour chatting with Philippe Kalaf (Collabora’s CEO), they had me convinced. This is real. So what did Collabora finally offer me that no one else has? Total autonomy. In my new role at Collabora, my mandate consists of two things: invest in and mentor the Collabora 3D graphics team and invest in upstream Linux and open-source graphics however I see fit. I won’t be expected to do any contract work. I may meet with clients from time to time and I’ll likely get involved more with the various Collabora-driven Mesa projects but my primary focus will be on ensuring that upstream is healthy. I won’t be tied to any one driver or hardware vendor either. Sure, it’d be good to do a bit of Panfrost work so I can help Alyssa out since she’s now my coworker and I’ll likely still work on Intel drivers a bit since that’s my home turf. But, at the end of the day, I’m now free to put my effort wherever it’s needed in the stack without concern for corporate priorities. Ray-tracing in RADV? Why not. OpenCL 3.0 for everyone? Sure. Hacking on a new kernel interface for Freedreno? That’s fine too. As far as I’m concerned, when it comes to how I spend my engineering effort, I now report directly to upstream. No strings attached. One of the interesting side-effect of this is how it will affect my role within Khronos. Collabora is a Khronos member so I still plan to be involved there but it will look different. For several years now (as long as RADV has been a competent driver, really), I’ve always worn two hats at Khronos: Intel and Mesa/Linux. Most of the time, I’m representing Intel but there were always those weird awkward moments where I help out the Igalia team working on V3DV or the RADV team. Now that I’m no longer at a hardware vendor, I can really embrace the role of representing Mesa and Linux upstream within Khronos. This doesn’t mean that I’m suddenly going to fix all your Vulkan spec problems overnight but it does mean I’ll be paying a bit more attention to the non-Intel drivers and doing what I can to ensure that all the Vulkan drivers in Mesa are in good shape. Honestly, I’m still in shock that I was offered this role. It’s a great testament to Collabora’s belief in upstream that they’re willing to fund such a role and it shows an incredible amount of faith in my work. At Intel, I was blessed to be able to work upstream as part of my day job, which isn’t something most open-source software developers get. To have someone believe in your work so much that they’re willing to cut you a pay check just to keep doing what you’re doing is mind boggling. I’m truly honored and I hope the work I do in the days, months, and years to come will prove that their faith was well placed. So, what am I going to be working on with my new found freedom? Do I have any cool new projects planned that are going to turn the industry upside-down? Of course I do! But those are topics for other blog posts. [Less]
Posted over 2 years ago
We Need To Talk It’s come to my attention that there’s a lot of rumors flying around about what exactly I’m doing aside from posting the latest info about where Jason Ekstrand, who coined the phrase, “If it compiles, we should ship it.” is going to ... [More] end up. Everyone knows that jekstrand’s next career move is big news—the kind of industry-shaking maneuvering that has every BigCo from Alphabet to Meta on tenterhooks. This post is going to debunk a number of the most common nonsense I’ve been hearing as well as give some updates about what else I’ve been doing besides scouring the internet for even the tiniest clue about what’s coming for this man’s career in 2022. Is Jason going to Apple to work on a modernized, open source implementation of Mac OS with a new Finder based on Vulkan? My sources were very keen on this rumor up until Tuesday, when, in an undisclosed IRC channel, Jason himself had the following to say: Sachiel: Contrary to popular belief, I can't work on every idea in the multiverse simultaneously. I'm limited to the same N dimensions as the rest of you. This absolutely blew all the existing chatter out of the water. Until now, in the course of working on more sparse texturing extensions, I had the firm impression that we’d be seeing a return to form, likely with a Khronos member company, continuing to work on graphics. But now? With this? Clearly everyone was thinking too small. Everyone except jekstrand himself, who will be taking up a position at CERN devising new display technology for particle accelerators. Or at least, that’s what I thought until yesterday. Is Jason really going to be working at CERN? How well does GPU knowledge translate to theoretical physics? Unfortunately, this turned out to be bogus, no more than chaff deployed to stop us from getting to the truth because we were too close. Later, while I was pondering how buggy NVIDIA’s sparse image functionality was in the latest beta drivers and attempting to pass what few equally buggy CTS cases there were for ARB_sparse_texture2, I stumbled upon the obvious. It’s so obvious, in fact, that everyone overlooked it because of how obvious it is. Jason has left Intel and turned in his badge because he’s on vacation. As everyone knows, he’s the kind of person who literally does not comprehend time in the same way that the rest of us do. It was his assessment of the HR policy that in order to take time off and leave the office, he had to quit. My latest intel (no pun intended) revealed that managers and executives alike were still scrambling, trying to figure out how to explain the company’s vacation policy using SSA-based compiler terminology, but optimizer passes left their attempts to engage him as no-ops. Tragic. So this whole thing was just a ruse? I’ll be completely honest with you since you’ve read this far: I’ve just heard breaking news today. This is so fresh, so hot-off-the-presses that it’s almost as difficult to reveal as it is that I’ve implemented another 4 GL extensions. When the totality of all my MRs are landed, zink will become the GL driver in Mesa supporting the most extensions, and this is likely to be the case for the next release. Shocking, I know. But not nearly as shocking as the fact that Jason is actually starting at Texas Instruments working on Vulkan for graphing calculators. Think about it. Anyone who knows jekstrand even the smallest amount knows how much sense this makes on both sides. He gets unlimited graphing calculators, and that’s all he had to hear before signing the contract. It’s that simple. Graphing Calculators? Does Anyone Even Use Those Anymore? I know at least one person who does, and it’s not Jason Ekstrand. Because in the time that I was writing out the last (and now deprecated) information I had available, there’s been more, even later breaking news. Copper now has a real MR open for it. I realize it’s entirely off-topic now to be talking about some measly merge request, but it has the WSI tag on it, which means Jason has no choice but to read through the entire thing. That’s because he’ll be working for Khronos as the Assistant Deputy Director of Presentation. If there’s presentations to be done by anyone in the graphics space, for any reason, they’ll have to go through jekstrand first. I don’t envy the responsibility and accountability that this sort of role demands; when it comes to shedsmanship, people in the presentation space are several levels above the rest. We can only hope he’s up to the challenge. Or at least, we would if that were actually where he was going, because I’ve just heard from [Less]