I Use This!
Very High Activity

News

Analyzed 1 day ago. based on code collected 3 days ago.
Posted about 2 years ago
Subgroup operations or wave intrinsics, such as reducing a value across the threads of a shader subgroup or wave, were introduced in GPU programming languages a while ago. They communicate with other threads of the same wave, for example to exchange ... [More] the input values of a reduction, but not necessarily with all of them if there is divergent control flow.In LLVM, we call such operations convergent. Unfortunately, LLVM does not define how the set of communicating threads in convergent operations -- the set of converged threads -- is affected by control flow.If you're used to thinking in terms of structured control flow, this may seem trivial. Obviously, there is a tree of control flow constructs: loops, if-statements, and perhaps a few others depending on the language. Two threads are converged in the body of a child construct if and only if both execute that body and they are converged in the parent. Throw in some simple and intuitive rules about loop counters and early exits (nested return, break and continue, that sort of thing) and you're done.In an unstructured control flow graph, the answer is not obvious at all. I gave a presentation at the 2020 LLVM Developers' Meeting that explains some of the challenges as well as a solution proposal that involves adding convergence control tokens to the IR.Very briefly, convergent operations in the proposal use a token variable that is defined by a convergence control intrinsic. Two dynamic instances of the same static convergent operation from two different threads are converged if and only if the dynamic instances of the control intrinsic producing the used token values were converged.(The published draft of the proposal talks of multiple threads executing the same dynamic instance. I have since been convinced that it's easier to teach this matter if we instead always give every thread its own dynamic instances and talk about a convergence equivalence relation between dynamic instances. This doesn't change the resulting semantics.)The draft has three such control intrinsics: anchor, entry, and (loop) heart. Of particular interest here is the heart. For the most common and intuitive use cases, a heart intrinsic is placed in the header of natural loops. The token it defines is used by convergent operations in the loop. The heart intrinsic itself also uses a token that is defined outside the loop: either by another heart in the case of nested loops, or by an anchor or entry. The heart combines two intuitive behaviors: It uses a token in much the same way that convergent operations do: two threads are converged for their first execution of the heart if and only if they were converged at the intrinsic that defined the used token. Two threads are converged at subsequent executions of the heart if and only if they were converged for the first execution and they are currently at the same loop iteration, where iterations are counted by a virtual loop counter that is incremented at the heart. Viewed from this angle, how about we define a weaker version of these rules that lies somewhere between an anchor and a loop heart? We could call it a "light heart", though I will stick with "iterating anchor". The iterating anchor defines a token but has no arguments. Like for the anchor, the set of converged threads is implementation-defined -- when the iterating anchor is first encountered. When threads encounter the iterating anchor again without leaving the dominance region of its containing basic block, they are converged if and only if they were converged during their previous encounter of the iterating anchor.The notion of an iterating anchor came up when discussing the convergence behaviors that can be guaranteed for natural loops. Is it possible to guarantee that natural loops always behave in the natural way -- according to their loop counter -- when it comes to convergence?Naively, this should be possible: just put hearts into loop headers! Unfortunately, that's not so straightforward when multiple natural loops are contained in an irreducible loop: Hearts in A and C must refer to a token defined outside the loops; that is, a token defined in E. The resulting program is ill-formed because it has a closed path that goes through two hearts that use the same token, but the path does not go through the definition of that token. This well-formedness rule exists because the rules about heart semantics are unsatisfiable if the rule is broken.The underlying intuitive issue is that if the branch at E is divergent in a typical implementation, the wave (or subgroup) must choose whether A or C is executed first. Neither choice works. The heart in A indicates that (among the threads that are converged in E) all threads that visit A (whether immediately or via C) must be converged during their first visit of A. But if the wave executes A first, then threads which branch directly from E to A cannot be converged with those that first branch to C. The opposite conflict exists if the wave executes C first.If we replace the hearts in A and C by iterating anchors, this problem goes away because the convergence during the initial visit of each block is implementation-defined. In practice, it should fall out of which of the blocks the implementation decides to execute first.So it seems that iterating anchors can fill a gap in the expressiveness of the convergence control design. But are they really a sound addition? There are two main questions: Satisfiability: Can the constraints imposed by iterating anchors be satisfied, or can they cause the sort of logical contradiction discussed for the example above? And if so, is there a simple static rule that prevents such contradictions? Spooky action at a distance: Are there generic code transforms which change semantics while changing a part of the code that is distant from the iterating anchor? The second question is important because we want to add convergence control to LLVM without having to audit and change the existing generic transforms. We certainly don't want to hurt compile-time performance by increasing the amount of code that generic transforms have to examine for making their decisions.Satisfiability Consider the following simple CFG with an iterating anchor in A and a heart in B that refers back to a token defined in E:Now consider two threads that are initially converged with execution traces: E - A - A - B - X E - A - B - A - X The heart rule implies that the threads must be converged in B. The iterating anchor rule implies that if the threads are converged in their first dynamic instances of A, then they must also be converged in their second dynamic instances of A, which leads to a temporal paradox.One could try to resolve the paradox by saying that the threads cannot be converged in A at all, but this would mean that the threads mustdiverge before a divergent branch occurs. That seems unreasonable, since typical implementations want to avoid divergence as long as control flow is uniform.The example arguably breaks the spirit of the rule about convergence regions from the draft proposal linked above, and so a minor change to the definition of convergence region may be used to exclude it.What if the CFG instead looks as follows, which does not break any rules about convergence regions:For the same execution traces, the heart rule again implies that the threads must be converged in B. The convergence of the first dynamic instances of A are technically implementation-defined, but we'd expect most implementations to be converged there.The second dynamic instances of A cannot be converged due to the convergence of the dynamic instances of B. That's okay: the second dynamic instance of A in thread 2 is a re-entry into the dominance region of A, and so its convergence is unrelated to any convergence of earlier dynamic instances of A.Spooky action at a distance Unfortunately, we still cannot allow this second example. A program transform may find that the conditional branch in E is constant and the edge from E to B is dead. Removing that edge brings us back to the previous example which is ill-formed. However, a transform which removes the dead edge would not normally inspect the blocks A and B or their dominance relation in detail. The program becomes ill-formed by spooky action at a distance.The following static rule forbids both example CFGs: if there is a closed path through a heart and an iterating anchor, but not through the definition of the token that the heart uses, then the heart must dominate the iterating anchor.There is at least one other issue of spooky action at a distance. If the iterating anchor is not the first (non-phi) instruction of its basic block, then it may be preceded by a function call in the same block. The callee may contain control flow that ends up being inlined. Back edges that previously pointed at the block containing the iterating anchor will then point to a different block, which changes the behavior quite drastically. Essentially, the iterating anchor is reduced to a plain anchor.What can we do about that? It's tempting to decree that an iterating anchor must always be the first (non-phi) instruction of a basic block. Unfortunately, this is not easily done in LLVM in the face of general transforms that might sink instructions or merge basic blocks.Preheaders to the rescue We could chew through some other ideas for making iterating anchors work, but that turns out to be unnecessary. The desired behavior of iterating anchors can be obtained by inserting preheader blocks. The initial example of two natural loops contained in an irreducible loop becomes: Place anchors in Ap and Cp and hearts in A and C that use the token defined by their respective dominating anchor. Convergence at the anchors is implementation-defined, but relative to this initial convergence at the anchor, convergence inside the natural loops headed by A and C behaves in the natural way, based on a virtual loop counter. The transform of inserting an anchor in the preheader is easily generalized.To sum it up: We've concluded that defining an "iterating anchor" convergence control intrinsic is problematic, but luckily also unnecessary. The control intrinsics defined in the original proposal are sufficient. I hope that the discussion that led to those conclusions helps illustrate some aspects of the convergence control proposal for LLVM as well as the goals and principles that drove it. [Less]
Posted about 2 years ago
They Say An Image Macro Conveys An Entire Day Of Shouting At The Computer
Posted about 2 years ago
A quick reminder: libei is the library for emulated input. It comes as a pair of C libraries, libei for the client side and libeis for the server side. libei has been sitting mostly untouched since the last status update. There are two use-cases we ... [More] need to solve for input emulation in Wayland - the ability to emulate input (think xdotool, or Synergy/Barrier/InputLeap client) and the ability to capture input (think Synergy/Barrier/InputLeap server). The latter effectively blocked development in libei [1], until that use-case was sorted there wasn't much point investing too much into libei - after all it may get thrown out as a bad idea. And epiphanies were as elusive like toilet paper and RATs, so nothing much get done. This changed about a week or two ago when the required lightbulb finally arrived, pre-lit from the factory. So, the solution to the input capturing use-case is going to be a so-called "passive context" for libei. In the traditional [2] "active context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the available screens. libei then sends events through these devices, causing input to be appear in the compositor which moves the cursor around. In a typical and simple use-case you'd get a 1920x1080 absolute pointer device and a keyboard with a $layout keymap, libei then sends events to position the cursor and or happily type away on-screen. In the "passive context" approach for libei we have the EIS implementation in the compositor and a client using libei to connect to that. The compositor sets up a seat or more, then some devices within that seat that typically represent the physical devices connected to the host computer. libei then receives events from these devices, causing input to be generated in the libei client. In a typical and simple use-case you'd get a relative pointer device and a keyboard device with a $layout keymap, the compositor then sends events matching the relative input of the connected mouse or touchpad. The two notable differences are thus: events flow from EIS to libei and the devices don't represent the screen but rather the physical [3] input devices. This changes libei from a library for emulated input to an input event transport layer between two processes. On a much higher level than e.g. evdev or HID and with more contextual information (seats, devices are logically abstracted, etc.). And of course, the EIS implementation is always in control of the events, regardless which direction they flow. A compositor can implement an event filter or designate key to break the connection to the libei client. In pseudocode, the compositor's input event processing function will look like this: function handle_input_events(): real_events = libinput.get_events() for e in real_events: if input_capture_active: send_event_to_passive_libei_client(e) else: process_event(e) emulated_events = eis.get_events_from_active_clients() for e in emulated_events: process_event(e) Not shown here are the various appropriate filters and conversions in between (e.g. all relative events from libinput devices would likely be sent through the single relative device exposed on the EIS context). Again, the compositor is in control so it would be trivial to implement e.g. capturing of the touchpad only but not the mouse. In the current design, a libei context can only be active or passive, not both. The EIS context is both, it's up to the implementation to disconnect active or passive clients if it doesn't support those. Notably, the above only caters for the transport of input events, it doesn't actually make any decision on when to capture events. This handled by the CaptureInput XDG Desktop Portal [4]. The idea here is that an application like Synergy/Barrier/InputLeap server connects to the CaptureInput portal and requests a CaptureInput session. In that session it can define pointer barriers (left edge, right edge, etc.) and, in the future, maybe other triggers. In return it gets a libei socket that it can initialize a libei context from. When the compositor decides that the pointer barrier has been crossed, it re-routes the input events through the EIS context so they pop out in the application. Synergy/Barrier/InputLeap then converts that to the global position, passes it to the right remote Synergy/Barrier/InputLeap client and replays it there through an active libei context where it feeds into the local compositor. Because the management of when to capture input is handled by the portal and the respective backends, it can be natively integrated into the UI. Because the actual input events are a direct flow between compositor and application, the latency should be minimal. Because it's a high-level event library, you don't need to care about hardware-specific details (unlike, say, the inputfd proposal from 2017). Because the negotiation of when to capture input is through the portal, the application itself can run inside a sandbox. And because libei only handles the transport layer, compositors that don't want to support sandboxes can set up their own negotiation protocol. So overall, right now this seems like a workable solution. [1] "blocked" is probably overstating it a bit but no-one else tried to push it forward, so.. [2] "traditional" is probably overstating it for a project that's barely out of alpha development [3] "physical" is probably overstating it since it's likely to be a logical representation of the types of inputs, e.g. one relative device for all mice/touchpads/trackpoints [4] "handled by" is probably overstating it since at the time of writing the portal is merely a draft of an XML file [Less]
Posted about 2 years ago
Blogging: That Thing I Forgot About Yeah, my b, I forgot this was a thing. Fuck it though, I’m a professional, so I’m gonna pretend I didn’t just skip a month of blogs and get right back into it. Gallivm Gallivm is the nir/tgsi-to-llvm translation ... [More] layer in Gallium that LLVMpipe (and thus Lavapipe) uses to generate the JIT functions which make triangles. It’s very old code in that it predates me knowing how triangles work, but that doesn’t mean it doesn’t have bugs. And Gallivm bugs are the worst bugs. For a long time, I’ve had SIGILL crashes on exactly one machine locally for the CTS glob dEQP-GLES31.functional.program_uniform.by*sampler2D_samplerCube*. These tests pass on everyone else’s machines including CI. Like I said, Gallivm bugs are the worst bugs. Debugging How does one debug JIT code? GDB can’t be used, valgrind doesn’t work, and, despite what LLVM developers would tell you, building an assert-enabled LLVM doesn’t help at all in most cases here since that will only catch invalid behavior, not questionably valid behavior that very obviously produces invalid results. So we enter the world of lp_build_print debugging. Much like standard printf debugging, the strategy here is to just lp_build_print_value or lp_build_printf("I hate this part of the shader too") our way to figuring out where in the shader the crash occurs. Here’s an example shader from dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex that crashes: #version 310 es in highp vec4 a_position; out mediump float v_vtxOut; struct structType { mediump sampler2D m0; mediump samplerCube m1; }; uniform structType u_var; mediump float compare_float (mediump float a, mediump float b) { return abs(a - b) < 0.05 ? 1.0 : 0.0; } mediump float compare_vec4 (mediump vec4 a, mediump vec4 b) { return compare_float(a.x, b.x)*compare_float(a.y, b.y)*compare_float(a.z, b.z)*compare_float(a.w, b.w); } void main (void) { gl_Position = a_position; v_vtxOut = 1.0; v_vtxOut *= compare_vec4(texture(u_var.m0, vec2(0.0)), vec4(0.15, 0.52, 0.26, 0.35)); v_vtxOut *= compare_vec4(texture(u_var.m1, vec3(0.0)), vec4(0.88, 0.09, 0.30, 0.61)); } Which, in llvmpipe NIR, is: shader: MESA_SHADER_VERTEX source_sha1: {0xcb00c93e, 0x64db3b0f, 0xf4764ad3, 0x12b69222, 0x7fb42437} inputs: 1 outputs: 2 uniforms: 0 shared: 0 ray queries: 0 decl_var uniform INTERP_MODE_NONE sampler2D lower@u_var.m0 (0, 0, 0) decl_var uniform INTERP_MODE_NONE samplerCube lower@u_var.m1 (0, 0, 1) decl_function main (0 params) impl main { block block_0: /* preds: */ vec1 32 ssa_0 = deref_var &a_position (shader_in vec4) vec4 32 ssa_1 = intrinsic load_deref (ssa_0) (access=0) vec1 16 ssa_2 = load_const (0xb0cd = -0.150024) vec1 16 ssa_3 = load_const (0x2a66 = 0.049988) vec1 16 ssa_4 = load_const (0xb829 = -0.520020) vec1 16 ssa_5 = load_const (0xb429 = -0.260010) vec1 16 ssa_6 = load_const (0xb59a = -0.350098) vec1 16 ssa_7 = load_const (0xbb0a = -0.879883) vec1 16 ssa_8 = load_const (0xadc3 = -0.090027) vec1 16 ssa_9 = load_const (0xb4cd = -0.300049) vec1 16 ssa_10 = load_const (0xb8e1 = -0.609863) vec2 32 ssa_13 = load_const (0x00000000, 0x00000000) = (0.000000, 0.000000) vec1 32 ssa_49 = load_const (0x00000000 = 0.000000) vec4 16 ssa_14 = (float16)txl ssa_13 (coord), ssa_49 (lod), 0 (texture), 0 (sampler) vec1 16 ssa_15 = fadd ssa_14.x, ssa_2 vec1 16 ssa_16 = fabs ssa_15 vec1 16 ssa_17 = fadd ssa_14.y, ssa_4 vec1 16 ssa_18 = fabs ssa_17 vec1 16 ssa_19 = fadd ssa_14.z, ssa_5 vec1 16 ssa_20 = fabs ssa_19 vec1 16 ssa_21 = fadd ssa_14.w, ssa_6 vec1 16 ssa_22 = fabs ssa_21 vec1 16 ssa_23 = fmax ssa_16, ssa_18 vec1 16 ssa_24 = fmax ssa_23, ssa_20 vec1 16 ssa_25 = fmax ssa_24, ssa_22 vec3 32 ssa_27 = load_const (0x00000000, 0x00000000, 0x00000000) = (0.000000, 0.000000, 0.000000) vec1 32 ssa_50 = load_const (0x00000000 = 0.000000) vec4 16 ssa_28 = (float16)txl ssa_27 (coord), ssa_50 (lod), 1 (texture), 1 (sampler) vec1 16 ssa_29 = fadd ssa_28.x, ssa_7 vec1 16 ssa_30 = fabs ssa_29 vec1 16 ssa_31 = fadd ssa_28.y, ssa_8 vec1 16 ssa_32 = fabs ssa_31 vec1 16 ssa_33 = fadd ssa_28.z, ssa_9 vec1 16 ssa_34 = fabs ssa_33 vec1 16 ssa_35 = fadd ssa_28.w, ssa_10 vec1 16 ssa_36 = fabs ssa_35 vec1 16 ssa_37 = fmax ssa_30, ssa_32 vec1 16 ssa_38 = fmax ssa_37, ssa_34 vec1 16 ssa_39 = fmax ssa_38, ssa_36 vec1 16 ssa_40 = fmax ssa_25, ssa_39 vec1 32 ssa_41 = flt32 ssa_40, ssa_3 vec1 32 ssa_42 = b2f32 ssa_41 vec1 32 ssa_43 = deref_var &gl_Position (shader_out vec4) intrinsic store_deref (ssa_43, ssa_1) (wrmask=xyzw /*15*/, access=0) vec1 32 ssa_44 = deref_var &v_vtxOut (shader_out float) intrinsic store_deref (ssa_44, ssa_42) (wrmask=x /*1*/, access=0) /* succs: block_1 */ block block_1: } There’s two sample ops (txl), and since these tests only do simple texture() calls, it seems reasonable to assume that one of them is causing the crash. Sticking a lp_build_print_value on the texel values fetched by the sample operations will reveal whether the crash occurs before or after them. What output does this yield? Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'.. texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0 texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0 texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0 texel 1.43279037e-322 6.95333598e-310 0 1.43279037e-322 1.08694442e-322 1.43279037e-322 1.08694442e-322 0 [1] 3500332 illegal hardware instruction (core dumped) Each txl op fetches four values, which means this is the result from the first instruction, but the second one isn’t reached before the crash. Unsurprisingly, this is also the cube sampling instruction, which makes sense given that all the crashes of this type I get are from cube sampling tests. Now that it’s been determined the second txl is causing the crash, it’s reasonable to assume that the construction of that sampling op is the cause rather than the op itself, as proven by sticking some simple lp_build_printf("What am I doing with my life") calls in just before that op. Indeed, as the printfs confirm, I’m still questioning the life choices that led me to this point, so it’s now proven that the txl instruction itself is the problem. Cube sampling has a lot of complex math involved for face selection, and I’ve spent a lot of time in there recently. My first guess was that the cube coordinates were bogus. Printing them yielded results: Test case 'dEQP-GLES31.functional.program_uniform.by_pointer.render.basic_struct.sampler2D_samplerCube_vertex'.. texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 texel 6.9008994e-310 0 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 0.349019617 cubecoords nan nan nan nan nan nan nan nan cubecoords nan nan nan nan nan nan nan nan These cube coords have more NaNs than a 1960s Batman TV series, so it looks like I was right in my hunch. Printing the cube S-face value next yields more NaNs. My printf search continued a couple more iterations until I wound up at this function: static LLVMValueRef lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord) { /* ima = +0.5 / abs(coord); */ LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5); LLVMValueRef absCoord = lp_build_abs(coord_bld, coord); LLVMValueRef ima = lp_build_div(coord_bld, posHalf, absCoord); return ima; } Immediately, all of us multiverse-brain engineers spot something suspicious: this has a division operation with a user-provided divisor. Printing absCoord here yielded all zeroes, which was about where my remaining energy was at this Friday morning, so I mangled the code slightly: static LLVMValueRef lp_build_cube_imapos(struct lp_build_context *coord_bld, LLVMValueRef coord) { /* ima = +0.5 / abs(coord); */ LLVMValueRef posHalf = lp_build_const_vec(coord_bld->gallivm, coord_bld->type, 0.5); LLVMValueRef absCoord = lp_build_abs(coord_bld, coord); /* avoid div by zero */ LLVMValueRef sel = lp_build_cmp(coord_bld, PIPE_FUNC_GREATER, absCoord, coord_bld->zero); LLVMValueRef div = lp_build_div(coord_bld, posHalf, absCoord); LLVMValueRef ima = lp_build_select(coord_bld, sel, div, coord_bld->zero); return ima; } And blammo, now that Gallivm could no longer divide by zero, the test was now passing. And so were a lot of others. Progress There’s been some speculation about how close Zink really is to being “useful”, where “useful” is determined by the majesty of passing GL4.6 CTS. So how close is it? The answer might shock you. Remaining Lavapipe Fails: 17 KHR-GL46.gpu_shader_fp64.builtin.mod_dvec2,Fail KHR-GL46.gpu_shader_fp64.builtin.mod_dvec3,Fail KHR-GL46.gpu_shader_fp64.builtin.mod_dvec4,Fail KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail KHR-GL46.texture_barrier.disjoint-texels,Fail KHR-GL46.texture_barrier.overlapping-texels,Fail KHR-GL46.texture_barrier_ARB.disjoint-texels,Fail KHR-GL46.texture_barrier_ARB.overlapping-texels,Fail KHR-GL46.texture_swizzle.functional,Fail KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Crash Remaining ANV Fails (Icelake): 9 KHR-GL46.pipeline_statistics_query_tests_ARB.functional_primitives_vertices_submitted_and_clipping_input_output_primitives,Fail KHR-GL46.tessellation_shader.single.isolines_tessellation,Fail KHR-GL46.tessellation_shader.tessellation_control_to_tessellation_evaluation.data_pass_through,Fail KHR-GL46.tessellation_shader.tessellation_invariance.invariance_rule3,Fail KHR-GL46.tessellation_shader.tessellation_shader_point_mode.points_verification,Fail KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.degenerate_case,Fail KHR-GL46.tessellation_shader.tessellation_shader_quads_tessellation.inner_tessellation_level_rounding,Fail KHR-GL46.tessellation_shader.tessellation_shader_tessellation.gl_InvocationID_PatchVerticesIn_PrimitiveID,Fail KHR-GL46.tessellation_shader.vertex.vertex_spacing,Fail Big Triangle better keep a careful eye on us now. [Less]
Posted about 2 years ago
Around 2 years ago while I was working on tessellation support for llvmpipe, and running the heaven benchmark on my Ryzen, I noticed that heaven despite running slowly wasn't saturating all the cores. I dug in a bit, and found that llvmpipe despite ... [More] threading rasterization, fragment shading and blending stages, never did anything else while those were happening.I dug into the code as I clearly remembered seeing a concept of a "scene" where all the primitives were binned into and then dispatched. It turned out the "scene" was always executed synchronously.At the time I wrote support to allow multiple scenes to exist, so while one scene was executing the vertex shading and binning for the next scene could execute, and it would be queued up. For heaven at the time I saw some places where it would build 36 scenes. However heaven was still 1fps with tess, and regressions in other areas were rampant, and I mostly left them in a branch.The reasons so many things were broken by the patches was that large parts of llvmpipe and also lavapipe, weren't ready for the async pipeline processing. The concept of a fence after the pipeline finished was there, but wasn't used properly everywhere. A lot of operations assumed there was nothing going on behind the scenes so never fenced. Lots of things like queries broke due to fact that a query would always be ready in the old model, but now query availability could return unavailable like a real hw driver. Resource tracking existed but was incomplete, so knowing when to flush wasn't always accurate. Presentation was broken due to incorrect waiting both for GL and Lavapipe. Lavapipe needed semaphore support that actually did things as apps used it between the render and present pipeline pieces.Mesa CI recently got some paraview traces added to it, and I was doing some perf traces with them. Paraview is a data visualization tool, and it generates vertex heavy workloads, as opposed to compositors and even games. It turned out binning was most of the overhead, and I realized the overlapping series could help this sort of workload. I dusted off the patch series and nailed down all the issues.Emma Anholt ran some benchmarks on the results with the paraview traces and got pv-waveletvolume fps +13.9279% +/- 4.91667% (n=15) pv-waveletcountour fps +67.8306% +/- 11.4762% (n=3) which seems like a good return on the investment. I've got it all lined up in a merge request and it doesn't break CI anymore, so hopefully get it landed in the next while, once I cleanup any misc bits. [Less]
Posted about 2 years ago
Earlier this week, Neil McGovern announced that he is due to be stepping down as the Executive Director as the GNOME Foundation later this year. As the President of the board and Neil’s effective manager together with the Executive Committee, I ... [More] wanted to take a moment to reflect on his achievements in the past 5 years and explain a little about what the next steps would be. Since joining in 2017, Neil has overseen a productive period of growth and maturity for the Foundation, increasing our influence both within the GNOME project and the wider Free and Open Source Software community. Here’s a few highlights of what he’s achieved together with the Foundation team and the community: Improved public perception of GNOME as a desktop and GTK as a development platform, helping to align interests between key contributors and wider ecosystem stakeholders and establishing an ongoing collaboration with KDE around the Linux App Summit. Worked with the board to improve the maturity of the board itself and allow it to work at a more strategic level, instigating staggered two-year terms for directors providing much-needed stability, and established the Executive and Finance committees to handle specific topics and the Governance committees to take a longer-term look at the board’s composition and capabilities. Arranged 3 major grants to the Foundation totaling $2M and raised a further $250k through targeted fundraising initiatives. Grown the Foundation team to its largest ever size, investing in staff development, and established ongoing direct contributions to GNOME, GTK and Flathub by Foundation staff and contractors. Launched and incubated Flathub as an inclusive and sustainable ecosystem for Linux app developers to engage directly with their users, and delivered the Community Engagement Challenge to invest in the sustainability of our contributor base ­­– the Foundation’s largest and most substantial programs outside of GNOME itself since Outreachy. Achieved a fantastic resolution for GNOME and the wider community, by negotiating a settlement which protects FOSS developers from patent enforcement by the Rothschild group of non-practicing entities. Stood for a diverse and inclusive Foundation, implementing a code of conduct for GNOME events and online spaces, establishing our first code of conduct committee and updating the bylaws to be gender-neutral. Established the GNOME Circle program together with the board, broadening the membership base of the foundation by welcoming app and library developers from the wider ecosystem. Recognizing and appreciating the amazing progress that GNOME has made with Neil’s support, the search for a new Executive Director provides the opportunity for the Foundation board to set the agenda and next high-level goals we’d like to achieve together with our new Executive Director. In terms of the desktop, applications, technology, design and development processes, whilst there are always improvements to be made, the board’s general feeling is that thanks to the work of our amazing community of contributors, GNOME is doing very well in terms of what we produce and publish. Recent desktop releases have looked great, highly polished and well-received, and the application ecosystem is growing and improving through new developers and applications bringing great energy at the moment. From here, our largest opportunity in terms of growing the community and our user base is being able to articulate the benefits of what we’ve produced to a wider public audience, and deliver impact which allows us to secure and grow new and sustainable sources of funding. For individuals, we are able to offer an exceedingly high quality desktop experience and a broad range of powerful applications which are affordable to all, backed by a nonprofit which can be trusted to look after your data, digital security and your best interests as an individual. From the perspective of being a public charity in the US, we also have the opportunity to establish programs that draw upon our community, technology and products to deliver impact such as developing employable skills, incubating new Open Source contributors, learning to program and more. For our next Executive Director, we will be looking for an individual with existing experience in that nonprofit landscape, ideally with prior experience establishing and raising funds for programs that deliver impact through technology, and appreciation for the values that bring people to Free, Open Source and other Open Culture organizations. Working closely with the existing members, contributors, volunteers and whole GNOME community, and managing our relationships with the Advisory Board and other key partners, we hope to find a candidate that can build public awareness and help people learn about, use and benefit from what GNOME has built over the past two decades. Neil has agreed to stay in his position for a 6 month transition period, during which he will support the board in our search for a new Executive Director and support a smooth hand-over. Over the coming weeks we will publish the job description for the new ED, and establish a search committee who will be responsible for sourcing and interviewing candidates to make a recommendation to the board for Neil’s successor – a hard act to follow! I’m confident the community will join me and the board in personally thanking Neil for his 5 years of dedicated service in support of GNOME and the Foundation. Should you have any queries regarding the process, or offers of assistance in the coming hiring process, please don’t hesitate to join the discussion or reach out directly to the board. [Less]
Posted about 2 years ago
After roughly 20 years and counting up to 0.40 in release numbers, I've decided to call the next version of the xf86-input-wacom driver the 1.0 release. [1] This cycle has seen a bulk of development (>180 patches) which is roughly as much as ... [More] the last 12 releases together. None of these patches actually added user-visible features, so let's talk about technical dept and what turned out to be an interesting way of reducing it. The wacom driver's git history goes back to 2002 and the current batch of maintainers (Ping, Jason and I) have all been working on it for one to two decades. It used to be a Wacom-only driver but with the improvements made to the kernel over the years the driver should work with most tablets that have a kernel driver, albeit some of the more quirky niche features will be more limited (but your non-Wacom devices probably don't have those features anyway). The one constant was always: the driver was extremely difficult to test, something common to all X input drivers. Development is a cycle of restarting the X server a billion times, testing is mostly plugging hardware in and moving things around in the hope that you can spot the bugs. On a driver that doesn't move much, this isn't necessarily a problem. Until a bug comes along, that requires some core rework of the event handling - in the kernel, libinput and, yes, the wacom driver. After years of libinput development, I wasn't really in the mood for the whole "plug every tablet in and test it, for every commit". In a rather caffeine-driven development cycle [2], the driver was separated into two logical entities: the core driver and the "frontend". The default frontend is the X11 one which is now a relatively thin layer around the core driver parts, primarily to translate events into the X Server's API. So, not unlike libinput + xf86-input-libinput in terms of architecture. In ascii-art: | +--------------------+ | big giant /dev/input/event0->| core driver | x11 |->| X server +--------------------+ | process | Now, that logical separation means we can have another frontend which I implemented as a relatively light GObject wrapper and is now a library creatively called libgwacom: +-----------------------+ | /dev/input/event0->| core driver | gwacom |--| tools or test suites +-----------------------+ | This isn't a public library or API and it's very much focused on the needs of the X driver so there are some peculiarities in there. What it allows us though is a new wacom-record tool that can hook onto event nodes and print the events as they come out of the driver. So instead of having to restart X and move and click things, you get this: $ ./builddir/wacom-recordwacom-record: version: 0.99.2 git: xf86-input-wacom-0.99.2-17-g404dfd5a device: path: /dev/input/event6 name: "Wacom Intuos Pro M Pen" events: - source: 0 event: new-device name: "Wacom Intuos Pro M Pen" type: stylus capabilities: keys: true is-absolute: true is-direct-touch: false ntouches: 0 naxes: 6 axes: - {type: x , range: [ 0, 44800], resolution: 200000} - {type: y , range: [ 0, 29600], resolution: 200000} - {type: pressure , range: [ 0, 65536], resolution: 0} - {type: tilt_x , range: [ -64, 63], resolution: 57} - {type: tilt_y , range: [ -64, 63], resolution: 57} - {type: wheel , range: [ -900, 899], resolution: 0} ... - source: 0 mode: absolute event: motion mask: [ "x", "y", "pressure", "tilt-x", "tilt-y", "wheel" ] axes: { x: 28066, y: 17643, pressure: 0, tilt: [ -4, 56], rotation: 0, throttle: 0, wheel: -108, rings: [ 0, 0] This is YAML which means we can process the output for comparison or just to search for things. A tool to quickly analyse data makes for faster development iterations but it's still a far cry from reliable regression testing (and writing a test suite is a daunting task at best). But one nice thing about GObject is that it's accessible from other languages, including Python. So our test suite can be in Python, using pytest and all its capabilities, plus all the advantages Python has over C. Most of driver testing comes down to: create a uinput device, set up the driver with some options, push events through that device and verify they come out of the driver in the right sequence and format. I don't need C for that. So there's pull request sitting out there doing exactly that - adding a pytest test suite for a 20-year old X driver written in C. That this is a) possible and b) a lot less work than expected got me quite unreasonably excited. If you do have to maintain an old C library, maybe consider whether's possible doing the same because there's nothing like the warm fuzzy feeling a green tick on a CI pipeline gives you. [1] As scholars of version numbers know, they make as much sense as your stereotypical uncle's facebook opinion, so why not. [2] The Colombian GDP probably went up a bit [Less]
Posted about 2 years ago
FOSDEM 2022 took place this past weekend, on February 5th and 6th. It was a virtual event for the second year in a row, but this year the Graphics devroom made a comeback and I participated in it with a talk titled “Fun with border colors in ... [More] Vulkan”. In the talk, I explained the context and origins behind the VK_EXT_border_color_swizzle extension that was published last year and in which I’m listed as one of the contributors. Big kudos and a big thank you to the FOSDEM organizers one more year. FOSDEM is arguably the most important free and open source software conference in Europe and one of the most important FOSS conferences in the world. It’s run entirely by volunteers, doing an incredible amount of work that makes it possible to have hundreds of talks and dozens of different devrooms in the span of two days. Special thanks to the Graphics devroom organizers. For the virtual setup, one more year FOSDEM relied on Matrix. It’s great because at Igalia we also use Matrix for our internal communications and, thanks to the federated nature of the service, I could join the FOSDEM virtual rooms using the same interface, client and account I normally use for work. The FOSDEM organizers also let participants create ad-hoc accounts to join the conference, in case they didn’t have a Matrix account previously. Thanks to Matrix widgets, each virtual devroom had its corresponding video stream, which you could also watch freely on their site, embedded in each of the virtual devrooms, so participants wanting to watch the talks and ask questions had everything in a single page. Talks were pre-recorded and submitted in advance, played at the scheduled times, and Jitsi was used for post-talk Q&A sessions, in which moderators and devroom organizers read aloud the most voted questions in the devroom chat. Of course, a conference this big is not without its glitches. Video feeds from presenters and moderators were sometimes cut automatically by Jitsi allegedly due to insufficient bandwidth. It also happened to me during my Q&A section while I was using a wired connection on a 300 Mbps symmetric FTTH line. I can only suppose the pipe was not wide enough on the other end to handle dozens of streams at the same time, or Jitsi was playing games as it sometimes does. In any case, audio was flawless. In addition, some of the pre-recorded videos could not be played at the scheduled time, resulting in a black screen with no sound, due to an apparent bug in the video system. It’s worth noting all pre-recorded talks had been submitted, processed and reviewed prior to the conference, so this was an unexpected problem. This happened with my talk and I had to do the presentation live. Fortunately, I had written a script for the talk and could use it to deliver it without issues by sharing my screen with the slides over Jitsi. Finally, as a possible improvement point for future virtual or mixed editions, is the fact that the deadline for submitting talk videos was only communicated directly and prominently by email on the day the deadline ended, a couple of weeks before the conference. It was also mentioned in the presenter’s guide that was linked in a previous email message, but an explicit warning a few days or a week before the deadline would have been useful to avoid last-minute rushes and delays submitting talks. In any case, those small problems don’t take away the great online-only experience we had this year. Transcription Another advantage of having a written script for the talk is that I can use it to provide a pseudo-transcription of its contents for those that prefer not to watch a video or are unable to do so. I’ve also summed up the Q&A section at the end below. The slides are available as an attachment in the talk page. Enjoy and see you next year, hopefully in Brussels this time. Slide 1 (Talk cover) Hello, my name is Ricardo Garcia. I work at Igalia as part of its Graphics Team, where I mostly work on the CTS project creating new Vulkan tests and fixing existing ones. Sometimes this means I also contribute to the specification text and other pieces of the Vulkan ecosystem. Today I’m going to talk about the story behind the “border color swizzle” extension that was published last year. I created tests for this one and I also participated in its release process, so I’m listed as one of the contributors. Slide 2 (Sampling in Vulkan) I’ve already started mentioning border colors, so before we dive directly into the extension let me give you a brief introduction to sampling operations in Vulkan and explain where border colors fit in that. Sampling means reading pixels from an image view and is typically done in the fragment shader, for example to apply a texture to some geometry. In the example you see here, we have an image view with 3 8-bit color components in BGR order and in unsigned normalized format. This means we’ll suppose each image pixel is stored in memory using 3 bytes, with each byte corresponding to the blue, green and red components in that order. However, when we read pixels from that image view, we want to get back normalized floating point values between 0 (for the lowest value) and 1 (for the highest value, i.e. when all bits are 1 and the natural number in memory is 255). As you can see in the GLSL code, the result of the operation is a vector of 4 floating point numbers. Since the image does not have alpha information, it’s natural to think the output vector may have a 1 in the last component, making the color opaque. If the coordinates of the sample operation make us read the pixel represented there, we would get the values you see on the right. It’s also worth noting the sampler argument is a combination of two objects in Vulkan: an image view and a sampler object that specifies how sampling is done. Slide 3 (Normalized Coordinates) Focusing a bit on the coordinates used to sample from the image, the most common case is using normalized coordinates, which means using floating point values between 0 and 1 in each of the image axis, like the 2D case you see on the right. But, what happens if the coordinates fall outside that range? That means sampling outside the original image, in points around it like the red marks you see on the right. That depends on how the sampler is configured. When creating it, we can specify a so-called “address mode” independently for each of the 3 texture coordinate axis that may be used (2 in our example). Slide 4 (Address Mode) There are several possible address modes. The most common one is probably the one you see here on the bottom left, which is the repeat addressing mode, which applies some kind of module operation to the coordinates as if the texture was virtually repeating in the selected axis. There’s also the clamp mode on the top right, for example, which clamps coordinates to 0 and 1 and produces the effect of the texture borders extending beyond the image edge. The case we’re interested in is the one on the top left, which is the border mode. When sampling outside we get a border color, as if the image was surrounded by a virtually infinite frame of a chosen color. Slide 5 (Border Color) The border color is specified when creating the sampler, and initially could only be chosen among a restricted set of values: transparent black (all zeros), opaque white (all ones) or the “special” opaque black color, which has a zero in all color components and a 1 in the alpha component. The “custom border color” extension introduced the possibility of specifying arbitrary RGBA colors when creating the sampler. Slide 6 (Image View Swizzle) However, sampling operations are also affected by one parameter that’s not part of the sampler object. It’s part of the image view and it’s called the component swizzle. In the example I gave you before we got some color values back, but that was supposing the component swizzle was the identity swizzle (i.e. color components were not reorder or replaced). It’s possible, however, to specify other swizzles indicating what the resulting final color should be for each of the 4 components: you can reorder the components arbitrarily (e.g. saying the red component should actually come from the original blue one), you can force some of them to be zero or one, you can replicate one of the original components in multiple positions of the final color, etc. It’s a very flexible operation. Slide 7 (Border Color and Swizzle pt. 1) While working on the Zink Mesa driver, Mike discovered that the interaction between non-identity swizzle and custom border colors produced different results for different implementations, and wondered if the result was specified at all. Slide 8 (Border Color and Swizzle pt. 2) Let me give you an example: you specify a custom border color of 0, 0, 1, 1 (opaque blue) and an addressing mode of clamping to border in the sampler. The image view has this strange swizzle in which the red component should come from the original blue, the green component is always zero, the blue component comes from the original green and the alpha component is not modified. If the swizzle applies to the border color you get red. If it does not, you get blue. Any option is reasonable: if the border color is specified as part of the sampler, maybe you want to get that color no matter which image view you use that sampler on, and expect to always get a blue border. If the border color is supposed to act as if it came from the original image, it should be affected by the swizzle as the normal pixels are and you’d get red. Slide 9 (Border Color and Swizzle pt. 3) Jason pointed out the spec laid out the rules in a section called “Texel Input Operations”, which specifies that swizzling should affect border colors, and non-identity swizzles could be applied to custom border colors without restrictions according to the spec, contrary to “opaque black”, which was considered a special value and non-identity swizzles would result in undefined values with that border. Slide 10 (Texel Input Operations) The Texel Input Operations spec section describes what the expected result is according to some steps which are supposed to happen in a defined order. It doesn’t mean the hardware has to work like this. It may need instructions before or after the hardware sampling operation to simulate things happen in the order described there. I’ve simplified and removed some of the steps but if border color needs to be applied we’re interested in the steps we can see in bold, and step 5 (border color applied) comes before step 7 (applying the image view swizzle). I’ll describe the steps with a bit more detail now. Slide 11 (Coordinate Conversion) Step 1 is coordinate conversion: this includes converting normalized coordinates to integer texel coordinates for the image view and clamping and modifying those values depending on the addressing mode. Slide 12 (Coordinate Validation) Once that is done, step 2 is validating the coordinates. Here, we’ll decide if texel replacement takes place or not, which may imply using the border color. In other sampling modes, robustness features will also be taken into account. Slide 13 (Reading Texel from Image) Step 3 happens when the coordinates are valid, and is reading the actual texel from the image. This immediately implies reordering components from the in-memory layout to the standard RGBA layout, which means a BGR image view gets its components immediately put in RGB order after reading. Slide 14 (Format Conversion) Step 4 also applies if an actual texel was read from the image and is format conversion. For example, unsigned normalized formats need to convert pixel values (stored as natural numbers in memory) to floating point values. Our example texel, already in RGB order, results in the values you see on the right. Slide 15 (Texel Replacement) Step 5 is texel replacement, and is the alternative to the previous two steps when the coordinates were not valid. In the case of border colors, this means taking the border color and cutting it short so it only has the components present in the original image view, to act as if the border color was actually part of the image. Because this happens after the color components have already been reordered, the border color is always specified in standard red, green, blue and alpha order when creating the sampler. The fact that the original image view was in BGR order is irrelevant for the border color. We care about the alpha component being missing, but not about the in-memory order of the image view. Our transparent blue border is converted to just “blue” in this step. Slide 16 (Expansion to RGBA) Step 6 takes us back to a unified flow of steps: it applies to the color no matter where it came from. The color is expanded to always have 4 components as expected in the shader. Missing color components are replaced with zeros and the alpha component, if missing, is set to one. Our original transparent blue border is now opaque blue. Slide 17 (Component Swizzle) Step 7, finally the swizzle is applied. Let’s suppose our image view had that strange swizzle in which the red component is copied from the original blue, the green component is set to zero, the blue one is set to one and the alpha component is not modified. Our original transparent blue border is now opaque magenta. Slide 18 (VK_EXT_custom_border_color) So we had this situation in which some implementations swizzled the border color and others did not. What could we do? We could double-down on the existing spec and ask vendors to fix their implementations but, what happens if they cannot fix them? Or if the fix is impractical due to its impact in performance? Unfortunately, that was the actual situation: some implementations could not be fixed. After discovering this problem, CTS tests were going to be created for these cases. If an implementation failed to behave as mandated by the spec, it wouldn’t pass conformance, so those implementations only had one way out: stop supporting custom border colors, but that’s also a loss for users if those implementations are in widespread use (and they were). The second option is backpedaling a bit, making behavior undefined unless some other feature is present and designing a mechanism that would allow custom border colors to be used with non-identity swizzles at least in some of the implementations. Slide 19 (VK_EXT_border_color_swizzle) And that’s how the “border color swizzle” extension was created last year. Custom colors with non-identity swizzle produced undefined results unless the borderColorSwizzle feature was available and enabled. Some implementations could advertise support for this almost “for free” and others could advertise lack of support for this feature. In the middle ground, some implementations can indicate they support the case, but the component swizzle has to be indicated when creating the sampler as well. So it’s both part of the image view and part of the sampler. Samplers created this way can only be used with image views having a matching component swizzle (which means they are no longer generic samplers). The drawback of this extension, apart from the obvious observation that it should’ve been part of the original custom border color extension, is that it somehow lowers the bar for applications that want to use a single code path for every vendor. If borderColorSwizzle is supported, it’s always legal to pass the swizzle when creating the sampler. Some implementations will need it and the rest can ignore it, so the unified code path is now harder or more specific. And that’s basically it. Sometimes the Vulkan Working Group in Khronos has had to backpedal and mark as undefined something that previous versions of the Vulkan spec considered defined. It’s not frequent nor ideal, but it happens. But it usually does not go as far as publishing a new extension as part of the fix, which is why I considered this interesting. Slide 20 (Questions?) Thanks for watching! Let me know if you have any questions. Q&A Section Martin: The first question is from "ancurio" and he’s asking if swizzling is implemented in hardware. Me: I don’t work on implementations so take my answer with a grain of salt. It’s my understanding you can usually program that in hardware and the hardware does the swizzling for you. There may be implementations which need to do the swizzling in software, emitting extra instructions. Martin: Another question from "ancurio". When you said lowering the bar do you mean raising it? I explain that, yes, I meant to say raising the bar for the application. Note: I meant to say that it lowers the bar for the specification and API, which means a more complicated solution has been accepted. Martin: "enunes" asks if this was originally motivated by some real application bug or by something like conformance tests/spec disambiguation? I explain it has both factors. Mike found the problem while developing Zink, so a real application hit the problematic case, and then the Vulkan Working Group inside Khronos wanted to fix this, make the spec clear and provide a solution for apps that wanted to use non-identity swizzle with border colors, as it was originally allowed. Martin: no more questions in the room but I have one more for you. How was your experience dealing with Khronos coordinating with different vendors and figuring out what was the acceptable solution for everyone? I explain that the main driver behind the extension in Khronos was Piers Daniell from NVIDIA (NB: listed as the extension author). I mention that my experience was positive, that the Working Group is composed of people who are really interested in making a good specification and implementations that serve app developers. When this problem was detected I created some tests that worked as a poll to see which vendors could make this work easily and what others may need to make this work if at all. Then, this was discussed in the Working Group, a solution was proposed (the extension), then more vendors reviewed and commented that, then tests were adapted to the final solution, and finally the extension was published. Martin: How long did this whole process take? Me: A few months. Take into account the Working Group does not meet every day, and they have a backlog of issues to discuss. Each of the previous steps takes several weeks, so you end up with a few months, which is not bad. Martin: Not bad at all. Me: Not at all, I think it works reasonably well. Martin: Specially when you realize you broke something and the specification needs fixing. Definitely decent. [Less]
Posted about 2 years ago
(I nearly went with clutterectomy, but that would be doing our old servant project a disservice.)Yesterday, I finally merged the work-in-progress branch porting totem to GStreamer's GTK GL sink widget, undoing a lot of the work done in 2011 and 2014 ... [More] to port the video widget and then to finally make use of its features.But GTK has been modernised (in GTK3 but in GTK4 even more so), GStreamer grew a collection of GL plugins, Wayland and VA-API matured and clutter (and its siblings clutter-gtk, and clutter-gst) didn't get the resources they needed to follow.A screenshot with practically no changes, as expectedThe list of bug fixes and enhancements is substantial: Makes some files that threw shaders warnings playable Fixes resize lag for the widgets embedded in the video widget Fixes interactions with widgets on some HDR capable systems, or even widgets disappearing sometimes (!) Gets rid of the floating blank windows under Wayland Should help with tearing, although that's highly dependent on the system Hi-DPI support Hardware acceleration (through libva) Until the port to GTK4, we expect a overall drop in performance on systems where there's no VA-API support, and the GTK4 port should bring it to par with the fastest of players available for GNOME.You can install a Preview version right now by running: $ flatpak install --user https://flathub.org/beta-repo/appstream/org.gnome.Totem.Devel.flatpakrefand filing bug in the GNOME GitLab.Next stop, a GTK4 port! [Less]
Posted about 2 years ago
22.0 I always do one of these big roundups for each Mesa release, so here’s what you can expect to see from zink in the upcoming release: fewer hangs on RADV massively improved usability on NVIDIA greatly improved performance with unsupported ... [More] texture download formats (e.g., CS:GO, L4D2) more extensions: ARB_sparse_texture, ARB_sparse_texture2, ARB_sparse_texture_clamp, EXT_memory_object, EXT_memory_object_fd, GL_EXT_semaphore, GL_EXT_semaphore_fd ~1000% improved glxgears performance (be sure to run with --i-know-this-is-not-a-benchmark to see the real speed) tons and tons and tons of bug fixes All around looking like another great release. I Hate gl_PointSize And So Can You Yes, we’re here. After literally years of awfulness, I’ve finally solved (for good) the debacle that is point size conversion from GL to Vulkan. What’s so awful about it, you might be asking. How hard can it be to just add gl_PointSize to a shader, you follow up with as you push your glasses higher up your nose. Allow me to explain. In Vulkan, there is exactly one method for setting the size of points: the gl_PointSize shader output controls it, and that’s it. In OpenGL (core profile): 14.4 Points If program point size mode is enabled, the derived point size is taken from the (potentially clipped) shader built-in gl_PointSize written by the last vertex processing stage and clamped to the implementation-dependent point size range. If the value written to gl_PointSize is less than or equal to zero, or if no value was written to gl_PointSize, results are undefined. If program point size mode is disabled, the derived point size is specified with the command void PointSize( float size ); 11.2.3.4 Tessellation Evaluation Shader Outputs Tessellation evaluation shaders have a number of built-in output variables used to pass values to equivalent built-in input variables read by subsequent shader stages or to subsequent fixed functionality vertex processing pipeline stages. These variables are gl_Position, gl_PointSize, gl_ClipDistance, and gl_CullDistance, and all behave identically to equivalently named vertex shader outputs. 11.3.4.5 Geometry Shader Outputs The built-in output gl_PointSize, if written, holds the size of the point to be rasterized, measured in pixels In short, if PROGRAM_POINT_SIZE is enabled, then points are sized based on the gl_PointSize shader output of the last vertex stage. In OpenGL ES (versions 2.0, 3.0, 3.1): (3.3 | 3.4 | 13.3) Points The point size is taken from the shader built-in gl_PointSize written by the vertex shader, and clamped to the implementation-dependent point size range. In OpenGL ES (version 3.2): 13.5 Points The point size is determined by the last vertex processing stage. If the last vertex processing stage is not a vertex shader, the point size is 1.0. If the last vertex processing stage is a vertex shader, the point size is taken from the shader built-in gl_PointSize written by the vertex shader, and is clamped to the implementation-dependent point size range. Thus for an ES context, the point size always comes from the last vertex stage, which means it can be anything it wants to be if that stage is a vertex shader and cannot be written to for all other stages because it is not a valid output (this last, bolded part is going to be really funny in a minute or two). What do the specs agree on? If a vertex shader is the last vertex stage, it can write gl_PointSize Literally that’s it. Awesome. Zink As we know, Vulkan has a very simple and clearly defined model for point size: The point size is taken from the (potentially clipped) shader built-in PointSize written by: • the geometry shader, if active; • the tessellation evaluation shader, if active and no geometry shader is active; • the vertex shader, otherwise - 27.10. Points It really can be that simple. So one would think that we can just hook up some conditionals based on the GL rules and then export the correct value. That would be easy. Simple. It would make sense. HAHA hahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahahaha XFB It gets worse (obviously). gl_PointSize is a valid XFB varying, which means it must be exported correctly to the transform feedback buffer. For the ES case, it’s simple, but for desktop GL, there’s a little something called PROGRAM_POINT_SIZE state which totally fucks that up. Because, as we know, Vulkan has exactly one way of setting point size, and it’s the shader variable. Thus, if there is a desktop GL context using a vertex shader as its last vertex stage for a draw, and if that shader has its own gl_PointSize value, this value must be exported for XFB. But not used for point rasterization. It’s Actually Even Worse Than That …Because in order to pass CTS for ES 3.2, your implementation also has to be able to violate spec. Remember above when I said it was going to be funny that gl_PointSize is not a legal output for non-vertex stages in ES contexts? CTS explicitly has “wide points” tests which verify illegal point sizes that are exported by the tessellation and geometry shader stages. Isn’t that cool? Also, let’s be reasonable people for a moment, who actually wants a point that’s just one pixel? Nobody can see that on their 8k display. To Sum Up I hate GL point size, and so should you. [Less]