|
Posted
over 6 years
ago
We are happy to announce the release of bzip2 1.0.8.
This is a fixup release because the CVE-2019-12900 fix in bzip2 1.0.7 was too strict and might have prevented decompression of some files that earlier bzip2 versions could decompress. And it
... [More]
contains a few more patches from various distros and forks.
bzip2 1.0.8 contains the following fixes:
Accept as many selectors as the file format allows. This relaxes the fix for CVE-2019-12900 from 1.0.7 so that bzip2 allows decompression of bz2 files that use (too) many selectors again.
Fix handling of large (> 4GB) files on Windows.
Cleanup of bzdiff and bzgrep scripts so they don’t use any bash extensions and handle multiple archives correctly.
There is now a bz2-files testsuite at https://sourceware.org/git/bzip2-tests.git
Patches by Joshua Watt, Mark Wielaard, Phil Ross, Vincent Lefevre, Led and Kristýna Streitová.
This release also finalizes the move of bzip2 to a community maintained project at https://sourceware.org/bzip2/
Git repository
Public (developer) mailinglist <[email protected]>To subscribe send email to <[email protected]>.You do not have to be subscribed to sent messages to the list.
Bug tracker
Documentation
Latest and historical downloads (ftp or https)
Extended testsuite
Thanks to Bhargava Shastry bzip2 is now also part of oss-fuzz to catch fuzzing issues early and (hopefully not) often. [Less]
|
|
Posted
over 6 years
ago
We are happy to announce the release of bzip2 1.0.7.
This is an emergency release because the old bzip2 website is gone and there were outstanding security issues. The original bzip2 home, downloads and documentation can now be found at:
... [More]
https://sourceware.org/bzip2/
bzip2 1.0.7 contains only the following bug/security fixes:
Fix undefined behavior in the macros SET_BH, CLEAR_BH, & ISSET_BH
bzip2: Fix return value when combining –test,-t and -q.
bzip2recover: Fix buffer overflow for large argv[0]
bzip2recover: Fix use after free issue with outFile (CVE-2016-3189)
Make sure nSelectors is not out of range (CVE-2019-12900)
A future 1.1.x release is being prepared by Federico Mena Quintero, which will include more fixes, an updated build system and possibly an updated SONAME default.
Please read his blog for more background on this.
NOTE/WARNING: There has been a report that the CVE-2019-12900 fix prevents decompression of some (buggy lbzip2 compressed) files that bzip2 1.0.6 could decompress. See the discussion on the bzip2-devel mailinglist. There is a proposed workaround now. [Less]
|
|
Posted
over 6 years
ago
In this miniseries, I’d like to introduce a couple of new developments of the Shenandoah GC that are upcoming in JDK 13. This here is about a new architecture and a new operating system that Shenandoah will be working with.
Solaris
Only about a
... [More]
few days ago, Bellsoft contributed a change that allowed Shenandoah to build and run on Solaris. Shenandoah itself has zero operating-system-specific code in it, and is therefore relatively easy to port to new operating systems. In this case, it mostly amounts to a batch of fixes to make the Solaris compiler happy, like removing a trailing comma in enums.
One notable gotcha that we hit was with Solaris 10. Contrary to what later versions of Solaris do, and what basically all relevant other operating systems do, Solaris 10 maps user memory to upper address ranges, e.g. to addresses starting with 0xff… instead of 0x7f. Other operating systems reserve the upper half of the address space to kernel memory. This conflicted with an optimization of Shenandoah’s task queues, which would encode pointers assuming it has some spare space in the upper address range. It was easy enough to disable via build-time-flag, and so Aleksey did. The fix is totally internal to Shenandoah GC and does not affect the representation of Java references in heap. With this change, Shenandoah can be built and run on Solaris 10 and newer (and possibly older, but we haven’t tried). This is not only interesting for folks who want Shenandoah to run on Solaris, but also for us, because it requires the extra bit of cleanliness to make non-mainline toolchains happy.
The changes for Solaris support are already in JDK 13 development repositories, and are in-fact already backported to Shenandoah’s JDK 11 and JDK 8 backports repositories.
x86_32
Shenandoah used to support x86_32 in “passive” mode long time ago. This mode relies only on stop-the-world GC to avoid implementing barriers (basically, runs Degenerated GC all the time). It was an interesting mode to see the footprint numbers you can get with uncommits and slimmer native pointers with really small microservice-size VMs. This mode was dropped before integration upstream, because many Shenandoah tests expect all heuristics/modes to work properly, and having the rudimentary x86_32 support was breaking tier1 tests. So we disabled it.
Today, we have significantly simplified runtime interface thanks to load-reference-barriers and elimination of separate forwarding pointer slot, and we can build the fully concurrent x86_32 on top of that. This allows us to maintain 32-bit cleanness in Shenandoah code (we have fixed >5 bugs ahead of this change!), plus serves as proof of concept that Shenandoah can be implemented on 32-bit platforms. It is interesting in scenarios where the extra footprint savings are important like in containers or embedded systems. The combination of LRB+no more forwarding pointer+32bit support gives us the current lowest bounds for footprint that would be possible with Shenandoah.
The changes for x86_32 bit support are done and ready to be integrated into JDK 13. However, they are currently waiting for the elimination-of-forwarding-pointer change, which in turn is waiting for a nasty C2 bug fix. The plan is to later backport it to Shenandoah JDK 11 and JDK 8 backports – after load-reference-barriers and elimination-of-forwarding-pointer changes have been backported.
Other arches and OSes
With those two additions to OS and architecturs support, Shenandoah will soon be available (e.g. known to build and run) on four operating systems: Linux, Windows, MacOS and Solaris, plus 3 architectures: x86_64, arm64 and x86_32. Given Shenandoah’s design with zero OS specific code, and not overly complex architecture-specific code, we may be looking at more OSes or architectures to join the flock in future releases, if anybody finds it interesting enough to implement.
As always, if you don’t want to wait for releases, you can already have everything and help sort out problems: check out The Shenandoah GC Wiki.
[Less]
|
|
Posted
over 6 years
ago
In this miniseries, I’d like to introduce a couple of new developments of the Shenandoah GC that are upcoming in JDK 13. The change I want to talk about here addresses another very frequent, perhaps *the* most frequent concern about Shenandoah GC:
... [More]
the need for an extra word per object. Many believe this is a core requirement for Shenandoah, but it is actually not, as you would see below.
Let’s first look at the usual object layout of an object in the Hotspot JVM:
0: [mark-word ]
8: [class-word ]
16: [field 1 ]
24: [field 3 ]
32: [field 3 ]
Each section here marks a heap-word. That would be 64 bits on 64 bit architectures and 32 bits on 32 bit architectures.
The first word is the so-called mark-word, or header of the object. It is used for a variety of purposes: it can keep the hash-code of an object, it has 3 bits that are used for various locking states, some GCs use it to track object age and marking status, and it can be ‘overlaid’ with a pointer to the ‘displaced’ mark, to an ‘inflated’ lock or, during GC, the forwarding pointer.
The second word is reserved for the klass-pointer. This is simply a pointer to the Hotspot-internal data-structure that represents the class of the object.
Arrays would have an additional word next to store the arraylength. What follows afterwards is the actual ‘payload’ of the object, i.e. fields and array elements.
When running with Shenandoah enabled, the layout would look like this instead:
-8: [fwd pointer]
0: [mark-word ]
8: [class-word ]
16: [field 1 ]
24: [field 3 ]
32: [field 3 ]
The forward pointer is used for Shenandoah’s concurrent evacuation protocol:
Normally it points to itself -> the object is not evacuated yet
When evacuating (by the GC or via a write-barrier), we first copy the object, then install new forwarding pointer to that copy using an atomic compare-and-swap, possibly yielding a pointer to an offending copy. Only one copy wins.
Now, the canonical copy to read-from or write-to can be found simply by reading this forwarding pointer.
The advantage of this protocol is that it’s simple and cheap. The cheap aspect is important here, because, remember, Shenandoah needs to resolve the forwardee for every single read or write, even primitive ones. And using this protocol, the read-barrier for this would be a single instruction:
mov %rax, (%rax, -8)
That’s about as simple as it gets.
The disadvantage is obviously that it requires more memory. In the worst case, for objects without any payload, one more word for otherwise two-word object. That’s 50% more. With more realistic object size distributions, you’d still end up with 5%-10% more overhead, YMMV. This also results in reduced performance: allocating the same number of objects would hit the ceiling faster than without that overhead, prompting GCs more often, and therefore reduce throughput.
If you’ve read the above paragraphs carefully, you will have noticed that the mark-word is also used/overlaid by some GCs to carry the forwarding pointer. So why not do the same in Shenandoah? The answer is (or used to be), that reading the forwarding pointer requires a little more work. We need to somehow distinguish a true mark-word from a forwarding pointer. That is done by setting the lowest two bits in the mark-word. Those are usually used as locking bits, but the combination 0b11 is not a legal combination of lock bits. In other words: when they are set, the mark-word, with the lowest bits masked to 0, is to be interpreted as forwarding pointer. This decoding of the mark word is significantly more complex than the above simple read of the forwarding pointer. I did in-fact build a prototype a while ago, and the additional cost of the read-barriers was prohibitive and did not justify the savings.
All of this changed with the recent arrival of load reference barriers:
We no longer require read-barriers, especially not on (very frequent) primitive reads
The load-reference-barriers are conditional, which means their slow-path (actual resolution) is only activated when 1. GC is active and 2. the object in question is in the collection set. This is fairly infrequent. Compare that to the previous read-barriers which would be always-on.
We no longer allow any access to from-space copies. The strong invariant guarantees us that we only ever read from and write to to-space copies.
Two consequences of these are: the from-space copy is not actually used for anything, we can use that space to put the forwarding pointer, instead of reserving an extra word for it. We can basically nuke the whole contents of the from-space copy, and put the forwarding pointer anywhere. We only need to be able to distinguish between ‘not forwarded’ (and we don’t care about other contents) and ‘forwarded’ (the rest is forwarding pointer).
It also means that the actual mid- and slow-paths of the load-reference-barriers are not all that hot, and we can easily afford to do a little bit of decoding there. It amounts to something like (in pseudocode):
oop decode_forwarding(oop obj) {
mark m = obj->load_mark();
if ((m & 0b11) == 0b11) {
return (oop) (m & ~0b11);
} else {
return obj;
}
}
While this looks noticably more complicated than the above simple load of the forwarding pointer, it is still basically a free lunch because it’s only ever executed in the not-very-hot mid-path of the load-reference-barrier. With this, the new object layout would be:
0: [mark word (or fwd pointer)]
8: [class word]
16: [field 1]
24: [field 2]
32: [field 3]
Doing so has a number of advantages:
Obviously, it reduces Shenandoah’s memory footprint by putting away with the extra word.
Not quite as obviously, this results in increased throughput: we can now allocate more objects before hitting the GC trigger, resulting in fewer cycles spent in actual GC.
Objects are packed more tightly, which results in improved CPU cache pressure.
Again, the required GC interfaces are simpler: where we needed special implementations of the allocation paths (to reserve and initialize the extra word), we can now use the same allocation code as any other GC
To give you an idea of the throughput improvements: all the GC sensitive benchmarks that I have tried showed gains between 10% and 15%. Others benefited less or not at all, but that is not surprising for benchmarks that don’t do any GC at all. But it is important to note that the extra decoding cost does not actually show up anywhere, it is basically negligible. It probably would show up on heavily evacuating workloads. But most applications don’t evacuate that much, and most of the work is done by GC threads anyway, making midpath decoding cheap enough.
The implementation of this has recently been pushed to the shenandoah/jdk repository. We are currently shaking out one last known bug, and then it’s ready to go upstream into JDK 13 repository. The plan is to eventually backport it to Shenandoah’s JDK 11 and JDK 8 backports repositories, and from there into RPMs. If you don’t want to wait, you can already have it: check out The Shenandoah GC Wiki. [Less]
|
|
Posted
over 6 years
ago
In this miniseries, I’d like to introduce a couple of new developments of the Shenandoah GC that are upcoming in JDK 13. The change I want to talk about here addresses another very frequent, perhaps *the* most frequent concern about Shenandoah GC:
... [More]
the need for an extra word per object. Many believe this is a core requirement for Shenandoah, but it is actually not, as you would see below.
Let’s first look at the usual object layout of an object in the Hotspot JVM:
0: [mark-word ]
8: [class-word ]
16: [field 1 ]
24: [field 2 ]
32: [field 3 ]
Each section here marks a heap-word. That would be 64 bits on 64 bit architectures and 32 bits on 32 bit architectures.
The first word is the so-called mark-word, or header of the object. It is used for a variety of purposes: it can keep the hash-code of an object, it has 3 bits that are used for various locking states, some GCs use it to track object age and marking status, and it can be ‘overlaid’ with a pointer to the ‘displaced’ mark, to an ‘inflated’ lock or, during GC, the forwarding pointer.
The second word is reserved for the klass-pointer. This is simply a pointer to the Hotspot-internal data-structure that represents the class of the object.
Arrays would have an additional word next to store the arraylength. What follows afterwards is the actual ‘payload’ of the object, i.e. fields and array elements.
When running with Shenandoah enabled, the layout would look like this instead:
-8: [fwd pointer]
0: [mark-word ]
8: [class-word ]
16: [field 1 ]
24: [field 2 ]
32: [field 3 ]
The forward pointer is used for Shenandoah’s concurrent evacuation protocol:
Normally it points to itself -> the object is not evacuated yet
When evacuating (by the GC or via a write-barrier), we first copy the object, then install new forwarding pointer to that copy using an atomic compare-and-swap, possibly yielding a pointer to an offending copy. Only one copy wins.
Now, the canonical copy to read-from or write-to can be found simply by reading this forwarding pointer.
The advantage of this protocol is that it’s simple and cheap. The cheap aspect is important here, because, remember, Shenandoah needs to resolve the forwardee for every single read or write, even primitive ones. And using this protocol, the read-barrier for this would be a single instruction:
mov %rax, (%rax, -8)
That’s about as simple as it gets.
The disadvantage is obviously that it requires more memory. In the worst case, for objects without any payload, one more word for otherwise two-word object. That’s 50% more. With more realistic object size distributions, you’d still end up with 5%-10% more overhead, YMMV. This also results in reduced performance: allocating the same number of objects would hit the ceiling faster than without that overhead, prompting GCs more often, and therefore reduce throughput.
If you’ve read the above paragraphs carefully, you will have noticed that the mark-word is also used/overlaid by some GCs to carry the forwarding pointer. So why not do the same in Shenandoah? The answer is (or used to be), that reading the forwarding pointer requires a little more work. We need to somehow distinguish a true mark-word from a forwarding pointer. That is done by setting the lowest two bits in the mark-word. Those are usually used as locking bits, but the combination 0b11 is not a legal combination of lock bits. In other words: when they are set, the mark-word, with the lowest bits masked to 0, is to be interpreted as forwarding pointer. This decoding of the mark word is significantly more complex than the above simple read of the forwarding pointer. I did in-fact build a prototype a while ago, and the additional cost of the read-barriers was prohibitive and did not justify the savings.
All of this changed with the recent arrival of load reference barriers:
We no longer require read-barriers, especially not on (very frequent) primitive reads
The load-reference-barriers are conditional, which means their slow-path (actual resolution) is only activated when 1. GC is active and 2. the object in question is in the collection set. This is fairly infrequent. Compare that to the previous read-barriers which would be always-on.
We no longer allow any access to from-space copies. The strong invariant guarantees us that we only ever read from and write to to-space copies.
Two consequences of these are: the from-space copy is not actually used for anything, we can use that space to put the forwarding pointer, instead of reserving an extra word for it. We can basically nuke the whole contents of the from-space copy, and put the forwarding pointer anywhere. We only need to be able to distinguish between ‘not forwarded’ (and we don’t care about other contents) and ‘forwarded’ (the rest is forwarding pointer).
It also means that the actual mid- and slow-paths of the load-reference-barriers are not all that hot, and we can easily afford to do a little bit of decoding there. It amounts to something like (in pseudocode):
oop decode_forwarding(oop obj) {
mark m = obj->load_mark();
if ((m & 0b11) == 0b11) {
return (oop) (m & ~0b11);
} else {
return obj;
}
}
While this looks noticably more complicated than the above simple load of the forwarding pointer, it is still basically a free lunch because it’s only ever executed in the not-very-hot mid-path of the load-reference-barrier. With this, the new object layout would be:
0: [mark word (or fwd pointer)]
8: [class word]
16: [field 1]
24: [field 2]
32: [field 3]
Doing so has a number of advantages:
Obviously, it reduces Shenandoah’s memory footprint by putting away with the extra word.
Not quite as obviously, this results in increased throughput: we can now allocate more objects before hitting the GC trigger, resulting in fewer cycles spent in actual GC.
Objects are packed more tightly, which results in improved CPU cache pressure.
Again, the required GC interfaces are simpler: where we needed special implementations of the allocation paths (to reserve and initialize the extra word), we can now use the same allocation code as any other GC
To give you an idea of the throughput improvements: all the GC sensitive benchmarks that I have tried showed gains between 10% and 15%. Others benefited less or not at all, but that is not surprising for benchmarks that don’t do any GC at all. But it is important to note that the extra decoding cost does not actually show up anywhere, it is basically negligible. It probably would show up on heavily evacuating workloads. But most applications don’t evacuate that much, and most of the work is done by GC threads anyway, making midpath decoding cheap enough.
The implementation of this has recently been pushed to the shenandoah/jdk repository. We are currently shaking out one last known bug, and then it’s ready to go upstream into JDK 13 repository. The plan is to eventually backport it to Shenandoah’s JDK 11 and JDK 8 backports repositories, and from there into RPMs. If you don’t want to wait, you can already have it: check out The Shenandoah GC Wiki. [Less]
|
|
Posted
over 6 years
ago
In this miniseries, I’d like to introduce a couple of new developments of the Shenandoah GC that are upcoming in JDK 13. Perhaps the most significant, even though not directly user-visible, change is the switch of Shenandoah’s barrier model to load
... [More]
reference barriers. It resolves one major point of criticism against Shenandoah, that is their expensive primitive read-barriers.
Shenandoah (as well as other collectors) employ barriers in order to ensure heap consistency. More specifically, Shenandoah GC employs barriers to ensure what we call ‘to-space-invariant’. What it means is this: when Shenandoah is collecting, it is copying objects from so-called ‘from-space’ to ‘to-space’, and it does so while Java threads are running (concurrently). This means that there may be two copies of any object floating around in the JVM. In order to maintain heap consistency, we need to ensure either of:
writes happen into to-space copy + reads can happen from both copies, subject to memory model constraints = weak to-space invariant
writes and reads always happen into/from the to-space copy = strong to-space invariant
And the way we ensure that is by employing the corresponding type of barriers whenever reads and writes happen. Consider this pseudocode:
void example(Foo foo) {
Bar b1 = foo.bar; // Read
while (..) {
Baz baz = b1.baz; // Read
b1.x = makeSomeValue(baz); // Write
}
Employing the Shenandoah barriers, it would look like this (what the JVM+GC would do under the hood):
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
while (..) {
Baz baz = readBarrier(b1).baz; // Read
X value = makeSomeValue(baz);
writeBarrier(b1).x = readBarrier(value); // Write
}
I.e. whereever we read from an object, we first resolve the object via a read-barrier, and wherever we write to an object, we possibly copy the object to to-space. I won’t go into the details of this here, let’s just say that both operations are somewhat costly. Notice also that we need a read-barrier on the value of the write here to ensure we only ever write to-space-references into fields while heap references get updated (another nuisance of Shenandoah’s old barrier model).
Seeing that those barriers are a costly affair, we worked quite hard to optimize them. A very important optimization is to hoist barriers out of loops. We see that b1 is defined outside the loop, but only used inside the loop. We can just as well do the barriers outside the loop, once, instead of many times inside the loop:
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
Bar b1' = readBarrier(b1);
Bar b1'' = writeBarrier(b1);
while (..) {
Baz baz = b1'.baz; // Read
X value = makeSomeValue(baz);
b1''.x = readBarrier(value); // Write
}
And because write-barriers are stronger than read-barriers, we can fold the two up:
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
Bar b1' = writeBarrier(b1);
while (..) {
Baz baz = b1'.baz; // Read
X value = makeSomeValue(baz);
b1'.x = readBarrier(value); // Write
}
This is all nice and works fairly well, but it is also troublesome: the optimization passes for this are very complex. The fact that both from-space and two-space-copies of any objects can float around the JVM at any time is a major source of headaches and complexity. For example, we need extra barriers for comparing objects in case we compare an object to a different copy of itself. Read-barriers and write-barriers need to be inserted for *any* read or write, including primitive reads or writes. And those are very frequent, especially reads.
So why not short-cut this, and strongly ensure to-space-invariance right when an object is loaded from memory? That is where load-reference-barriers come in. They work mostly like our previous write-barriers, but are not employed at use-sites (when reading from or storing to the object), but instead much earlier when objects are loaded (at their definition-site):
void example(Foo foo) {
Bar b1' = loadReferenceBarrier(foo.bar);
while (..) {
Baz baz = loadReferenceBarrier(b1'.baz); // Read
X value = makeSomeValue(baz);
b1'.x = value; // Write
}
You can see that the code is basically the same as before – after our optimizations- , except that we didn’t need to optimize anything yet. Also, the read-barrier for the store-value is gone, because we now know (because of the strong to-space-invariant) that whatever makeSomeValue() did, it must already have employed the load-reference-barrier if needed. The new load-reference-barrier is almost 100% the same as our previous write-barrier.
The advantages of this barrier model are many (for us GC developers):
Strong invariant means it’s a lot easier to reason about the state of GC and objects
Much simpler barrier interface. Infact, a lot of stuff that we added to GC barrier interfaces after JDK11 will now become unused: no need for barriers on primitives, no need for object equality barriers, etc.
Optimization is much easier (see above). Barriers are naturally placed at the least-hot locations: their def-sites, instead of their most-hot locations: their use-sites, and then attempted to optimize them away from there (and not always successfully).
No more need for object equals barriers
No more need for ‘resolve’ barriers (a somewhat exotic kind of barriers used mostly in intrinsics and places that do read-like or write-like operations)
All barriers are now conditional, which opens up opportunities for further optimization later on
We can re-enable a bunch of optimizations like fast JNI getters that needed to be disabled before because they did not play well with possible from-space references
For users, this is mostly invisible, and the bottom line is that this improves overall Shenandoah’s performance. It also opens the way for follow-up improvements like elimination of the forwarding pointer, which I’ll get to in a follow-up article.
Load reference barriers have been integrated into JDK 13 development repository in April 2019. We will start backporting it to Shenandoah’s JDK 11 and JDK 8 backports soon. If you don’t want to wait, you can already have it: check out The Shenandoah GC Wiki. [Less]
|
|
Posted
over 6 years
ago
In this miniseries, I’d like to introduce a couple of new developments of the Shenandoah GC that are upcoming in JDK 13. Perhaps the most significant, even though not directly user-visible, change is the switch of Shenandoah’s barrier model to load
... [More]
reference barriers. It resolves one major point of criticism against Shenandoah, that is their expensive primitive read-barriers.
Shenandoah (as well as other collectors) employ barriers in order to ensure heap consistency. More specifically, Shenandoah GC employs barriers to ensure what we call ‘to-space-invariant’. What it means is this: when Shenandoah is collecting, it is copying objects from so-called ‘from-space’ to ‘to-space’, and it does so while Java threads are running (concurrently). This means that there may be two copies of any object floating around in the JVM. In order to maintain heap consistency, we need to ensure either of:
writes happen into to-space copy + reads can happen from both copies, subject to memory model constraints = weak to-space invariant
writes and reads always happen into/from the to-space copy = strong to-space invariant
And the way we ensure that is by employing the corresponding type of barriers whenever reads and writes happen. Consider this pseudocode:
void example(Foo foo) {
Bar b1 = foo.bar; // Read
while (..) {
Baz baz = b1.baz; // Read
b1.x = makeSomeValue(baz); // Write
}
Employing the Shenandoah barriers, it would look like this (what the JVM+GC would do under the hood):
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
while (..) {
Baz baz = readBarrier(b1).baz; // Read
X value = makeSomeValue(baz);
writeBarrier(b1).x = readBarrier(value); // Write
}
I.e. whereever we read from an object, we first resolve the object via a read-barrier, and wherever we write to an object, we possibly copy the object to to-space. I won’t go into the details of this here, let’s just say that both operations are somewhat costly. Notice also that we need a read-barrier on the value of the write here to ensure we only ever write to-space-references into fields while heap references get updated (another nuisance of Shenandoah’s old barrier model).
Seeing that those barriers are a costly affair, we worked quite hard to optimize them. A very important optimization is to hoist barriers out of loops. We see that b1 is defined outside the loop, but only used inside the loop. We can just as well do the barriers outside the loop, once, instead of many times inside the loop:
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
Bar b1' = readBarrier(b1);
Bar b1'' = writeBarrier(b1);
while (..) {
Baz baz = b1'.baz; // Read
X value = makeSomeValue(baz);
b1''.x = readBarrier(value); // Write
}
And because write-barriers are stronger than read-barriers, we can fold the two up:
void example(Foo foo) {
Bar b1 = readBarrier(foo).bar; // Read
Bar b1' = writeBarrier(b1);
while (..) {
Baz baz = b1'.baz; // Read
X value = makeSomeValue(baz);
b1'.x = readBarrier(value); // Write
}
This is all nice and works fairly well, but it is also troublesome: the optimization passes for this are very complex. The fact that both from-space and two-space-copies of any objects can float around the JVM at any time is a major source of headaches and complexity. For example, we need extra barriers for comparing objects in case we compare an object to a different copy of itself. Read-barriers and write-barriers need to be inserted for *any* read or write, including primitive reads or writes. And those are very frequent, especially reads.
So why not short-cut this, and strongly ensure to-space-invariance right when an object is loaded from memory? That is where load-reference-barriers come in. They work mostly like our previous write-barriers, but are not employed at use-sites (when reading from or storing to the object), but instead much earlier when objects are loaded (at their definition-site):
void example(Foo foo) {
Bar b1' = loadReferenceBarrier(foo.bar);
while (..) {
Baz baz = b1'.baz; // Read
X value = makeSomeValue(baz);
b1'.x = value; // Write
}
You can see that the code is basically the same as before – after our optimizations- , except that we didn’t need to optimize anything yet. Also, the read-barrier for the store-value is gone, because we now know (because of the strong to-space-invariant) that whatever makeSomeValue() did, it must already have employed the load-reference-barrier if needed. The new load-reference-barrier is almost 100% the same as our previous write-barrier.
The advantages of this barrier model are many (for us GC developers):
Strong invariant means it’s a lot easier to reason about the state of GC and objects
Much simpler barrier interface. Infact, a lot of stuff that we added to GC barrier interfaces after JDK11 will now become unused: no need for barriers on primitives, no need for object equality barriers, etc.
Optimization is much easier (see above). Barriers are naturally placed at the least-hot locations: their def-sites, instead of their most-hot locations: their use-sites, and then attempted to optimize them away from there (and not always successfully).
No more need for object equals barriers
No more need for ‘resolve’ barriers (a somewhat exotic kind of barriers used mostly in intrinsics and places that do read-like or write-like operations)
All barriers are now conditional, which opens up opportunities for further optimization later on
We can re-enable a bunch of optimizations like fast JNI getters that needed to be disabled before because they did not play well with possible from-space references
For users, this is mostly invisible, and the bottom line is that this improves overall Shenandoah’s performance. It also opens the way for follow-up improvements like elimination of the forwarding pointer, which I’ll get to in a follow-up article. Stay tuned.
Load reference barriers have been integrated into JDK 13 development repository in April 2019. We will start backporting it to Shenandoah’s JDK 11 and JDK 8 backports soon. If you don’t want to wait, you can already have it: check out The Shenandoah GC Wiki. [Less]
|
|
Posted
over 6 years
ago
glibc already released 2.29, but I was still on a much older version and hadn’t noticed 2.28 (which is the version that is in RHEL8) has a really nice fix for people who obsess about memory leaks.
When running valgrind to track memory leaks you might
... [More]
have noticed that there are sometimes some glibc data structures left.
These are often harmless, small things that are needed during the whole lifetime of the process. So it is normally fine to not explicitly clean that up. Since the memory is reclaimed anyway when the process dies.
But when tracking memory leaks they are slightly annoying. When you want to be sure you don’t have any leaks in your program it is distracting to have to ignore and filter out some harmless leaks.
glibc already had a mechanism to help memory trackers like valgrind memcheck. If you call the secret __libc_freeres function from the last exiting thread, glibc would dutifully free all memory. Which is what valgrind does for you (unless you want to see all the memory left and use --run-libc-freeres=no).
But it didn’t work for memory allocated by pthreads (libpthreads.so) or dlopen (libdl.so). So sometimes you would still see some stray “garbage” left even if you were sure to have released all memory in your own program.
Carlos O’Donell has fixed this:
Bug 23329 – The __libc_freeres infrastructure is not properly run across DSO boundaries.
So upgrade to glibc 2.28+ and really get those memory leaks to zero!
All heap blocks were freed -- no leaks are possible
[Less]
|
|
Posted
over 6 years
ago
Julian Seward released valgrind 3.15.0 which updates support for existing platforms and adds a major overhaul of the DHAT heap profiler. There are, as ever, many refinements and bug fixes. The release notes give more details.
Nicholas Nethercote
... [More]
used the old experimental DHAT tool a lot while profiling the Rust compiler and then decided to write and contribute A better DHAT (which contains a screenshot of the the new graphical viewer).
CORE CHANGES
The XTree Massif output format now makes use of the information obtained when specifying --read-inline-info=yes.
amd64 (x86_64): the RDRAND and F16C insn set extensions are now supported.
TOOL CHANGES
DHAT
DHAT been thoroughly overhauled, improved, and given a GUI. As a result, it has been promoted from an experimental tool to a regular tool. Run it with --tool=dhat instead of --tool=exp-dhat.
DHAT now prints only minimal data when the program ends, instead writing the bulk of the profiling data to a file. As a result, the --show-top-n and --sort-by options have been removed.
Profile results can be viewed with the new viewer, dh_view.html. When a run ends, a short message is printed, explaining how to view the result.
See the documentation for more details.
Cachegrind
cg_annotate has a new option, --show-percs, which prints percentages next to all event counts.
Callgrind
callgrind_annotate has a new option, --show-percs, which prints percentages next to all event counts.
callgrind_annotate now inserts commas in call counts, and sort the caller/callee lists in the call tree.
Massif
The default value for --read-inline-info is now yes on Linux/Android/Solaris. It is still no on other OS.
Memcheck
The option --xtree-leak=yes (to output leak result in xtree format) automatically activates the option --show-leak-kinds=all, as xtree visualisation tools such as kcachegrind can in any case select what kind of leak to visualise.
There has been further work to avoid false positives. In particular, integer equality on partially defined inputs (C == and !=) is now handled better.
OTHER CHANGES
The new option --show-error-list=no|yes displays, at the end of the run, the list of detected errors and the used suppressions. Prior to this change, showing this information could only be done by specifying -v -v, but that also produced a lot of other possibly-non-useful messages. The option -s is equivalent to --show-error-list=yes. [Less]
|
|
Posted
over 6 years
ago
Since the GNU Toolchain has many shared modules it sometimes feels like you have to rebuild everything (assembler, linker, binutils tools, debugger, simulators, etc.) just to get one of the latest tools from source.
Having all this reusable shared
... [More]
code is fun, but it does make build times a bit long.
Luckily most of the “extras” can be disabled if all you want is a fresh new GDB. Sergio Durigan Junior added the GDB configure steps to the GDB wiki so you can build GDB in just a couple of minutes after checking it out.
git clone git://sourceware.org/git/binutils-gdb.git
[Less]
|