|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
There's lots of ways to present RESTful web services these days, and REST has obviously become the new "it's IPC no it's not" hotness. And of course Rubyists have been helping to lead the way, building restfulness into just about everything they
... [More]
write. Rails itself is built around REST, with most controllers doubling as RESTful interfaces, and it even provides extra tools to help you transparently make RESTful calls from your application. If you're doing RESTful services for a typical Ruby application, Rails is the way to go (and even if you're not using Ruby in general...you should be considering JRuby + Rails).In the Java world the options aren't quite as clear, but one API has at least attempted to standardize the idea of doing RESTful services on the Java platform: JSR-311, otherwise known as JAX-RS.JAX-RS in theory makes it easy for you to simply mark up a piece of Java code with annotations and have it automatically be presented as a RESTful service. Of the available options, it may be the simplest, quickest way to get a Java-based service published and running.So I figured I'd try to use it from JRuby, and when doing it in Ruby it's actually surprisingly clean, even compared to Ruby options.The ServiceI followed the Jersey Getting Started tutorial using Ruby for everything (and not using Maven in this case).My version of their HelloWorldResource looks like this in Ruby:require 'java'java_import 'javax.ws.rs.Path'java_import 'javax.ws.rs.GET'java_import 'javax.ws.rs.Produces'java_package 'com.headius.demo.jersey'java_annotation 'Path("/helloworld")'class HelloWorldjava_annotation 'GET'java_annotation 'Produces("text/plain")'def cliched_message "Hello World"endendNotice that we're using the new features in JRuby 1.5 for producing "real" Java classes: java_package to specify a target package for the Java class, java_annotation to specify class and method annotations.We compile it using jrubyc from JRuby 1.5 like this:~/projects/jruby ➔ jrubyc -c ../jersey-archive-1.2/lib/jsr311-api-1.1.1.jar --javac restful_service.rbGenerating Java class HelloWorld to /Users/headius/projects/jruby/com/headius/demo/jersey/HelloWorld.javajavac -d /Users/headius/projects/jruby -cp /Users/headius/projects/jruby/lib/jruby.jar:../jersey-archive-1.2/lib/jsr311-api-1.1.1.jar /Users/headius/projects/jruby/com/headius/demo/jersey/HelloWorld.javaThe new --java(c) flags in jrubyc examine your source for any classes, spitting out .java source for each one in turn. Along the way, if you have marked it up with signatures, annotations, imports, and so on, it will emit those into the Java source as well. If you specify --javac (as opposed to --java), it will also compile the resulting sources for you with jruby and your user-specified jars in the classpath, as I've done here.The result looks like this to Java:~/projects/jruby ➔ javap com.headius.demo.jersey.HelloWorldCompiled from "HelloWorld.java"public class com.headius.demo.jersey.HelloWorld extends org.jruby.RubyObject{ public static org.jruby.runtime.builtin.IRubyObject __allocate__(org.jruby.Ruby, org.jruby.RubyClass); public com.headius.demo.jersey.HelloWorld(); public java.lang.Object cliched_message(); static {};}Under the covers, this class will load in the source of our restful_service.rb file and wire up all the Ruby and Java pieces so that both sides see HelloWorld as the in-memory representation of the Ruby HelloWorld class. Method calls are dispatched to the Ruby code, constructors dispatch to initialize, and so on. It's truly living in both worlds.With the service in hand, we now need a server script to start it up.The Server ScriptContinuing with the tutorial, I've taken their simple Java-based server script and ported it directly to Ruby:require 'java'java_import com.sun.jersey.api.container.grizzly.GrizzlyWebContainerFactorybase_uri = "http://localhost:9998/"init_params = {"com.sun.jersey.config.property.packages" => "com.headius.demo.jersey"}puts "Starting grizzly"thread_selector = GrizzlyWebContainerFactory.create( base_uri, init_params.to_java)puts <Jersey app started with WADL available at #{base_uri}application.wadlTry out #{base_uri}helloworldHit enter to stop it...EOSgetsthread_selector.stop_endpointexit(0)It's somewhat cleaner and shorter, but it wasn't particularly large to begin with. At any rate, it shows how simple it is to launch a Grizzly server and how nice annotation-based APIs can be for auto-configuring our JAX-RS service.The CLASSPATHAhh CLASSPATH. You are so maligned when all you hope to do is make it explicit where libraries are coming from. The world should learn from your successes and your failures.There's five jars required for this Jersey example to run. I've tossed them into my CLASSPATH env var, but you're free to do it however you likejersey-core-1.2.jarjersey-server-1.2.jarjsr311-api-1.1.1.jarasm-3.1.jargrizzly-servlet-webserver-1.9.9.jarThe first four are available in the jersey-archive download, and you can fetch the Grizzly jar from Maven or other places.Testing It OutThe lovely bit about this is that it's essentially a four-step process to do the entire thing: write and compile the service, write the server script, set up CLASSPATH, and run the server. Here's the server output, finding my HelloWorld service right where it should:~/projects/jruby ➔ jruby main.rbStarting grizzlyJersey app started with WADL available at http://localhost:9998/application.wadlTry out http://localhost:9998/helloworldHit enter to stop it...Jun 1, 2010 6:36:55 PM com.sun.jersey.api.core.PackagesResourceConfig initINFO: Scanning for root resource and provider classes in the packages:com.headius.demo.jerseyJun 1, 2010 6:36:55 PM com.sun.jersey.api.core.ScanningResourceConfig logClassesINFO: Root resource classes found:class com.headius.demo.jersey.HelloWorldJun 1, 2010 6:36:55 PM com.sun.jersey.api.core.ScanningResourceConfig initINFO: No provider classes found.Jun 1, 2010 6:36:55 PM com.sun.jersey.server.impl.application.WebApplicationImpl initiateINFO: Initiating Jersey application, version 'Jersey: 1.2 05/07/2010 02:04 PM'And perhaps the most anti-climactic climax ever, curling our newly-deployed service:~ ➔ curl http://localhost:9998/helloworldHello WorldIt works!What We've LearnedThis was a simple example, but we demonstrated several things:
JRuby's new jrubyc --java(c) support for generating "real" Java classes
Tagging a Ruby class with Java annotations, so it can be seen by a Java framework
Booting a Grizzly server from Ruby code
Implementing a JAX-RS service with Jersey in Ruby code
Do you have any examples of other nice annotation-based APIs we could test out with JRuby? [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
We've often talked about the benefit of running on the JVM, and while our performance numbers have been *good*, they've never been as incredibly awesome as a lot of people expected. After all, regardless of how well we performed when compared to
... [More]
other Ruby implementations, we weren't as fast as statically-typed JVM languages.It's time for that to change.I've started playing with performing more optimizations based on runtime information in JRuby. You may know that JRuby has always had a JIT (just-in-time compiler), which lazily compiled Ruby AST into JVM bytecode. What you may not know is that unlike most other JIT-based systems, we did not gather any information at runtime that might help the eventual compilation produces a better result. All we applied were the same static optimizations we could safely do for AOT compilation (ahead-of-time, like jrubyc), and ultimately the JIT mode just deferred compilation to reduce the startup cost of compiling everything before it runs.This was a pragmatic decision; the JVM itself does a lot to boost performance, even with our very naïve compiler and our limited static optimizations. Because of that, our performance has been perfectly fine for most users; we haven't even made a concentrated effort to improve execution speed for almost 18 months, since the JRuby 1.1.6 release. We've spent a lot more time handling the issues users actually wanted us to work on: better Java integration, specific Ruby incompatibilities, memory reduction (and occasional leaks), peripheral libraries like jruby-rack and activerecord-jdbc, and general system stability. When we asked users what we should focus on, performance always came up as a "nice to have" but rarely as something people had a problem with. We performed very nicely.These days, however, we're recognizing a few immutable truths:
You can never be fast enough, and if your performance doesn't steadily improve people will feel like you're moving backward.
People eventually *do* want Ruby code to run as fast as C or Java, and if we can make it happen we should.
JRuby users like to write in Ruby, obviously, so we should make an effort to allow writing more of JRuby in Ruby, which requires that we suffer no performance penalty in the process.
Working on performance can be a dreadful time sink, but succeeding in improving it can be a tremendous (albeit shallow) ego boost.
So along with other work for JRuby 1.6, I'm back in the compiler.I'm just starting to play with this stuff, so take all my results as highly preliminary.A JRuby Call Site PrimerAt each location in Ruby code where a dynamic call happens, JRuby installs what's called a call site. Other VMs may simply refer to the location of the call as the "call site", but in JRuby, each call site has an implementation of org.jruby.runtime.CallSite associated with it. So given the following code:def fib(a) if a < 2 a else fib(a - 1) + fib(a - 2) endendThere are six call sites: the "call site cache at each call site to blunt that lookup cost. And in implementation terms, that means each caching CallSite is an instance of org.jruby.runtime.callsite.CachingCallSite. Easy, right?The simplest way to take advantage of runtime information is to use these caching call sites as hints for what method we've been calling at a given call site. Let's take the example of "+". The "+" method here is being called against a Fixnum object every time it's encountered, since our particular run of "fib" never overflows into Bignum and never works with Float objects. In JRuby, Fixnum is implemented with org.jruby.RubyFixnum, and the "+" method is bound to RubyFixnum.op_plus. Here's the implementation of op_plus in JRuby: @JRubyMethod(name = "+") public IRubyObject op_plus(ThreadContext context, IRubyObject other) { if (other instanceof RubyFixnum) { return addFixnum(context, (RubyFixnum)other); } return addOther(context, other); }The details of the actual addition in addFixnum are left as an exercise for the reader.Notice that we have an annotation on the method: @JRubyMethod(name = "+"). All the Ruby core classes implemented in JRuby are tagged with a similar annotation, where we can specify the minimum and maximum numbers of arguments, the Ruby-visible method name (e.g. "+" here, which is not a valid Java method name), and other details of how the method should be called. Notice also that the method takes two arguments: the current ThreadContext (thread-local Ruby-specific runtime structures), and the "other" argument being added.When we build JRuby, we scan the JRuby codebase for JRubyMethod annotations and generate what we call invokers, one for every unique class+name combination in the core classes. These invokers are all instance of org.jruby.internal.runtime.methods.DynamicMethod, and they carry various informational details about the method as well as implementing various "call" signatures. The reason we generate an invoker class per method is simple: most JVMs will not inline through code used for many different code paths, so if we only had a single invoker for all methods in the system...nothing would ever inline.So, putting it all together. At runtime, we have a CachingCallSite instance for every method call in the code. The first time we call "+" for example, we go out and fetch the method object and cache it in place for future calls. That cached method object is a subclass of DynamicMethod, which knows how to do the eventual invocation of RubyFixnum.op_plus and return the desired result. Simple!First Steps Toward Runtime OptimizationThe changes I'm working on locally will finally take advantage of the fact that after running for a while, we have a darn good idea what methods are being called. And if we know what methods are being called, those invocations no longer have to be dynamic, provided our assumptions hold (assumptions being things like "we're always calling against Fixnums" or "it's always the core implementation of "+"). Put differently: because JRuby defers its eventual compilation to JVM bytecode, we can potentially turn dynamic calls into static calls.The process is actually very simple, and since I just started playing with it two days ago, I'll describe the steps I took.Step One: Turn Dynamic Into StaticFirst, I needed the generated methods to bring along enough information for me to do a direct call. Specifically, I needed to know what target Java class they pointed at (RubyFixnum in our "+" example), what the Java name of the method was (op_plus), what arguments the method signature expected to receive (returning IRubyObject and receiving ThreadContext, IRubyObject), and whether the implementation was static (as are most module-based methods...but our "op_plus" is not). So I bundled this information into what I called a "NativeCall" data structure, populated on any invokers for which a native call might be possible. public static class NativeCall { private final Class nativeTarget; private final String nativeName; private final Class nativeReturn; private final Class[] nativeSignature; private final boolean statik; ...In the compiler, I added a similar bit of logic. When compiling a dynamic call, it now will look to see whether the call site has already cached some method. If so, it checks whether that method has a corresponding NativeCall data structure associated with it. If it does, then instead of emitting the dynamic call logic it would normally emit, it compiles a normal, direct JVM invocation. So where we used to have a call like this:INVOKEVIRTUAL org/jruby/runtime/CallSite.callWe now have a call like this:INVOKEVIRTUAL org/jruby/RubyFixnum.op_plusThis call to "+" is now essentially equivalent to writing the same code in Java against our RubyFixnum class.This comprises the first step: get the compiler to recognize and "make static" dynamic calls we've seen before. My experimental code is able to do this for most core class methods right now, and it will not be difficult to extend it to any call.(The astute VM implementer will notice I made no mention of inserting a guard or test before the direct call, in case we later need to actually call a different method. This is, for the moment, intentional. I'll come back to it).Step Two: Reduce Some Fixnum UseThe second step I took was to allow some call paths to use primitive values rather than boxed RubyFixnum objects. RubyFixnum in JRuby always represents a 64-bit long value, but since our call path only supports invocation with IRubyObject, we generally have to construct a new RubyFixnum for all dynamic calls. These objects are very small and usually very short-lived, so they don't impact GC times much. But they do impact allocation rates; we still have to grab the memory space for every RubyFixnum, and so we're constantly chewing up memory bandwidth to do so.There are various ways in which VMs can eliminate or reduce object allocations like this: stack allocation, value types, true fixnums, escape analysis, and more. But of these, only escape analysis is available on current JVMs, and it's a very fragile optimization: all paths that would consume an object must be completely inlined, or else the object can't be elided.So to help reduce the burden on the JVM, we have a couple call paths that can receive a single long or double argument where it's possible for us to prove that we're passing a boxed long or double value (i.e. a RubyFixnum or RubyFloat). The second step I took was to make the compiler aware of several of these methods on RubyFixnum, such as the long version of op_plus: public IRubyObject op_plus(ThreadContext context, long other) { return addFixnum(context, other); }Since we have not yet implemented more advanced means of proving a given object is always a Fixnum (like doing local type propagation or per-variable type profiling at runtime), the compiler currently can only see that we're making a call with a literal value like "100" or "12.5". As luck would have it, there's three such cases in the "fib" implementation above: one for the "which actual method is being called in each case and how to make a call without standing up a RubyFixnum object, our actual call in the bytecode gets simplified further:INVOKEVIRTUAL org/jruby/RubyFixnum.op_minus (Lorg/jruby/runtime/ThreadContext;J)(For the uninitiated, that "J" at the end of the signature means this call is passing a primitive long for the second argument to op_minus.)We now have the potential to insert additional compiler smarts about how to make a call more efficiently. This is a simple case, of course, but consider the potential for another case: calling from Ruby to arbitrary Java code. Instead of encumbering that call with all our own multi-method dispatch logic *plus* the multiple layers of Java's reflection API, we can instead make the call *directly*, going straight from Ruby code to Java code with no intervening code. And with no intervening code, the JVM will be able to inline arbitrary Java code straight into Ruby code, optimize it as a whole. Are we getting excited yet?Step Three: Steal a Micro-optimizationThere's two pesky calls remaining in "fib": the recursive invocations of "fib" itself. Now the smart thing to do would be to proceed with adding logic to the compiler to be aware of calls from currently-jitting Ruby code to not-yet jitted Ruby code and handle that accordingly. For example, we might also force those methods to compile, and the methods they call to compile, and so on, allowing us to optimize from a "hot" method outward to potentially "colder" methods within some threshold distance. And of course, that's probably what we'll do; it's trivial to trigger any Ruby method in JRuby to JIT at any time, so adding this to the compiler will be a few minutes work. But since I've only been playing with this since Thursday, I figured I'd do an even smarter thing: cheat.Instead of adding the additional compiler smarts, I instead threw in a dirt-simple single-optimization subset of those smarts: detect self-recursion and optimize that case alone. In JRuby terms, this meant simply seeing that the "fib" CachingCallSite objects pointed back to the same "fib" method we were currently compiling. Instead of plumbing them through the dynamic pipeline, we make them be direct recursive calls, similar to the core method calls to "INVOKESTATIC ruby/jit/fib_ruby_B52DB2843EB6D56226F26C559F5510210E9E330D.__file__Where "B52DB28..." is the SHA1 hash of the "fib" method, and __file__ is the default entry point for jitted calls.Everything Else I'm Not Telling YouAs with any experiment, I'm bending compatibility a bit here to show how far we can move JRuby's "upper bound" on performance. So these optimizations currently include some caveats.They currently damage Ruby backtraces, since by skipping the invokers we're no longer tracking Ruby-specific backtrace information (current method, file, line, etc) in a heap-based structure. This is one area where JRuby loses some performance, since the JVM is actually double-tracking both the Java backtrace information and the Ruby backtrace information. We will need to do a bit more work toward making the Java backtrace actually show the relevant Ruby information, or else generate our Ruby backtraces by mining the Java backtrace (and the existing "--fast" flag actually does this already, matching normal Ruby traces pretty well).The changes also don't track the data necessary for managing "backref" and "lastline" data (the $~ and $_ pseudo-globals), which means that regular expression matches or line reads might have reduced functionality in the presence of these optimizations (though you might never notice; those variables are usually discouraged). This behavior can be restored by clever use of thread-local stacks (only pushing down the stack when in the presence of a method that might use those pseudo-globals), or by simply easing back the optimizations if those variables are likely to be used. Ideally we'll be able to use runtime profiling to make a smart decision, since we'll know whether a given call site has ever encountered a method that reads or writes $~ or $_.Finally, I have not inserted the necessary guards before these direct invocations that would branch to doing a normal dynamic call. This was again a pragmatic decision to speed the progress of my experiment...but it also raises an interesting question. What if we knew, without a doubt, that our code had seen all the types and methods it would ever see. Wouldn't it be nice to say "optimize yourself to death" and know it's doing so? While I doubt we'll see a JRuby ship without some guard in place (ideally a simple "method ID" comparison, which I wouldn't expect to impact performance much), I think it's almost assured that we'll allow users to "opt in" to completely "statickifying" specific of code. And it's likely that if we started moving more of JRuby's implementation into Ruby code, we would take advantage of "fully" optimizing as well.With all these caveats in mind, let's take a look at some numbers.And Now, Numbers Meaningless to Anyone Not Using JRuby To Calculate Fibonacci NumbersThe "fib" method is dreadfully abused in benchmarking. It's largely a method call benchmark, since most calls spawn at least two more recursive calls and potentially several others if numeric operations are calls as well. In JRuby, it doubles as an allocation-rate or memory-bandwidth benchmark, since the bulk of our time is spent constructing Fixnum objects or updating per-call runtime data structures.But since it's a simple result to show, I'm going to abuse it again. Please keep in mind these numbers are meaningless except in comparison to each other; JRuby is still cranking through a lot more objects than implementations with true Fixnums or value types, and the JVM is almost certainly over-optimizing portions of this benchmark. But of course, that's part of the point; we're making it possible for the JVM to accelerate and potentially eliminate unnecessary computation, and it's all becoming possible because we're using runtime data to improve runtime compilation.I'll be using JRuby 1.6.dev (master plus my hacks) on OS X Java 6 (1.6.0_20) 64-bit Server VM on a MacBook Pro Core 2 Duo at 2.66GHz.First, the basic JRuby "fib" numbers with full Ruby backtraces and dynamic calls: 0.357000 0.000000 0.357000 ( 0.290000) 0.176000 0.000000 0.176000 ( 0.176000) 0.174000 0.000000 0.174000 ( 0.174000) 0.175000 0.000000 0.175000 ( 0.175000) 0.174000 0.000000 0.174000 ( 0.174000)Now, with backtraces calculated from Java backtraces, dynamic calls restructured slightly to be more inlinable (but still dynamic and via invokers), and limited use of primitive call paths. Essentially the fastest we can get in JRuby 1.5 without totally breaking compatibility (and with some minor breakages, in fact): 0.383000 0.000000 0.383000 ( 0.330000) 0.109000 0.000000 0.109000 ( 0.108000) 0.111000 0.000000 0.111000 ( 0.112000) 0.101000 0.000000 0.101000 ( 0.101000) 0.101000 0.000000 0.101000 ( 0.101000)Now, using all the runtime optimizations described above. Keep in mind these numbers will probably drop a bit with guards in place (but they're still pretty fantastic): 0.443000 0.000000 0.443000 ( 0.376000) 0.035000 0.000000 0.035000 ( 0.035000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000)And a couple comparisons of another mostly-recursive, largely-abused benchmark target: the tak function...Static optimizations only, as fast as we can make it: 1.750000 0.000000 1.750000 ( 1.695000) 0.779000 0.000000 0.779000 ( 0.779000) 0.764000 0.000000 0.764000 ( 0.764000) 0.775000 0.000000 0.775000 ( 0.775000) 0.763000 0.000000 0.763000 ( 0.763000)And with runtime optimizations: 0.899000 0.000000 0.899000 ( 0.832000) 0.331000 0.000000 0.331000 ( 0.331000) 0.332000 0.000000 0.332000 ( 0.332000) 0.329000 0.000000 0.329000 ( 0.329000) 0.331000 0.000000 0.331000 ( 0.332000)So for these trivial benchmarks, a couple hours hacking in JRuby's compiler has produced numbers 2-3x faster than the fastest JRuby performance we've achieved thusfar. Understanding why requires one last section.Hooray for the JVMThere's a missing trick here I haven't explained: the JVM is still doing most of the work for us.In the plain old dynamic-calling, static-optimized case, the JVM is dutifully optimizing everything we throw at it. It's taking our Ruby calls, inlining the invokers and their eventual calls, inlining core class methods (yes, this means we've always been able to inline Ruby into Ruby or Java into Ruby or Ruby into Java, given the appropriate tweaks), and optimizing things very well. But here's the problem: the JVM's default settings are tuned for optimizing *Java*, not for optimizing Ruby. In Java, a call from one method to another has exactly one hop. In JRuby's dynamic calls, it may take three or four hops, bouncing through CallSites and DynamicMethods to eventually get to the target. Since Hotspot only inlines up to 9 levels of calls by default, you can see we're eating up that budget very quickly. Add to that the fact that Java to Java calls have no intervening code to eat up bytecode size thresholds, and we're essentially only getting a fraction of the optimization potential out of Hotspot that a "simpler" language like Java does.Now consider the dynamically-optimized case. We've eliminated both the CallSite and the DynamicMethod from most calls, even if we have to insert a bit of guard code to do it. That means nine levels of Ruby calls can inline, or nine levels of logic from Ruby to core methods or Ruby to Java. We've now given Hotspot a much better picture of the system to optimize. We've also eliminated some of the background noise of doing a Ruby call, like updating rarely-used data structures. We'll need to ensure backtraces come out in some usable form, but at least we're not double-tracking them.The bottom line here is that instead of doing all the much harder work of adding inlining to our own compiler, all we need to do is let the JVM do it, and we benefit from the years of work that have gone into current JVMs' optimizing compilers. The best way to solve a hard problem is to get someone else to solve it for you, and the JVM does an excellent job of solving this particular hard problem. Instead of giving the JVM a fish *or* teaching it how to fish, we're just giving it a map of the lake; it's already an expert fisherman.There's also another side to these optimizations: our continued maintenance of an interpreter is starting to pay off. Other JVM languages that don't have an interpreted mode have a much more difficult time doing any runtime optimization; simply put, it's really hard to swap out bytecode you've already loaded unless you're explicitly abstracting how that bytecode gets generated in the first place. In JRuby, where methods have always had to switch from interpreted to jitted at runtime, we're can can hop back and forth much more freely, optimizing along the way.That's all for now. Don't expect to download JRuby 1.6 in a few months and see everything be three times (or ten times! or 100 times!) faster. There's a lot of details we need to work out about when these optimizations will be safe, how users might be able to opt-in to "full" optimization if necessary, and whether we'll get things fast enough to start replacing core code with Ruby. But early results are very promising, and it's assured we'll ship some of this very soon. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
We've often talked about the benefit of running on the JVM, and while our performance numbers have been *good*, they've never been as incredibly awesome as a lot of people expected. After all, regardless of how well we performed when compared to
... [More]
other Ruby implementations, we weren't as fast as statically-typed JVM languages.It's time for that to change.I've started playing with performing more optimizations based on runtime information in JRuby. You may know that JRuby has always had a JIT (just-in-time compiler), which lazily compiled Ruby AST into JVM bytecode. What you may not know is that unlike most other JIT-based systems, we did not gather any information at runtime that might help the eventual compilation produces a better result. All we applied were the same static optimizations we could safely do for AOT compilation (ahead-of-time, like jrubyc), and ultimately the JIT mode just deferred compilation to reduce the startup cost of compiling everything before it runs.This was a pragmatic decision; the JVM itself does a lot to boost performance, even with our very naïve compiler and our limited static optimizations. Because of that, our performance has been perfectly fine for most users; we haven't even made a concentrated effort to improve execution speed for almost 18 months, since the JRuby 1.1.6 release. We've spent a lot more time handling the issues users actually wanted us to work on: better Java integration, specific Ruby incompatibilities, memory reduction (and occasional leaks), peripheral libraries like jruby-rack and activerecord-jdbc, and general system stability. When we asked users what we should focus on, performance always came up as a "nice to have" but rarely as something people had a problem with. We performed very nicely.These days, however, we're recognizing a few immutable truths:
You can never be fast enough, and if your performance doesn't steadily improve people will feel like you're moving backward.
People eventually *do* want Ruby code to run as fast as C or Java, and if we can make it happen we should.
JRuby users like to write in Ruby, obviously, so we should make an effort to allow writing more of JRuby in Ruby, which requires that we suffer no performance penalty in the process.
Working on performance can be a dreadful time sink, but succeeding in improving it can be a tremendous (albeit shallow) ego boost.
So along with other work for JRuby 1.6, I'm back in the compiler.I'm just starting to play with this stuff, so take all my results as highly preliminary.A JRuby Call Site PrimerAt each location in Ruby code where a dynamic call happens, JRuby installs what's called a call site. Other VMs may simply refer to the location of the call as the "call site", but in JRuby, each call site has an implementation of org.jruby.runtime.CallSite associated with it. So given the following code:def fib(a) if a < 2 a else fib(a - 1) + fib(a - 2) endendThere are six call sites: the "call site cache at each call site to blunt that lookup cost. And in implementation terms, that means each caching CallSite is an instance of org.jruby.runtime.callsite.CachingCallSite. Easy, right?The simplest way to take advantage of runtime information is to use these caching call sites as hints for what method we've been calling at a given call site. Let's take the example of "+". The "+" method here is being called against a Fixnum object every time it's encountered, since our particular run of "fib" never overflows into Bignum and never works with Float objects. In JRuby, Fixnum is implemented with org.jruby.RubyFixnum, and the "+" method is bound to RubyFixnum.op_plus. Here's the implementation of op_plus in JRuby: @JRubyMethod(name = "+") public IRubyObject op_plus(ThreadContext context, IRubyObject other) { if (other instanceof RubyFixnum) { return addFixnum(context, (RubyFixnum)other); } return addOther(context, other); }The details of the actual addition in addFixnum are left as an exercise for the reader.Notice that we have an annotation on the method: @JRubyMethod(name = "+"). All the Ruby core classes implemented in JRuby are tagged with a similar annotation, where we can specify the minimum and maximum numbers of arguments, the Ruby-visible method name (e.g. "+" here, which is not a valid Java method name), and other details of how the method should be called. Notice also that the method takes two arguments: the current ThreadContext (thread-local Ruby-specific runtime structures), and the "other" argument being added.When we build JRuby, we scan the JRuby codebase for JRubyMethod annotations and generate what we call invokers, one for every unique class+name combination in the core classes. These invokers are all instance of org.jruby.internal.runtime.methods.DynamicMethod, and they carry various informational details about the method as well as implementing various "call" signatures. The reason we generate an invoker class per method is simple: most JVMs will not inline through code used for many different code paths, so if we only had a single invoker for all methods in the system...nothing would ever inline.So, putting it all together. At runtime, we have a CachingCallSite instance for every method call in the code. The first time we call "+" for example, we go out and fetch the method object and cache it in place for future calls. That cached method object is a subclass of DynamicMethod, which knows how to do the eventual invocation of RubyFixnum.op_plus and return the desired result. Simple!First Steps Toward Runtime OptimizationThe changes I'm working on locally will finally take advantage of the fact that after running for a while, we have a darn good idea what methods are being called. And if we know what methods are being called, those invocations no longer have to be dynamic, provided our assumptions hold (assumptions being things like "we're always calling against Fixnums" or "it's always the core implementation of "+"). Put differently: because JRuby defers its eventual compilation to JVM bytecode, we can potentially turn dynamic calls into static calls.The process is actually very simple, and since I just started playing with it two days ago, I'll describe the steps I took.Step One: Turn Dynamic Into StaticFirst, I needed the generated methods to bring along enough information for me to do a direct call. Specifically, I needed to know what target Java class they pointed at (RubyFixnum in our "+" example), what the Java name of the method was (op_plus), what arguments the method signature expected to receive (returning IRubyObject and receiving ThreadContext, IRubyObject), and whether the implementation was static (as are most module-based methods...but our "op_plus" is not). So I bundled this information into what I called a "NativeCall" data structure, populated on any invokers for which a native call might be possible. public static class NativeCall { private final Class nativeTarget; private final String nativeName; private final Class nativeReturn; private final Class[] nativeSignature; private final boolean statik; ...In the compiler, I added a similar bit of logic. When compiling a dynamic call, it now will look to see whether the call site has already cached some method. If so, it checks whether that method has a corresponding NativeCall data structure associated with it. If it does, then instead of emitting the dynamic call logic it would normally emit, it compiles a normal, direct JVM invocation. So where we used to have a call like this:INVOKEVIRTUAL org/jruby/runtime/CallSite.callWe now have a call like this:INVOKEVIRTUAL org/jruby/RubyFixnum.op_plusThis call to "+" is now essentially equivalent to writing the same code in Java against our RubyFixnum class.This comprises the first step: get the compiler to recognize and "make static" dynamic calls we've seen before. My experimental code is able to do this for most core class methods right now, and it will not be difficult to extend it to any call.(The astute VM implementer will notice I made no mention of inserting a guard or test before the direct call, in case we later need to actually call a different method. This is, for the moment, intentional. I'll come back to it).Step Two: Reduce Some Fixnum UseThe second step I took was to allow some call paths to use primitive values rather than boxed RubyFixnum objects. RubyFixnum in JRuby always represents a 64-bit long value, but since our call path only supports invocation with IRubyObject, we generally have to construct a new RubyFixnum for all dynamic calls. These objects are very small and usually very short-lived, so they don't impact GC times much. But they do impact allocation rates; we still have to grab the memory space for every RubyFixnum, and so we're constantly chewing up memory bandwidth to do so.There are various ways in which VMs can eliminate or reduce object allocations like this: stack allocation, value types, true fixnums, escape analysis, and more. But of these, only escape analysis is available on current JVMs, and it's a very fragile optimization: all paths that would consume an object must be completely inlined, or else the object can't be elided.So to help reduce the burden on the JVM, we have a couple call paths that can receive a single long or double argument where it's possible for us to prove that we're passing a boxed long or double value (i.e. a RubyFixnum or RubyFloat). The second step I took was to make the compiler aware of several of these methods on RubyFixnum, such as the long version of op_plus: public IRubyObject op_plus(ThreadContext context, long other) { return addFixnum(context, other); }Since we have not yet implemented more advanced means of proving a given object is always a Fixnum (like doing local type propagation or per-variable type profiling at runtime), the compiler currently can only see that we're making a call with a literal value like "100" or "12.5". As luck would have it, there's three such cases in the "fib" implementation above: one for the "which actual method is being called in each case and how to make a call without standing up a RubyFixnum object, our actual call in the bytecode gets simplified further:INVOKEVIRTUAL org/jruby/RubyFixnum.op_minus (Lorg/jruby/runtime/ThreadContext;J)(For the uninitiated, that "J" at the end of the signature means this call is passing a primitive long for the second argument to op_minus.)We now have the potential to insert additional compiler smarts about how to make a call more efficiently. This is a simple case, of course, but consider the potential for another case: calling from Ruby to arbitrary Java code. Instead of encumbering that call with all our own multi-method dispatch logic *plus* the multiple layers of Java's reflection API, we can instead make the call *directly*, going straight from Ruby code to Java code with no intervening code. And with no intervening code, the JVM will be able to inline arbitrary Java code straight into Ruby code, optimize it as a whole. Are we getting excited yet?Step Three: Steal a Micro-optimizationThere's two pesky calls remaining in "fib": the recursive invocations of "fib" itself. Now the smart thing to do would be to proceed with adding logic to the compiler to be aware of calls from currently-jitting Ruby code to not-yet jitted Ruby code and handle that accordingly. For example, we might also force those methods to compile, and the methods they call to compile, and so on, allowing us to optimize from a "hot" method outward to potentially "colder" methods within some threshold distance. And of course, that's probably what we'll do; it's trivial to trigger any Ruby method in JRuby to JIT at any time, so adding this to the compiler will be a few minutes work. But since I've only been playing with this since Thursday, I figured I'd do an even smarter thing: cheat.Instead of adding the additional compiler smarts, I instead threw in a dirt-simple single-optimization subset of those smarts: detect self-recursion and optimize that case alone. In JRuby terms, this meant simply seeing that the "fib" CachingCallSite objects pointed back to the same "fib" method we were currently compiling. Instead of plumbing them through the dynamic pipeline, we make them be direct recursive calls, similar to the core method calls to "INVOKESTATIC ruby/jit/fib_ruby_B52DB2843EB6D56226F26C559F5510210E9E330D.__file__Where "B52DB28..." is the SHA1 hash of the "fib" method, and __file__ is the default entry point for jitted calls.Everything Else I'm Not Telling YouAs with any experiment, I'm bending compatibility a bit here to show how far we can move JRuby's "upper bound" on performance. So these optimizations currently include some caveats.They currently damage Ruby backtraces, since by skipping the invokers we're no longer tracking Ruby-specific backtrace information (current method, file, line, etc) in a heap-based structure. This is one area where JRuby loses some performance, since the JVM is actually double-tracking both the Java backtrace information and the Ruby backtrace information. We will need to do a bit more work toward making the Java backtrace actually show the relevant Ruby information, or else generate our Ruby backtraces by mining the Java backtrace (and the existing "--fast" flag actually does this already, matching normal Ruby traces pretty well).The changes also don't track the data necessary for managing "backref" and "lastline" data (the $~ and $_ pseudo-globals), which means that regular expression matches or line reads might have reduced functionality in the presence of these optimizations (though you might never notice; those variables are usually discouraged). This behavior can be restored by clever use of thread-local stacks (only pushing down the stack when in the presence of a method that might use those pseudo-globals), or by simply easing back the optimizations if those variables are likely to be used. Ideally we'll be able to use runtime profiling to make a smart decision, since we'll know whether a given call site has ever encountered a method that reads or writes $~ or $_.Finally, I have not inserted the necessary guards before these direct invocations that would branch to doing a normal dynamic call. This was again a pragmatic decision to speed the progress of my experiment...but it also raises an interesting question. What if we knew, without a doubt, that our code had seen all the types and methods it would ever see. Wouldn't it be nice to say "optimize yourself to death" and know it's doing so? While I doubt we'll see a JRuby ship without some guard in place (ideally a simple "method ID" comparison, which I wouldn't expect to impact performance much), I think it's almost assured that we'll allow users to "opt in" to completely "statickifying" specific of code. And it's likely that if we started moving more of JRuby's implementation into Ruby code, we would take advantage of "fully" optimizing as well.With all these caveats in mind, let's take a look at some numbers.And Now, Numbers Meaningless to Anyone Not Using JRuby To Calculate Fibonacci NumbersThe "fib" method is dreadfully abused in benchmarking. It's largely a method call benchmark, since most calls spawn at least two more recursive calls and potentially several others if numeric operations are calls as well. In JRuby, it doubles as an allocation-rate or memory-bandwidth benchmark, since the bulk of our time is spent constructing Fixnum objects or updating per-call runtime data structures.But since it's a simple result to show, I'm going to abuse it again. Please keep in mind these numbers are meaningless except in comparison to each other; JRuby is still cranking through a lot more objects than implementations with true Fixnums or value types, and the JVM is almost certainly over-optimizing portions of this benchmark. But of course, that's part of the point; we're making it possible for the JVM to accelerate and potentially eliminate unnecessary computation, and it's all becoming possible because we're using runtime data to improve runtime compilation.I'll be using JRuby 1.6.dev (master plus my hacks) on OS X Java 6 (1.6.0_20) 64-bit Server VM on a MacBook Pro Core 2 Duo at 2.66GHz.First, the basic JRuby "fib" numbers with full Ruby backtraces and dynamic calls: 0.357000 0.000000 0.357000 ( 0.290000) 0.176000 0.000000 0.176000 ( 0.176000) 0.174000 0.000000 0.174000 ( 0.174000) 0.175000 0.000000 0.175000 ( 0.175000) 0.174000 0.000000 0.174000 ( 0.174000)Now, with backtraces calculated from Java backtraces, dynamic calls restructured slightly to be more inlinable (but still dynamic and via invokers), and limited use of primitive call paths. Essentially the fastest we can get in JRuby 1.5 without totally breaking compatibility (and with some minor breakages, in fact): 0.383000 0.000000 0.383000 ( 0.330000) 0.109000 0.000000 0.109000 ( 0.108000) 0.111000 0.000000 0.111000 ( 0.112000) 0.101000 0.000000 0.101000 ( 0.101000) 0.101000 0.000000 0.101000 ( 0.101000)Now, using all the runtime optimizations described above. Keep in mind these numbers will probably drop a bit with guards in place (but they're still pretty fantastic): 0.443000 0.000000 0.443000 ( 0.376000) 0.035000 0.000000 0.035000 ( 0.035000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000)And a couple comparisons of another mostly-recursive, largely-abused benchmark target: the tak function...Static optimizations only, as fast as we can make it: 1.750000 0.000000 1.750000 ( 1.695000) 0.779000 0.000000 0.779000 ( 0.779000) 0.764000 0.000000 0.764000 ( 0.764000) 0.775000 0.000000 0.775000 ( 0.775000) 0.763000 0.000000 0.763000 ( 0.763000)And with runtime optimizations: 0.899000 0.000000 0.899000 ( 0.832000) 0.331000 0.000000 0.331000 ( 0.331000) 0.332000 0.000000 0.332000 ( 0.332000) 0.329000 0.000000 0.329000 ( 0.329000) 0.331000 0.000000 0.331000 ( 0.332000)So for these trivial benchmarks, a couple hours hacking in JRuby's compiler has produced numbers 2-3x faster than the fastest JRuby performance we've achieved thusfar. Understanding why requires one last section.Hooray for the JVMThere's a missing trick here I haven't explained: the JVM is still doing most of the work for us.In the plain old dynamic-calling, static-optimized case, the JVM is dutifully optimizing everything we throw at it. It's taking our Ruby calls, inlining the invokers and their eventual calls, inlining core class methods (yes, this means we've always been able to inline Ruby into Ruby or Java into Ruby or Ruby into Java, given the appropriate tweaks), and optimizing things very well. But here's the problem: the JVM's default settings are tuned for optimizing *Java*, not for optimizing Ruby. In Java, a call from one method to another has exactly one hop. In JRuby's dynamic calls, it may take three or four hops, bouncing through CallSites and DynamicMethods to eventually get to the target. Since Hotspot only inlines up to 9 levels of calls by default, you can see we're eating up that budget very quickly. Add to that the fact that Java to Java calls have no intervening code to eat up bytecode size thresholds, and we're essentially only getting a fraction of the optimization potential out of Hotspot that a "simpler" language like Java does.Now consider the dynamically-optimized case. We've eliminated both the CallSite and the DynamicMethod from most calls, even if we have to insert a bit of guard code to do it. That means nine levels of Ruby calls can inline, or nine levels of logic from Ruby to core methods or Ruby to Java. We've now given Hotspot a much better picture of the system to optimize. We've also eliminated some of the background noise of doing a Ruby call, like updating rarely-used data structures. We'll need to ensure backtraces come out in some usable form, but at least we're not double-tracking them.The bottom line here is that instead of doing all the much harder work of adding inlining to our own compiler, all we need to do is let the JVM do it, and we benefit from the years of work that have gone into current JVMs' optimizing compilers. The best way to solve a hard problem is to get someone else to solve it for you, and the JVM does an excellent job of solving this particular hard problem. Instead of giving the JVM a fish *or* teaching it how to fish, we're just giving it a map of the lake; it's already an expert fisherman.There's also another side to these optimizations: our continued maintenance of an interpreter is starting to pay off. Other JVM languages that don't have an interpreted mode have a much more difficult time doing any runtime optimization; simply put, it's really hard to swap out bytecode you've already loaded unless you're explicitly abstracting how that bytecode gets generated in the first place. In JRuby, where methods have always had to switch from interpreted to jitted at runtime, we're can can hop back and forth much more freely, optimizing along the way.That's all for now. Don't expect to download JRuby 1.6 in a few months and see everything be three times (or ten times! or 100 times!) faster. There's a lot of details we need to work out about when these optimizations will be safe, how users might be able to opt-in to "full" optimization if necessary, and whether we'll get things fast enough to start replacing core code with Ruby. But early results are very promising, and it's assured we'll ship some of this very soon. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
We've often talked about the benefit of running on the JVM, and while our performance numbers have been *good*, they've never been as incredibly awesome as a lot of people expected. After all, regardless of how well we performed when compared to
... [More]
other Ruby implementations, we weren't as fast as statically-typed JVM languages.It's time for that to change.I've started playing with performing more optimizations based on runtime information in JRuby. You may know that JRuby has always had a JIT (just-in-time compiler), which lazily compiled Ruby AST into JVM bytecode. What you may not know is that unlike most other JIT-based systems, we did not gather any information at runtime that might help the eventual compilation produces a better result. All we applied were the same static optimizations we could safely do for AOT compilation (ahead-of-time, like jrubyc), and ultimately the JIT mode just deferred compilation to reduce the startup cost of compiling everything before it runs.This was a pragmatic decision; the JVM itself does a lot to boost performance, even with our very naïve compiler and our limited static optimizations. Because of that, our performance has been perfectly fine for most users; we haven't even made a concentrated effort to improve execution speed for almost 18 months, since the JRuby 1.1.6 release. We've spent a lot more time handling the issues users actually wanted us to work on: better Java integration, specific Ruby incompatibilities, memory reduction (and occasional leaks), peripheral libraries like jruby-rack and activerecord-jdbc, and general system stability. When we asked users what we should focus on, performance always came up as a "nice to have" but rarely as something people had a problem with. We performed very nicely.These days, however, we're recognizing a few immutable truths:You can never be fast enough, and if your performance doesn't steadily improve people will feel like you're moving backward.People eventually *do* want Ruby code to run as fast as C or Java, and if we can make it happen we should.JRuby users like to write in Ruby, obviously, so we should make an effort to allow writing more of JRuby in Ruby, which requires that we suffer no performance penalty in the process.Working on performance can be a dreadful time sink, but succeeding in improving it can be a tremendous (albeit shallow) ego boost.So along with other work for JRuby 1.6, I'm back in the compiler.I'm just starting to play with this stuff, so take all my results as highly preliminary.A JRuby Call Site PrimerAt each location in Ruby code where a dynamic call happens, JRuby installs what's called a call site. Other VMs may simply refer to the location of the call as the "call site", but in JRuby, each call site has an implementation of org.jruby.runtime.CallSite associated with it. So given the following code:def fib(a) if a < 2 a else fib(a - 1) + fib(a - 2) endendThere are six call sites: the "<" call for "a < 2", the two "-" calls for "a - 1" and "a - 2", the two "fib" calls, and the "+" call for "fib(a - 1) + fib(a - 2)". The naïve approach to doing a dynamic call would be to query the object for the named method every time and then invoke it, like the dumbest possible reflection code might do. In JRuby (like most dynamic language VMs) we instead have a call site cache at each call site to blunt that lookup cost. And in implementation terms, that means each caching CallSite is an instance of org.jruby.runtime.callsite.CachingCallSite. Easy, right?The simplest way to take advantage of runtime information is to use these caching call sites as hints for what method we've been calling at a given call site. Let's take the example of "+". The "+" method here is being called against a Fixnum object every time it's encountered, since our particular run of "fib" never overflows into Bignum and never works with Float objects. In JRuby, Fixnum is implemented with org.jruby.RubyFixnum, and the "+" method is bound to RubyFixnum.op_plus. Here's the implementation of op_plus in JRuby: @JRubyMethod(name = "+") public IRubyObject op_plus(ThreadContext context, IRubyObject other) { if (other instanceof RubyFixnum) { return addFixnum(context, (RubyFixnum)other); } return addOther(context, other); }The details of the actual addition in addFixnum are left as an exercise for the reader.Notice that we have an annotation on the method: @JRubyMethod(name = "+"). All the Ruby core classes implemented in JRuby are tagged with a similar annotation, where we can specify the minimum and maximum numbers of arguments, the Ruby-visible method name (e.g. "+" here, which is not a valid Java method name), and other details of how the method should be called. Notice also that the method takes two arguments: the current ThreadContext (thread-local Ruby-specific runtime structures), and the "other" argument being added.When we build JRuby, we scan the JRuby codebase for JRubyMethod annotations and generate what we call invokers, one for every unique class+name combination in the core classes. These invokers are all instance of org.jruby.internal.runtime.methods.DynamicMethod, and they carry various informational details about the method as well as implementing various "call" signatures. The reason we generate an invoker class per method is simple: most JVMs will not inline through code used for many different code paths, so if we only had a single invoker for all methods in the system...nothing would ever inline.So, putting it all together. At runtime, we have a CachingCallSite instance for every method call in the code. The first time we call "+" for example, we go out and fetch the method object and cache it in place for future calls. That cached method object is a subclass of DynamicMethod, which knows how to do the eventual invocation of RubyFixnum.op_plus and return the desired result. Simple!First Steps Toward Runtime OptimizationThe changes I'm working on locally will finally take advantage of the fact that after running for a while, we have a darn good idea what methods are being called. And if we know what methods are being called, those invocations no longer have to be dynamic, provided our assumptions hold (assumptions being things like "we're always calling against Fixnums" or "it's always the core implementation of "+"). Put differently: because JRuby defers its eventual compilation to JVM bytecode, we can potentially turn dynamic calls into static calls.The process is actually very simple, and since I just started playing with it two days ago, I'll describe the steps I took.Step One: Turn Dynamic Into StaticFirst, I needed the generated methods to bring along enough information for me to do a direct call. Specifically, I needed to know what target Java class they pointed at (RubyFixnum in our "+" example), what the Java name of the method was (op_plus), what arguments the method signature expected to receive (returning IRubyObject and receiving ThreadContext, IRubyObject), and whether the implementation was static (as are most module-based methods...but our "op_plus" is not). So I bundled this information into what I called a "NativeCall" data structure, populated on any invokers for which a native call might be possible. public static class NativeCall { private final Class nativeTarget; private final String nativeName; private final Class nativeReturn; private final Class[] nativeSignature; private final boolean statik; ...In the compiler, I added a similar bit of logic. When compiling a dynamic call, it now will look to see whether the call site has already cached some method. If so, it checks whether that method has a corresponding NativeCall data structure associated with it. If it does, then instead of emitting the dynamic call logic it would normally emit, it compiles a normal, direct JVM invocation. So where we used to have a call like this:INVOKEVIRTUAL org/jruby/runtime/CallSite.callWe now have a call like this:INVOKEVIRTUAL org/jruby/RubyFixnum.op_plusThis call to "+" is now essentially equivalent to writing the same code in Java against our RubyFixnum class.This comprises the first step: get the compiler to recognize and "make static" dynamic calls we've seen before. My experimental code is able to do this for most core class methods right now, and it will not be difficult to extend it to any call.(The astute VM implementer will notice I made no mention of inserting a guard or test before the direct call, in case we later need to actually call a different method. This is, for the moment, intentional. I'll come back to it).Step Two: Reduce Some Fixnum UseThe second step I took was to allow some call paths to use primitive values rather than boxed RubyFixnum objects. RubyFixnum in JRuby always represents a 64-bit long value, but since our call path only supports invocation with IRubyObject, we generally have to construct a new RubyFixnum for all dynamic calls. These objects are very small and usually very short-lived, so they don't impact GC times much. But they do impact allocation rates; we still have to grab the memory space for every RubyFixnum, and so we're constantly chewing up memory bandwidth to do so.There are various ways in which VMs can eliminate or reduce object allocations like this: stack allocation, value types, true fixnums, escape analysis, and more. But of these, only escape analysis is available on current JVMs, and it's a very fragile optimization: all paths that would consume an object must be completely inlined, or else the object can't be elided.So to help reduce the burden on the JVM, we have a couple call paths that can receive a single long or double argument where it's possible for us to prove that we're passing a boxed long or double value (i.e. a RubyFixnum or RubyFloat). The second step I took was to make the compiler aware of several of these methods on RubyFixnum, such as the long version of op_plus: public IRubyObject op_plus(ThreadContext context, long other) { return addFixnum(context, other); }Since we have not yet implemented more advanced means of proving a given object is always a Fixnum (like doing local type propagation or per-variable type profiling at runtime), the compiler currently can only see that we're making a call with a literal value like "100" or "12.5". As luck would have it, there's three such cases in the "fib" implementation above: one for the "<" call and two for the "-" calls. By combining the compiler's knowledge of which actual method is being called in each case and how to make a call without standing up a RubyFixnum object, our actual call in the bytecode gets simplified further:INVOKEVIRTUAL org/jruby/RubyFixnum.op_minus (Lorg/jruby/runtime/ThreadContext;J)(For the uninitiated, that "J" at the end of the signature means this call is passing a primitive long for the second argument to op_minus.)We now have the potential to insert additional compiler smarts about how to make a call more efficiently. This is a simple case, of course, but consider the potential for another case: calling from Ruby to arbitrary Java code. Instead of encumbering that call with all our own multi-method dispatch logic *plus* the multiple layers of Java's reflection API, we can instead make the call *directly*, going straight from Ruby code to Java code with no intervening code. And with no intervening code, the JVM will be able to inline arbitrary Java code straight into Ruby code, optimize it as a whole. Are we getting excited yet?Step Three: Steal a Micro-optimizationThere's two pesky calls remaining in "fib": the recursive invocations of "fib" itself. Now the smart thing to do would be to proceed with adding logic to the compiler to be aware of calls from currently-jitting Ruby code to not-yet jitted Ruby code and handle that accordingly. For example, we might also force those methods to compile, and the methods they call to compile, and so on, allowing us to optimize from a "hot" method outward to potentially "colder" methods within some threshold distance. And of course, that's probably what we'll do; it's trivial to trigger any Ruby method in JRuby to JIT at any time, so adding this to the compiler will be a few minutes work. But since I've only been playing with this since Thursday, I figured I'd do an even smarter thing: cheat.Instead of adding the additional compiler smarts, I instead threw in a dirt-simple single-optimization subset of those smarts: detect self-recursion and optimize that case alone. In JRuby terms, this meant simply seeing that the "fib" CachingCallSite objects pointed back to the same "fib" method we were currently compiling. Instead of plumbing them through the dynamic pipeline, we make them be direct recursive calls, similar to the core method calls to "<" and "-" and "+". So our fib calls turn into the following bytecode-level invocation:INVOKESTATIC ruby/jit/fib_ruby_B52DB2843EB6D56226F26C559F5510210E9E330D.__file__Where "B52DB28..." is the SHA1 hash of the "fib" method, and __file__ is the default entry point for jitted calls.Everything Else I'm Not Telling YouAs with any experiment, I'm bending compatibility a bit here to show how far we can move JRuby's "upper bound" on performance. So these optimizations currently include some caveats.They currently damage Ruby backtraces, since by skipping the invokers we're no longer tracking Ruby-specific backtrace information (current method, file, line, etc) in a heap-based structure. This is one area where JRuby loses some performance, since the JVM is actually double-tracking both the Java backtrace information and the Ruby backtrace information. We will need to do a bit more work toward making the Java backtrace actually show the relevant Ruby information, or else generate our Ruby backtraces by mining the Java backtrace (and the existing "--fast" flag actually does this already, matching normal Ruby traces pretty well).The changes also don't track the data necessary for managing "backref" and "lastline" data (the $~ and $_ pseudo-globals), which means that regular expression matches or line reads might have reduced functionality in the presence of these optimizations (though you might never notice; those variables are usually discouraged). This behavior can be restored by clever use of thread-local stacks (only pushing down the stack when in the presence of a method that might use those pseudo-globals), or by simply easing back the optimizations if those variables are likely to be used. Ideally we'll be able to use runtime profiling to make a smart decision, since we'll know whether a given call site has ever encountered a method that reads or writes $~ or $_.Finally, I have not inserted the necessary guards before these direct invocations that would branch to doing a normal dynamic call. This was again a pragmatic decision to speed the progress of my experiment...but it also raises an interesting question. What if we knew, without a doubt, that our code had seen all the types and methods it would ever see. Wouldn't it be nice to say "optimize yourself to death" and know it's doing so? While I doubt we'll see a JRuby ship without some guard in place (ideally a simple "method ID" comparison, which I wouldn't expect to impact performance much), I think it's almost assured that we'll allow users to "opt in" to completely "statickifying" specific of code. And it's likely that if we started moving more of JRuby's implementation into Ruby code, we would take advantage of "fully" optimizing as well.With all these caveats in mind, let's take a look at some numbers.And Now, Numbers Meaningless to Anyone Not Using JRuby To Calculate Fibonacci NumbersThe "fib" method is dreadfully abused in benchmarking. It's largely a method call benchmark, since most calls spawn at least two more recursive calls and potentially several others if numeric operations are calls as well. In JRuby, it doubles as an allocation-rate or memory-bandwidth benchmark, since the bulk of our time is spent constructing Fixnum objects or updating per-call runtime data structures.But since it's a simple result to show, I'm going to abuse it again. Please keep in mind these numbers are meaningless except in comparison to each other; JRuby is still cranking through a lot more objects than implementations with true Fixnums or value types, and the JVM is almost certainly over-optimizing portions of this benchmark. But of course, that's part of the point; we're making it possible for the JVM to accelerate and potentially eliminate unnecessary computation, and it's all becoming possible because we're using runtime data to improve runtime compilation.I'll be using JRuby 1.6.dev (master plus my hacks) on OS X Java 6 (1.6.0_20) 64-bit Server VM on a MacBook Pro Core 2 Duo at 2.66GHz.First, the basic JRuby "fib" numbers with full Ruby backtraces and dynamic calls: 0.357000 0.000000 0.357000 ( 0.290000) 0.176000 0.000000 0.176000 ( 0.176000) 0.174000 0.000000 0.174000 ( 0.174000) 0.175000 0.000000 0.175000 ( 0.175000) 0.174000 0.000000 0.174000 ( 0.174000)Now, with backtraces calculated from Java backtraces, dynamic calls restructured slightly to be more inlinable (but still dynamic and via invokers), and limited use of primitive call paths. Essentially the fastest we can get in JRuby 1.5 without totally breaking compatibility (and with some minor breakages, in fact): 0.383000 0.000000 0.383000 ( 0.330000) 0.109000 0.000000 0.109000 ( 0.108000) 0.111000 0.000000 0.111000 ( 0.112000) 0.101000 0.000000 0.101000 ( 0.101000) 0.101000 0.000000 0.101000 ( 0.101000)Now, using all the runtime optimizations described above. Keep in mind these numbers will probably drop a bit with guards in place (but they're still pretty fantastic): 0.443000 0.000000 0.443000 ( 0.376000) 0.035000 0.000000 0.035000 ( 0.035000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000)And a couple comparisons of another mostly-recursive, largely-abused benchmark target: the tak function...Static optimizations only, as fast as we can make it: 1.750000 0.000000 1.750000 ( 1.695000) 0.779000 0.000000 0.779000 ( 0.779000) 0.764000 0.000000 0.764000 ( 0.764000) 0.775000 0.000000 0.775000 ( 0.775000) 0.763000 0.000000 0.763000 ( 0.763000)And with runtime optimizations: 0.899000 0.000000 0.899000 ( 0.832000) 0.331000 0.000000 0.331000 ( 0.331000) 0.332000 0.000000 0.332000 ( 0.332000) 0.329000 0.000000 0.329000 ( 0.329000) 0.331000 0.000000 0.331000 ( 0.332000)So for these trivial benchmarks, a couple hours hacking in JRuby's compiler has produced numbers 2-3x faster than the fastest JRuby performance we've achieved thusfar. Understanding why requires one last section.Hooray for the JVMThere's a missing trick here I haven't explained: the JVM is still doing most of the work for us.In the plain old dynamic-calling, static-optimized case, the JVM is dutifully optimizing everything we throw at it. It's taking our Ruby calls, inlining the invokers and their eventual calls, inlining core class methods (yes, this means we've always been able to inline Ruby into Ruby or Java into Ruby or Ruby into Java, given the appropriate tweaks), and optimizing things very well. But here's the problem: the JVM's default settings are tuned for optimizing *Java*, not for optimizing Ruby. In Java, a call from one method to another has exactly one hop. In JRuby's dynamic calls, it may take three or four hops, bouncing through CallSites and DynamicMethods to eventually get to the target. Since Hotspot only inlines up to 9 levels of calls by default, you can see we're eating up that budget very quickly. Add to that the fact that Java to Java calls have no intervening code to eat up bytecode size thresholds, and we're essentially only getting a fraction of the optimization potential out of Hotspot that a "simpler" language like Java does.Now consider the dynamically-optimized case. We've eliminated both the CallSite and the DynamicMethod from most calls, even if we have to insert a bit of guard code to do it. That means nine levels of Ruby calls can inline, or nine levels of logic from Ruby to core methods or Ruby to Java. We've now given Hotspot a much better picture of the system to optimize. We've also eliminated some of the background noise of doing a Ruby call, like updating rarely-used data structures. We'll need to ensure backtraces come out in some usable form, but at least we're not double-tracking them.The bottom line here is that instead of doing all the much harder work of adding inlining to our own compiler, all we need to do is let the JVM do it, and we benefit from the years of work that have gone into current JVMs' optimizing compilers. The best way to solve a hard problem is to get someone else to solve it for you, and the JVM does an excellent job of solving this particular hard problem. Instead of giving the JVM a fish *or* teaching it how to fish, we're just giving it a map of the lake; it's already an expert fisherman.There's also another side to these optimizations: our continued maintenance of an interpreter is starting to pay off. Other JVM languages that don't have an interpreted mode have a much more difficult time doing any runtime optimization; simply put, it's really hard to swap out bytecode you've already loaded unless you're explicitly abstracting how that bytecode gets generated in the first place. In JRuby, where methods have always had to switch from interpreted to jitted at runtime, we're can can hop back and forth much more freely, optimizing along the way.That's all for now. Don't expect to download JRuby 1.6 in a few months and see everything be three times (or ten times! or 100 times!) faster. There's a lot of details we need to work out about when these optimizations will be safe, how users might be able to opt-in to "full" optimization if necessary, and whether we'll get things fast enough to start replacing core code with Ruby. But early results are very promising, and it's assured we'll ship some of this very soon. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
We've often talked about the benefit of running on the JVM, and while our performance numbers have been *good*, they've never been as incredibly awesome as a lot of people expected. After all, regardless of how well we performed when compared to
... [More]
other Ruby implementations, we weren't as fast as statically-typed JVM languages.It's time for that to change.I've started playing with performing more optimizations based on runtime information in JRuby. You may know that JRuby has always had a JIT (just-in-time compiler), which lazily compiled Ruby AST into JVM bytecode. What you may not know is that unlike most other JIT-based systems, we did not gather any information at runtime that might help the eventual compilation produces a better result. All we applied were the same static optimizations we could safely do for AOT compilation (ahead-of-time, like jrubyc), and ultimately the JIT mode just deferred compilation to reduce the startup cost of compiling everything before it runs.This was a pragmatic decision; the JVM itself does a lot to boost performance, even with our very naïve compiler and our limited static optimizations. Because of that, our performance has been perfectly fine for most users; we haven't even made a concentrated effort to improve execution speed for almost 18 months, since the JRuby 1.1.6 release. We've spent a lot more time handling the issues users actually wanted us to work on: better Java integration, specific Ruby incompatibilities, memory reduction (and occasional leaks), peripheral libraries like jruby-rack and activerecord-jdbc, and general system stability. When we asked users what we should focus on, performance always came up as a "nice to have" but rarely as something people had a problem with. We performed very nicely.These days, however, we're recognizing a few immutable truths:You can never be fast enough, and if your performance doesn't steadily improve people will feel like you're moving backward.People eventually *do* want Ruby code to run as fast as C or Java, and if we can make it happen we should.JRuby users like to write in Ruby, obviously, so we should make an effort to allow writing more of JRuby in Ruby, which requires that we suffer no performance penalty in the process.Working on performance can be a dreadful time sink, but succeeding in improving it can be a tremendous (albeit shallow) ego boost.So along with other work for JRuby 1.6, I'm back in the compiler.I'm just starting to play with this stuff, so take all my results as highly preliminary.A JRuby Call Site PrimerAt each location in Ruby code where a dynamic call happens, JRuby installs what's called a call site. Other VMs may simply refer to the location of the call as the "call site", but in JRuby, each call site has an implementation of org.jruby.runtime.CallSite associated with it. So given the following code:def fib(a) if a < 2 a else fib(a - 1) + fib(a - 2) endendThere are six call sites: the "<" call for "a < 2", the two "-" calls for "a - 1" and "a - 2", the two "fib" calls, and the "+" call for "fib(a - 1) + fib(a - 2)". The naïve approach to doing a dynamic call would be to query the object for the named method every time and then invoke it, like the dumbest possible reflection code might do. In JRuby (like most dynamic language VMs) we instead have a call site cache at each call site to blunt that lookup cost. And in implementation terms, that means each caching CallSite is an instance of org.jruby.runtime.callsite.CachingCallSite. Easy, right?The simplest way to take advantage of runtime information is to use these caching call sites as hints for what method we've been calling at a given call site. Let's take the example of "+". The "+" method here is being called against a Fixnum object every time it's encountered, since our particular run of "fib" never overflows into Bignum and never works with Float objects. In JRuby, Fixnum is implemented with org.jruby.RubyFixnum, and the "+" method is bound to RubyFixnum.op_plus. Here's the implementation of op_plus in JRuby: @JRubyMethod(name = "+") public IRubyObject op_plus(ThreadContext context, IRubyObject other) { if (other instanceof RubyFixnum) { return addFixnum(context, (RubyFixnum)other); } return addOther(context, other); }The details of the actual addition in addFixnum are left as an exercise for the reader.Notice that we have an annotation on the method: @JRubyMethod(name = "+"). All the Ruby core classes implemented in JRuby are tagged with a similar annotation, where we can specify the minimum and maximum numbers of arguments, the Ruby-visible method name (e.g. "+" here, which is not a valid Java method name), and other details of how the method should be called. Notice also that the method takes two arguments: the current ThreadContext (thread-local Ruby-specific runtime structures), and the "other" argument being added.When we build JRuby, we scan the JRuby codebase for JRubyMethod annotations and generate what we call invokers, one for every unique class+name combination in the core classes. These invokers are all instance of org.jruby.internal.runtime.methods.DynamicMethod, and they carry various informational details about the method as well as implementing various "call" signatures. The reason we generate an invoker class per method is simple: most JVMs will not inline through code used for many different code paths, so if we only had a single invoker for all methods in the system...nothing would ever inline.So, putting it all together. At runtime, we have a CachingCallSite instance for every method call in the code. The first time we call "+" for example, we go out and fetch the method object and cache it in place for future calls. That cached method object is a subclass of DynamicMethod, which knows how to do the eventual invocation of RubyFixnum.op_plus and return the desired result. Simple!First Steps Toward Runtime OptimizationThe changes I'm working on locally will finally take advantage of the fact that after running for a while, we have a darn good idea what methods are being called. And if we know what methods are being called, those invocations no longer have to be dynamic, provided our assumptions hold (assumptions being things like "we're always calling against Fixnums" or "it's always the core implementation of "+"). Put differently: because JRuby defers its eventual compilation to JVM bytecode, we can potentially turn dynamic calls into static calls.The process is actually very simple, and since I just started playing with it two days ago, I'll describe the steps I took.Step One: Turn Dynamic Into StaticFirst, I needed the generated methods to bring along enough information for me to do a direct call. Specifically, I needed to know what target Java class they pointed at (RubyFixnum in our "+" example), what the Java name of the method was (op_plus), what arguments the method signature expected to receive (returning IRubyObject and receiving ThreadContext, IRubyObject), and whether the implementation was static (as are most module-based methods...but our "op_plus" is not). So I bundled this information into what I called a "NativeCall" data structure, populated on any invokers for which a native call might be possible. public static class NativeCall { private final Class nativeTarget; private final String nativeName; private final Class nativeReturn; private final Class[] nativeSignature; private final boolean statik; ...In the compiler, I added a similar bit of logic. When compiling a dynamic call, it now will look to see whether the call site has already cached some method. If so, it checks whether that method has a corresponding NativeCall data structure associated with it. If it does, then instead of emitting the dynamic call logic it would normally emit, it compiles a normal, direct JVM invocation. So where we used to have a call like this:INVOKEVIRTUAL org/jruby/runtime/CallSite.callWe now have a call like this:INVOKEVIRTUAL org/jruby/RubyFixnum.op_plusThis call to "+" is now essentially equivalent to writing the same code in Java against our RubyFixnum class.This comprises the first step: get the compiler to recognize and "make static" dynamic calls we've seen before. My experimental code is able to do this for most core class methods right now, and it will not be difficult to extend it to any call.(The astute VM implementer will notice I made no mention of inserting a guard or test before the direct call, in case we later need to actually call a different method. This is, for the moment, intentional. I'll come back to it).Step Two: Reduce Some Fixnum UseThe second step I took was to allow some call paths to use primitive values rather than boxed RubyFixnum objects. RubyFixnum in JRuby always represents a 64-bit long value, but since our call path only supports invocation with IRubyObject, we generally have to construct a new RubyFixnum for all dynamic calls. These objects are very small and usually very short-lived, so they don't impact GC times much. But they do impact allocation rates; we still have to grab the memory space for every RubyFixnum, and so we're constantly chewing up memory bandwidth to do so.There are various ways in which VMs can eliminate or reduce object allocations like this: stack allocation, value types, true fixnums, escape analysis, and more. But of these, only escape analysis is available on current JVMs, and it's a very fragile optimization: all paths that would consume an object must be completely inlined, or else the object can't be elided.So to help reduce the burden on the JVM, we have a couple call paths that can receive a single long or double argument where it's possible for us to prove that we're passing a boxed long or double value (i.e. a RubyFixnum or RubyFloat). The second step I took was to make the compiler aware of several of these methods on RubyFixnum, such as the long version of op_plus: public IRubyObject op_plus(ThreadContext context, long other) { return addFixnum(context, other); }Since we have not yet implemented more advanced means of proving a given object is always a Fixnum (like doing local type propagation or per-variable type profiling at runtime), the compiler currently can only see that we're making a call with a literal value like "100" or "12.5". As luck would have it, there's three such cases in the "fib" implementation above: one for the "<" call and two for the "-" calls. By combining the compiler's knowledge of which actual method is being called in each case and how to make a call without standing up a RubyFixnum object, our actual call in the bytecode gets simplified further:INVOKEVIRTUAL org/jruby/RubyFixnum.op_minus (Lorg/jruby/runtime/ThreadContext;J)(For the uninitiated, that "J" at the end of the signature means this call is passing a primitive long for the second argument to op_minus.)We now have the potential to insert additional compiler smarts about how to make a call more efficiently. This is a simple case, of course, but consider the potential for another case: calling from Ruby to arbitrary Java code. Instead of encumbering that call with all our own multi-method dispatch logic *plus* the multiple layers of Java's reflection API, we can instead make the call *directly*, going straight from Ruby code to Java code with no intervening code. And with no intervening code, the JVM will be able to inline arbitrary Java code straight into Ruby code, optimize it as a whole. Are we getting excited yet?Step Three: Steal a Micro-optimizationThere's two pesky calls remaining in "fib": the recursive invocations of "fib" itself. Now the smart thing to do would be to proceed with adding logic to the compiler to be aware of calls from currently-jitting Ruby code to not-yet jitted Ruby code and handle that accordingly. For example, we might also force those methods to compile, and the methods they call to compile, and so on, allowing us to optimize from a "hot" method outward to potentially "colder" methods within some threshold distance. And of course, that's probably what we'll do; it's trivial to trigger any Ruby method in JRuby to JIT at any time, so adding this to the compiler will be a few minutes work. But since I've only been playing with this since Thursday, I figured I'd do an even smarter thing: cheat.Instead of adding the additional compiler smarts, I instead threw in a dirt-simple single-optimization subset of those smarts: detect self-recursion and optimize that case alone. In JRuby terms, this meant simply seeing that the "fib" CachingCallSite objects pointed back to the same "fib" method we were currently compiling. Instead of plumbing them through the dynamic pipeline, we make them be direct recursive calls, similar to the core method calls to "<" and "-" and "+". So our fib calls turn into the following bytecode-level invocation:INVOKESTATIC ruby/jit/fib_ruby_B52DB2843EB6D56226F26C559F5510210E9E330D.__file__Where "B52DB28..." is the SHA1 hash of the "fib" method, and __file__ is the default entry point for jitted calls.Everything Else I'm Not Telling YouAs with any experiment, I'm bending compatibility a bit here to show how far we can move JRuby's "upper bound" on performance. So these optimizations currently include some caveats.They currently damage Ruby backtraces, since by skipping the invokers we're no longer tracking Ruby-specific backtrace information (current method, file, line, etc) in a heap-based structure. This is one area where JRuby loses some performance, since the JVM is actually double-tracking both the Java backtrace information and the Ruby backtrace information. We will need to do a bit more work toward making the Java backtrace actually show the relevant Ruby information, or else generate our Ruby backtraces by mining the Java backtrace (and the existing "--fast" flag actually does this already, matching normal Ruby traces pretty well).The changes also don't track the data necessary for managing "backref" and "lastline" data (the $~ and $_ pseudo-globals), which means that regular expression matches or line reads might have reduced functionality in the presence of these optimizations (though you might never notice; those variables are usually discouraged). This behavior can be restored by clever use of thread-local stacks (only pushing down the stack when in the presence of a method that might use those pseudo-globals), or by simply easing back the optimizations if those variables are likely to be used. Ideally we'll be able to use runtime profiling to make a smart decision, since we'll know whether a given call site has ever encountered a method that reads or writes $~ or $_.Finally, I have not inserted the necessary guards before these direct invocations that would branch to doing a normal dynamic call. This was again a pragmatic decision to speed the progress of my experiment...but it also raises an interesting question. What if we knew, without a doubt, that our code had seen all the types and methods it would ever see. Wouldn't it be nice to say "optimize yourself to death" and know it's doing so? While I doubt we'll see a JRuby ship without some guard in place (ideally a simple "method ID" comparison, which I wouldn't expect to impact performance much), I think it's almost assured that we'll allow users to "opt in" to completely "statickifying" specific of code. And it's likely that if we started moving more of JRuby's implementation into Ruby code, we would take advantage of "fully" optimizing as well.With all these caveats in mind, let's take a look at some numbers.And Now, Numbers Meaningless to Anyone Not Using JRuby To Calculate Fibonacci NumbersThe "fib" method is dreadfully abused in benchmarking. It's largely a method call benchmark, since most calls spawn at least two more recursive calls and potentially several others if numeric operations are calls as well. In JRuby, it doubles as an allocation-rate or memory-bandwidth benchmark, since the bulk of our time is spent constructing Fixnum objects or updating per-call runtime data structures.But since it's a simple result to show, I'm going to abuse it again. Please keep in mind these numbers are meaningless except in comparison to each other; JRuby is still cranking through a lot more objects than implementations with true Fixnums or value types, and the JVM is almost certainly over-optimizing portions of this benchmark. But of course, that's part of the point; we're making it possible for the JVM to accelerate and potentially eliminate unnecessary computation, and it's all becoming possible because we're using runtime data to improve runtime compilation.I'll be using JRuby 1.6.dev (master plus my hacks) on OS X Java 6 (1.6.0_20) 64-bit Server VM on a MacBook Pro Core 2 Duo at 2.66GHz.First, the basic JRuby "fib" numbers with full Ruby backtraces and dynamic calls: 0.357000 0.000000 0.357000 ( 0.290000) 0.176000 0.000000 0.176000 ( 0.176000) 0.174000 0.000000 0.174000 ( 0.174000) 0.175000 0.000000 0.175000 ( 0.175000) 0.174000 0.000000 0.174000 ( 0.174000)Now, with backtraces calculated from Java backtraces, dynamic calls restructured slightly to be more inlinable (but still dynamic and via invokers), and limited use of primitive call paths. Essentially the fastest we can get in JRuby 1.5 without totally breaking compatibility (and with some minor breakages, in fact): 0.383000 0.000000 0.383000 ( 0.330000) 0.109000 0.000000 0.109000 ( 0.108000) 0.111000 0.000000 0.111000 ( 0.112000) 0.101000 0.000000 0.101000 ( 0.101000) 0.101000 0.000000 0.101000 ( 0.101000)Now, using all the runtime optimizations described above. Keep in mind these numbers will probably drop a bit with guards in place (but they're still pretty fantastic): 0.443000 0.000000 0.443000 ( 0.376000) 0.035000 0.000000 0.035000 ( 0.035000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000) 0.036000 0.000000 0.036000 ( 0.036000)And a couple comparisons of another mostly-recursive, largely-abused benchmark target: the tak function...Static optimizations only, as fast as we can make it: 1.750000 0.000000 1.750000 ( 1.695000) 0.779000 0.000000 0.779000 ( 0.779000) 0.764000 0.000000 0.764000 ( 0.764000) 0.775000 0.000000 0.775000 ( 0.775000) 0.763000 0.000000 0.763000 ( 0.763000)And with runtime optimizations: 0.899000 0.000000 0.899000 ( 0.832000) 0.331000 0.000000 0.331000 ( 0.331000) 0.332000 0.000000 0.332000 ( 0.332000) 0.329000 0.000000 0.329000 ( 0.329000) 0.331000 0.000000 0.331000 ( 0.332000)So for these trivial benchmarks, a couple hours hacking in JRuby's compiler has produced numbers 2-3x faster than the fastest JRuby performance we've achieved thusfar. Understanding why requires one last section.Hooray for the JVMThere's a missing trick here I haven't explained: the JVM is still doing most of the work for us.In the plain old dynamic-calling, static-optimized case, the JVM is dutifully optimizing everything we throw at it. It's taking our Ruby calls, inlining the invokers and their eventual calls, inlining core class methods (yes, this means we've always been able to inline Ruby into Ruby or Java into Ruby or Ruby into Java, given the appropriate tweaks), and optimizing things very well. But here's the problem: the JVM's default settings are tuned for optimizing *Java*, not for optimizing Ruby. In Java, a call from one method to another has exactly one hop. In JRuby's dynamic calls, it may take three or four hops, bouncing through CallSites and DynamicMethods to eventually get to the target. Since Hotspot only inlines up to 9 levels of calls by default, you can see we're eating up that budget very quickly. Add to that the fact that Java to Java calls have no intervening code to eat up bytecode size thresholds, and we're essentially only getting a fraction of the optimization potential out of Hotspot that a "simpler" language like Java does.Now consider the dynamically-optimized case. We've eliminated both the CallSite and the DynamicMethod from most calls, even if we have to insert a bit of guard code to do it. That means nine levels of Ruby calls can inline, or nine levels of logic from Ruby to core methods or Ruby to Java. We've now given Hotspot a much better picture of the system to optimize. We've also eliminated some of the background noise of doing a Ruby call, like updating rarely-used data structures. We'll need to ensure backtraces come out in some usable form, but at least we're not double-tracking them.The bottom line here is that instead of doing all the much harder work of adding inlining to our own compiler, all we need to do is let the JVM do it, and we benefit from the years of work that have gone into current JVMs' optimizing compilers. The best way to solve a hard problem is to get someone else to solve it for you, and the JVM does an excellent job of solving this particular hard problem. Instead of giving the JVM a fish *or* teaching it how to fish, we're just giving it a map of the lake; it's already an expert fisherman.There's also another side to these optimizations: our continued maintenance of an interpreter is starting to pay off. Other JVM languages that don't have an interpreted mode have a much more difficult time doing any runtime optimization; simply put, it's really hard to swap out bytecode you've already loaded unless you're explicitly abstracting how that bytecode gets generated in the first place. In JRuby, where methods have always had to switch from interpreted to jitted at runtime, we're can can hop back and forth much more freely, optimizing along the way.That's all for now. Don't expect to download JRuby 1.6 in a few months and see everything be three times (or ten times! or 100 times!) faster. There's a lot of details we need to work out about when these optimizations will be safe, how users might be able to opt-in to "full" optimization if necessary, and whether we'll get things fast enough to start replacing core code with Ruby. But early results are very promising, and it's assured we'll ship some of this very soon. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
I originally started to send this to the JRuby dev list and to the Ruboto list, but realized quickly that it might make a good blog post. Since I don't blog enough lately, here it is.I've been looking into better ways to precompile Ruby code to
... [More]
classes for deploy on Android devices. Normally, JRuby users can just jar up their .rb files and load and require them as though they were on the filesystem; JRuby finds, loads, and runs them just fine. This works well enough on Android, but since there's no way to generate bytecode at runtime, JRuby code that isn't precompiled must run interpreted forever...and run a bit slower than we'd like because of it. In order for Ruby to be a first-class language for Android development, we must make it possible to precompile Ruby code *completely* and bundle it up with the application. So this evening I spent some time making that possible.I have some good news, some bad news, and some good news. First, a bit of background into JRuby's compiler.What JRuby's Compiler ProducesJRuby's ahead-of-time compiler produces a single .class file per .rb file, to ease deployment and lookup of those files (and because I think it's ugly to always vomit out an unidentifiable class for every method body). This produces a nice 1:1 mapping between .rb and .class, but it comes at a cost: since those .class files are just "bags of methods" we need to bind those methods somehow. This usually happens at runtime, with JRuby generating a small "handle" class for every method as it is bound. So for a script like this:# foo.rbclass Foo def bar; endenddef hello; endYou will get one top-level class file when you AOT compile, and then two more class files are generated at runtime for the "handles" for methods "bar" and "hello". This provides the best possible performance for invocation, plus a nice 1:1 on-disk format...but it means we're still generating a lot of code at runtime.The other complication is that jrubyc normally outputs a .class file of the same name as the .rb file, to ease lookup of that .class file at runtime. So the main .class for the above script would be called "foo.class". The problem with this is that "foo.rb" may not always be loaded as "foo.rb". A user might load '../yum/../foo.rb' or some other peculiar path. As a result, the base name of the file is not enough to determine what class name to load. To solve this, I've introduced an alternate naming scheme that uses the SHA1 hash of the *actual* content of the file as the class name. So, for the above script, the resulting class would be named:ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2While this isn't a pretty name, it does provide a way to locate the compiled version of a script universally, regardless of what path is used to load it.The Good NewsI've modified jrubyc (on master only...we need to talk about whether this should be a late addition to 1.5) to have a new --sha1 flag. As you might guess, this flag alters the compile process to generate the sha1-named class for each compiled file.~/projects/jruby ➔ jrubyc foo.rb Compiling foo.rb to class foo~/projects/jruby ➔ jrubyc --sha1 foo.rb Compiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2~/projects/jruby ➔ jruby -X+C -J-Djruby.jit.debug=true -e "require 'foo'"...found jitted code for ./foo.rb at class: ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2...This is actually finding the foo.rb file, calculating its SHA1 hash, and then loading the .class file instead. So if you had a bunch of .rb code for an Android application and wanted to precompile it, you'd run this command to get the sha1 classes, and then include both the .rb file and the .class file in your application (the .rb file must be there because...you guessed it...we need to calculate the sha1 hash from its contents).To test this out, I actually ran jrubyc against the Ruby stdlib to produce a sha1 class for every .rb file:~/projects/jruby ➔ jrubyc -t /tmp --sha1 lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1ACompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Compiling lib/ruby/1.8//benchmark.rb to class ruby.jit.FILE_0C42EBD7F248AF396DE7A70C0FBC31E9E8D233DE...Compiling lib/ruby/1.8//xsd/xmlparser/rexmlparser.rb to class ruby.jit.FILE_8B106B9E9F2F1768470A7A4E6BD1A36FC0859862Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Compiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CThis produces 524 class files for 524 .rb files, just as it should, and running with forced compilation (-X+C) and jruby.jit.debug=true shows that it finds each class when loading anything from stdlib. That's a good start!What About the Handles?I mentioned above that we also generate, at runtime, a small handle class for every bound method in a given script. And again, since we can't generate bytecode on-device, we need a way to pregenerate all those handles.An hour's worth of work later, and jrubyc has a --handles flag that will additionally spit out all method handles for each script compiled. Here's our foo script compiled with --sha1 and --handles, along with the resulting .class files:~/projects/jruby ➔ jrubyc --sha1 --handles foo.rbCompiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2Generating direct handles for foo.rb~/projects/jruby ➔ ls ruby/jit/*351347*ruby/jit/FILE_351347C9126659D4479558A2706DBC35E45D16D2.class~/projects/jruby ➔ ls *351347*ruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__1$RUBY$barFixed0.classruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__2$RUBY$helloFixed0.classAnd sure enough, we can also see that these handles are being loaded instead of generated at runtime. So it's possible with these two options to *completely* precompile JRuby sources into .class files. Hooray!The Bad NewsMy next step was obviously to try to precompile and dex the entire Ruby standard library. That's 524 files, but how many method bodies? We'd need to generate a handle for each one of them.~/projects/jruby ➔ mkdir stdlib-compiled~/projects/jruby ➔ jrubyc --sha1 --handles -t stdlib-compiled/ lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1AGenerating direct handles for lib/ruby/1.8//abbrev.rbCompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Generating direct handles for lib/ruby/1.8//base64.rb...Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Generating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlparser.rbCompiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CGenerating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb~/projects/jruby ➔ find stdlib-compiled/ -name \*.class | wc -l 8212Wowsers, that's a lot of method bodies..over 7500 of them. But of course this is the entire Ruby standard library, with code for network protocols, templating, xml parsing, soap, and so on. Now for the more frightening numbers: keeping in mind that .class is a pretty verbose file format, how big are all these class files?~/projects/jruby ➔ du -ks stdlib-compiled/ruby14008 stdlib-compiled/ruby~/projects/jruby ➔ du -ks stdlib-compiled/44784 stdlib-compiled/Yeeow! The standard library alone (without handles) produces 14MB of .class files, and with handles it goes up to a whopping 44MB of .class files! That seems a bit high, doesn't it? Especially considering that the .rb files add up to around 4.5MB?Well there's a few explanations for this. First off, the generated handles are rather small, around 2k each, but they each are probably 50% the exact same code. They're generated as separate handles primarily because the JVM will not inline the same loaded body of code through two different call paths, so we have to duplicate that logic repeatedly. Java 7 fixes some of this, but for now we're stuck. The handle classes also share almost identical constant pools, or in-file tables of strings. Many of the same characteristics apply to the compiled Ruby scripts, so the 44MB number is a bit larger than it needs to be.We can show a more realistic estimate of on-disk size by compressing the lot, first with normal "jar", and then with the "pack200" utility, which takes greater advantage of the .class format's intra-file redundancy:~/projects/jruby ➔ cd stdlib-compiled/~/projects/jruby/stdlib-compiled ➔ jar cf stdlib-compiled.jar .~/projects/jruby/stdlib-compiled ➔ pack200 stdlib-compiled.pack.gz stdlib-compiled.jar~/projects/jruby/stdlib-compiled ➔ ls -l stdlib-compiled.*-rw-r--r-- 1 headius staff 13424221 Apr 30 01:43 stdlib-compiled.jar-rw-r--r-- 1 headius staff 4051355 Apr 30 01:44 stdlib-compiled.pack.gzNow we're seeing more reasonable numbers. A 13MB jar file is still pretty large, but it's not bad considering we started with 44MB of .class files. The packed size is even better: only 4MB for a completely-compiled Ruby standard library, and ultimately *smaller* than the original sources.So what's the bad news? It obviously wasn't the size, since I just showed that was a red herring. The bad news is when we try to dex this jar.The "dx" ToolThe Android SDK ships with a tool called "dx" which gets used at build time to translate Java bytecode (in .class files, .jar files, etc) to Dalvik bytecode (resulting in a .dex file. Along the way it optimizes the code, tidies up redundancies, and basically makes it as clean and compact as possible for distribution to an Android device. Once on the device, the Dalvik bytecode gets immediately compiled into whatever the native processor runs, so the dex data needs to be as clean and optimized as possible.Every Java application shipped for Android must pass through dx in some way, so my next step was to try to "dex" the compiled standard library:~/projects/jruby/out ➔ ../../android-sdk-mac_86/platforms/android-7/tools/dx --dex --verbose --positions=none --no-locals --output=stdlib-compiled.dex stdlib-compiled.jar processing archive stdlib-compiled.jar...ignored resource META-INF/MANIFEST.MFprocessing ruby/jit/FILE_003796EE1C0C24540DF7239B8197C183BC7017BB.class...processing ruby/jit/FILE_00499F5FE29ED8EDB63965B0F65B19CFE994D120.class......processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__5$RUBY$run_suiteFixed0.class...processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__6$RUBY$create_resultFixed0.class...trouble writing output: format == nullUh-oh, that doesn't look good. What happened?Well it turns out that the Ruby standard library *plus* all the handles needed to bind it is too much for the current dex file format It's a known issue that similarly bit Maciek Makowski (reported of the linked bug) when he tried to dex both the Scala compiler and the Scala base set of libraries in one go. And similar to his case, I was able to successfully dex *either* the precompiled stdlib *or* the generated handles...but not both at the same time.What Can We Do?It appears that for the moment, it's not going to be possible to completely precompile the entire Ruby standard library. But there's ways around that.First off, probably no application on the planet needs the entire standard library, so we can easily just include the files needed for a given app. That may be enough to cut the size down tremendously. It's also perfectly possible to build a very complicated Ruby application for Android that will easily fit into the current dex format; I doubt most mobile applications would result in 4.5MB of uncompressed .rb source. So the added --sha1 and --handle features will be immediately useful for Android development.Secondly, I've been planning on adding a different way to bind methods that doesn't require a class file per method. I would probably generate a large switch for each .rb file and then bind the methods numerically, so only a single additional class (and only a few methods in that class) would be needed to bind an entire compiled .rb script. This issue with dex will force me to finally do that.And lastly, there's a bit more good news. Remember that the packed size of the entire standard library plus handles was around 4MB? Here's the sizes of the dex'ed standard library and handles:~/projects/jruby ➔ ls -l *.dex-rw-r--r-- 1 headius staff 3718340 Apr 30 00:57 stdlib-compiled-solo.dex-rw-r--r-- 1 headius staff 8656300 Apr 30 00:52 stdlib-compiled-handles.dex~/projects/jruby ➔ jar cf blah.apk *.dex~/projects/jruby ➔ ls -l blah.apk -rw-r--r-- 1 headius staff 3179625 Apr 30 02:01 blah.apkOnce dex has worked its magic against our sources, we're now down to 3.1MB of compressed code...a pretty good size for the entire Ruby standard library plus 7500+ noisy, repetitive handles. We're definitely within reach of making full Ruby development for Android a reality. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
I originally started to send this to the JRuby dev list and to the Ruboto list, but realized quickly that it might make a good blog post. Since I don't blog enough lately, here it is.I've been looking into better ways to precompile Ruby code to
... [More]
classes for deploy on Android devices. Normally, JRuby users can just jar up their .rb files and load and require them as though they were on the filesystem; JRuby finds, loads, and runs them just fine. This works well enough on Android, but since there's no way to generate bytecode at runtime, JRuby code that isn't precompiled must run interpreted forever...and run a bit slower than we'd like because of it. In order for Ruby to be a first-class language for Android development, we must make it possible to precompile Ruby code *completely* and bundle it up with the application. So this evening I spent some time making that possible.I have some good news, some bad news, and some good news. First, a bit of background into JRuby's compiler.What JRuby's Compiler ProducesJRuby's ahead-of-time compiler produces a single .class file per .rb file, to ease deployment and lookup of those files (and because I think it's ugly to always vomit out an unidentifiable class for every method body). This produces a nice 1:1 mapping between .rb and .class, but it comes at a cost: since those .class files are just "bags of methods" we need to bind those methods somehow. This usually happens at runtime, with JRuby generating a small "handle" class for every method as it is bound. So for a script like this:# foo.rbclass Foo def bar; endenddef hello; endYou will get one top-level class file when you AOT compile, and then two more class files are generated at runtime for the "handles" for methods "bar" and "hello". This provides the best possible performance for invocation, plus a nice 1:1 on-disk format...but it means we're still generating a lot of code at runtime.The other complication is that jrubyc normally outputs a .class file of the same name as the .rb file, to ease lookup of that .class file at runtime. So the main .class for the above script would be called "foo.class". The problem with this is that "foo.rb" may not always be loaded as "foo.rb". A user might load '../yum/../foo.rb' or some other peculiar path. As a result, the base name of the file is not enough to determine what class name to load. To solve this, I've introduced an alternate naming scheme that uses the SHA1 hash of the *actual* content of the file as the class name. So, for the above script, the resulting class would be named:ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2While this isn't a pretty name, it does provide a way to locate the compiled version of a script universally, regardless of what path is used to load it.The Good NewsI've modified jrubyc (on master only...we need to talk about whether this should be a late addition to 1.5) to have a new --sha1 flag. As you might guess, this flag alters the compile process to generate the sha1-named class for each compiled file.~/projects/jruby ➔ jrubyc foo.rb Compiling foo.rb to class foo~/projects/jruby ➔ jrubyc --sha1 foo.rb Compiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2~/projects/jruby ➔ jruby -X+C -J-Djruby.jit.debug=true -e "require 'foo'"...found jitted code for ./foo.rb at class: ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2...This is actually finding the foo.rb file, calculating its SHA1 hash, and then loading the .class file instead. So if you had a bunch of .rb code for an Android application and wanted to precompile it, you'd run this command to get the sha1 classes, and then include both the .rb file and the .class file in your application (the .rb file must be there because...you guessed it...we need to calculate the sha1 hash from its contents).To test this out, I actually ran jrubyc against the Ruby stdlib to produce a sha1 class for every .rb file:~/projects/jruby ➔ jrubyc -t /tmp --sha1 lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1ACompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Compiling lib/ruby/1.8//benchmark.rb to class ruby.jit.FILE_0C42EBD7F248AF396DE7A70C0FBC31E9E8D233DE...Compiling lib/ruby/1.8//xsd/xmlparser/rexmlparser.rb to class ruby.jit.FILE_8B106B9E9F2F1768470A7A4E6BD1A36FC0859862Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Compiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CThis produces 524 class files for 524 .rb files, just as it should, and running with forced compilation (-X+C) and jruby.jit.debug=true shows that it finds each class when loading anything from stdlib. That's a good start!What About the Handles?I mentioned above that we also generate, at runtime, a small handle class for every bound method in a given script. And again, since we can't generate bytecode on-device, we need a way to pregenerate all those handles.An hour's worth of work later, and jrubyc has a --handles flag that will additionally spit out all method handles for each script compiled. Here's our foo script compiled with --sha1 and --handles, along with the resulting .class files:~/projects/jruby ➔ jrubyc --sha1 --handles foo.rbCompiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2Generating direct handles for foo.rb~/projects/jruby ➔ ls ruby/jit/*351347*ruby/jit/FILE_351347C9126659D4479558A2706DBC35E45D16D2.class~/projects/jruby ➔ ls *351347*ruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__1$RUBY$barFixed0.classruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__2$RUBY$helloFixed0.classAnd sure enough, we can also see that these handles are being loaded instead of generated at runtime. So it's possible with these two options to *completely* precompile JRuby sources into .class files. Hooray!The Bad NewsMy next step was obviously to try to precompile and dex the entire Ruby standard library. That's 524 files, but how many method bodies? We'd need to generate a handle for each one of them.~/projects/jruby ➔ mkdir stdlib-compiled~/projects/jruby ➔ jrubyc --sha1 --handles -t stdlib-compiled/ lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1AGenerating direct handles for lib/ruby/1.8//abbrev.rbCompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Generating direct handles for lib/ruby/1.8//base64.rb...Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Generating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlparser.rbCompiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CGenerating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb~/projects/jruby ➔ find stdlib-compiled/ -name \*.class | wc -l 8212Wowsers, that's a lot of method bodies..over 7500 of them. But of course this is the entire Ruby standard library, with code for network protocols, templating, xml parsing, soap, and so on. Now for the more frightening numbers: keeping in mind that .class is a pretty verbose file format, how big are all these class files?~/projects/jruby ➔ du -ks stdlib-compiled/ruby14008 stdlib-compiled/ruby~/projects/jruby ➔ du -ks stdlib-compiled/44784 stdlib-compiled/Yeeow! The standard library alone (without handles) produces 14MB of .class files, and with handles it goes up to a whopping 44MB of .class files! That seems a bit high, doesn't it? Especially considering that the .rb files add up to around 4.5MB?Well there's a few explanations for this. First off, the generated handles are rather small, around 2k each, but they each are probably 50% the exact same code. They're generated as separate handles primarily because the JVM will not inline the same loaded body of code through two different call paths, so we have to duplicate that logic repeatedly. Java 7 fixes some of this, but for now we're stuck. The handle classes also share almost identical constant pools, or in-file tables of strings. Many of the same characteristics apply to the compiled Ruby scripts, so the 44MB number is a bit larger than it needs to be.We can show a more realistic estimate of on-disk size by compressing the lot, first with normal "jar", and then with the "pack200" utility, which takes greater advantage of the .class format's intra-file redundancy:~/projects/jruby ➔ cd stdlib-compiled/~/projects/jruby/stdlib-compiled ➔ jar cf stdlib-compiled.jar .~/projects/jruby/stdlib-compiled ➔ pack200 stdlib-compiled.pack.gz stdlib-compiled.jar~/projects/jruby/stdlib-compiled ➔ ls -l stdlib-compiled.*-rw-r--r-- 1 headius staff 13424221 Apr 30 01:43 stdlib-compiled.jar-rw-r--r-- 1 headius staff 4051355 Apr 30 01:44 stdlib-compiled.pack.gzNow we're seeing more reasonable numbers. A 13MB jar file is still pretty large, but it's not bad considering we started with 44MB of .class files. The packed size is even better: only 4MB for a completely-compiled Ruby standard library, and ultimately *smaller* than the original sources.So what's the bad news? It obviously wasn't the size, since I just showed that was a red herring. The bad news is when we try to dex this jar.The "dx" ToolThe Android SDK ships with a tool called "dx" which gets used at build time to translate Java bytecode (in .class files, .jar files, etc) to Dalvik bytecode (resulting in a .dex file. Along the way it optimizes the code, tidies up redundancies, and basically makes it as clean and compact as possible for distribution to an Android device. Once on the device, the Dalvik bytecode gets immediately compiled into whatever the native processor runs, so the dex data needs to be as clean and optimized as possible.Every Java application shipped for Android must pass through dx in some way, so my next step was to try to "dex" the compiled standard library:~/projects/jruby/out ➔ ../../android-sdk-mac_86/platforms/android-7/tools/dx --dex --verbose --positions=none --no-locals --output=stdlib-compiled.dex stdlib-compiled.jar processing archive stdlib-compiled.jar...ignored resource META-INF/MANIFEST.MFprocessing ruby/jit/FILE_003796EE1C0C24540DF7239B8197C183BC7017BB.class...processing ruby/jit/FILE_00499F5FE29ED8EDB63965B0F65B19CFE994D120.class......processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__5$RUBY$run_suiteFixed0.class...processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__6$RUBY$create_resultFixed0.class...trouble writing output: format == nullUh-oh, that doesn't look good. What happened?Well it turns out that the Ruby standard library *plus* all the handles needed to bind it is too much for the current dex file format It's a known issue that similarly bit Maciek Makowski (reported of the linked bug) when he tried to dex both the Scala compiler and the Scala base set of libraries in one go. And similar to his case, I was able to successfully dex *either* the precompiled stdlib *or* the generated handles...but not both at the same time.What Can We Do?It appears that for the moment, it's not going to be possible to completely precompile the entire Ruby standard library. But there's ways around that.First off, probably no application on the planet needs the entire standard library, so we can easily just include the files needed for a given app. That may be enough to cut the size down tremendously. It's also perfectly possible to build a very complicated Ruby application for Android that will easily fit into the current dex format; I doubt most mobile applications would result in 4.5MB of uncompressed .rb source. So the added --sha1 and --handle features will be immediately useful for Android development.Secondly, I've been planning on adding a different way to bind methods that doesn't require a class file per method. I would probably generate a large switch for each .rb file and then bind the methods numerically, so only a single additional class (and only a few methods in that class) would be needed to bind an entire compiled .rb script. This issue with dex will force me to finally do that.And lastly, there's a bit more good news. Remember that the packed size of the entire standard library plus handles was around 4MB? Here's the sizes of the dex'ed standard library and handles:~/projects/jruby ➔ ls -l *.dex-rw-r--r-- 1 headius staff 3718340 Apr 30 00:57 stdlib-compiled-solo.dex-rw-r--r-- 1 headius staff 8656300 Apr 30 00:52 stdlib-compiled-handles.dex~/projects/jruby ➔ jar cf blah.apk *.dex~/projects/jruby ➔ ls -l blah.apk -rw-r--r-- 1 headius staff 3179625 Apr 30 02:01 blah.apkOnce dex has worked its magic against our sources, we're now down to 3.1MB of compressed code...a pretty good size for the entire Ruby standard library plus 7500+ noisy, repetitive handles. We're definitely within reach of making full Ruby development for Android a reality. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
I originally started to send this to the JRuby dev list and to the Ruboto list, but realized quickly that it might make a good blog post. Since I don't blog enough lately, here it is.I've been looking into better ways to precompile Ruby code to
... [More]
classes for deploy on Android devices. Normally, JRuby users can just jar up their .rb files and load and require them as though they were on the filesystem; JRuby finds, loads, and runs them just fine. This works well enough on Android, but since there's no way to generate bytecode at runtime, JRuby code that isn't precompiled must run interpreted forever...and run a bit slower than we'd like because of it. In order for Ruby to be a first-class language for Android development, we must make it possible to precompile Ruby code *completely* and bundle it up with the application. So this evening I spent some time making that possible.I have some good news, some bad news, and some good news. First, a bit of background into JRuby's compiler.What JRuby's Compiler ProducesJRuby's ahead-of-time compiler produces a single .class file per .rb file, to ease deployment and lookup of those files (and because I think it's ugly to always vomit out an unidentifiable class for every method body). This produces a nice 1:1 mapping between .rb and .class, but it comes at a cost: since those .class files are just "bags of methods" we need to bind those methods somehow. This usually happens at runtime, with JRuby generating a small "handle" class for every method as it is bound. So for a script like this:# foo.rbclass Foo def bar; endenddef hello; endYou will get one top-level class file when you AOT compile, and then two more class files are generated at runtime for the "handles" for methods "bar" and "hello". This provides the best possible performance for invocation, plus a nice 1:1 on-disk format...but it means we're still generating a lot of code at runtime.The other complication is that jrubyc normally outputs a .class file of the same name as the .rb file, to ease lookup of that .class file at runtime. So the main .class for the above script would be called "foo.class". The problem with this is that "foo.rb" may not always be loaded as "foo.rb". A user might load '../yum/../foo.rb' or some other peculiar path. As a result, the base name of the file is not enough to determine what class name to load. To solve this, I've introduced an alternate naming scheme that uses the SHA1 hash of the *actual* content of the file as the class name. So, for the above script, the resulting class would be named:ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2While this isn't a pretty name, it does provide a way to locate the compiled version of a script universally, regardless of what path is used to load it.The Good NewsI've modified jrubyc (on master only...we need to talk about whether this should be a late addition to 1.5) to have a new --sha1 flag. As you might guess, this flag alters the compile process to generate the sha1-named class for each compiled file.~/projects/jruby ➔ jrubyc foo.rb Compiling foo.rb to class foo~/projects/jruby ➔ jrubyc --sha1 foo.rb Compiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2~/projects/jruby ➔ jruby -X+C -J-Djruby.jit.debug=true -e "require 'foo'"...found jitted code for ./foo.rb at class: ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2...This is actually finding the foo.rb file, calculating its SHA1 hash, and then loading the .class file instead. So if you had a bunch of .rb code for an Android application and wanted to precompile it, you'd run this command to get the sha1 classes, and then include both the .rb file and the .class file in your application (the .rb file must be there because...you guessed it...we need to calculate the sha1 hash from its contents).To test this out, I actually ran jrubyc against the Ruby stdlib to produce a sha1 class for every .rb file:~/projects/jruby ➔ jrubyc -t /tmp --sha1 lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1ACompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Compiling lib/ruby/1.8//benchmark.rb to class ruby.jit.FILE_0C42EBD7F248AF396DE7A70C0FBC31E9E8D233DE...Compiling lib/ruby/1.8//xsd/xmlparser/rexmlparser.rb to class ruby.jit.FILE_8B106B9E9F2F1768470A7A4E6BD1A36FC0859862Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Compiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CThis produces 524 class files for 524 .rb files, just as it should, and running with forced compilation (-X+C) and jruby.jit.debug=true shows that it finds each class when loading anything from stdlib. That's a good start!What About the Handles?I mentioned above that we also generate, at runtime, a small handle class for every bound method in a given script. And again, since we can't generate bytecode on-device, we need a way to pregenerate all those handles.An hour's worth of work later, and jrubyc has a --handles flag that will additionally spit out all method handles for each script compiled. Here's our foo script compiled with --sha1 and --handles, along with the resulting .class files:~/projects/jruby ➔ jrubyc --sha1 --handles foo.rbCompiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2Generating direct handles for foo.rb~/projects/jruby ➔ ls ruby/jit/*351347*ruby/jit/FILE_351347C9126659D4479558A2706DBC35E45D16D2.class~/projects/jruby ➔ ls *351347*ruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__1$RUBY$barFixed0.classruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__2$RUBY$helloFixed0.classAnd sure enough, we can also see that these handles are being loaded instead of generated at runtime. So it's possible with these two options to *completely* precompile JRuby sources into .class files. Hooray!The Bad NewsMy next step was obviously to try to precompile and dex the entire Ruby standard library. That's 524 files, but how many method bodies? We'd need to generate a handle for each one of them.~/projects/jruby ➔ mkdir stdlib-compiled~/projects/jruby ➔ jrubyc --sha1 --handles -t stdlib-compiled/ lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1AGenerating direct handles for lib/ruby/1.8//abbrev.rbCompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Generating direct handles for lib/ruby/1.8//base64.rb...Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Generating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlparser.rbCompiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CGenerating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb~/projects/jruby ➔ find stdlib-compiled/ -name \*.class | wc -l 8212Wowsers, that's a lot of method bodies..over 7500 of them. But of course this is the entire Ruby standard library, with code for network protocols, templating, xml parsing, soap, and so on. Now for the more frightening numbers: keeping in mind that .class is a pretty verbose file format, how big are all these class files?~/projects/jruby ➔ du -ks stdlib-compiled/ruby14008 stdlib-compiled/ruby~/projects/jruby ➔ du -ks stdlib-compiled/44784 stdlib-compiled/Yeeow! The standard library alone (without handles) produces 14MB of .class files, and with handles it goes up to a whopping 44MB of .class files! That seems a bit high, doesn't it? Especially considering that the .rb files add up to around 4.5MB?Well there's a few explanations for this. First off, the generated handles are rather small, around 2k each, but they each are probably 50% the exact same code. They're generated as separate handles primarily because the JVM will not inline the same loaded body of code through two different call paths, so we have to duplicate that logic repeatedly. Java 7 fixes some of this, but for now we're stuck. The handle classes also share almost identical constant pools, or in-file tables of strings. Many of the same characteristics apply to the compiled Ruby scripts, so the 44MB number is a bit larger than it needs to be.We can show a more realistic estimate of on-disk size by compressing the lot, first with normal "jar", and then with the "pack200" utility, which takes greater advantage of the .class format's intra-file redundancy:~/projects/jruby ➔ cd stdlib-compiled/~/projects/jruby/stdlib-compiled ➔ jar cf stdlib-compiled.jar .~/projects/jruby/stdlib-compiled ➔ pack200 stdlib-compiled.pack.gz stdlib-compiled.jar~/projects/jruby/stdlib-compiled ➔ ls -l stdlib-compiled.*-rw-r--r-- 1 headius staff 13424221 Apr 30 01:43 stdlib-compiled.jar-rw-r--r-- 1 headius staff 4051355 Apr 30 01:44 stdlib-compiled.pack.gzNow we're seeing more reasonable numbers. A 13MB jar file is still pretty large, but it's not bad considering we started with 44MB of .class files. The packed size is even better: only 4MB for a completely-compiled Ruby standard library, and ultimately *smaller* than the original sources.So what's the bad news? It obviously wasn't the size, since I just showed that was a red herring. The bad news is when we try to dex this jar.The "dx" ToolThe Android SDK ships with a tool called "dx" which gets used at build time to translate Java bytecode (in .class files, .jar files, etc) to Dalvik bytecode (resulting in a .dex file. Along the way it optimizes the code, tidies up redundancies, and basically makes it as clean and compact as possible for distribution to an Android device. Once on the device, the Dalvik bytecode gets immediately compiled into whatever the native processor runs, so the dex data needs to be as clean and optimized as possible.Every Java application shipped for Android must pass through dx in some way, so my next step was to try to "dex" the compiled standard library:~/projects/jruby/out ➔ ../../android-sdk-mac_86/platforms/android-7/tools/dx --dex --verbose --positions=none --no-locals --output=stdlib-compiled.dex stdlib-compiled.jar processing archive stdlib-compiled.jar...ignored resource META-INF/MANIFEST.MFprocessing ruby/jit/FILE_003796EE1C0C24540DF7239B8197C183BC7017BB.class...processing ruby/jit/FILE_00499F5FE29ED8EDB63965B0F65B19CFE994D120.class......processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__5$RUBY$run_suiteFixed0.class...processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__6$RUBY$create_resultFixed0.class...trouble writing output: format == nullUh-oh, that doesn't look good. What happened?Well it turns out that the Ruby standard library *plus* all the handles needed to bind it is too much for the current dex file format It's a known issue that similarly bit Maciek Makowski (reported of the linked bug) when he tried to dex both the Scala compiler and the Scala base set of libraries in one go. And similar to his case, I was able to successfully dex *either* the precompiled stdlib *or* the generated handles...but not both at the same time.What Can We Do?It appears that for the moment, it's not going to be possible to completely precompile the entire Ruby standard library. But there's ways around that.First off, probably no application on the planet needs the entire standard library, so we can easily just include the files needed for a given app. That may be enough to cut the size down tremendously. It's also perfectly possible to build a very complicated Ruby application for Android that will easily fit into the current dex format; I doubt most mobile applications would result in 4.5MB of uncompressed .rb source. So the added --sha1 and --handle features will be immediately useful for Android development.Secondly, I've been planning on adding a different way to bind methods that doesn't require a class file per method. I would probably generate a large switch for each .rb file and then bind the methods numerically, so only a single additional class (and only a few methods in that class) would be needed to bind an entire compiled .rb script. This issue with dex will force me to finally do that.And lastly, there's a bit more good news. Remember that the packed size of the entire standard library plus handles was around 4MB? Here's the sizes of the dex'ed standard library and handles:~/projects/jruby ➔ ls -l *.dex-rw-r--r-- 1 headius staff 3718340 Apr 30 00:57 stdlib-compiled-solo.dex-rw-r--r-- 1 headius staff 8656300 Apr 30 00:52 stdlib-compiled-handles.dex~/projects/jruby ➔ jar cf blah.apk *.dex~/projects/jruby ➔ ls -l blah.apk -rw-r--r-- 1 headius staff 3179625 Apr 30 02:01 blah.apkOnce dex has worked its magic against our sources, we're now down to 3.1MB of compressed code...a pretty good size for the entire Ruby standard library plus 7500+ noisy, repetitive handles. We're definitely within reach of making full Ruby development for Android a reality. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
I originally started to send this to the JRuby dev list and to the Ruboto list, but realized quickly that it might make a good blog post. Since I don't blog enough lately, here it is.I've been looking into better ways to precompile Ruby code to
... [More]
classes for deploy on Android devices. Normally, JRuby users can just jar up their .rb files and load and require them as though they were on the filesystem; JRuby finds, loads, and runs them just fine. This works well enough on Android, but since there's no way to generate bytecode at runtime, JRuby code that isn't precompiled must run interpreted forever...and run a bit slower than we'd like because of it. In order for Ruby to be a first-class language for Android development, we must make it possible to precompile Ruby code *completely* and bundle it up with the application. So this evening I spent some time making that possible.I have some good news, some bad news, and some good news. First, a bit of background into JRuby's compiler.What JRuby's Compiler ProducesJRuby's ahead-of-time compiler produces a single .class file per .rb file, to ease deployment and lookup of those files (and because I think it's ugly to always vomit out an unidentifiable class for every method body). This produces a nice 1:1 mapping between .rb and .class, but it comes at a cost: since those .class files are just "bags of methods" we need to bind those methods somehow. This usually happens at runtime, with JRuby generating a small "handle" class for every method as it is bound. So for a script like this:# foo.rbclass Foo def bar; endenddef hello; endYou will get one top-level class file when you AOT compile, and then two more class files are generated at runtime for the "handles" for methods "bar" and "hello". This provides the best possible performance for invocation, plus a nice 1:1 on-disk format...but it means we're still generating a lot of code at runtime.The other complication is that jrubyc normally outputs a .class file of the same name as the .rb file, to ease lookup of that .class file at runtime. So the main .class for the above script would be called "foo.class". The problem with this is that "foo.rb" may not always be loaded as "foo.rb". A user might load '../yum/../foo.rb' or some other peculiar path. As a result, the base name of the file is not enough to determine what class name to load. To solve this, I've introduced an alternate naming scheme that uses the SHA1 hash of the *actual* content of the file as the class name. So, for the above script, the resulting class would be named:ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2While this isn't a pretty name, it does provide a way to locate the compiled version of a script universally, regardless of what path is used to load it.The Good NewsI've modified jrubyc (on master only...we need to talk about whether this should be a late addition to 1.5) to have a new --sha1 flag. As you might guess, this flag alters the compile process to generate the sha1-named class for each compiled file.~/projects/jruby ➔ jrubyc foo.rb Compiling foo.rb to class foo~/projects/jruby ➔ jrubyc --sha1 foo.rb Compiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2~/projects/jruby ➔ jruby -X+C -J-Djruby.jit.debug=true -e "require 'foo'"...found jitted code for ./foo.rb at class: ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2...This is actually finding the foo.rb file, calculating its SHA1 hash, and then loading the .class file instead. So if you had a bunch of .rb code for an Android application and wanted to precompile it, you'd run this command to get the sha1 classes, and then include both the .rb file and the .class file in your application (the .rb file must be there because...you guessed it...we need to calculate the sha1 hash from its contents).To test this out, I actually ran jrubyc against the Ruby stdlib to produce a sha1 class for every .rb file:~/projects/jruby ➔ jrubyc -t /tmp --sha1 lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1ACompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Compiling lib/ruby/1.8//benchmark.rb to class ruby.jit.FILE_0C42EBD7F248AF396DE7A70C0FBC31E9E8D233DE...Compiling lib/ruby/1.8//xsd/xmlparser/rexmlparser.rb to class ruby.jit.FILE_8B106B9E9F2F1768470A7A4E6BD1A36FC0859862Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Compiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CThis produces 524 class files for 524 .rb files, just as it should, and running with forced compilation (-X+C) and jruby.jit.debug=true shows that it finds each class when loading anything from stdlib. That's a good start!What About the Handles?I mentioned above that we also generate, at runtime, a small handle class for every bound method in a given script. And again, since we can't generate bytecode on-device, we need a way to pregenerate all those handles.An hour's worth of work later, and jrubyc has a --handles flag that will additionally spit out all method handles for each script compiled. Here's our foo script compiled with --sha1 and --handles, along with the resulting .class files:~/projects/jruby ➔ jrubyc --sha1 --handles foo.rbCompiling foo.rb to class ruby.jit.FILE_351347C9126659D4479558A2706DBC35E45D16D2Generating direct handles for foo.rb~/projects/jruby ➔ ls ruby/jit/*351347*ruby/jit/FILE_351347C9126659D4479558A2706DBC35E45D16D2.class~/projects/jruby ➔ ls *351347*ruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__1$RUBY$barFixed0.classruby_jit_FILE_351347C9126659D4479558A2706DBC35E45D16D2Invokermethod__2$RUBY$helloFixed0.classAnd sure enough, we can also see that these handles are being loaded instead of generated at runtime. So it's possible with these two options to *completely* precompile JRuby sources into .class files. Hooray!The Bad NewsMy next step was obviously to try to precompile and dex the entire Ruby standard library. That's 524 files, but how many method bodies? We'd need to generate a handle for each one of them.~/projects/jruby ➔ mkdir stdlib-compiled~/projects/jruby ➔ jrubyc --sha1 --handles -t stdlib-compiled/ lib/ruby/1.8/Compiling all in '/Users/headius/projects/jruby/lib/ruby/1.8'...Compiling lib/ruby/1.8//abbrev.rb to class ruby.jit.FILE_4F30363F88066CC74555ABA5BE4B73FDE323BE1AGenerating direct handles for lib/ruby/1.8//abbrev.rbCompiling lib/ruby/1.8//base64.rb to class ruby.jit.FILE_DD42170B797E34D082C952B92A19474E3FDF3FA2Generating direct handles for lib/ruby/1.8//base64.rb...Compiling lib/ruby/1.8//xsd/xmlparser/xmlparser.rb to class ruby.jit.FILE_AF51477EA5467822D8ADED37EEB5AB5D841E07D9Generating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlparser.rbCompiling lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb to class ruby.jit.FILE_3203482AEE794F4B9D5448BF51935879B026092CGenerating direct handles for lib/ruby/1.8//xsd/xmlparser/xmlscanner.rb~/projects/jruby ➔ find stdlib-compiled/ -name \*.class | wc -l 8212Wowsers, that's a lot of method bodies..over 7500 of them. But of course this is the entire Ruby standard library, with code for network protocols, templating, xml parsing, soap, and so on. Now for the more frightening numbers: keeping in mind that .class is a pretty verbose file format, how big are all these class files?~/projects/jruby ➔ du -ks stdlib-compiled/ruby14008 stdlib-compiled/ruby~/projects/jruby ➔ du -ks stdlib-compiled/44784 stdlib-compiled/Yeeow! The standard library alone (without handles) produces 14MB of .class files, and with handles it goes up to a whopping 44MB of .class files! That seems a bit high, doesn't it? Especially considering that the .rb files add up to around 4.5MB?Well there's a few explanations for this. First off, the generated handles are rather small, around 2k each, but they each are probably 50% the exact same code. They're generated as separate handles primarily because the JVM will not inline the same loaded body of code through two different call paths, so we have to duplicate that logic repeatedly. Java 7 fixes some of this, but for now we're stuck. The handle classes also share almost identical constant pools, or in-file tables of strings. Many of the same characteristics apply to the compiled Ruby scripts, so the 44MB number is a bit larger than it needs to be.We can show a more realistic estimate of on-disk size by compressing the lot, first with normal "jar", and then with the "pack200" utility, which takes greater advantage of the .class format's intra-file redundancy:~/projects/jruby ➔ cd stdlib-compiled/~/projects/jruby/stdlib-compiled ➔ jar cf stdlib-compiled.jar .~/projects/jruby/stdlib-compiled ➔ pack200 stdlib-compiled.pack.gz stdlib-compiled.jar~/projects/jruby/stdlib-compiled ➔ ls -l stdlib-compiled.*-rw-r--r-- 1 headius staff 13424221 Apr 30 01:43 stdlib-compiled.jar-rw-r--r-- 1 headius staff 4051355 Apr 30 01:44 stdlib-compiled.pack.gzNow we're seeing more reasonable numbers. A 13MB jar file is still pretty large, but it's not bad considering we started with 44MB of .class files. The packed size is even better: only 4MB for a completely-compiled Ruby standard library, and ultimately *smaller* than the original sources.So what's the bad news? It obviously wasn't the size, since I just showed that was a red herring. The bad news is when we try to dex this jar.The "dx" ToolThe Android SDK ships with a tool called "dx" which gets used at build time to translate Java bytecode (in .class files, .jar files, etc) to Dalvik bytecode (resulting in a .dex file. Along the way it optimizes the code, tidies up redundancies, and basically makes it as clean and compact as possible for distribution to an Android device. Once on the device, the Dalvik bytecode gets immediately compiled into whatever the native processor runs, so the dex data needs to be as clean and optimized as possible.Every Java application shipped for Android must pass through dx in some way, so my next step was to try to "dex" the compiled standard library:~/projects/jruby/out ➔ ../../android-sdk-mac_86/platforms/android-7/tools/dx --dex --verbose --positions=none --no-locals --output=stdlib-compiled.dex stdlib-compiled.jar processing archive stdlib-compiled.jar...ignored resource META-INF/MANIFEST.MFprocessing ruby/jit/FILE_003796EE1C0C24540DF7239B8197C183BC7017BB.class...processing ruby/jit/FILE_00499F5FE29ED8EDB63965B0F65B19CFE994D120.class......processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__5$RUBY$run_suiteFixed0.class...processing ruby_jit_FILE_FEF23DE8CDA5B9BD9D880CBC08D3249158379E58Invokermethod__6$RUBY$create_resultFixed0.class...trouble writing output: format == nullUh-oh, that doesn't look good. What happened?Well it turns out that the Ruby standard library *plus* all the handles needed to bind it is too much for the current dex file format It's a known issue that similarly bit Maciek Makowski (reported of the linked bug) when he tried to dex both the Scala compiler and the Scala base set of libraries in one go. And similar to his case, I was able to successfully dex *either* the precompiled stdlib *or* the generated handles...but not both at the same time.What Can We Do?It appears that for the moment, it's not going to be possible to completely precompile the entire Ruby standard library. But there's ways around that.First off, probably no application on the planet needs the entire standard library, so we can easily just include the files needed for a given app. That may be enough to cut the size down tremendously. It's also perfectly possible to build a very complicated Ruby application for Android that will easily fit into the current dex format; I doubt most mobile applications would result in 4.5MB of uncompressed .rb source. So the added --sha1 and --handle features will be immediately useful for Android development.Secondly, I've been planning on adding a different way to bind methods that doesn't require a class file per method. I would probably generate a large switch for each .rb file and then bind the methods numerically, so only a single additional class (and only a few methods in that class) would be needed to bind an entire compiled .rb script. This issue with dex will force me to finally do that.And lastly, there's a bit more good news. Remember that the packed size of the entire standard library plus handles was around 4MB? Here's the sizes of the dex'ed standard library and handles:~/projects/jruby ➔ ls -l *.dex-rw-r--r-- 1 headius staff 3718340 Apr 30 00:57 stdlib-compiled-solo.dex-rw-r--r-- 1 headius staff 8656300 Apr 30 00:52 stdlib-compiled-handles.dex~/projects/jruby ➔ jar cf blah.apk *.dex~/projects/jruby ➔ ls -l blah.apk -rw-r--r-- 1 headius staff 3179625 Apr 30 02:01 blah.apkOnce dex has worked its magic against our sources, we're now down to 3.1MB of compressed code...a pretty good size for the entire Ruby standard library plus 7500+ noisy, repetitive handles. We're definitely within reach of making full Ruby development for Android a reality. [Less]
|
|
Posted
over 15 years
ago
by
[email protected] (Charles Oliver Nutter)
One of the most commonly used native extensions for Ruby is the Nokogiri XML API. Nokogiri wraps libxml and has a fair amount of C code that links directly against the Ruby C extension API...an API we don't support in JRuby (and won't, without a lot
... [More]
of community help).A bit over a year ago, the Nokogiri folks did us a big favor by creating an FFI version of Nokogiri that works surprisingly well; it's probably the most widely-used FFI-based Ruby library around. But the endgame for Nokogiri on JRuby has always been to get a pure-Java version. Not everyone is allowed to link native libraries on their Java platform of choice, and those that are often have trouble getting the right libxml versions installed. The Java version needs to happen.That day is very close.I spent a bit of time this weekend getting the Nokogiri "java" port running on my system, and the folks working on it have brought it almost to 100% passing. It's time to push it over the edge.Building and TestingHere's my process for getting it building. Let me know if this needs to be edited.Update: Added rake-compiler and hoe to gems you need to install and modified the git command-line for versions that don't automatically create a local tracking branch.1. Clone the Nokogiri repository and switch to the "java" branch~/projects ➔ git clone git://github.com/tenderlove/nokogiri.gitInitialized empty Git repository in /Users/headius/projects/nokogiri/.git/remote: Counting objects: 14767, done.remote: Compressing objects: 100% (3882/3882), done.remote: Total 14767 (delta 10482), reused 13969 (delta 9945)Receiving objects: 100% (14767/14767), 3.73 MiB | 742 KiB/s, done.Resolving deltas: 100% (10482/10482), done.~/projects ➔ cd nokogiri/~/projects/nokogiri ➔ git checkout -b java origin/javaBranch java set up to track remote branch java from origin.Switched to a new branch 'java'2. Install racc, rexical, rake-compiler, and hoe into Ruby (C Ruby, that is, since they also have extensions)~/projects/nokogiri ➔ sudo gem install racc rexical rake-compiler hoeBuilding native extensions. This could take a while...Successfully installed racc-1.4.6Successfully installed rexical-1.0.4Successfully installed rake-compiler-0.7.0Successfully installed hoe-2.6.04 gems installed3. Build the lexer and parser using C Ruby~/projects/nokogiri ➔ rake gem:dev:spec(in /Users/headius/projects/nokogiri)warning: couldn't activate the debugging plugin, skippingrake-compiler must be configured first to enable cross-compilation/usr/bin/racc -l -o lib/nokogiri/css/generated_parser.rb lib/nokogiri/css/parser.yrex --independent -o lib/nokogiri/css/generated_tokenizer.rb lib/nokogiri/css/tokenizer.rex4. Build the Java bits (using rake in JRuby)~/projects/nokogiri ➔ jruby -S rake java:build(in /Users/headius/projects/nokogiri)warning: couldn't activate the debugging plugin, skippingjavac -g -cp /Users/headius/projects/jruby/lib/jruby.jar:../../lib/nekohtml.jar:../../lib/nekodtd.jar:../../lib/xercesImpl.jar:../../lib/isorelax.jar:../../lib/jing.jar nokogiri/*.java nokogiri/internals/*.javajar cf ../../lib/nokogiri/nokogiri.jar nokogiri/*.class nokogiri/internals/*.class5. Run the tests (again with rake on JRuby)~/projects/nokogiri ➔ jruby -S rake test(in /Users/headius/projects/nokogiri)...full output...On my system, I get about 8 failures and 19 errors, out of 785 tests and 1657 assertions. We're very close!A few other useful tasks:
jruby -S rake java:clean_all wipes out the build Java stuff
jruby -S rake java:gem builds the Java gem, if you want to try installing it
Helping OutIf you'd like to help fix these bugs, there's a few ways to approach it.
Join the nokogiri-talk Google Group so you can communicate with others working on the port. The key folks right now are Yoko Harada and Sergio Arbeo (who did the original bulk of the work for GSoC 2009). I'm also poking at it a bit in my spare time.
Post to the group to let folks know you want to help. This will help avoid duplicated effort.
Pick tests that appear to be missing or incorrect Ruby logic, like "not implemented", nil results ("method blah not found for nil") or arity errors ("3 arguments for 2" kinds of things). These are often the simplest ones to fix.
Don't give up! We're almost there!
It would be great if we could have a 100% working Nokogiri Java port for JRuby 1.5 final this month. I hope to see you on the nokogiri-talk list! Feel free to comment here if you have questions about getting bootstrapped. [Less]
|