| April 1997 |
| Get FREE JW e-mail alerts |
Summary
One of the concessions Java programs make to natively compiled programs in other languages is reduced performance. However, there are techniques that you can use to narrow the performance difference. This article explores some low-level optimization techniques in Java and presents a benchmark applet to measure Java's performance. (5,900 words)
By Doug Bell
|
Mail this article to a friend |
ccording to the pioneering computer
scientist Donald Knuth,
"Premature optimization is the root
of all evil." Any article on optimization must start by pointing out
that there are usually more reasons not to optimize than to
optimize.
Before optimizing, you should carefully consider whether you need to optimize at all. Optimization in Java can be an elusive target since the execution environments vary. Using a better algorithm probably will yield a bigger performance increase than any amount of low-level optimizations and is more likely to deliver an improvement under all execution conditions. As a rule, high-level optimizations should be considered before doing low-level optimizations.
So why optimize?
If it's such a bad idea, why optimize at all? Well, in an ideal world
you wouldn't. But the reality is that sometimes the biggest problem
with a program is that it requires simply too many resources, and
these resources (memory, CPU cycles, network bandwidth, or a
combination) may be limited. Code fragments that occur multiple
times throughout a program are likely to be size-sensitive, while
code with many execution iterations may be speed-sensitive.
Make Java fast!
As an interpreted language with a compact bytecode, speed, or the
lack of it, is what most often pops up as a problem in Java. We'll
primarily look at how to make Java run faster rather than making it fit
into a smaller space -- although we'll point out where and how these
approaches affect memory or network bandwidth. The focus will be on
the core language rather than on the Java APIs.
By the way, one thing we won't discuss here is the use of native methods written in C or assembly. While using native methods can give the ultimate performance boost, it does so at the cost of Java's platform independence. It is possible to write both a Java version of a method and native versions for selected platforms; this leads to increased performance on some platforms without giving up the ability to run on all platforms. But this is all I'm going to say on the subject of replacing Java with C code. (See the Java Tip, "Write native methods" for more information on this topic.) Our focus in this article is on how to make Java fast.
90/10, 80/20, hut, hut, hike!
As a rule, 90% of a program's excution time is spent executing 10% of
the code. (Some people use the 80/20 rule, but my experience writing
and optimizing commercial games in several languages for the last 15
years has shown that the 90/10 formula is typical for
performance-hungry programs since few tasks tend to be performed with
great frequency.) Optimizing the other 90% of the program (where 10% of
the execution time was spent) has no noticeable effect on performance.
If you were able to make that 90% of the code execute twice as fast,
the program would only be 5% faster. So the first task in optimizing
code is to identify the 10% (frequently it is less than this) of the
program that consumes most of the execution time. This isn't always
where you expect it to be.
General optimization techniques
There are several common optimization techniques that apply
regardless of the language being used. Some of these techniques, such
as global register allocation, are sophisticated strategies to
allocate machine resources (for example, CPU registers) and don't
apply to Java bytecodes. We'll focus on the techniques that basically
involve restructuring code and substituting equivalent operations
within a method.
Strength reduction
Strength reduction occurs when an operation is replaced by an
equivalent operation that executes faster. The most common example of
strength reduction is using the shift operator to multiply and divide
integers by a power of 2. For example, x >> 2 can
be used in place of x / 4, and x << 1
replaces x * 2.
Common sub expression elimination
Common sub expression elimination removes redundant calculations.
Instead of writing
double x = d * (lim / max) * sx;
double y = d * (lim / max) * sy;
the common sub expression is calculated once and used for both calculations:
double depth = d * (lim / max);
double x = depth * sx;
double y = depth * sy;
Code motion
Code motion moves code that performs an operation or calculates an
expression whose result doesn't change, or is invariant.
The code is moved so that it only executes when the result may
change, rather than executing each time the result is required. This
is most common with loops, but it can also involve code repeated on
each invocation of a method. The following is an example of invariant
code motion in a loop:
for (int i = 0; i < x.length; i++)
x[i] *= Math.PI * Math.cos(y);
becomes
double picosy = Math.PI * Math.cos(y);
for (int i = 0; i < x.length; i++)
x[i] *= picosy;
Unrolling loops
Unrolling loops reduces the overhead of loop control code by
performing more than one operation each time through the loop, and
consequently executing fewer iterations. Working from the previous
example, if we know that the length of x[] is always a
multiple of two, we might rewrite the loop as:
double picosy = Math.PI * Math.cos(y);
for (int i = 0; i < x.length; i += 2) {
x[i] *= picosy;
x[i+1] *= picosy;
}
In practice, unrolling loops such as this -- in which the value of
the loop index is used within the loop and must be separately
incremented -- does not yield an appreciable speed increase in
interpreted Java because the bytecodes lack instructions to
efficiently combine the "+1" into the array index.
All of the optimization tips in this article embody one or more of the general techniques listed above.
Putting the compiler to work
Modern C and Fortran compilers produce highly optimized code. C++
compilers generally produce less efficient code, but are still well
along the path to producing optimal code. All of these compilers have
gone through many generations under the influence of strong market
competition and have become finely honed tools for squeezing every
last drop of performance out of ordinary code. They almost certainly
use all of the general optimization techniques presented above. But
there are still plenty of tricks left for making compilers generate
efficient code.
javac, JITs, and native code compilers
The level of optimization that javac performs when
compiling code at this point is minimal. It defaults to doing the
following:
i = (10 *10) compiles to
i = 100.
goto bytecodes are avoided.
if(false) i = 1.
The level of optimization javac provides should improve, probably dramatically, as the language matures and compiler vendors begin to compete in earnest on the basis of code generation. Java just now is getting second-generation compilers.
Then there are just-in-time (JIT) compilers that convert Java bytecodes into native code at run time. Several are already available, and while they can increase the execution speed of your program dramatically, the level of optimization they can perform is constrained because optimization occurs at run time. A JIT compiler is more concerned with generating the code quickly than with generating the quickest code.
Native code compilers that compile Java directly to native code should offer the greatest performance but at the cost of platform independence. Fortunately, many of the tricks presented here will be achieved by future compilers, but for now it takes a little work to get the most out the compiler.
javac does offer one performance option you can
enable: invoking the -O option to cause the compiler to
inline certain method calls:
javac -O MyClass
Inlining a method call inserts the code for the method directly into
the code making the method call. This eliminates the overhead of the
method call. For a small method this overhead can represent a significant
percentage of its execution time. Note that only methods declared as
either private, static, or final
can be considered for inlining, because only these methods are
statically resolved by the compiler. Also, synchronized
methods won't be inlined. The compiler will only inline small methods
typically consisting of only one or two lines of code.
Unfortunately, the 1.0 versions of the javac compiler have a
bug that will generate code that can't pass the bytecode verifier when
the -O option is used. This has been fixed in JDK 1.1.
(The bytecode verifier checks code before it is allowed to run to make
sure that it doesn't violate any Java rules.) It will inline methods
that reference class members inaccessible to the calling class. For
example, if the following classes are compiled together using the
-O option
class A {
private static int x = 10;
public static void getX () { return x; }
}
class B {
int y = A.getX();
}
the call to A.getX() in class B will get inlined in class B as if B had been written as:
class B {
int y = A.x;
}
However, this will cause the generation of bytecodes to access the
private A.x variable that will be generated in B's code. This code will
execute fine, but since it violates Java's access restrictions, it will
get flagged by the verifier with an IllegalAccessError the
first time the code is executed.
This bug doesn't make the -O option useless, but you do
have to be careful about how you use it. If invoked on a single class, it can
inline certain method calls within the class without risk. Several
classes can be inlined together as long as there aren't any potential
access restrictions. And some code (such as applications) isn't
subjected to the bytecode verifier. You can ignore the bug if you know
your code will only execute without being subjected to the verifier.
For additional information, see my javac-O FAQ.
Profilers
Fortunately, the JDK comes with a built-in profiler to help identify
where time is spent in a program. It will keep track of the time spent
in each routine and write the information to the file
java.prof. To run the profiler, use the -prof
option when invoking the Java interpreter:
java -prof myClass
Or for use with an applet:
java -prof sun.applet.AppletViewer myApplet.html
There are a few caveats for using the profiler. The profiler output
is not particularly easy to decipher. Also, in JDK 1.0.2 it truncates
the method names to 30 characters, so it may not be possible to
distinguish some methods. Unfortunately, with the Mac there is no
means to invoke the profiler, so Mac users are out of luck. On top of
all this, Sun's Java document page (see Resources) no longer includes the documentation
for the -prof option). However, if your platform does
support the -prof option, either Vladimir Bulatov's
HyperProf or Greg White's ProfileViewer can be used to help interpret
the results (see Resources).
It is also possible to "profile" code by inserting explicit timing into the code:
long start = System.currentTimeMillis();
// do operation to be timed here
long time = System.currentTimeMillis() - start;
System.currentTimeMillis() returns time in 1/1000ths of
a second. However, some systems, such as a Windows PC, have a system
timer with less (much less) resolution than a 1/1000th of a second.
Even 1/1000th of a second isn't long enough to accurately time many
operations. In these cases, or on systems with low-resolution timers,
it may be necessary to time how long it takes to repeat the operation
n times and then divide the total time by n to get
the actual time. Even when profiling is available, this technique can
be useful for timing a specific task or operation.
Here are a few closing notes on profiling:
The Benchmark applet
The Benchmark applet measures the time it takes to do an operation
thousands (or even millions) of times, subtracts the time spent doing
operations other than the test (such as the loop overhead), and then
uses this information to compute how long each operation took. It runs
each test for approximately one second. In an attempt to eliminate
random delays from other operations the computer may perform during a
test, it runs each test three times and uses the best result. It also
attempts to eliminate garbage collection as a factor in the tests.
Because of this, the more memory available to the benchmark, the more
accurate the benchmark results are.
The times for operations are given in nanoseconds, or billionths of a second, designated by "ns." While this is a very small increment of time, it actually may not be small enough to measure some operations accurately on very fast computers! If you are lucky enough to be using a fast machine, the benchmark will adjust the timing units to picoseconds (trillionths of a second), designated by "ps." If, on the other hand, your computer is slow, it may adjust back to using microseconds (millionths of a second), designated by "µs," or even all the way back to milliseconds (thousandths of a second), designated by "ms."
To run the Benchmark applet, select the tests you wish to run on the left, click the "Run Benchmark" button, and sit back and wait. Even moving the mouse around the screen can affect the accuracy of the benchmark, so just be patient while it runs the tests. You can stop the current test by clicking the "Stop" button, although it may take the applet a second or two to respond. "Clear" simply clears the text area. If "Console" is checked, the output will be mirrored to the Java console. If you want to save the results, you may need to check this.
Briefly, the individual benchmarks are:
instanceof
expressions.
try/catch/finally statements.
wait()/notify(). (Developing a test for threads that
actually ran on all the platforms I could test turned out to be
quite a challenge.)
To test another Java virtual machine (JVM) besides your browser, download the applet and run it. To test output from another compiler, download the source, recompile the code, and run it. You also can easily add your own tests to the benchmark. For comparison, here are benchmark results for a few Java runtimes.
Understanding the benchmarks
The purpose of benchmarking operations is to know which operations to
avoid or to replace with alternate implementations. This varies between
environments -- and especially between interpreters and JIT compilers
that convert Java's bytecodes into native code at run time.
Interpreted code will want to take advantage of the bytecode
instructions, while a JIT compiler often will have strengths and
weaknesses separate from the bytecode. We'll go over each of the
benchmarks and identify the types of operations that work best for both
interpreted and non-interpreted environments. (Note that some
compilers, such as Asymetrix's SuperCede on the PC, produce native code
directly. Wherever there is a reference to a JIT compiler, you can
substitute any compiler that produces native code.)
The loop benchmark
Almost all speed problems involve loops, and the loop itself is an
easy place to get a quick performance boost. The impact of the loop
overhead depends on how many iterations the loop executes and how
much it does during each iteration. Obviously, the more iterations
and the less work done per iteration, the greater the benefit with
optimizing the loop itself. The basic loop we're going to look at is:
for (i = 0; i < N; i++) {
// do something
}
The first rule is to use a local int variable for the
loop counter. We'll see the reason for this in the Variable benchmark.
It's unlikely that the loop counter would be anything other than a
local int, but I mention it just in case. If it isn't
necessary inside the loop for the loop variable i to count
from 0 to N-1, then the loop can be
restructured to use a more efficient test. Rather than comparing
i against N at each iteration, which requires
N to be loaded (assuming N isn't a constant
or static final variable), restructure the loop to take
advantage of the efficient checks against 0. Comparing against zero is
almost always more efficient in any language since the
underlying tests are based on < 0,<= 0, == 0, != 0, >= 0 and
> 0. To optimize, restructure the loop as:
for (i = N; --i >= 0; ) {
// do something
}
By combining the predecrement of i with the
comparison against 0, the necessary condition flags for
the comparison will already be set -- so no explicit comparison is
required. Interpreted code doesn't actually get this free test
because the bytecodes used for conditional tests require the
condition value to be on the stack. But interpreted code does get a
cheaper test because i is used for the test directly
without requiring an explicit comparison. JIT compiler code would
almost certainly get the benefit of both the free test and the
implicit comparison to zero.
One common use of small loops is to iterate through the elements
in an array. If the purpose of the loop simply is to copy elements
from one array to another array of the same data type (that is, from
an array of int to another array of int --
not from an array of byte to an array of
int where a data type conversion is required), then use
System.arraycopy(). This is a special-purpose optimized
native method for copying array elements; for copies of more than a
few elements, it's by far the fastest way to copy array elements.
If you're not copying and want to iterate to the end of the array,
there is an alternative. Because all array accesses in Java are
bounds checked, you can rely on this implicit testing and simply
catch the ArrayIndexOutOfBoundsException, which will be
thrown when the loop runs past the end of the array.
try {
for (i = 0; ; i++)
array[i] = i;
}
catch (ArrayIndexOutOfBoundsException e) {}
The above isn't necessarily good coding style but coding style frequently is compromised by optimizations. You'll want to make sure the loop performs enough iterations to cover the cost of throwing the exception. (See the exception benchmark for these cases.) Because an exception is not cheap to throw, this technique usually would be used only when the loop typically would perform a lot of iterations. In fact, consider using an exception to terminate a loop as a last-ditch effort since the results are likely to vary on different platforms.
The variable benchmark
The performance of a variable is determined by its scope and its
type. Local method variables are the fastest to access. There are a
couple of reasons for this. One is that there is no reference either
to the object instance, in the case of an instance variable, or to
the class in the case of a static variable. There simply is
an index into the data space. The other reason, for
interpreted code at least, is that there are special bytecodes with
implicit offsets for accessing the first four local variables and
parameters. Variables are assigned to slots in a method according to
the following rules:
this reference (this is
implicitly passed as the left-most parameter to an instance
method).
this reference, if any, is assigned, the
parameters to the method are assigned to slots, starting with the
left-most parameter through the right-most parameter.
double or long occupies two
variable slots. The first slot must be within the first four slots
to use the special bytecodes for accessing the long
or double variable.
Under a JIT compiler, some local variables are likely to get a further boost in performance by being kept in a register.
The fastest types of variables are int and reference
variables. For values stored in variables (as opposed to values
stored in arrays), int, boolean,
byte, short, and char
variables all are represented as 32-bit values and use the same
bytecodes for loading and storing. In arrays, boolean
and byte values are stored as 8-bit values, while
short and char values are stored as 16-bit
values. int, float, and object reference
values are always stored as 32-bits each, and long and
double values are always stored as 64-bit values.
The operations byte, short, and
char are performed as ints, and assigning the
results to a variable of the corresponding type requires an explicit
cast. In the following, (b + c) is an int
value and has to be cast to assign to a byte variable:
byte a, b, c; a = (byte) (b + c);
This cast requires an extra bytecode instruction, so there is another penalty for using smaller types. The speed differences in using these types when using a JIT compiler are less clear. In this case, it depends on what data types the processor handles efficiently.
When using arrays in Java, it's important to realize how an array initializer is implemented. The array
byte[][] myData = {{1, 2}, {3, 4}};
is initialized at run time one element at a time. It is translated literally by the compiler to the equivalent of:
byte[][] myData = new byte[2][]; myData[0] = new byte[2]; myData[0][0] = 1; myData[0][1] = 2; myData[1] = new byte[2]; myData[1][0] = 3; myData[1][1] = 4;
This array initialization has a couple of ramifications: 1) It can bloat your class files to embed data in arrays in the class file. For data of any significant size, it is probably faster to load the data as a separate element. 2) An initialized array as a local variable will execute all this code each time the method is invoked. Declare the array as a static or instance member to eliminate the initialization for each iteration. Even if you need a fresh copy each time the method is called, for non-trivial arrays it is faster to store a single initialized copy and make a copy of it.
private static int[] data = {1,1,2,3,5,8,13,21,34,55,89};
int[] fibo (int inc) {
int[] data = (int[])this.data.clone();
for (int i = data.length; --i >=0; )
data[i] += inc;
return data;
}
(For a multidimensional array, System.arraycopy()
andclone() both will only do a shallow copy of the
outermost dimension. The subarrays will be shared unless they are
individually copied or cloned.)
The method benchmark
There are four different bytecodes for invoking methods, and even
more quick bytecodes. Quick bytecodes are special bytecodes
used by the JVM to accelerate operations at run time. You can think
of the quick bytecodes as the interpreter's version of a JIT
compiler. They convert an operation that requires looking up a method
(or field or class) by name into one that uses an index so that it
only has to resolve the name once. The quick bytecodes are only
generated at run time and are never produced by the compiler. You
don't really need to know anything about quick bytecodes other than
that they make certain operations faster the second and consecutive
times they are executed. The first time the method benchmark is run
in an interpreter after the Benchmark applet is loaded, it runs a
test of the non-quick bytecodes. After that the interpreter has
changed the code to use quick bytecodes so the benchmark then runs
those tests four to five times faster.
There are two basic categories of methods: those that can be
statically resolved by the compiler and the rest, which must
be dynamically resolved at run time. To statically resolve a
method, the compiler must be able to determine absolutely which
method should be invoked. Methods declared as static,
private, and final, as well as all
constructors, may be resolved at compile time because the class the
method belongs to is known. A statically resolved method executes
more quickly than a dynamically resolved method.
The best way to optimize method calls is to eliminate them. The
compiler has an option for inlining certain statically resolved
methods (as described above in "Putting the
compiler to work") -- or you can simply inline method calls
yourself by replacing the calls with equivalent code. In a critical
section of code, calling small, easily inlined methods such as
Math.min(), Math.max(), and
Math.abs() can speed up those operations dramatically.
The next best option is to convert method and especially interface
invocations to static, private, or
final calls. If you have methods in your class that
don't make use of instance variables or call other instance methods,
they can be declared static. If a method doesn't need to
be overridden, it can be declared final. And methods that
shouldn't be called from outside the class should always be declared
private anyway.
By far the most expensive type of call in Java is to a synchronized method. Not only does the overhead of calling a synchronized method dramatically increase the calling time, especially with a JIT compiler, but there is always the potential that the call will be delayed waiting for the monitor for the object. And when the method returns, the monitor must be released, which takes more time.
There are a couple things you can do to speed up synchronized code. Again, the first line of defense is to eliminate calls to synchronized methods, so make sure that the operations actually require synchronization. Be careful though; this is not an area of your program to be overly ambitious in optimizing as the problems you can cause can be difficult to track down.
Because of the way a synchronized method is invoked, it is slightly faster to call a synchronized method than to enter a synchronized block. Also, a call to a synchronized method when the monitor is already owned by the thread executes somewhat faster than the synchronized method when the monitor isn't owned. So for the four combinations of these you get:
class Test {
static int x;
void test () {
meth1();
meth2(); // slowest
synchronized (this) {
meth1(); // fastest
meth2();
}
}
synchronized void meth1 () {
x = 1;
}
void meth2 () {
synchronized (this) { x = 1; }
}
}
Of course, these timings don't include the time it took to enter
the synchronized block inside test(). If
you have a heavily synchronized class that is calling
lots of synchronized methods from other
synchronized methods, you may want to consider having
synchronized methods delegate the work to private
non-synchronized methods to avoid the overhead of reacquiring the
monitor.
class X {
public synchronized void setStuff (int i) {
doSetStuff();
}
private doSetStuff () {
// setStuff code goes here
}
synchronized void doLotsOfStuff () {
doSetStuff(); // instead of calling setStuff()
// do more stuff...
}
}
The operator benchmark
The operator benchmark runs a lot of tests but doesn't tell us too
much. When run on an interpreter, however, it does point out one
interesting tidbit about the bytecodes. The int++ test has
a significantly faster execution than any of the other operators. This
is because the only bytecode that directly modifies a variable instead
of operating on the stack is the iinc instruction; this
increments a local int directly with a sign-extended byte
(-128to 127). This bytecode is one big reason why you should always
use a local int for a loop counter.
In general, it is this benchmark that illustrates where strength
reduction optimization techniques can be used.
Under
an interpreter there is no real difference between writing x =
x + y; vs. x += y;, because the bytecodes
generated are identical when x is a local variable. You
should still use the compound operator because it is easier to see
the operation being performed. However, if x is changed
to an instance variable, array element, or just about any other
lvalue, then x += y; becomes the preferred
expression. (An lvalue refers the the subexpression on
the left-hand
side of an assignment operator.)
Another issue is using shifting rather than multiplying and
dividing by powers of 2. Many modern RISC CPUs have or nearly have
eliminated the difference in execution time between multiply and
shift. However, divide remains the slow operation on most if not all
architectures. So while x <<= 1; generally has
little to no advantage over x *= 2, x >>=
1;, still it is an improvement over x /= 2.
If you are running under an interpreter, then the benchmark also
should show the general advantage int offers over using
byte, short, char, or
long.
The casting benchmark
Casting object references to refer to different object types in
Java (object casts)
can get pretty dense --
especially if you work with a lot of vectors or other container
classes. It turns out that the cost of a cast is high enough that any
object cast that can't be resolved at compile time (an object cast
that can be resolved at compile time is an unnecessary cast) takes
long enough that it is better to save the cast object in a local
variable than to repeat the cast. So instead of writing:
boolean equals (Object obj) {
if (obj instanceof Point)
return (((Point)obj).x == this.x && ((Point)obj).y == this.y);
return false;
}
write it as:
boolean equals (Object obj) {
if (obj instanceof Point) {
Point p = (Point)obj;
return (p.x == this.x && p.y == this.y);
}
return false;
}
Try it out yourself. These two examples are the last two tests in the casting benchmark.
If the cast is to an interface, it's probably at least twice as bad speedwise as casting to a class. In fact, there is one type of cast that can take much longer to execute. If you have an object hierarchy like
interface X {}
class Super {}
class Sub extends Super implements X {}
then the following cast, to an interface implemented by a subclass, takes anywhere from two to three times as long as casting to the subclass.
Super cali = new Sub(); X x = (X) cali;
The further the separation between the interface and the subclass (that is, the further back in the interface inheritance chain the cast interface is from the implemented interface), the longer the cast takes to resolve.
Also beware of unnecessary uses of instanceof. The
following cast is resolved by the compiler and produces no runtime
code to implement the unnecessary cast.
Point q = new Point(); Point p = (Point) q;
However,
Point p = new Point(); boolean b = (p instanceof Point);
can not be resolved by the compiler because
instanceof must return false if p ==
null.
Casting data types is simpler and cheaper than casting objects
because the type of the data value being cast is known (for example, an
int never is actually a subclass of an int.)
However, again since int is the natural size used by the
JVM (ints are the common data type that all other data types can be
directly converted to and from), beware of using the other data types.
Casting from a long to a short requires first
casting from a long to an int and then from
an int to a short.
The instantiation benchmark
Instantiating an object is fairly expensive in terms of CPU cycles.
On top of that, discarded objects will need to be garbage
collected at some point. The efficiency of garbage collection is
difficult to measure, and the Benchmark applet makes no attempt to
measure it. However, if you run the instantiation benchmark, most of
the extra time between the "test" time and the total running time is
spent garbage collecting, so this gives some idea of the impact
garbage collection can have on the performance of code that
instantiates lots of objects. It appears to take about as long to
garbage collect an object as it takes to create one.
Also, the longer the hierarchy chain for the object, the more constructors that must be called. This adds to the instantiation overhead. If you add extra layers to your class hierarchy for increased abstraction, you'll also get increased instantiation time.
The best option is to avoid instantiating objects in tight loops. Where possible, reuse an existing object. The loop
for (int i = 0; i < limit; i++) {
StringBuffer sb = new StringBuffer();
// do something with sb...
}
would be faster if written as
StringBuffer sb = new StringBuffer();
for (int i = 0; i < limit; i++) {
sb.setLength(0);
// do something with sb...
}
One other minor issue with instantiating objects: When an object is instantiated, all of its instance variables automatically are initialized to their default values, which not coincidentally is the value with all bits set to zero for all data types. If the instance variables are explicitly initialized to their default values, this will duplicate the default initialization, generate additional bytecodes, and make the instantiation take longer. (There are rare cases when the explicit initialization is needed to reinitialize a variable that was altered from the default value during the super class's constructor.)
The exception benchmark
Throwing exceptions and catching them with try/catch blocks
is normally an exceptional circumstance and therefore not typically
a performance concern for optimization. Perhaps surprisingly, a
try {} statement is "free" -- that is to say, no bytecodes
are generated, and unless an exception is thrown, all the try
{} statement adds to your code as far as overhead is the
goto bytecode to skip the catch () {}
statement and one or more entries in the method's exception
table. (An exception table keeps track of the beginning and
ending bytecode offsets for try statements and the offsets
of the corresponding catch and finally statements.)
A finally statement adds a little more overhead as it
is implemented as an inline subroutine. When an exception is thrown,
a quick check against the exception table for each method in the call
chain determines whether or not the exception is handled by the
method. Overall, the try/catch/finally mechanism in the
JVM is efficient. (When translated to native code,
try/catch/finally can present a challenge to producing
optimized code because of issues with register variables, so using
these statements may reduce the efficiency of code generated by a JIT
compiler.)
As you can see from the exception benchmark, instantiating a new exception is not as streamlined as throwing and catching the exception. An exception generates and stores a snapshot of the stack at the time the exception was instantiated. This is where most of the time for instantiating an exception goes. If you're throwing exceptions regularly as part of the normal operation of your code, you may want to rethink your design. But, if you do have a need to do this (such as returning a result from several calls deep), and if you're simply going to catch and handle the exception, then repeatedly instantiating an exception can be a waste of time. Instead, it is possible to reuse an exception:
MyException myexcept = new MyException();
void doit (int val) throws MyException {
if (val == 0)
throw myexcept;
}
(I warned you that optimizing isn't pretty.)
The thread benchmark
I could write an entire article on the difficulties and bugs in
various Java implementations that I encountered while writing a
thread benchmark. Suffice it to say that just getting threads to
behave similarly on all platforms can be a difficult proposition in
itself. There are a number of performance issues related to threads,
and they tend to be platform-specific.
To start with, each running thread has both a native or "C" stack
and a Java stack, so memory can be an issue. There is an overhead for
switching between threads (a context switch), which can vary
significantly between platforms. The thread benchmark shows the
timing of using wait()/notify() to initiate context
switches. Using threads also requires releasing and acquiring the
monitor for an object, so a context switch without the synchronized
overhead would be somewhat faster. Synchronized access between
threads, as discussed in the method benchmark
section, can result in deadlock and delays or it simply adds the
overhead for the monitor to calls.
For example, a game that uses sprite animation could implement the sprites with a separate thread for each sprite. (A sprite is an arbitrarily shaped graphic that moves over the background without destroying it.) If there were many sprites, the overhead just from the context switches could bring the system to its knees, let alone what would happen if several sprites wanted to synchronize on the same object (like the game field) at the same time. A more efficient implementation would be to implement a single sprite manager that would handle all of the sprites in a single thread.
Timer threads provide another opportunity for optimization. The Benchmark applet uses threads to time its tests. It starts a new thread for each test, which is simple and straightforward. The Benchmark applet would spend a lot less time with the timer if it simply left a single thread running and signaled it when to start the timer. However, following my own advice, I didn't bother optimizing the timer because doing so would unnecessarily complicate the code.
Conclusions
Low-level optimizations in Java can increase the speed of your code.
Frequently, just being aware of which operations are time consuming
is enough to avoid problems in the first place. If it's necessary to
optimize your code, make sure to follow these simple guidelines:
About the author
Doug Bell is vice president of development for FTL Games. FTL Games has
been producing entertainment software for dozens of different home game
consoles and home computers since the early '80s (before it was cool).
During his 15-plus years in the entertainment software business, he has
led the design and development of many top-selling commercial games,
including the groundbreaking DungeonMaster RPG series. He has been
working full-time on programming in Java for the last year and is a
frequent contributor to the technical discussions in the
comp.lang.java.* newsgroups. He is a co-founder of the San Diego County Java SIG
user's group. He can be reached at dbell@shvn.com, and his neglected home
page can be found at http://www.2nu.com/Doug/index.html.
If you have problems with this magazine, contact
webmaster@javaworld.com
URL: http://www.javaworld.com/javaworld/jw-04-1997/jw-04-optimize.html
Last modified: Wednesday, April 22, 1998