Wednesday, September 24, 2008

The Cost Of Indirection

Consider these two methods, identical except for that one takes an ArrayList<String>, while the other takes a List<String>.

static void direct(ArrayList list, String string) {
    for (int j = 10000; j > 0; j--) {
        list.clear();
        for (int i = 10000; i > 0; i--)
            list.add(string);
    }
}

static void indirect(List list, String string) {
    for (int j = 10000; j > 0; j--) {
        list.clear();
        for (int i = 10000; i > 0; i--)
            list.add(string);
    }
}

Believe it or not, the method which takes an ArrayList will run 30% faster than the method which takes a List. Even more interestingly, switching from List<String> to the Collection<String> interface would make no difference. Why is this? Perhaps the JVM is skipping vtable lookups within the method that uses ArrayList...

If our theory is correct, then the JVM will skip vtable lookups in code that it can prove to be using a specific, concrete method. Therefore, if we were to extend the ArrayList class, we should see a slowdown in the performance of the direct method above. Indeed, we can demonstrate this as follows:

private static class MyArrayList extends ArrayList {
    public boolean add(T e) {
        return super.add(e);
    }
};

public static void main(String[] args) {
    ArrayList list = new ArrayList();
    String string = "dummy";

    for (int j = 0; j < 2; j++) {
        for (int i = 0; i < 3; i++) {
            long directTime = 0;
            long indirectTime = 0;

            // Do a few runs to increase precision
            for (int k = 0; k < 10; k++) {

                indirectTime -= System.nanoTime();
                indirect(list, string);
                indirectTime += System.nanoTime();

                directTime -= System.nanoTime();
                direct(list, string);
                directTime += System.nanoTime();
            }

            // Print the results
            print(directTime/1e9, indirectTime/1e9);
        }

        System.out.println(">> Loading MyArrayList");
        System.out.println();
        MyArrayList.class.getName();
    }
}

static void print(double directTime, double indirectTime) {
    double delta = 100.0 * (1.0 - directTime/indirectTime);
    System.out.printf("%.3gms for List\n", indirectTime);
    System.out.printf("%.3gms for ArrayList\n", directTime);
    System.out.printf("Difference is %.3g%%\n", delta);
    System.out.println();
}

The theory is correct! The output of the above program looks like this:

31.7ms for List
21.7ms for ArrayList
Difference is 31.6%

31.7ms for List
21.3ms for ArrayList
Difference is 32.7%

31.4ms for List
21.4ms for ArrayList
Difference is 31.7%

>> Loading MyArrayList into memory

31.2ms for List
31.5ms for ArrayList
Difference is -0.782%

31.4ms for List
31.1ms for ArrayList
Difference is 1.20%

31.3ms for List
31.5ms for ArrayList
Difference is -0.726%

So what rules can we learn from this?

  1. Try to avoid method indirection in performance critical code. If your code needs to be modular then you must use indirection, but try to limit it to as few indirections as necessary. Overly general code does have a performance hit.
  2. Extending a class in one piece of code can have surprising performance effects on another piece of code. You're not in C++ land anymore folks!

I do wonder though: is there scope for runtime templates in the JVM? Could the JVM perhaps produce multiple instances of the indirect method (one instance for each concrete subclass of List) and jump directly to the most appropriate method? It seems to have potential for improving JVM performance.

Monday, September 8, 2008

Counting Bytecode

I'd like to open this blog with a simple but extremely crude optimisation method: counting bytecode.

Presumably you are aware that Java (the language) compiles to Java bytecode. If you've never seen it, a human-readable form of the bytecode looks something like this:

0:  ICONST_0
1:  ISTORE 1
4:  IINC 1  1
6:  ILOAD 1
7:  IFNE 11
9:  ALOAD 0
10: INVOKESPECIAL run()V
11: RETURN

If you take a bit of time, you can learn to read this code. It is not recommended to write bytecode manually though. :)

Now, take a simple routine in an application that would be execute often, like firing an event. By using freely available Java tools we can look closely at its bytecode and get some ideas on how to optimise it...

class Alarm {
  private Runnable[] listeners;
  private int count;
  public void fireAlarm() {
    for (int i = 0; i < count; i++)
      listeners[i].run();
  }
};

0:   iconst_0                   get the constant 0
1:   istore_1                     assign it to “i”
2:   goto    19             jump to the comparison
5:   aload_0               get the value of “this”
6:   getfield #27;    get the value of “listeners”
9:   iload_1                  get the value of “i”
10:  aaload        get the value of “listeners[i]”
11:  invokeinterface #41,1;             call run()
16:  iinc 1, 1                       increment “i”
19:  iload_1                  get the value of “i”
20:  aload_0               get the value of “this”
21:  getfield #23;        get the value of “count”
24:  if_icmplt 5       perform comparison and loop
27:  return

Looking at the above bytecode, we can see that there are 10 operations performed in every loop iteration. No matter how clever the JVM becomes, it can be worthwhile to reduce the number of operations within oft-executed loops.

In this case, we could try moving all those aload_0 and related operations outside of the loop. This would shorten our loop by 2 bytecode operations, though it does make the entire method longer...

class Alarm {
  private Runnable[] listeners;
  private int count;
  public void fireAlarm() {
    Runnable[] listeners = this.listeners;
    int count = this.count;
    for (int i = 0; i < count; i++)
      listeners[i].run();
  }
};

0:   aload_0                    get value of “this”   
1:   getfield #27;         get value of “listeners”   
4:   astore_1           assign to local “listeners”   
5:   aload_0                    get value of “this”   
6:   getfield #23;             get value of “count”   
9:   istore_2               assign to local “count”   
10:  iconst_0                    get the constant 0   
11:  istore_3                         assign to “i”   
12:  goto 26                 jump to the comparison   
15:  aload_1         get value of local “listeners”   
16:  iload_3                       get value of “i”   
17:  aaload             get value of “listeners[i]”   
18:  invokeinterface #41,1;              call run()   
23:  iinc 3, 1               increment value of “i”   
26:  iload_3                       get value of “i”   
27:  iload_2             get value of local “count”   
28:  if_icmplt 15       perform comparison and loop   
31:  return

Would you believe that such a small change can improve the performance of this routine by up to 20 percent? I kid you not. Simply forcing member variables to be copied on to the stack can significantly improve performance.

Of course, the listeners themselves will consume most of the execution time, so you will not see an immediate 20% improvement in performance from this small change. However, perhaps other pieces of code could be altered to execute more inefficiently... Indeed, the main lessons to take from this are:

  1. Think about the bytecode behind your Java code. If you can think of ways to shorten frequently-execude bytecode, you're probably on the right track. This applies for all code, not just listeners and loops!
  2. Test the performance of your code, and derive 'best practices' for your coding. In the above example, we learned that 'changing member variables to local variables' can provide a performance improvement. Use it in the future!

Over the next few weeks, I will post some of the techniques I have learned to improve performance of my own Java code. Indeed, there are ways to make this event firing loop execute much faster...