/ CPU architecture

High Performance Computing in Java

This article analyzes the current state of Java and JVM support for high performance computing. The focus will be on optimizations based on hardware architecture and avoiding synchronisation. It is assumed that your algorithm complexity is optimal or the best you can achieve but as we will see, very often better performance can be achieved by knowing the basic characteristics of memory and CPU architecture than by trying to reduce the algorithm's complexity.

Memory and CPU architecture

For a long time I wondered how best to start this article to pique the reader's curiosity. In the end I decided to go with two simple examples and then explain their behavior.

Example 1: Iterating over a matrix

This example shows two different ways of iterating over a matrix: column and row based. The code conceptually does exactly the same in both cases. It reads the matrix cell by cell and sums values:

Column oriented traversal:

for (int row = 0; row < MATRIX_SIZE; row ++)
   for (int col = 0; col < MATRIX_SIZE; col++)
       sum += matrix[col][row];

Row oriented traversal:

for (int row = 0; row < MATRIX_SIZE; row ++)
   for (int col = 0; col < MATRIX_SIZE; col++)
       sum += matrix[row][col];

Full benchmark code:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.TimeValue;

import java.util.Random;
import java.util.concurrent.TimeUnit;

public class MatrixTraversal {

   public static void main(String[] args) throws RunnerException {
       Options opt = new OptionsBuilder()
               .include(MatrixTraversal.class.getSimpleName())
               .warmupIterations(10)
               .warmupTime(TimeValue.seconds(10))
               .forks(1)
               .build();

   new Runner(opt).run();
   }

   @State(Scope.Thread)
   public static class MatrixTraversalState {

   private static int MATRIX_SIZE = 8*400;
   private int[][] matrix = new int [MATRIX_SIZE][MATRIX_SIZE];

   @Setup(Level.Trial)
   public void setup() {
       Random rand = new Random();
       rand.setSeed(System.currentTimeMillis());
       for (int row = 0; row < MATRIX_SIZE; row ++)
           for (int col = 0; col < MATRIX_SIZE; col++)
               matrix[row][col] = rand.nextInt();
   }

   @TearDown(Level.Trial)
   public void tearDown() {
       matrix = new int [MATRIX_SIZE][MATRIX_SIZE];
   }


   }

   @Benchmark
   @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS)
   public void columnBased(MatrixTraversalState state, Blackhole blackhole) {
       long sum = 0L;
       for (int row = 0; row < MatrixTraversalState.MATRIX_SIZE; row ++)
           for (int col = 0; col < MatrixTraversalState.MATRIX_SIZE; col++)
               sum += state.matrix[col][row];
       blackhole.consume(sum);
   }

   @Benchmark
   @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS)
   public void rowBased(MatrixTraversalState state, Blackhole blackhole) {
       long sum = 0L;
       for (int row = 0; row < MatrixTraversalState.MATRIX_SIZE; row ++)
           for (int col = 0; col < MatrixTraversalState.MATRIX_SIZE; col++)
               sum += state.matrix[row][col];
       blackhole.consume(sum);
   }
}

Here are the results:

# Run complete. Total time: 00:04:03

Benchmark Mode Cnt Score Error Units
MatrixTraversal.columnBased thrpt 20 0,006 ± 0,001 ops/ms
MatrixTraversal.rowBased thrpt 20 0,249 ± 0,021 ops/ms

How can it be that two conceptually identical algorithms, differing only in traversal method, have such a big difference in throughput? One is more than 42 times more efficient!

Let’s whet your curiosity with a second example.

Example 2: Local and global variable mutation.

This example counts odd numbers in a matrix in parallel. Each thread explores a separate row and increments its own counter; the only difference in the code samples is the way they mutate counter variables.
The first one directly mutates variables in the table with results and the second uses local variables and stores the result at the end.

Global variable mutation:

for(int number: toProcess) {
   if (number % 2 != 0) ParallelOddCounter.results[rowNum] += 1;
}

Local variable mutation:

int result = 0;
for(int number: toProcess) {
   if (number % 2 != 0) result += 1;
}
ParallelOddCounter.results[rowNum] = result;

Parallel execution:

int threadNum = ROWS;
List<CalculateOddTask> tasks = IntStream.range(0, threadNum)
       .mapToObj(i -> new CalculateOddTask(matrix[i], i))
       .collect(toList());

Long sum = tasks.parallelStream()
       .map(CalculateOddTask::calculate)
       .mapToLong(Long::longValue).sum();

Full benchmark code:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.TimeValue;

import java.util.List;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

import static java.util.stream.Collectors.toList;

public class ParallelOddCounter {

   public static void main(String[] args) throws RunnerException {
       Options opt = new OptionsBuilder()
               .include(ParallelOddCounter.class.getSimpleName())
               .warmupIterations(10)
               .warmupTime(TimeValue.seconds(10))
               .forks(1)
               .build();

   new Runner(opt).run();
   }

   @State(Scope.Thread)
   public static class ParallelOddCounterState {

   private static int ROWS = 8;
   private static int COLUMNS = 500000;

   private int[][] matrix = new int [ROWS][COLUMNS];
   private static int[] results = new int [ROWS];

   @Setup(Level.Trial)
   public void setup() {
       Random rand = new Random();
       rand.setSeed(System.currentTimeMillis());
       for (int row = 0; row < ROWS; row ++)
           for (int col = 0; col < COLUMNS; col++)
               matrix[row][col] = rand.nextInt();
   }

   @TearDown(Level.Trial)
   public void tearDown() {
       matrix = new int [ROWS][COLUMNS];
       results = new int [ROWS];
   }


   }

   @Benchmark
   @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS)
   public void localVariableMutation(ParallelOddCounterState state, Blackhole blackhole) {
       int threadNum = ParallelOddCounterState.ROWS;
       List<LocalVariableMutation> tasks = IntStream.range(0, threadNum)
               .mapToObj(i -> new LocalVariableMutation(state.matrix[i], i, state))
               .collect(toList());

   long sum = tasks.parallelStream()
           .map(LocalVariableMutation::calculate)
           .mapToLong(Long::longValue).sum();
   blackhole.consume(sum);
   }

   @Benchmark
   @BenchmarkMode(Mode.Throughput) @OutputTimeUnit(TimeUnit.MILLISECONDS)
   public void globalVariableMutation(ParallelOddCounterState state, Blackhole blackhole) {
       int threadNum = ParallelOddCounterState.ROWS;
       List<GlobalVariableMutation> tasks = IntStream.range(0, threadNum)
               .mapToObj(i -> new GlobalVariableMutation(state.matrix[i], i, state))
               .collect(toList());

   long sum = tasks.parallelStream()
           .map(GlobalVariableMutation::calculate)
           .mapToLong(Long::longValue).sum();
   blackhole.consume(sum);
   }

   private class LocalVariableMutation {

   private final int[] toProcess;
   private final int rowNum;
   private final ParallelOddCounterState state;

   LocalVariableMutation(int[] toProcess, int rowNum, ParallelOddCounterState state) {
       this.toProcess = toProcess;
       this.rowNum = rowNum;
       this.state = state;
   }

   private long calculate() {
       int result = 0;
       for(int number: toProcess) {
           if (number % 2 != 0) result += 1;
       }
       state.results[rowNum] = result;

       return state.results[rowNum];
   }

   }

   private class GlobalVariableMutation {

   private final int[] toProcess;
   private final int rowNum;
   private final ParallelOddCounterState state;

   GlobalVariableMutation(int[] toProcess, int rowNum, ParallelOddCounterState state) {
       this.toProcess = toProcess;
       this.rowNum = rowNum;
       this.state = state;
   }

   private long calculate() {
       state.results[rowNum] = 0;
       for(int number: toProcess) {
           if (number % 2 != 0) state.results[rowNum] += 1;
       }

       return state.results[rowNum];
   }

   }
}

I ran and measured execution times and here are the results:

# Run complete. Total time: 00:04:01

Benchmark Mode Cnt Score Error Units
ParallelOddCounter.globalVariableMutation thrpt 20 0,158 ± 0,007 ops/ms
ParallelOddCounter.localVariableMutation thrpt 20 0,300 ± 0,010 ops/ms

Again, two conceptually similar algorithms with such a big difference in throughput.
Before we jump into deep analysis of hardware and system performance counters let’s describe how CPU and computer memory works in a very general way.

Memory

To reduce access time to the main memory, most modern, commonly used CPUs have a dedicated hardware cache for instructions and data organized as a hierarchy of more cache levels (usually L1, L2 and L3).
The diagram below shows such a hierarchy with typical access times.

processor

As you can see, the L1 cache is the fastest but smallest, and L3 is the slowest but biggest.
In order to read data as fast as possible, the CPU uses several techniques to predict which data will be used in the near future. These techniques are based on memory access patterns, i. e.:

  • Sequential: the simplest pattern, where data are read one by one forward or backward incrementing address
  • Random: this is also a pattern but most multiprocessor systems cannot deal with it efficiently
  • Linear: similar to sequential but with linear transformation of an index

Example 1, Analysis

Lets run perf on a row- and column-based traversal and see what is going on:

Row based:
Perf stats:

  19609,482454      task-clock:u (msec)       #    0,163 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
         18661      page-faults:u             #    0,952 K/sec                  
   61118777117      cycles:u                  #    3,117 GHz                      (28,47%)
   27524496200      stalled-cycles-frontend:u #   45,03% frontend cycles idle     (28,56%)
  136464457301      instructions:u            #    2,23  insn per cycle         
                                              #    0,20  stalled cycles per insn  (35,72%)
    6985113568      branches:u                #  356,211 M/sec                    (35,79%)
      18318068      branch-misses:u           #    0,26% of all branches          (35,85%)
   54548129986      L1-dcache-loads:u         # 2781,722 M/sec                    (28,55%)
    3410858719      L1-dcache-load-misses:u   #    6,25% of all L1-dcache hits    (14,33%)
    2667108320      LLC-loads:u               #  136,011 M/sec                    (14,31%)
   <not supported>      LLC-load-misses:u                                           
   <not supported>      L1-icache-loads:u                                           
           1468678      L1-icache-load-misses:u                                       (21,43%)
       54734555275      dTLB-loads:u              # 2791,229 M/sec                    (15,49%)
          53875977      dTLB-load-misses:u        #    0,10% of all dTLB cache hits   (20,06%)
             86805      iTLB-loads:u              #    0,004 M/sec                    (14,22%)
            158113      iTLB-load-misses:u        #  182,15% of all iTLB cache hits   (21,32%)
   <not supported>      L1-dcache-prefetches:u                                      
        3306277688      L1-dcache-prefetch-misses:u #  168,606 M/sec                    (28,43%)

 120,529408835 seconds time elapsed

Column based:
Perf stats:

  21516,069282      task-clock:u (msec)       #    0,176 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
         18669      page-faults:u             #    0,868 K/sec                  
   65470848000      cycles:u                  #    3,043 GHz                      (28,13%)
   63180495808      stalled-cycles-frontend:u #   96,50% frontend cycles idle     (28,20%)
   11394498221      instructions:u            #    0,17  insn per cycle         
                                              #    5,54  stalled cycles per insn  (35,36%)
    1905013858      branches:u                #   88,539 M/sec                    (35,43%)
       1840750      branch-misses:u           #    0,10% of all branches          (35,48%)
    3772951641      L1-dcache-loads:u         #  175,355 M/sec                    (27,84%)
    3107093650      L1-dcache-load-misses:u   #   82,35% of all L1-dcache hits    (17,01%)
    3036548176      LLC-loads:u               #  141,129 M/sec                    (15,67%)
   <not supported>      LLC-load-misses:u                                           
   <not supported>      L1-icache-loads:u                                           
           2018699      L1-icache-load-misses:u                                       (22,13%)
        3774983469      dTLB-loads:u              #  175,449 M/sec                    (22,05%)
        2133463613      dTLB-load-misses:u        #   56,52% of all dTLB cache hits   (19,68%)
             75845      iTLB-loads:u              #    0,004 M/sec                    (14,23%)
            112901      iTLB-load-misses:u        #  148,86% of all iTLB cache hits   (21,33%)
   <not supported>      L1-dcache-prefetches:u                                      
          85831840      L1-dcache-prefetch-misses:u #    3,989 M/sec                    (28,10%)

 122,431284555 seconds time elapsed

What is most interesting here are the following three statistics:

  • L1-dcache-load-misses:u (6.25% of all L1-dcache hits vs 82.35% of all L1-dcache hits)
  • dTLB-load-misses:u (0-10% of all dTLB cache hits vs 56.52% of all dTLB cache hits) - TLB (Translation Lookaside Buffer) is a hardware cache that translates virtual memory addresses to physical memory addresses.
  • Instructions:u (2.23 insn per cycle vs 0.17 insn per cycle)

These results clearly show that the columns-based experiment has very poor cache performance. It is essential to know that chunks of memory are loaded into caches in one continuous block.
Row-based traversing uses whole chunks of rows that fits cache line but column-based traversing uses only the first cell:

cache1

Example 2, Analysis

Lets run perf on the second example with the same arguments:

Local variable mutation:

 135454,217254      task-clock:u (msec)       #    1,125 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
          3180      page-faults:u             #    0,023 K/sec                  
  408320114133      cycles:u                  #    3,014 GHz                      (25,04%)
  277245114110      stalled-cycles-frontend:u #   67,90% frontend cycles idle     (26,15%)
  244103544681      instructions:u            #    0,60  insn per cycle         
                                              #    1,14  stalled cycles per insn  (33,27%)
   31263636316      branches:u                #  230,806 M/sec                    (33,18%)
   11355446592      branch-misses:u           #   36,32% of all branches          (32,85%)
   22849799812      L1-dcache-loads:u         #  168,690 M/sec                    (19,98%)
    1445932294      L1-dcache-load-misses:u   #    6,33% of all L1-dcache hits    (18,03%)
     137910971      LLC-loads:u               #    1,018 M/sec                    (17,29%)
   <not supported>      LLC-load-misses:u                                           
   <not supported>      L1-icache-loads:u                                           
          22658484      L1-icache-load-misses:u                                       (20,66%)
       22856383028      dTLB-loads:u              #  168,739 M/sec                    (17,12%)
          55405650      dTLB-load-misses:u        #    0,24% of all dTLB cache hits   (17,59%)
            928501      iTLB-loads:u              #    0,007 M/sec                    (15,27%)
           2619021      iTLB-load-misses:u        #  282,07% of all iTLB cache hits   (19,86%)
   <not supported>      L1-dcache-prefetches:u                                      
        1398462355      L1-dcache-prefetch-misses:u #   10,324 M/sec                    (25,34%)

 120,425967116 seconds time elapsed

# Run complete. Total time: 00:04:01
Global array mutation:

Perf stats:

135905,565940      task-clock:u (msec)       #    1,127 CPUs utilized          
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
           185      page-faults:u             #    0,001 K/sec                  
  411036541860      cycles:u                  #    3,024 GHz                      (24,78%)
  304354234256      stalled-cycles-frontend:u #   74,05% frontend cycles idle     (25,94%)
  132423213521      instructions:u            #    0,32  insn per cycle         
                                              #    2,30  stalled cycles per insn  (33,06%)
   30718239771      branches:u                #  226,026 M/sec                    (33,03%)
   12839661193      branch-misses:u           #   41,80% of all branches          (32,64%)
   19402593594      L1-dcache-loads:u         #  142,765 M/sec                    (19,89%)
    2595520843      L1-dcache-load-misses:u   #   13,38% of all L1-dcache hits    (18,09%)
     785438220      LLC-loads:u               #    5,779 M/sec                    (17,17%)
   <not supported>      LLC-load-misses:u                                           
   <not supported>      L1-icache-loads:u                                           
          15883187      L1-icache-load-misses:u                                       (20,51%)
       19425528148      dTLB-loads:u              #  142,934 M/sec                    (17,08%)
              34359211      dTLB-load-misses:u        #    0,18% of all dTLB cache hits   (17,37%)
           1017412      iTLB-loads:u              #    0,007 M/sec                    (15,25%)
               1790179      iTLB-load-misses:u        #  175,95% of all iTLB cache hits   (19,76%)
   <not supported>      L1-dcache-prefetches:u                                      
         974441614      L1-dcache-prefetch-misses:u #    7,170 M/sec                    (25,08%)

 120,537409202 seconds time elapsed

Again it looks like the slower algorithm has poorer cache performance but in both cases the traversal algorithm is the same; the only difference is how they mutate the counter. In this case we faced an issue known as ‘false sharing’. Multiple threads were trying to read and write to memory addresses that were different but very close to each other (in the same L1 cache line).

cache2

To understand why such application behavior is wrong, let's examine how caches work in a multithreaded environment. We used the MESI cache coherence protocol,also known as the Illinois protocol.

To avoid needless synchronization between caches and memory, every line of cache is marked with one of four exclusive states:
Modifier (M): cache line resides only in the current cache and it has been modified from the value in main memory. It will be synchronized to main memory or higher level cache before permitting read.
Exclusive (E): cache lines resides only in the current cache and is unmodified.
Shared (S): cache line may be shared between other caches and is unmodified.
Invalid (I): cache line is not used.

During runtime, cache state may change. In this article I skipped NUMA (Non uniform memory access architecture).

So now it is clear that threads interfere with each other by setting the cache state to M. Before being read by another thread, the whole cache line is written back to a higher level cache or main memory and then fetched again to a different L1 local by the thread reading it.
Using the local variable and the mutating shared line at the end fixes such behavior. It is also possible to add padding between results cells to avoid false sharing.

As you can see, we should be careful about how we design our data and algorithms, keeping in mind the hardware they will be running on.


Linux distribution: Fedora 26,
My old laptop specification (x8):

vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
stepping	: 9
microcode	: 0x1f
cpu MHz		: 1197.421
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt ibpb ibrs stibp dtherm ida arat pln pts
bogomips	: 4589.84
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual

PS: SO1 is one of the driving forces of retailer digitalization. We have created a very powerful AI for retail which is capable of personalizing promotions for users in real-time and across devices. The SO1 Engine sources the entire portfolio of the retailer and automatically selects the right products for each individual consumer and adjusts discounts such that revenue, profit, or consumer satisfaction are maximized.

SO1-Talents-search-V3

Michał Warecki

Michał Warecki is a Software Engineer at SO1 with over a decade of experience building complex systems on the JVM. His main focus is simplicity and performance of a code.

Read More