[java] Performance of Java matrix math libraries?

We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:

Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:

I searched and found nothing.


Details of our speed comparison:

We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).

Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.

In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

This question is related to java math matrix performance

The answer is


There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/ Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).

What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions. http://ojalgo.org/matrix_compare.html


We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.

In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.


For 3d graphics applications the lwjgl.util vector implementation out-performed above mentioned jblas by a factor of about 3.

I have done 1 million matrix multiplications of a vec4 with a 4x4 matrix.

lwjgl finished in about 18ms, jblas required about 60ms.

(I assume, that the JNI approach is not very suitable for fast successive application of relatively small multiplications. Since the translation/mapping may take more time than the actual execution of the multiplication.)


I'm the author of la4j (Linear Algebra for Java) library and here is my point. I've been working on la4j for 3 years (the latest release is 0.4.0 [01 Jun 2013]) and only now I can start doing performace analysis and optimizations since I've just covered the minimal required functional. So, la4j isn't as fast as I wanted but I'm spending loads of my time to change it.

I'm currently in the middle of porting new version of la4j to JMatBench platform. I hope new version will show better performance then previous one since there are several improvement I made in la4j such as much faster internal matrix format, unsafe accessors and fast blocking algorithm for matrix multiplications.


I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!


I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.

Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.

The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.


I just compared Apache Commons Math with jlapack.

Test: singular value decomposition of a random 1024x1024 matrix.

Machine: Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz, linux x64

Octave code: A=rand(1024); tic;[U,S,V]=svd(A);toc

results                                execution time
---------------------------------------------------------
Octave                                 36.34 sec

JDK 1.7u2 64bit
    jlapack dgesvd                     37.78 sec
    apache commons math SVD            42.24 sec


JDK 1.6u30 64bit
    jlapack dgesvd                     48.68 sec
    apache commons math SVD            50.59 sec

Native routines
Lapack* invoked from C:                37.64 sec
Intel MKL                               6.89 sec(!)

My conclusion is that jlapack called from JDK 1.7 is very close to the native binary performance of lapack. I used the lapack binary library coming with linux distro and invoked the dgesvd routine to get the U,S and VT matrices as well. All tests were done using double precision on exactly the same matrix each run (except Octave).

Disclaimer - I'm not an expert in linear algebra, not affiliated to any of the libraries above and this is not a rigorous benchmark. It's a 'home-made' test, as I was interested comparing the performance increase of JDK 1.7 to 1.6 as well as commons math SVD to jlapack.


Jeigen https://github.com/hughperkins/jeigen

  • wraps Eigen C++ library http://eigen.tuxfamily.org , which is one of the fastest free C++ libraries available
  • relatively terse syntax, eg 'mmul', 'sub'
  • handles both dense and sparse matrices

A quick test, by multiplying two dense matrices, ie:

import static jeigen.MatrixUtil.*;

int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();

Results:

Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms
  • Compared to jama, everything is faster :-P
  • Compared to jblas, Jeigen is not quite as fast, but it handles sparse matrices.
  • Compared to ojalgo, Jeigen takes about the same amount of elapsed time, but only using one core, so Jeigen uses half the total cpu. Jeigen has a terser syntax, ie 'mmul' versus 'multiplyRight'

There's also UJMP


Building on Varkhan's post that Pentium-specific native code would do better:


You should add Apache Mahout to your shopping list.


I'm the author of Java Matrix Benchmark (JMatBench) and I'll give my thoughts on this discussion.

There are significant difference between Java libraries and while there is no clear winner across the whole range of operations, there are a few clear leaders as can be seen in the latest performance results (October 2013).

If you are working with "large" matrices and can use native libraries, then the clear winner (about 3.5x faster) is MTJ with system optimised netlib. If you need a pure Java solution then MTJ, OjAlgo, EJML and Parallel Colt are good choices. For small matrices EJML is the clear winner.

The libraries I did not mention showed significant performance issues or were missing key features.


There's a benchmark of various matrix packages available in java up on http://code.google.com/p/java-matrix-benchmark/ for a few different hardware configurations. But it's no substitute for doing your own benchmark.

Performance is going to vary with the type of hardware you've got (cpu, cores, memory, L1-3 cache, bus speed), the size of the matrices and the algorithms you intend to use. Different libraries have different takes on concurrency for different algorithms, so there's no single answer. You may also find that the overhead of translating to the form expected by a native library negates the performance advantage for your use case (some of the java libraries have more flexible options regarding matrix storage, which can be used for further performance optimizations).

Generally though, JAMA, Jampack and COLT are getting old, and do not represent the state of the current performance available in Java for linear algebra. More modern libraries make more effective use of multiple cores and cpu caches. JAMA was a reference implementation, and pretty much implements textbook algorithms with little regard to performance. COLT and IBM Ninja were the first java libraries to show that performance was possible in java, even if they lagged 50% behind native libraries.


You may want to check out the jblas project. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.

The developer has posted some benchmarks in which jblas comes off favourably against MTJ and Colt.


I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).

So what does all this mean? Well:

  • in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
  • if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)

A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.


Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help. At that point, I think we "just" need a collective effort to build a native Jama implementation...


Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Java through JNI wrappers.


Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.


Examples related to java

Under what circumstances can I call findViewById with an Options Menu / Action Bar item? How much should a function trust another function How to implement a simple scenario the OO way Two constructors How do I get some variable from another class in Java? this in equals method How to split a string in two and store it in a field How to do perspective fixing? String index out of range: 4 My eclipse won't open, i download the bundle pack it keeps saying error log

Examples related to math

How to do perspective fixing? How to pad a string with leading zeros in Python 3 How can I use "e" (Euler's number) and power operation in python 2.7 numpy max vs amax vs maximum Efficiently getting all divisors of a given number Using atan2 to find angle between two vectors How to calculate percentage when old value is ZERO Finding square root without using sqrt function? Exponentiation in Python - should I prefer ** operator instead of math.pow and math.sqrt? How do I get the total number of unique pairs of a set in the database?

Examples related to matrix

How to get element-wise matrix multiplication (Hadamard product) in numpy? How can I plot a confusion matrix? Error: stray '\240' in program What does the error "arguments imply differing number of rows: x, y" mean? How to input matrix (2D list) in Python? Difference between numpy.array shape (R, 1) and (R,) Counting the number of non-NaN elements in a numpy ndarray in Python Inverse of a matrix using numpy How to create an empty matrix in R? numpy matrix vector multiplication

Examples related to performance

Why is 2 * (i * i) faster than 2 * i * i in Java? What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? How to check if a key exists in Json Object and get its value Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? Most efficient way to map function over numpy array The most efficient way to remove first N elements in a list? Fastest way to get the first n elements of a List into an Array Why is "1000000000000000 in range(1000000000000001)" so fast in Python 3? pandas loc vs. iloc vs. at vs. iat? Android Recyclerview vs ListView with Viewholder