Using Java with Nvidia GPUs CUDA

Question

I m working on a business project that is done in Java  and it needs huge computation power to compute business markets  Simple math  but with huge amount of data   We ordered some CUDA GPUs to try it with and since Java is not supported by CUDA  I m wondering where to start  Should I build a JNI interface  Should I use JCUDA or are there other ways   I don   t have experience in this field and I would like if someone could direct me to something so I can start researching and learning

User · Answer

From the research I have done  if you are targeting Nvidia GPUs and have decided to use CUDA over OpenCL  I found three ways to use the CUDA API in java     JCuda  or alternative - http   www jcuda org   This seems like the best solution for the problems I am working on  Many of libraries such as CUBLAS are available in JCuda  Kernels are still written in C though  JNI - JNI interfaces are not my favorite to write  but are very powerful and would allow you to do anything CUDA can do  JavaCPP - This basically lets you make a JNI interface in Java without writing C code directly  There is an example here  What is the easiest way to run working CUDA code in Java  of how to use this with CUDA thrust  To me  this seems like you might as well just write a JNI interface    All of these answers basically are just ways of using C C   code in Java  You should ask yourself why you need to use Java and if you can t do it in C C   instead   If you like Java and know how to use it and don t want to work with all the pointer management and what-not that comes with C C   then JCuda is probably the answer  On the other hand  the CUDA Thrust library and other libraries like it can be used to do a lot of the pointer management in C C   and maybe you should look at that   If you like C C   and don t mind pointer management  but there are other constraints forcing you to use Java  then JNI might be the best approach  Though  if your JNI methods are just going be wrappers for kernel commands you might as well just use JCuda    There are a few alternatives to JCuda such as Cuda4J and Root Beer  but those do not seem to be maintained  Whereas at the time of writing this JCuda supports CUDA 10 1  which is the most up-to-date CUDA SDK   Additionally there are a few java libraries that use CUDA  such as deeplearning4j and Hadoop  that may be able to do what you are looking for without requiring you to write kernel code directly  I have not looked into them too much though

User · Answer

I d start by using one of the projects out there for Java and CUDA  http   www jcuda org

User · Answer

Marco13 already provided an excellent answer   In case you are in search for a way to use the GPU without implementing CUDA OpenCL kernels  I would like to add a reference to the finmath-lib-cuda-extensions  finmath-lib-gpu-extensions  http   finmath net finmath-lib-cuda-extensions   disclaimer  I am the maintainer of this project    The project provides an implementation of  vector classes   to be precise  an interface called RandomVariable  which provides arithmetic operations and reduction on vectors  There are implementations for the CPU and GPU  There are implementation using algorithmic differentiation or plain valuations   The performance improvements on the GPU are currently small  but for vectors of size 100 000 you may get a factor   10 performance improvements   This is due to the small kernel sizes  This will improve in a future version   The GPU implementation use JCuda and JOCL and are available for Nvidia and ATI GPUs   The library is Apache 2 0 and available via Maven Central

User · Answer

First of all  you should be aware of the fact that CUDA will not automagically make computations faster  On the one hand  because GPU programming is an art  and it can be very  very challenging to get it right  On the other hand  because GPUs are well-suited only for certain kinds of computations   This may sound confusing  because you can basically compute anything on the GPU  The key point is  of course  whether you will achieve a good speedup or not  The most important classification here is whether a problem is task parallel or data parallel  The first one refers  roughly speaking  to problems where several threads are working on their own tasks  more or less independently  The second one refers to problems where many threads are all doing the same - but on different parts of the data    The latter is the kind of problem that GPUs are good at  They have many cores  and all the cores do the same  but operate on different parts of the input data    You mentioned that you have  simple math but with huge amount of data   Although this may sound like a perfectly data-parallel problem and thus like it was well-suited for a GPU  there is another aspect to consider  GPUs are ridiculously fast in terms of theoretical computational power  FLOPS  Floating Point Operations Per Second   But they are often throttled down by the memory bandwidth   This leads to another classification of problems  Namely whether problems are memory bound or compute bound    The first one refers to problems where the number of instructions that are done for each data element is low  For example  consider a parallel vector addition  You ll have to read two data elements  then perform a single addition  and then write the sum into the result vector  You will not see a speedup when doing this on the GPU  because the single addition does not compensate for the efforts of reading writing the memory    The second term   compute bound   refers to problems where the number of instructions is high compared to the number of memory reads writes  For example  consider a matrix multiplication  The number of instructions will be O n 3  when n is the size of the matrix  In this case  one can expect that the GPU will outperform a CPU at a certain matrix size  Another example could be when many complex trigonometric computations  sine cosine etc  are performed on  few  data elements    As a rule of thumb  You can assume that reading writing one data element from the  main  GPU memory has a latency of about 500 instructions      Therefore  another key point for the performance of GPUs is data locality  If you have to read or write data  and in most cases  you will have to  -    then you should make sure that the data is kept as close as possible to the GPU cores  GPUs thus have certain memory areas  referred to as  local memory  or  shared memory   that usually is only a few KB in size  but particularly efficient for data that is about to be involved in a computation   So to emphasize this again  GPU programming is an art  that is only remotely related to parallel programming on the CPU  Things like Threads in Java  with all the concurrency infrastructure like ThreadPoolExecutors  ForkJoinPools etc  might give the impression that you just have to split your work somehow and distribute it among several processors  On the GPU  you may encounter challenges on a much lower level  Occupancy  register pressure  shared memory pressure  memory coalescing     just to name a few     However  when you have a data-parallel  compute-bound problem to solve  the GPU is the way to go      A general remark  Your specifically asked for CUDA  But I d strongly recommend you to also have a look at OpenCL  It has several advantages  First of all  it s an vendor-independent  open industry standard  and there are implementations of OpenCL by AMD  Apple  Intel and NVIDIA  Additionally  there is a much broader support for OpenCL in the Java world  The only case where I d rather settle for CUDA is when you want to use the CUDA runtime libraries  like CUFFT for FFT or CUBLAS for BLAS  Matrix Vector operations   Although there are approaches for providing similar libraries for OpenCL  they can not directly be used from Java side  unless you create your own JNI bindings for these libraries      You might also find it interesting to hear that in October 2012  the OpenJDK HotSpot group started the project  Sumatra   http   openjdk java net projects sumatra    The goal of this project is to provide GPU support directly in the JVM  with support from the JIT  The current status and first results can be seen in their mailing list at http   mail openjdk java net mailman listinfo sumatra-dev     However  a while ago  I collected some resources related to  Java on the GPU  in general  I ll summarize these again here  in no particular order    Disclaimer  I m the author of http   jcuda org  and http   jocl org      Byte code translation and OpenCL code generation   https   github com aparapi aparapi   An open-source library that is created and actively maintained by AMD  In a special  Kernel  class  one can override a specific method which should be executed in parallel  The byte code of this method is loaded at runtime using an own bytecode reader  The code is translated into OpenCL code  which is then compiled using the OpenCL compiler  The result can then be executed on the OpenCL device  which may be a GPU or a CPU  If the compilation into OpenCL is not possible  or no OpenCL is available   the code will still be executed in parallel  using a Thread Pool   https   github com pcpratts rootbeer1   An open-source library for converting parts of Java into CUDA programs  It offers dedicated interfaces that may be implemented to indicate that a certain class should be executed on the GPU  In contrast to Aparapi  it tries to automatically serialize the  relevant  data  that is  the complete relevant part of the object graph   into a representation that is suitable for the GPU   https   code google com archive p java-gpu    A library for translating annotated Java code  with some limitations  into CUDA code  which is then compiled into a library that executes the code on the GPU  The Library was developed in the context of a PhD thesis  which contains profound background information about the translation process   https   github com ochafik ScalaCL   Scala bindings for OpenCL  Allows special Scala collections to be processed in parallel with OpenCL  The functions that are called on the elements of the collections can be usual Scala functions  with some limitations  which are then translated into OpenCL kernels   Language extensions  http   www ateji com px index html   A language extension for Java that allows parallel constructs  e g  parallel for loops  OpenMP style  which are then executed on the GPU with OpenCL  Unfortunately  this very promising project is no longer maintained    http   www habanero rice edu Publications html  JCUDA    A library that can translate special Java Code  called JCUDA code  into Java- and CUDA-C code  which can then be compiled and executed on the GPU  However  the library does not seem to be publicly available   https   www2 informatik uni-erlangen de EN research JavaOpenMP index html   Java language extension for for OpenMP constructs  with a CUDA backend  Java OpenCL CUDA binding libraries  https   github com ochafik JavaCL   Java bindings for OpenCL  An object-oriented OpenCL library  based on auto-generated low-level bindings  http   jogamp org jocl www    Java bindings for OpenCL  An object-oriented OpenCL library  based on auto-generated low-level bindings  http   www lwjgl org    Java bindings for OpenCL  Auto-generated low-level bindings and object-oriented convenience classes  http   jocl org    Java bindings for OpenCL  Low-level bindings that are a 1 1 mapping of the original OpenCL API  http   jcuda org    Java bindings for CUDA  Low-level bindings that are a 1 1 mapping of the original CUDA API  Miscellaneous  http   sourceforge net projects jopencl    Java bindings for OpenCL  Seem to be no longer maintained since 2010  http   www hoopoe-cloud com    Java bindings for CUDA  Seem to be no longer maintained

User · Answer

There is not much information on the nature of the problem and the data  so difficult to advise  However  would recommend to assess the feasibility of other solutions  that can be easier to integrate with java and enables horizontal as well as vertical scaling  The first I would suggest to look at is an open source analytical engine called Apache Spark https   spark apache org  that is available on Microsoft Azure but probably on other cloud IaaS providers too  If you stick to involving your GPU then the suggestion is to look at other GPU supported analytical databases on the market that fits in the budget of your organisation

[java] Using Java with Nvidia GPUs (CUDA)

Examples related to java

Examples related to cuda

Examples related to gpu

Examples related to multi-gpu