Generic GPU Kernels

Julia has a library called CUDAnative, which hacks the compiler to lunge your code on GPUs.

the usage of CuArrays, CUDAnative

xs, ys, zs = CuArray(rand(1024)), CuArray(rand(1024)), CuArray(zeros(1024))

impartial kernel_vadd(out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  out[i] = a[i] + b[i]
  return
terminate

@cuda (1, length(xs)) kernel_vadd(zs, xs, ys)

@impart zs == xs + ys

Is this better than writing CUDA C? Firstly, it’s easy to mistake this for easy syntactic convenience, but I’m gratified that it brings something fundamentally novel to the desk. Julia’s extremely effective array abstractions prove to be a mammoth fit for GPU programming, and it needs to be of interest to GPGPU hackers no matter whether they use the language already.

A Novel Dimension

For numerics experts, one in every of Julia’s killer aspects is its extremely effective N-dimensional array toughen. This extends no longer appropriate to high-level “vectorised” operations like broadcasting arithmetic, but furthermore to the interior loops in the lowest-level kernels. For instance, grab a CPU kernel that adds two 2D arrays:

impartial add!(out, a, b)
  for i = 1: size(a, 1)
    for j = 1: size(a, 2)
      out[i,j] = a[i,j] + b[i,j]
    terminate
  terminate
terminate

This kernel is swiftly, but laborious to generalise all the intention by diversified numbers of dimensions. The commerce wished to toughen 3D arrays, to illustrate, is little and mechanical (add an additional interior loop), but we are in a position to’t write it the usage of typical capabilities.

Julia’s code generation permits an desirable, if moderately arcane, respond:

the usage of Spoiled.Cartesian

@generated impartial add!(out, a, b)
  N = ndims(out)
  quote
    @nloops $N i out originate
      @nref($N, out, i) = @nref($N, a, i) + @nref($N, b, i)
    terminate
  terminate
terminate

The @generated annotation permits us to hook into Julia’s code specialisation; when the impartial receives matrices as enter, our custom-made code generation will create and lunge a twice-nested loop. This might presumably also merely behave the an identical as our add! impartial above, but for arrays of any dimension. Must you grab away @generated you might presumably search for the internals.

julia> the usage of MacroTools
julia> add!(zs, xs, ys) |> macroexpand |> MacroTools.prettify
quote
    for i_2 = indices(out, 2)
        nothing
        for i_1 = indices(out, 1)
            nothing
            out[i_1, i_2] = a[i_1, i_2] + b[i_1, i_2]
            nothing
        terminate
        nothing
    terminate
terminate

Must you are making an are trying it with, remark, a seven dimensional enter, you’ll be chuffed you didn’t favor to jot down the code yourself.

for i_7 = indices(out, 7)
  for i_6 = indices(out, 6)
    for i_5 = indices(out, 5)
      for i_4 = indices(out, 4)
        for i_3 = indices(out, 3)
          for i_2 = indices(out, 2)
            for i_1 = indices(out, 1)
              out[i_1, i_2, i_3, i_4, i_5, i_6, i_7] = a[i_1, i_2, i_3, i_4, i_5, i_6, i_7] + b[i_1, i_2, i_3, i_4, i_5, i_6, i_7]
# Some output uncared for

Spoiled.Cartesian is a extremely effective framework and has many more desirable instruments, but that illustrates the core point.

Right here’s a bonus. Addition clearly is excellent over any assortment of enter arrays. The an identical instruments we dilapidated for generic dimensionality will be dilapidated to generalise the assortment of inputs, too:

@generated impartial addn!(out, xs:: Vararg{Any,N}) the place N
  quote
    for i = 1: length(out)
      out[i] = @ncall $N (+) j -> xs[j][i]
    terminate
  terminate
terminate

Once more, grab away the @generated to gaze what’s taking place:

julia> addn!(zs, xs, xs, ys, ys) |> macroexpand |> MacroTools.prettify
quote
  for i = 1: length(out)
    out[i] = (xs[1])[i] + (xs[2])[i] + (xs[3])[i] + (xs[4])[i]
  terminate
terminate

If we place this collectively we are in a position to affect an N-dimensional, N-argument version of kernel_vadd on the GPU (the place @cuindex hides the messy ND indexing):

@generated impartial kernel_vadd(out, xs:: NTuple{N}) the place N
  quote
    I = @cuindex(out)
    out[I...] = @ncall $N (+) j -> xs[j][I...]
    return
  terminate
terminate

@cuda (1, length(xs)) kernel_vadd(zs, (xs, ys))

This fast kernel can now add any assortment of arrays of any dimension; is it peaceable appropriate “CUDA with Julia syntax”, or is it something more?

Capabilities for Nothing

Julia has more tricks up its sleeve. It robotically specialises higher-say capabilities, that implies that if we write:

impartial kernel_zip2(f, out, a, b)
  i = (blockIdx().x-1) * blockDim().x + threadIdx().x
  out[i] = f(a[i], b[i])
  return
terminate

@cuda (1, length(xs)) kernel_zip2(+, zs, xs, ys)

It behaves and performs precisely like kernel_vadd; but we are in a position to use any binary impartial without additional code. For instance, we are in a position to now subtract two arrays:

@cuda (1, length(xs)) kernel_zip2(-, zs, xs, ys)

Combining this with the above, we now beget your entire instruments we now favor to jot down a generic broadcast kernel (when you occur to’re unfamiliar with array broadcasting, ponder of it as a relatively more total intention). Right here is utilized in the CuArrays equipment loaded earlier, so that you might presumably immediately write:

julia> ?(x) = 1 / (1 + exp(-x))

julia> ?.(xs)
1024-element CuArray{Waft64,1}: 
 0.547526
 0.6911  
 ?

(Which, if we generalise kernel_vadd in the programs outlined above, is appropriate an “add” the usage of the ? impartial and a single enter.)

There’s no rate of it in our code, but Julia will compile a custom-made GPU kernel to lunge this high-level expression. Julia will furthermore fuse just a few declares collectively, so if we write an expression like

This creates a single kernel call, and not utilizing a reminiscence allocation or transient arrays required. Gorgeous chilly – and neatly out of the reach every other system I know of.

& Derivatives for Free

Must you stare upon the distinctive kernel_vadd above, you’ll stare that there don’t appear to be any kinds mentioned. Julia is duck typed, even on the GPU, and this kernel will work for anything else that helps the upright operations.

For instance, the inputs don’t beget to be CuArrays, as long as they gaze like arrays and must be transferred to the GPU. If we add a type of numbers to a CuArray like so:

@cuda (1, length(xs)) kernel_vadd(xs, xs, 1: 1024)

The vary 1: 1024 is rarely basically dispensed in reminiscence; the parts [1, 2, ..., 1024] are computed on-the-flee as wished on the GPU. The element type of the array is furthermore generic, and handiest needs to toughen +; so Int + Waft64 works, as above, but we are in a position to furthermore use user-defined amount kinds.

A extremely effective example is the twin amount. A twin amount is entirely a pair of numbers, like a flowery amount; it’s a designate that carries around its grasp derivative.

julia> the usage of ForwardDiff
julia> f(x) = x^2 + 2x + 3

julia> x = ForwardDiff.Dual(5, 1)
Dual{Void}(5,1)

julia> f(x)
Dual{Void}(38,12)

The last Dual carries the rate that we ask from f (5^2 + 2*x + 3 == 38), but furthermore the derivative (2x + 2 == 12).

Dual numbers beget an amazingly high energy:simplicity ratio and are undoubtedly swiftly, but are utterly impractical in most languages. Julia makes it straightforward, and furthermore, a vector of twin numbers will transparently attain the derivative computation on the GPU.

julia> xs = CuArray(ForwardDiff.Dual.(1: 1024, 1))

julia> f.(xs)
1024-element CuArray{ForwardDiff.Dual{Void,Int64,1},1}: 
          Dual{Void}(6,4)
         Dual{Void}(11,6)
         Dual{Void}(18,8)
                        ?

julia> ?.(xs)
1024-element CuArray{ForwardDiff.Dual{Void,Waft64,1},1}: 
 Dual{Void}(0.731059,0.196612)   
 Dual{Void}(0.880797,0.104994)   
 Dual{Void}(0.952574,0.0451767)  
            ?

Now not handiest is there no overhead compared with hand-writing the important cuda kernel for this; there’s no overhead at all! In my benchmarks, taking a derivative the usage of twin numbers is appropriate as swiftly as computing handiest the rate with raw floats. Gorgeous spectacular.

In machine studying frameworks, it’s current to favor a “layer” for every imaginable activation impartial: sigmoid, relu, tanh etc. Having this trick in our toolkit capacity that backpropagation by any scalar impartial will work for free.

General, GPU kernels in Julia are amazingly generic, all the intention by kinds, dimensions and arity. Desire to broadcast an integer vary, a twin-amount matrix, and a 6D array of floats? Creep forward, and a single, extraordinarily swiftly GPU kernel will give you the terminate consequence.

xs = CuArray(ForwardDiff.Dual.(randn(100,100), 1))
ys = CuArray(randn(1, 100, 5, 5, 5))
(1: 100) .* xs ./ ys
100×100×5×5×5 Array{ForwardDiff.Dual{Void,Waft64,1},5}: 
[:, :, 1, 1, 1] =
   Dual{Void}(0.0127874,-0.427122)  …   Dual{Void}(-0.908558,-0.891798)
   Dual{Void}(0.97554,-2.56273)     …   Dual{Void}(-8.22101,-5.35079)  
  Dual{Void}(-7.13571,-4.27122)          Dual{Void}(2.14025,-8.91798)  
              ?                     ?

The paunchy broadcasting equipment in CuArrays is 60 traces long. While no longer utterly trivial, here’s an fabulous amount of performance to regain from this great code. CuArrays itself is underneath 400 source traces, while offering merely about all total array operations (indexing, concatenation, permutedims etc) in a equally generic capacity.

Julia’s skill to spit out specialised code is unprecedented, and I’m angry to gaze the place this leads in future. For instance, it can presumably well presumably be comparatively easy to intention a Theano-like framework in Julia, and create specialised kernels for higher computations. Either capacity, I ponder we’ll be listening to more about Julia and GPUs as time goes on.

Corpulent credit for the work in the attend of this to Tim Besard and Jarrett Revels, respective authors of the fabulous CUDAnative and ForwardDiff.