Lecture 6

Agenda
📚 GPU computing & performance assessment (continued)
💻 Unit testing and reference tests
🚧 Exercises:

  • Data transfer optimisations on GPUs

  • GPU codes for diffusion 2D

  • Reference testing in Julia


Content

👉 get started with exercises


GPU computing and performance assessment

The goal of this lecture 6 is to tackle:

GPU architecture and kernel programming

We'll get started with a brief overview of the Nvidia GPU architecture and how to program it.

The Nvidia general purpose GPUs we will use in this course can be programmed using the CUDA language extension.

CUDA is accessible in Julia via CUDA.jl, which exposes most of the native CUDA features to the Julia ecosystem.

Note, however, that CUDA.jl does not use nvcc, the Nvidia compiler, but compiles like other Julia code just ahead of time with LLVM.

First, let's distinguish among CPU, GPU, hardware, application and CUDA.

What are host and device?

The host is the system CPU. The system memory (DRAM) linked to the CPU is the host memory. The GPU is called a device and GPU memory is device memory.

The GPU hardware is composed of Global (DRAM) memory, L2 cache and many streaming multi-processors (SMs). Each SM contains many compute units (called "CUDA cores" by Nvidia), registers, L1 cache (can be repurposed as shared memory depending on the architecture) and read-only memory.

The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. [ref]

In the CUDA programming model, blocks of threads compose the grid. In our implementation, we want to map one thread to each finite-difference cell of the 2D Cartesian domain.

The figure hereafter depicts the relation between the CUDA domain and the finite-difference domain:

cuda_grid

Indices ix and iy replace the loop indices providing a "vectorised" map of threads - the core to leverage GPU performance. We'll come back to this in a second part of this lecture.

In the CUDA programming model, blocks (red) of threads compose the grid (green).

In our implementations, we will map one thread (red box) to each cell of the 2D Cartesian domain (blue). Other mappings are possible, of course.

How does it relate to the GPU hardware?

All threads of a block are guaranteed to be executed concurrently on an SM (yellow box) and therefore share SM resources such as registers, L1 cache (/shared memory) and read-only memory.

We'll see later that the performance of a GPU application is highly sensitive to the optimal choice of the thread, block, grid layout, the so-called kernel launch parameters.

Writing a Julia GPU function (aka kernel) copying array A to array B with the layout from the above figure looks as follow

using CUDA

function copy!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    A[ix,iy] = B[ix,iy]
    return
end

threads = (4, 3)
blocks  = (2, 2)
nx, ny  = threads[1]*blocks[1], threads[2]*blocks[2]
A       = CUDA.zeros(Float64, nx, ny)
B       =  CUDA.rand(Float64, nx, ny)

@cuda blocks=blocks threads=threads copy!(A, B)
synchronize()

Playing with GPUs: the rules

With this short overview we should have the important concepts in mind to get started with GPU computing 🚀

💡 Note
A more complete introduction to CUDA (or refresher) can be accessed here. Julia GPU resources can be accessed at https://juliagpu.org.

GPU computing and performance assessment

The goal of this part is to:

  1. Learn about:

  1. Consolidate:

This content is distributed under MIT licence. Authors: S. Omlin (CSCS), L. Räss (ETHZ).

In order to get started, we need to connect to a machine which has GPU(s).

Let's take some time to get started.

👉 Getting started:

💡 Note
Values reported in this notebook are for the Nvidia P100 16GB PCIe GPU.

We will use the packages CUDA and BenchmarkTools to create a little performance laboratory:

import Pkg; Pkg.add("BenchmarkTools");
using CUDA
using BenchmarkTools

Scientific applications' performance

The performance of most scientific applications nowadays is bound by memory access speed (memory-bound) rather than by the speed computations can be done (compute-bound).

The reason is that current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

This imbalance can be quantified by dividing the computation peak performance [GFLOP/s] by the memory access peak performance [GB/s] and multiplied by the size of a number in Bytes (for simplicity, theoretical peak performance values as specified by the vendors can be used). For example for the Tesla P100 GPU, it is:

5300 [GFlop/s]732 [GB/s] × 8=58 \frac{5300 ~\mathrm{[GFlop/s]}}{732 ~\mathrm{[GB/s]}}~×~8 = 58

(here computed with double precision values taken from the vendor's product specification sheet).

So we can do 58 floating point operations per number read from main memory or written to it.

As a consequence, we can consider floating point operations be "for free" when we work in the memory-bounded regime as in this lecture.

Therefore, let us start with investigating the performance of different ways to express and launch GPU memory copies. We will wrap all of these memory copies in functions, to enable the Julia compiler to optimize them best.

There exists already the function copyto!, which permits to copy data from one pre-allocated array to another; thus, we start with analysing this function's performance.

But first, let us list what GPUs are available and make sure we assign no more than one user per GPU:

collect(devices())
device!(0) # select a GPU between 0-7

To this purpose, we allocate two arrays and benchmark the function using BenchmarkTools:

nx = ny = 32
A = CUDA.zeros(Float64, nx, ny);
B = CUDA.rand(Float64, nx, ny);
@benchmark begin copyto!($A, $B); synchronize() end
💡 Note
Previously defined variables are interpolated with $ into the benchmarked expression.
⚠️ Warning!
If not specified otherwise, CUDA.zeros(nx, ny) allocates Float32.

Time samples resulting from benchmarking as just performed follow normally a right skewed distribution.

For such distribution, the median is the most robust of the commonly used estimators of the central tendency; the minimum is in general also a good estimator as hardware cannot by accident run faster than with the ideal and it is as a result commonly used for reporting performance (for more information on estimators see here).

Using @belapsed instead of @benchmark, we directly obtain the minimum of the taken time samples:

t_it = @belapsed begin copyto!($A, $B); synchronize() end

Now, we know that it does not take "an awful lot of time". Of course, we do not want to stop here, but figure out how good the achieved performance was.

To this aim, we compute the total memory throughput, T_tot [GB/s], which is defined as the volume of the copied data [GB] divided by the time spent [s]:

T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
💡 Note
The factor 2 comes from the fact that the data is read and written (2 operations).

Compare now T_tot with the known peak memory throughput, T_peak, which is found e.g. in scientific or vendor publications (for the Nvidia Tesla P100 GPUs, it is 559 GB/s, according to this source.

💡 Note
Achievable peak memory throughput is usually significantly lower than the theoretical peak bandwidth announced by the vendor (for the Tesla P100 GPUs, the latter is 732 GB/s as noted already earlier).
💡 Note
Here 1 GB is 1e9 Bytes as in the publication, where the peak memory throughput of the Tesla P100 GPU was obtained from.

You have surely found T_tot to be orders of magnitude below T_peak. This is to be expected when copying a small array.

Let us determine how T_tot behaves with increasing array sizes:

array_sizes = []
throughputs = []
for pow = 0:11
    nx = ny = 32*2^pow
    if (3*nx*ny*sizeof(Float64) > CUDA.available_memory()) break; end
    A = CUDA.zeros(Float64, nx, ny);
    B = CUDA.rand(Float64, nx, ny);
    t_it = @belapsed begin copyto!($A, $B); synchronize() end
    T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
    push!(array_sizes, nx)
    push!(throughputs, T_tot)
    println("(nx=ny=$nx) T_tot = $(T_tot)")
    CUDA.unsafe_free!(A)
    CUDA.unsafe_free!(B)
end

You can observe that the best performance is on pair with T_peak or a bit lower (measured 522 GB/s with the Tesla P100 GPU) as copyto! is a function that needs to work in all possible cases and it is not specifically optimised for a particular hardware.

Furthermore, we note that best performance is obtained for large arrays (in the order of Gigabytes).

We will use the array size for which we obtained the best result for the remainder of the performance experiments:

T_tot_max, index = findmax(throughputs)
nx = ny = array_sizes[index]
A = CUDA.zeros(Float64, nx, ny);
B = CUDA.rand(Float64, nx, ny);

GPU array programming

Let us now create our own memory copy function using GPU Array Programming (AP).

We can write a memory copy simply as A .= B; and wrap it in a function using Julia's concise notation, it looks as follows:

@inbounds memcopy_AP!(A, B) = (A .= B)
💡 Note
We use @inbounds macro to make sure no array bounds checking is performed, which would slow down significantly.
💡 Note
A = B would not do a memcopy, but make A an alias of B, i.e. make A point to the same data in memory as B.

We also benchmark it and compute T_tot:

t_it = @belapsed begin memcopy_AP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it

The performance you observe might be a little lower than with the copyto! function (measured 496 GB/s with the Tesla P100 GPU).

The few experiments that we have done together so far have shown you already that performing memory copy with maximal possible performance (T_peak) is not a completely trivial task.

GPU kernel programming

We will now use GPU Kernel Programming (KP) to try to get closer to T_peak.

A memory copy kernel can be written e.g. as follows:

@inbounds function memcopy_KP!(A, B)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    A[ix,iy] = B[ix,iy]
    return nothing
end

Then, in order to copy the (entire) array B to A, we need to launch the kernel such that the above indices ix and iy map exactly to each array cell.

Therefore, we need to have blocks[1]*threads[1] == nx and blocks[2]*threads[2] == ny.

We will try first with the simplest possible option using only one thread per block:

threads = (1, 1)
blocks  = (nx, ny)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it

T_tot is certainly orders of magnitude below T_peak with this kernel launch parameters.

We need to take into account that single threads cannot run completely independently, but threads are launched in small groups within a block, called warps; a warp consists of 32 threads on current GPUs.

Furthermore, warps should access contiguous memory for best performance.

We therefore retry using 32 threads (one warp) per block as follows:

threads = (32, 1)
blocks  = (nx÷threads[1], ny)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
💡 Note
For simplicity, the number of threads was set here explicitly to 32; more future proof would be to retrieve the warp size from the corresponding CUDA attribute by doing: attribute(device(),CUDA.DEVICE_ATTRIBUTE_WARP_SIZE).

T_tot is now probably in the order of magnitude of T_peak, yet depending on the used GPU it can be still significantly below (measured 302 GB/s with the Tesla P100 GPU).

If T_tot is significantly below T_peak, then we need to set the numbers of threads per block closer to the maximum the GPU allows.

Let us determine how T_tot behaves with an increasing number of threads per blocks:

max_threads  = attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
thread_count = []
throughputs  = []
for pow = Int(log2(32)):Int(log2(max_threads))
    threads = (2^pow, 1)
    blocks  = (nx÷threads[1], ny)
    t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
    T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
    push!(thread_count, prod(threads))
    push!(throughputs, T_tot)
    println("(threads=$threads) T_tot = $(T_tot)")
end

You should observe now that beyond a certain minimum number of threads per block (64 with the Tesla P100 GPU), T_tot is quite close to T_peak (which exact thread/block configuration leads to the best T_tot depends on the used GPU architecture).

Instead of increasing the number of threads only in the x dimension, we can also do so in the y dimension.

We keep though 32 threads in the x dimension in order to let the warps access contiguous memory:

thread_count = []
throughputs  = []
for pow = 0:Int(log2(max_threads/32))
    threads = (32, 2^pow)
    blocks  = (nx÷threads[1], ny÷threads[2])
    t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
    T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
    push!(thread_count, prod(threads))
    push!(throughputs, T_tot)
    println("(threads=$threads) T_tot = $(T_tot)")
end

T_tot is even slightly better in general. Much more important is though that a thread block accesses now not a 1D-line of the arrays, but a 2D block.

We will see later that this is of great benefit when, e.g., computing finite difference derivatives in x and y direction.

So far, we experimented with memory copy in the strict sense: copy an array from one place to the other. When doing computations, we often read more data than we write.

We will therefore also do a few experiments on another commonly benchmarked case: read two arrays and write only one.

We modify therefore the previous kernel to take a third array C as input and add it to B (the rest is identical):

@inbounds function memcopy2_KP!(A, B, C)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    A[ix,iy] = B[ix,iy] + C[ix,iy]
    return nothing
end

Then, we test exactly as for the previous kernel how T_tot behaves with an increasing number of threads per blocks in y dimension, keeping it fixed to 32 in x dimension:

C = CUDA.rand(Float64, nx, ny);
thread_count = []
throughputs  = []
for pow = 0:Int(log2(max_threads/32))
    threads = (32, 2^pow)
    blocks  = (nx÷threads[1], ny÷threads[2])
    t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy2_KP!($A, $B, $C); synchronize() end
    T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
    push!(thread_count, prod(threads))
    push!(throughputs, T_tot)
    println("(threads=$threads) T_tot = $(T_tot)")
end
💡 Note
There is now a factor 3 instead of 2 in the computation of T_tot: 2 arrays are read and 1 written (3 operations).

Compare now the best measured T_tot to the T_peak obtained from the publication and if it is higher, then it means we need to correct T_peak to take the value of the T_tot measured (T_tot measured with the Tesla P100 GPU is 561 GB/s, i.e., 2 GB/s higher than the T_peak obtained from the publication mentioned earlier).

Note that the T_peak reported in the publication was obtained with a slightly different kernel which multiplies C with a scalar in addition; it is usually referred to as triad.

For completeness, we will also quickly benchmark a triad kernel.

To this purpose, we will directly use the best thread/block configuration that we have found in the previous experiment:

@inbounds function memcopy_triad_KP!(A, B, C, s)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    A[ix,iy] = B[ix,iy] + s*C[ix,iy]
    return nothing
end

s = rand()

T_tot_max, index = findmax(throughputs)
threads = (32, thread_count[index]÷32)
blocks  = (nx÷threads[1], ny÷threads[2])
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_triad_KP!($A, $B, $C, $s); synchronize() end
T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it

There should be no significant difference between T_tot of this triad kernel and of the previous kernel (with the Tesla P100 GPU, it is 561 GB/s with both kernels).

Finally, let us also check the triad performance we obtain with GPU array programming:

@inbounds memcopy_triad_AP!(A, B, C, s) = (A .= B.+ s.*C)

t_it = @belapsed begin memcopy_triad_AP!($A, $B, $C, $s); synchronize() end
T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it

T_tot is probably a bit lower than in the above experiment, but still rather close to T_peak.

Congratulations! You have successfully made it through the memory copy kernel optimization experiments and learned about the fundamental parameters determining memory throughput.

One moment! For the following exercises you will need the parameters we have established here for best memory access:

println("nx=ny=$nx; threads=$threads; blocks=$blocks")

back to Content

Unit testing and reference tests in Julia

(Unit) testing is pervasive in the Julia ecosystem thanks to efficient built-in tools and a culture encouraging testing.

This JuliaCon 2021 talk gives a nice overview: more than 90% of all registered packages have at least some tests, with the median package having about 25% of the code being tests.

Terms:

The goals of this lecture are

(How to setup CI as part of a project of yours will be taught later)

Registered Packages: CI tests & using as documentation

Let's look at a simple package: UnPack.jl

UnPack.jl

💻 -> "demo"

Registered Packages: test locally

Using: UnPack.jl

Installed packages can be tested:

pkg> add UnPack

pkg> test UnPack

Registered Packages: test locally

Going one step further. Make and test changes of a package. dev the package:

pkg> dev UnPack

This will checkout the package to ~/.julia/dev/UnPack.

Re-Start Julia with this package activated:

$ cd ~/.julia/dev/UnPack
$ julia --project

In package mode run the tests:

(UnPack) pkg> test
    Testing UnPack
      Status `/tmp/jl_LgpabA/Project.toml`
  [3a884ed6] UnPack v1.0.2 `~/julia/dot-julia-dev/UnPack`
...

If you edit the source, e.g. to fix a bug, re-run the tests before submitting a PR.

Write your own tests

Start easy:

Step up:

Another day:

Write your own tests: demo with "car_travel.jl" from Lecture 1

using Plots

function car_travel_1D()
    # Physical parameters
    V     = 113.0          # speed, km/h
    L     = 200.0          # length of segment, km
    dir   = 1              # switch 1 = go right, -1 = go left
    ttot  = 16.0           # total time, h
    # Numerical parameters
    dt    = 0.1            # time step, h
    nt    = Int(cld(ttot, dt))  # number of time steps
    # Array initialisation
    T     = zeros(nt)
    X     = zeros(nt)
    # Time loop
    for it = 2:nt
        T[it] = T[it-1] + dt
        X[it] = X[it-1] + dir*V*dt  # move the car
        if X[it] > L
            dir = -1      # if beyond L, go back (left)
        elseif X[it] < 0
            dir = 1       # if beyond 0, go back (right)
        end
    end
    # Visualisation
    # display(scatter(T, X, markersize=5, xlabel="time, hrs", ylabel="distance, km", framestyle=:box, legend=:none))
    return T, X
end

T, X = car_travel_1D()

Write your own tests: demo with "car_travel.jl" from Lecture 1

Steps:

  1. generate a project and add scripts/car_travel.jl

  2. use reference tests

  3. add some unit tests in-line

  4. move the tests to test/runtests.jl

💡 Note
To make the pkg> test run, you have to have a file src/MyPkg.jl, even if it is just empty.

Write your own tests: demo with "car_travel.jl" from Lecture 1

Step 1: generate a package

$ cd to-some-dir
$ julia --project

julia> using Pkg; Pkg.generate("L6Testing")

Steps 3–4 are in the repository course-101-0250-00-L6Testing.jl; note that this steps are encoded in the git history which the README links into.

💡 Note
For outputs from big simulations, such as ours, it make sense to only reference-test at a few 10s of indices.

back to Content

Exercises - lecture 6

⚠️ Warning!
Exercises (including the .ipynb) have to be uploaded to your private (shared) GitHub repository, in a specific folder for each lecture. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

Exercise 1 - Data transfer optimisations

👉 See Logistics for submission details.

The goal of this exercise is to:

Prerequisites:

This content is distributed under MIT licence. Authors: S. Omlin (CSCS), L. Räss (ETHZ).

Getting started

👉 Download the lecture6_ex1.ipynb notebook and edit it.

💡 Note
Values reported in this exercise are for the Nvidia P100 16GB PCIe GPU.

We will again use the packages CUDA, BenchmarkTools and Plots to create a little performance laboratory:

using IJulia
using CUDA
using BenchmarkTools
using Plots

Before we go further, make sure we select the GPU we want to run on (if running on a multi-GPU node). In the terminal or Julia REPL in shell mode (typing ;), type nvidia-smi command to list visible GPUs. Remember the GPU_ID you want to use.

Then, in Julia, add following if you decide to, e.g., use GPU 0:

GPU_ID = 0
device!(GPU_ID)
⚠️ Warning!
Having multiple users accessing the same GPU will result in severe performance deprecation.

Let us consider the following 2-D heat diffusion solver (the comments explain the code):

function diffusion2D()
    # Physics
    lam      = 1.0                                          # Thermal conductivity
    c0       = 2.0                                          # Heat capacity
    lx, ly   = 10.0, 10.0                                   # Length of computational domain in dimension x and y

    # Numerics
    nx, ny   = 32*2, 32*2                                   # Number of gridpoints in dimensions x and y
    nt       = 100                                          # Number of time steps
    dx       = lx/(nx-1)                                    # Space step in x-dimension
    dy       = ly/(ny-1)                                    # Space step in y-dimension
    _dx, _dy = 1.0/dx, 1.0/dy

    # Array initializations
    T    = CUDA.zeros(Float64, nx, ny)                      # Temperature
    Ci   = CUDA.zeros(Float64, nx, ny)                      # 1/Heat capacity
    qTx  = CUDA.zeros(Float64, nx-1, ny-2)                  # Heat flux, x component
    qTy  = CUDA.zeros(Float64, nx-2, ny-1)                  # Heat flux, y component
    dTdt = CUDA.zeros(Float64, nx-2, ny-2)                  # Change of Temperature in time

    # Initial conditions
    Ci .= 1/c0                                              # 1/Heat capacity (could vary in space)
    T  .= CuArray([10.0*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2) for ix=1:size(T,1), iy=1:size(T,2)]) # Initialization of Gaussian temperature anomaly

    # Time loop
    dt  = min(dx^2,dy^2)/lam/maximum(Ci)/4.1                # Time step for 2D Heat diffusion
    opts = (aspect_ratio=1, xlims=(1, nx), ylims=(1, ny), clims=(0.0, 10.0), c=:davos, xlabel="Lx", ylabel="Ly") # plotting options
    for it = 1:nt
        diffusion2D_step!(T, Ci, qTx, qTy, dTdt, lam, dt, _dx, _dy) # Diffusion time step.
        if it % 10 == 0
            IJulia.clear_output(true)
            display(heatmap(Array(T)'; opts...))            # Visualization
            sleep(0.1)
        end
    end
end
💡 Note
Divisions are precomputed as they are slower than multiplications.

The function to compute an actual time step is still missing to complete this solver. It can be written, e.g., as follows with finite differences using GPU array programming (AP):

@inbounds @views macro d_xa(A) esc(:( ($A[2:end  , :     ] .- $A[1:end-1, :     ]) )) end
@inbounds @views macro d_xi(A) esc(:( ($A[2:end  ,2:end-1] .- $A[1:end-1,2:end-1]) )) end
@inbounds @views macro d_ya(A) esc(:( ($A[ :     ,2:end  ] .- $A[ :     ,1:end-1]) )) end
@inbounds @views macro d_yi(A) esc(:( ($A[2:end-1,2:end  ] .- $A[2:end-1,1:end-1]) )) end
@inbounds @views macro  inn(A) esc(:( $A[2:end-1,2:end-1]                          )) end

@inbounds @views function diffusion2D_step!(T, Ci, qTx, qTy, dTdt, lam, dt, _dx, _dy)
    qTx     .= .-lam.*@d_xi(T).*_dx                              # Fourier's law of heat conduction: qT_x  = -λ ∂T/∂x
    qTy     .= .-lam.*@d_yi(T).*_dy                              # ...                               qT_y  = -λ ∂T/∂y
    dTdt    .= @inn(Ci).*(.-@d_xa(qTx).*_dx .- @d_ya(qTy).*_dy)  # Conservation of energy:           ∂T/∂t = 1/cp (-∂qT_x/∂x - ∂qT_y/∂y)
    @inn(T) .= @inn(T) .+ dt.*dTdt                               # Update of temperature             T_new = T_old + ∂t ∂T/∂t
end
💡 Note
We use everywhere views to avoid allocations of temporary arrays (see here for more information).
💡 Note
We use everywhere dots to fuse vectorized operations and avoid any allocations of temporary arrays (see here for more information). We wrote all dots explicitly for clarity; the macro @. removes the need of writing all dots explicitly.
💡 Note
We use simple macros to enable nice syntax, in particular macros can be used also on the left side of an equal sign (learn here more about macros).

Run now the 2-D heat diffusion solver to verify that it is working:

diffusion2D()

Task 1 (Benchmarking)

Benchmark the function diffusion2D_step! using BenchmarkTools and compute a straightforward lower bound of the total memory throughput, T_tot_lb; then, compare it to the peak memory throughput, T_peak. You can compute T_tot_lb considering only full array reads and writes and knowing that there is no data reuse between different GPU array computation statements as each statement is translated into a separate and independently launched kernel (note that to obtain the actual T_tot, one would need to use a profiler).

Furthermore, use the nx=ny found best in the introduction notebook (1_memorycopy.ipynb) to allocate the necessary arrays if the amount of memory of your GPU allows it (else divide this nx and ny by 2).

To help you, there is already some code below to initialize the required arrays and scalars for the benchmarking.

💡 Note
hint: Do not forget to interpolate these predefined variables into the benchmarking expression using $ and note that you do not need to call the solver itself (diffusion2D)!

nx = ny = # complete!
T    = CUDA.rand(Float64, nx, ny);
Ci   = CUDA.rand(Float64, nx, ny);
qTx  = CUDA.zeros(Float64, nx-1, ny-2);
qTy  = CUDA.zeros(Float64, nx-2, ny-1);
dTdt = CUDA.zeros(Float64, nx-2, ny-2);
lam = _dx = _dy = dt = rand();
# solution
t_it = @belapsed begin diffusion2D_step!($T, $Ci, $qTx, $qTy, $dTdt, $lam, $dt, $_dx, $_dy); synchronize() end
T_tot_lb = 11*1/1e9*nx*ny*sizeof(Float64)/t_it

Save the measured minimal runtime and the computed T_tot_lb in other variables (t_it_task1 and T_tot_lb_task1) in order not to overwrite them later (adapt these two lines if you used other variable names!); moreover, we will remove the arrays we do no longer need in order to save space:

t_it_task1 = t_it
T_tot_lb_task1 = T_tot_lb
CUDA.unsafe_free!(qTx)
CUDA.unsafe_free!(qTy)
CUDA.unsafe_free!(dTdt)

T_tot_lb should be relatively close to T_peak. Nevertheless, one could do these computations at least three times faster. You may wonder why it is possible to predict that just looking at the code. It is because three of the four arrays that are updated every iteration are not computed based on their values in the previous iteration and their individual values could therefore be computed on-the-fly when needed or stored in the much faster on-chip memory as intermediate results; these three arrays would never need to be stored in main memory and read from there. Only the temperature array (T) needs inevitably to be read from main memory and written to it at every iteration as is computed based on its values from the previous iteration (and the entire temperature array is orders of magnitudes bigger than the available on-chip memory). In addition, the heat capacity array (Ci) needs to be entirely read at every iteration. To sum up, all but three of eleven full array memory reads or writes can be avoided. If we avoid them, we reduce the main memory accesses by more than a factor three and can therefore expect the code to be at least three times faster.

As a consequence, T_tot and T_tot_lb are often not good metrics to evaluate the optimality of an implementation. Based on these reflections, we will introduce a better metric for the performance evaluation of solvers as the above. But first, let us verify that we can indeed speed up these computations by a factor three or more.

With GPU kernel programming, we could do that as just described, fusing the four kernels that correspond to the four GPU array programming statements into one. However, we want to try an easier solution using GPU array programming at this point.

There is, however, no obvious way to compute values on-the-fly when needed or to store intermediate result on chip in order to achieve the above described. We can instead do the equivalent mathematically: we can substitute qTx and qTy into the expression to compute dTdt and then substitute this in turn into the expression to compute T to get:

Tnew=Told+t 1cp(x(λ Tx)y(λ Ty)) T_\mathrm{new} = T_\mathrm{old} + ∂t~\frac{1}{c_p} \left( -\frac{∂}{∂x} \left(-λ~\frac{∂T}{∂x}\right) -\frac{∂}{∂y} \left(-λ~\frac{∂T}{∂y}\right) \right)

Note that this would obviously be mathematically equivalent to the temperature update rule that we would obtain based on the commonly used heat diffusion equation for constant and scalar λλ:

Tnew=Told+t λcp(2Tx2+2Ty2) T_\mathrm{new} = T_\mathrm{old} + ∂t~\frac{λ}{c_p} \left( \frac{∂^2T}{∂x^2} + \frac{∂^2T}{∂y^2} \right)

We will though use (2) in order not to make limiting assumptions and simplify the computations done in diffusion2D_step!, but to solely optimize data transfer.

We remove therefore the arrays qTx, qTy and dTdt in the main function of the 2-D heat diffusion solver as they are no longer needed; moreover, we introduce T2 as a second array for the temperature. T2 is needed to write newly computed temperature values to a different location then the old temperature values while they are still needed for computations (else we would perform the spatial derivatives partly with new temperature values instead of only with old ones). Here is the resulting main function:

function diffusion2D()
    # Physics
    lam      = 1.0                                          # Thermal conductivity
    c0       = 2.0                                          # Heat capacity
    lx, ly   = 10.0, 10.0                                   # Length of computational domain in dimension x and y

    # Numerics
    nx, ny   = 32*2, 32*2                                   # Number of gridpoints in dimensions x and y
    nt       = 100                                          # Number of time steps
    dx       = lx/(nx-1)                                    # Space step in x-dimension
    dy       = ly/(ny-1)                                    # Space step in y-dimension
    _dx, _dy = 1.0/dx, 1.0/dy

    # Array initializations
    T    = CUDA.zeros(Float64, nx, ny)                      # Temperature
    T2   = CUDA.zeros(Float64, nx, ny)                      # 2nd array for Temperature
    Ci   = CUDA.zeros(Float64, nx, ny)                      # 1/Heat capacity

    # Initial conditions
    Ci .= 1/c0                                              # 1/Heat capacity (could vary in space)
    T  .= CuArray([10.0*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2) for ix=1:size(T,1), iy=1:size(T,2)]) # Initialization of Gaussian temperature anomaly
    T2 .= T;                                                 # Assign also T2 to get correct boundary conditions.

    # Time loop
    dt  = min(dx^2,dy^2)/lam/maximum(Ci)/4.1                # Time step for 2D Heat diffusion
    opts = (aspect_ratio=1, xlims=(1, nx), ylims=(1, ny), clims=(0.0, 10.0), c=:davos, xlabel="Lx", ylabel="Ly") # plotting options
    for it = 1:nt
        diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy)     # Diffusion time step.
        if it % 10 == 0
            IJulia.clear_output(true)
            display(heatmap(Array(T)'; opts...))            # Visualization
            sleep(0.1)
        end
        T, T2 = T2, T                                       # Swap the aliases T and T2 (does not perform any array copy)
    end
end

Task 2 (GPU array programming)

Write the corresponding function diffusion2D_step! to compute a time step using the temperature update rule (2); write it in a single GPU array programming statement (it should go over multiple lines) and without using any helper macros or functions in order to be sure that all computations will get fused into one single kernel.

💡 Note
hint: Make sure to use the correct function signature: diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy).

💡 Note
hint: To verify that it does the right computations, you can launch diffusion2D().
💡 Note
hint: Only add the @inbounds macro to the function once you have verified that it work as they should.
# solution
@inbounds @views function diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy)
    T2[2:end-1,2:end-1] .= T[2:end-1,2:end-1] .+ dt.*Ci[2:end-1,2:end-1].*(                        # T_new = T_old + ∂t 1/cp (
                            .- (   (.-lam.*(T[3:end  ,2:end-1] .- T[2:end-1,2:end-1]).*_dx)
                                .- (.-lam.*(T[2:end-1,2:end-1] .- T[1:end-2,2:end-1]).*_dx)).*_dx  #         - ∂(-λ ∂T/∂x)/∂x
                            .- (   (.-lam.*(T[2:end-1,3:end  ] .- T[2:end-1,2:end-1]).*_dy)
                                .- (.-lam.*(T[2:end-1,2:end-1] .- T[2:end-1,1:end-2]).*_dy)).*_dy  #         - ∂(-λ ∂T/∂y)/∂y
                            )                                                                      #         )
end

Task 3 (Benchmarking)

Benchmark the new function diffusion2D_step! and compute the runtime speed-up compared to the function benchmarked in Task 1. Then, compute T_tot_lb and the ratio between this T_tot_lb and the one obtained in Task 1.

# solution
T2 = CUDA.zeros(Float64, nx, ny);
t_it = @belapsed begin diffusion2D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy); synchronize() end
speedup = t_it_task1/t_it
T_tot_lb = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
ratio_T_tot_lb = T_tot_lb/T_tot_lb_task1

Save the measured minimal runtime and the computed Ttotlb in other variables (t_it_task3 and T_tot_lb_task3) in order not to overwrite them later (adapt these two lines if you used other variable names!):

t_it_task3 = t_it
T_tot_lb_task3 = T_tot_lb

You should have observed a significant speedup (a speedup of factor 2 measured with the Tesla P100 GPU) even though T_tot_lb has probably decreased (to 214 GB/s with the Tesla P100 GPU, i.e about 56% of T_tot_lb measured in task 1). This empirically confirms our earlier statement that T_tot_lb and consequently also T_tot (measured with a profiler) are often not good metrics to evaluate the optimality of an implementation.

A good metric should certainly be tightly linked to observed runtime. We will now try to further speedup the function diffusion2D_step! using straightforward GPU kernel programming.

Task 4 (GPU kernel programming)

Rewrite the function diffusion2D_step! using GPU kernel programming: from within this function, call a GPU kernel, which updates the temperature using update rule (2) (you also need to write this kernel); for simplicity's sake, hardcode the kernel launch parameter threads found best in the introduction (l6_1-gpu-memcopy.ipynb) into the function and compute blocks accordingly in order to have it work with the existing main function diffusion2 (use the function size instead of nx and ny to compute blocks).

💡 Note
You can base yourself on the kernel memcopy_triad_KP! from the introdution notebook to help you remember the very basics of GPU kernel programming.

💡 Note
In this kind of kernels, the computations are described for one array cell (here T2[ix,iy]) rather than for whole arrays - just like in a for loop; moreover, if-statements allow to ensure to remain within the array boundaries (in for loop this is achieved with the loop ranges).
💡 Note
To verify that it does the right computations, you can launch diffusion2D() (as in task 2).
💡 Note
Add the @inbounds macro direcly in front of the Temperature assignement (T2[ix,iy]) as else it does not propagate to the computations (more information on the propagation of @inbounds can be found here; however, as noted earlier, outside of these exercises, it is often more convenient to activate and deactivate bounds-checking globally instead of using the @inbounds macro).
💡 Note
Only add the @inbounds macro to the function once you have verified that it work as they should (as in task 2).
# solution
function diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy)
    threads = (32, 8)
    blocks  = (size(T2,1)÷threads[1], size(T2,2)÷threads[2])
    @cuda blocks=blocks threads=threads update_temperature!(T2, T, Ci, lam, dt, _dx, _dy)
end

function update_temperature!(T2, T, Ci, lam, dt, _dx, _dy)
    ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
    iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
    if (ix>1 && ix<size(T2,1) && iy>1 && iy<size(T2,2))
        @inbounds T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*(
                              - ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx
                              - ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy
                              ))
    end
    return
end

Task 5 (Benchmarking)

Just like in Task 3, benchmark the new function diffusion2D_step! and compute the runtime speedup compared to the function benchmarked in Task 1. Then, compute T_tot_lb and the ratio between this T_tot_lb and the one obtained in Task 1.

# solution
t_it = @belapsed begin diffusion2D_step!($T2, $T, $Ci, $lam, $dt, $_dx, $_dy); synchronize() end
speedup = t_it_task1/t_it
T_tot_lb = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
ratio_T_tot_lb = T_tot_lb/T_tot_lb_task1

The runtime speedup is probably even higher (a speedup of factor 5 measured with the Tesla P100 GPU), even though T_tot_lb is probably somewhat similar to the one obtained in task 1 (524 GB/s with the Tesla P100 GPU, i.e about 36% above T_tot_lb measured in task 1). We will now define a better metric for the performance evaluation of solvers like the one above, which is always proportional to observed runtime.

To this aim, let us recall first the reflections made after benchmarking the original GPU array programming code in Task 1:

three of the four arrays that are updated every iteration are not computed based on their values in the previous iteration and their individual values could therefore be computed on-the-fly when needed or stored in the much faster on-chip memory as intermediate results; these three arrays would never need to be stored in main memory and read from there. Only the temperature array (T) needs inevitably to be read from main memory and written to it at every iteration as is computed based on its values from the previous iteration (and the entire temperature array is orders of magnitudes bigger than the available on-chip memory). In addition, the heat capacity array (Ci) needs to be entirely read at every iteration. To sum up, all but three of eleven full array memory reads or writes can be avoided. If we avoid them, we reduce the main memory accesses by more than a factor three and can therefore expect the code to be at least three times faster.

With this in mind, we will now define the metric, which we call the effective memory throughput, TeffT_\mathrm{eff}.

The effective memory access, AeffA_\mathrm{eff} [GB], is the the sum of twice the memory footprint of the unknown fields, DuD_\mathrm{u}, (fields that depend on their own history and that need to be updated every iteration) and the known fields, DkD_\mathrm{k}, that do not change every iteration. The effective memory access divided by the execution time per iteration, t_it [sec], defines the effective memory throughput, TeffT_\mathrm{eff} [GB/s]:

Aeff=2 Du+Dk A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k} Teff=Aefftit T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}

The upper bound of TeffT_\mathrm{eff} is TpeakT_\mathrm{peak} as measured e.g. by McCalpin, 1995 for CPUs or a GPU analogue. Defining the TeffT_\mathrm{eff} metric, we assume that 1) we evaluate an iterative stencil-based solver, 2) the problem size is much larger than the cache sizes and 3) the usage of time blocking is not feasible or advantageous (which is a reasonable assumption for real-world applications). An important concept is not to include fields within the effective memory access that do not depend on their own history (e.g. fluxes); such fields can be re-computed on the fly or stored on-chip. Defining a theoretical upper bound for TeffT_\mathrm{eff} that is closer to the real upper bound is work in progress.

Task 6 (Benchmarking)

Compute the effective memory throughput, TeffT_\mathrm{eff}, for the solvers benchmarked in Task 1, 3 and 5 (you do not need to redo any benchmarking, but you can compute it based on the saved measured runtimes in these three tasks) and recompute the speedup achieved in Task 3 and 5 based on TeffT_\mathrm{eff} instead of based on the runtime; compare the newly computed speedups with the previous.

# solution
T_eff_task1 = (2*1+1)*1/1e9*nx*ny*sizeof(Float64)/t_it_task1
T_eff_task3 = (2*1+1)*1/1e9*nx*ny*sizeof(Float64)/t_it_task3
T_eff_task5 = (2*1+1)*1/1e9*nx*ny*sizeof(Float64)/t_it
speedup_Teff_task3 = T_eff_task3/T_eff_task1
speedup_Teff_task5 = T_eff_task5/T_eff_task1

Did the speedups you recomputed differ from the previous ones?

If yes, then you made a mistake. Due to the way TeffT_\mathrm{eff} is defined, it is always proportional to observed runtime and it reflects therefore any runtime speedup by 100% while the problem size and the number data type are kept fixed. If, however, you increase these parameters, then T_eff will reflect the additionally performed work and it therefore enables the comparison of the performance achieved in function of the problem size (or number data type). It even allows to compare the performance of different solvers or implementations to a certain point.

Most importantly though, comparing a measured TeffT_\mathrm{eff} with TpeakT_\mathrm{peak} informs us about room for performance improvement.

Task 7 (Benchmarking)

Compute by how much percent you can improve the performance of the solver at most:

#solution for P100
T_peak = ... # Peak memory throughput of the Tesla P100 GPU
@show T_eff/T_peak

Report the value and potentially a short explanation in the README.md on GitHub, within lecture 6 folder (do not forget to upload this finalised notebook as well).

back to Content


Exercise 2 - Solving PDEs on GPUs

👉 See Logistics for submission details.

The goal of this exercise is to:

Task 1

In there, place the Pf_diffusion_2D_perf_loop_fun.jl script you created for Lecture 5 homework (Exercise 1, Task 1). Duplicate the script and rename it Pf_diffusion_2D_perf_gpu.jl.

Getting inspiration from the material presented in lecture 6 and exercise 1, work-out the necessary modifications in the Pf_diffusion_2D_perf_gpu.jl code in order to enable it to execute on the Nvidia P100 GPU. For this task, use a kernel programming approach.

Hereafter, a step-wise list of changes you'll need to perform starting from your Pf_diffusion_2D_perf_loop_fun.jl code.

Add using CUDA at the top, in the packages.

Define, in the # Numerics section, the parameters to set the block and grid size such that the number of threads per blocks are fixed to threads = (32,4) (or to a better layout you could figure out from running the performance assessment you did). Define then the number of blocks blocks to be computed such that nx = threads[1]*blocks[1] and similarly for ny.

In the # Array initialisation section, make sure to now initialise CUDA arrays. You can use CUDA.zeros(Float64,nx,ny) as the GPU variant of zeros(nx,ny). Also, you can use CuArray() to wrap and CPU array turning it into a GPU array; CUDA.zeros(Float64,nx,ny) would be equivalent to CuArray(zeros(nx,ny)). This may be useful to, e.g., define initial conditions using broadcasting operations on CPU arrays and wrapping them in a GPU array for further use.

Going to the compute functions (or "kernels"), remove the nested loop(s) and replace them by the CUDA-related vectorised indices unique to each thread:

ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y

Pay attention that you need to enforce array bound checking (this was previously done by the loop bounds). A convenient way of doing so is using if conditions:

if (ix<=nx && iy<=ny)  Pf[ix,iy] = ...  end

Moving to the # Time loop, you'll now have to add information in order to allow a compute function (e.g. my_kernel!) to execute on the GPU. This can be achieved by adding @cuda blocks threads prior to the function call, turning, e.g.,

my_kernel!(...)

into

@cuda blocks blocks=blocks threads=threads my_kernel!(...)
synchronize()

or alternatively

CUDA.@sync @cuda blocks=blocks threads=threads my_kernel!(...)
⚠️ Warning!
Don't forget to synchronize the device to ensure all threads reached the barrier before the next iteration to avoid erroneous results.

Finally, for visualisation, you'll need to "gather" information from the GPU array (CuArray) back to the CPU array (Array) in order to plot it. This can be achieved by calling Array(Pf) in your visualisation routine.

💡 Note
CuArray() allows you to "transform" a CPU (or host) array to a GPU (or device) array, while Array() allows you to bring back the GPU (device) array to a CPU (host) array.

Task 2

Ensure the GPU code produces similar results as the reference CPU loop code for nx = ny = 127 number of grid points. To assess this, save the output (pressure Pf) fields for both the CPU and GPU codes after e.g. 50 iterations, and make sure their difference is close to machine precision. You could use Julia's unit testing functionalities, e.g., testset, for this task as well.

Task 3

Assess TpeakT_\mathrm{peak} of the Nvidia Tesla P100 GPU. To do so, embed the triad benchmark (kernel programming version) from lecture 6 in a Julia script and use it to assess TpeakT_\mathrm{peak}. Upload the script to your GitHub folder and save the TpeakT_\mathrm{peak} value for next task.

Task 4

Report in a figure you will insert in the README.md the effective memory throughput TeffT_\mathrm{eff} for the 2D fluid pressure diffusion GPU code as function of number of grid points nx = ny. Realise a weak scaling benchmark varying nx = ny = 32 .* 2 .^ (0:8) .- 1 (or until you run out of device memory). On the same figure, report as well TpeakT_\mathrm{peak} from Task 3.

Comment on the TeffT_\mathrm{eff} and TpeakT_\mathrm{peak} values achieved on the Tesla P100.

back to Content


Exercise 3 - Unit and reference tests

👉 See Logistics for submission details.

The goal of this exercise is to:

💡 Note
I had some odd errors caused by @views which I couldn't get to the bottom of. If you do too, just remove the @views.
💡 Note
I packaged the Demo of the lecture within the repo course-101-0250-00-L6Testing.jl, which should be the blueprint for this exercise.

Task:

back to Content