Agenda

📚 GPU computing & performance assessment (continued)

💻 Unit testing and reference tests

🚧 Exercises:

Data transfer optimisations on GPUs

GPU codes for diffusion 2D

Reference testing in Julia

GPU architecture and kernel programming

GPU computing and performance assessment

Unit testing and reference tests in Julia

We'll get started with a brief overview of the Nvidia GPU architecture and how to program it.

The Nvidia general purpose GPUs we will use in this course can be programmed using the CUDA language extension.

CUDA is accessible in Julia via CUDA.jl, which exposes most of the native CUDA features to the Julia ecosystem.

Note, however, that CUDA.jl does not use `nvcc`

, the Nvidia compiler, but compiles like other Julia code just ahead of time with LLVM.

First, let's distinguish among CPU, GPU, hardware, application and CUDA.

What are * host* and

The * host* is the system CPU. The system memory (DRAM) linked to the CPU is the host memory. The GPU is called a

The GPU hardware is composed of Global (DRAM) memory, L2 cache and many streaming multi-processors (SMs). Each SM contains many compute units (called "CUDA cores" by Nvidia), registers, L1 cache (can be repurposed as shared memory depending on the architecture) and read-only memory.

The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware.

[ref]

In the CUDA programming model, `blocks`

of `threads`

compose the `grid`

. In our implementation, we want to map one thread to each finite-difference cell of the 2D Cartesian domain.

The figure hereafter depicts the relation between the CUDA domain and the finite-difference domain:

Indices `ix`

and `iy`

replace the loop indices providing a "vectorised" map of threads - the core to leverage GPU performance. We'll come back to this in a second part of this lecture.

In the CUDA programming model, `blocks`

(red) of `threads`

compose the `grid`

(green).

In our implementations, we will map one thread (red box) to each cell of the 2D Cartesian domain (blue). Other mappings are possible, of course.

*How does it relate to the GPU hardware?*

All threads of a block are guaranteed to be executed concurrently on an SM (yellow box) and therefore share SM resources such as registers, L1 cache (/shared memory) and read-only memory.

We'll see later that the performance of a GPU application is highly sensitive to the optimal choice of the thread, block, grid layout, the so-called kernel launch parameters.

Writing a Julia GPU function (aka kernel) copying array `A`

to array `B`

with the layout from the above figure looks as follow

```
using CUDA
function copy!(A, B)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
A[ix,iy] = B[ix,iy]
return
end
threads = (4, 3)
blocks = (2, 2)
nx, ny = threads[1]*blocks[1], threads[2]*blocks[2]
A = CUDA.zeros(Float64, nx, ny)
B = CUDA.rand(Float64, nx, ny)
@cuda blocks=blocks threads=threads copy!(A, B)
synchronize()
```

**Playing with GPUs: the rules**

Current GPUs allow typically a maximum of 1024 threads per block.

The maximum number of blocks allowed is huge; computing the largest possible array on the GPU will make you run out of device memory (currently 16-80 GB) before hitting the maximal number of blocks when selecting sensible kernel launch parameters (usually threads per block >= 128).

Threads, blocks and grid have 3D "Cartesian" topology, which is very useful for 1D, 2D and 3D Cartesian finite-difference domains.

With this short overview we should have the important concepts in mind to get started with GPU computing 🚀

💡 Note

A more complete introduction to CUDA (or refresher) can be accessed here. Julia GPU resources can be accessed at https://juliagpu.org.

Learn about:

How to establish the peak memory throughput of your GPU

GPU array and kernel programming

Consolidate:

The basics of benchmarking

How to compute achieved memory throughput

*This content is distributed under MIT licence. Authors: S. Omlin (CSCS), L. Räss (ETHZ).*

In order to get started, we need to connect to a machine which has GPU(s).

Let's take some time to get started.

👉 Getting started:

Fetch your login infos in the

`daint_login.md`

file within your personal Polybox folder.Head to Software install for the directions.

Finally, fetch the

`l6_1-gpu-memcopy.ipynb`

notebooks for this lecture and upload them to your`scratch`

on Piz Daint.

💡 Note

Values reported in this notebook are for the Nvidia P100 16GB PCIe GPU.

We will use the packages `CUDA`

and `BenchmarkTools`

to create a little performance laboratory:

```
import Pkg; Pkg.add("BenchmarkTools");
using CUDA
using BenchmarkTools
```

The performance of most scientific applications nowadays is bound by memory access speed (*memory-bound*) rather than by the speed computations can be done (*compute-bound*).

The reason is that current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

This imbalance can be quantified by dividing the computation peak performance [GFLOP/s] by the memory access peak performance [GB/s] and multiplied by the size of a number in Bytes (for simplicity, theoretical peak performance values as specified by the vendors can be used). For example for the Tesla P100 GPU, it is:

$\frac{5300 ~\mathrm{[GFlop/s]}}{732 ~\mathrm{[GB/s]}}~×~8 = 58$(here computed with double precision values taken from the vendor's product specification sheet).

So we can do 58 floating point operations per number read from main memory or written to it.

As a consequence, we can consider **floating point operations be "for free"** when we work in the memory-bounded regime as in this lecture.

Therefore, let us start with investigating the performance of different ways to express and launch GPU memory copies. We will wrap all of these memory copies in functions, to enable the Julia compiler to optimize them best.

There exists already the function `copyto!`

, which permits to copy data from one pre-allocated array to another; thus, we start with analysing this function's performance.

But first, let us list what GPUs are available and make sure we assign no more than one user per GPU:

```
collect(devices())
device!(0) # select a GPU between 0-7
```

To this purpose, we allocate two arrays and benchmark the function using `BenchmarkTools`

:

```
nx = ny = 32
A = CUDA.zeros(Float64, nx, ny);
B = CUDA.rand(Float64, nx, ny);
@benchmark begin copyto!($A, $B); synchronize() end
```

💡 Note

Previously defined variables are interpolated with

`$`

into the benchmarked expression.⚠️ Warning!

If not specified otherwise,

`CUDA.zeros(nx, ny)`

allocates `Float32`

.Time samples resulting from benchmarking as just performed follow normally a right skewed distribution.

For such distribution, the median is the most robust of the commonly used estimators of the central tendency; the minimum is in general also a good estimator as hardware cannot by accident run faster than with the ideal and it is as a result commonly used for reporting performance (for more information on estimators see here).

Using `@belapsed`

instead of `@benchmark`

, we directly obtain the minimum of the taken time samples:

`t_it = @belapsed begin copyto!($A, $B); synchronize() end`

Now, we know that it does not take "an awful lot of time". Of course, we do not want to stop here, but figure out how good the achieved performance was.

To this aim, we compute the *total memory throughput*, `T_tot`

[GB/s], which is defined as the volume of the copied data [GB] divided by the time spent [s]:

`T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it`

💡 Note

The factor

`2`

comes from the fact that the data is read and written (`2`

operations).Compare now `T_tot`

with the known peak memory throughput, `T_peak`

, which is found e.g. in scientific or vendor publications (for the Nvidia Tesla P100 GPUs, it is 559 GB/s, according to this source.

💡 Note

Achievable peak memory throughput is usually significantly lower than the *theoretical peak bandwidth* announced by the vendor (for the Tesla P100 GPUs, the latter is 732 GB/s as noted already earlier).

💡 Note

Here 1 GB is 1e9 Bytes as in the publication, where the peak memory throughput of the Tesla P100 GPU was obtained from.

You have surely found `T_tot`

to be orders of magnitude below `T_peak`

. This is to be expected when copying a small array.

Let us determine how `T_tot`

behaves with increasing array sizes:

```
array_sizes = []
throughputs = []
for pow = 0:11
nx = ny = 32*2^pow
if (3*nx*ny*sizeof(Float64) > CUDA.available_memory()) break; end
A = CUDA.zeros(Float64, nx, ny);
B = CUDA.rand(Float64, nx, ny);
t_it = @belapsed begin copyto!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
push!(array_sizes, nx)
push!(throughputs, T_tot)
println("(nx=ny=$nx) T_tot = $(T_tot)")
CUDA.unsafe_free!(A)
CUDA.unsafe_free!(B)
end
```

You can observe that the best performance is on pair with `T_peak`

or a bit lower (measured 522 GB/s with the Tesla P100 GPU) as `copyto!`

is a function that needs to work in all possible cases and it is not specifically optimised for a particular hardware.

Furthermore, we note that best performance is obtained for large arrays (in the order of Gigabytes).

We will use the array size for which we obtained the best result for the remainder of the performance experiments:

```
T_tot_max, index = findmax(throughputs)
nx = ny = array_sizes[index]
A = CUDA.zeros(Float64, nx, ny);
B = CUDA.rand(Float64, nx, ny);
```

Let us now create our own memory copy function using GPU *Array Programming* (AP).

We can write a memory copy simply as `A .= B`

; and wrap it in a function using Julia's concise notation, it looks as follows:

`@inbounds memcopy_AP!(A, B) = (A .= B)`

💡 Note

We use

`@inbounds`

macro to make sure no array bounds checking is performed, which would slow down significantly. Note, furthermore, that outside of these exercises it can be more convenient not to use the `@inbounds`

macro, but to deactivate bounds checking instead globally for high performance runs by calling julia as follows : `julia --check-bounds=no ...`

💡 Note

`A = B`

would not do a memcopy, but make `A`

an alias of `B`

, i.e. make `A`

point to the same data in memory as `B`

.We also benchmark it and compute `T_tot`

:

```
t_it = @belapsed begin memcopy_AP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
```

The performance you observe might be a little lower than with the `copyto!`

function (measured 496 GB/s with the Tesla P100 GPU).

The few experiments that we have done together so far have shown you already that performing memory copy with maximal possible performance (T_peak) is not a completely trivial task.

We will now use GPU *Kernel Programming* (KP) to try to get closer to `T_peak`

.

A memory copy kernel can be written e.g. as follows:

```
@inbounds function memcopy_KP!(A, B)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
A[ix,iy] = B[ix,iy]
return nothing
end
```

Then, in order to copy the (entire) array `B`

to `A`

, we need to launch the kernel such that the above indices `ix`

and `iy`

map exactly to each array cell.

Therefore, we need to have `blocks[1]*threads[1] == nx`

and `blocks[2]*threads[2] == ny`

.

We will try first with the simplest possible option using only one thread per block:

```
threads = (1, 1)
blocks = (nx, ny)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
```

`T_tot`

is certainly orders of magnitude below `T_peak`

with this kernel launch parameters.

We need to take into account that single threads cannot run completely independently, but threads are launched in small groups within a block, called *warps*; a warp consists of 32 threads on current GPUs.

Furthermore, warps should access contiguous memory for best performance.

We therefore retry using 32 threads (one warp) per block as follows:

```
threads = (32, 1)
blocks = (nx÷threads[1], ny)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
```

💡 Note

For simplicity, the number of threads was set here explicitly to 32; more future proof would be to retrieve the warp size from the corresponding CUDA attribute by doing:

`attribute(device(),CUDA.DEVICE_ATTRIBUTE_WARP_SIZE)`

.`T_tot`

is now probably in the order of magnitude of `T_peak`

, yet depending on the used GPU it can be still significantly below (measured 302 GB/s with the Tesla P100 GPU).

If `T_tot`

is significantly below `T_peak`

, then we need to set the numbers of threads per block closer to the maximum the GPU allows.

Let us determine how `T_tot`

behaves with an increasing number of threads per blocks:

```
max_threads = attribute(device(),CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
thread_count = []
throughputs = []
for pow = Int(log2(32)):Int(log2(max_threads))
threads = (2^pow, 1)
blocks = (nx÷threads[1], ny)
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
push!(thread_count, prod(threads))
push!(throughputs, T_tot)
println("(threads=$threads) T_tot = $(T_tot)")
end
```

You should observe now that beyond a certain minimum number of threads per block (64 with the Tesla P100 GPU), `T_tot`

is quite close to `T_peak`

(which exact thread/block configuration leads to the best `T_tot`

depends on the used GPU architecture).

Instead of increasing the number of threads only in the x dimension, we can also do so in the y dimension.

We keep though 32 threads in the x dimension in order to let the warps access contiguous memory:

```
thread_count = []
throughputs = []
for pow = 0:Int(log2(max_threads/32))
threads = (32, 2^pow)
blocks = (nx÷threads[1], ny÷threads[2])
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_KP!($A, $B); synchronize() end
T_tot = 2*1/1e9*nx*ny*sizeof(Float64)/t_it
push!(thread_count, prod(threads))
push!(throughputs, T_tot)
println("(threads=$threads) T_tot = $(T_tot)")
end
```

`T_tot`

is even slightly better in general. Much more important is though that a thread block accesses now not a 1D-line of the arrays, but a 2D block.

We will see later that this is of great benefit when, e.g., computing finite difference derivatives in x and y direction.

So far, we experimented with memory copy in the strict sense: copy an array from one place to the other. When doing computations, we often read more data than we write.

We will therefore also do a few experiments on another commonly benchmarked case: read two arrays and write only one.

We modify therefore the previous kernel to take a third array `C`

as input and add it to `B`

(the rest is identical):

```
@inbounds function memcopy2_KP!(A, B, C)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
A[ix,iy] = B[ix,iy] + C[ix,iy]
return nothing
end
```

Then, we test exactly as for the previous kernel how `T_tot`

behaves with an increasing number of threads per blocks in y dimension, keeping it fixed to 32 in x dimension:

```
C = CUDA.rand(Float64, nx, ny);
thread_count = []
throughputs = []
for pow = 0:Int(log2(max_threads/32))
threads = (32, 2^pow)
blocks = (nx÷threads[1], ny÷threads[2])
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy2_KP!($A, $B, $C); synchronize() end
T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
push!(thread_count, prod(threads))
push!(throughputs, T_tot)
println("(threads=$threads) T_tot = $(T_tot)")
end
```

💡 Note

There is now a factor

`3`

instead of `2`

in the computation of `T_tot`

: `2`

arrays are read and `1`

written (`3`

operations).Compare now the best measured `T_tot`

to the `T_peak`

obtained from the publication and if it is higher, then it means we need to correct `T_peak`

to take the value of the `T_tot`

measured (`T_tot`

measured with the Tesla P100 GPU is 561 GB/s, i.e., 2 GB/s higher than the `T_peak`

obtained from the publication mentioned earlier).

Note that the `T_peak`

reported in the publication was obtained with a slightly different kernel which multiplies C with a scalar in addition; it is usually referred to as *triad*.

For completeness, we will also quickly benchmark a *triad* kernel.

To this purpose, we will directly use the best thread/block configuration that we have found in the previous experiment:

```
@inbounds function memcopy_triad_KP!(A, B, C, s)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
A[ix,iy] = B[ix,iy] + s*C[ix,iy]
return nothing
end
s = rand()
T_tot_max, index = findmax(throughputs)
threads = (32, thread_count[index]÷32)
blocks = (nx÷threads[1], ny÷threads[2])
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy_triad_KP!($A, $B, $C, $s); synchronize() end
T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
```

There should be no significant difference between `T_tot`

of this triad kernel and of the previous kernel (with the Tesla P100 GPU, it is 561 GB/s with both kernels).

Finally, let us also check the triad performance we obtain with GPU array programming:

```
@inbounds memcopy_triad_AP!(A, B, C, s) = (A .= B.+ s.*C)
t_it = @belapsed begin memcopy_triad_AP!($A, $B, $C, $s); synchronize() end
T_tot = 3*1/1e9*nx*ny*sizeof(Float64)/t_it
```

`T_tot`

is probably a bit lower than in the above experiment, but still rather close to `T_peak`

.

Congratulations! You have successfully made it through the memory copy kernel optimization experiments and learned about the fundamental parameters determining memory throughput.

One moment! For the following exercises you will need the parameters we have established here for best memory access:

`println("nx=ny=$nx; threads=$threads; blocks=$blocks")`

(Unit) testing is pervasive in the Julia ecosystem thanks to efficient built-in tools and a culture encouraging testing.

This JuliaCon 2021 talk gives a nice overview: more than 90% of all registered packages have at least some tests, with the median package having about 25% of the code being tests.

Terms:

"unit test": small tests, usually on a per-function basis

"integration test": test large part of the code base

"reference test": test against a previous output (not necessarily known whether "good" or "bad")

"Continuous Integration" (CI): running of the tests automatically on push to github/gitlab/etc

how to assess CI-run tests for packages of the Julia ecosystem (registered packages)

how to run tests for registered packages on your computer

how to use tests of packages as "documentation"

create a test-suite for a small project of your

learn to do reference-tests (besides unit-tests)

(How to setup CI as part of a project of yours will be taught later)

Let's look at a simple package: UnPack.jl

the CI-results are often displayed in form of

*badges*there are different CI-services, most used is GitHub-Actions

often the tests are a fairly good source of documentation by example

💻 -> "demo"

Using: UnPack.jl

Installed packages can be tested:

```
pkg> add UnPack
pkg> test UnPack
```

Going one step further. Make and test changes of a package. `dev`

the package:

`pkg> dev UnPack`

This will checkout the package to `~/.julia/dev/UnPack`

.

Re-Start Julia with this package activated:

```
$ cd ~/.julia/dev/UnPack
$ julia --project
```

In package mode run the tests:

```
(UnPack) pkg> test
Testing UnPack
Status `/tmp/jl_LgpabA/Project.toml`
[3a884ed6] UnPack v1.0.2 `~/julia/dot-julia-dev/UnPack`
...
```

If you edit the source, e.g. to fix a bug, re-run the tests before submitting a PR.

Start easy:

add test just to a script

Step up:

move tests to

`test/runtests.jl`

, the standard locationinclude scripts to just run-through

use "reference-tests" as integration tests

Another day:

setup CI on GitHub

```
using Plots
function car_travel_1D()
# Physical parameters
V = 113.0 # speed, km/h
L = 200.0 # length of segment, km
dir = 1 # switch 1 = go right, -1 = go left
ttot = 16.0 # total time, h
# Numerical parameters
dt = 0.1 # time step, h
nt = Int(cld(ttot, dt)) # number of time steps
# Array initialisation
T = zeros(nt)
X = zeros(nt)
# Time loop
for it = 2:nt
T[it] = T[it-1] + dt
X[it] = X[it-1] + dir*V*dt # move the car
if X[it] > L
dir = -1 # if beyond L, go back (left)
elseif X[it] < 0
dir = 1 # if beyond 0, go back (right)
end
end
# Visualisation
# display(scatter(T, X, markersize=5, xlabel="time, hrs", ylabel="distance, km", framestyle=:box, legend=:none))
return T, X
end
T, X = car_travel_1D()
```

Steps:

`generate`

a project and add`scripts/car_travel.jl`

use reference tests

add some unit tests in-line

move the tests to

`test/runtests.jl`

💡 Note

To make the

`pkg> test`

run, you have to have a file `src/MyPkg.jl`

, even if it is just empty.Step 1: generate a package

```
$ cd to-some-dir
$ julia --project
julia> using Pkg; Pkg.generate("L6Testing")
```

Steps 3–4 are in the repository course-101-0250-00-L6Testing.jl; note that this steps are encoded in the git history which the README links into.

💡 Note

For outputs from big simulations, such as ours, it make sense to only reference-test at a few 10s of indices.

⚠️ Warning!

Exercises (including the **specific folder for each lecture**. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

`.ipynb`

) have to be uploaded to your private (shared) GitHub repository, in a 👉 See Logistics for submission details.

The goal of this exercise is to:

learn how to minimize redundant main memory transfers and understand its importance;

understand the limits of the

*total memory throughput*metric for performance evaluation;learn how to compute the

*effective memory throughput*and understand its interest;learn about GPU array and kernel programming on the way.

Prerequisites:

the lecture 6

*Benchmarking memory copy and establishing peak memory access performance*(`l6_1-gpu-memcopy.ipynb`

)

*This content is distributed under MIT licence. Authors: S. Omlin (CSCS), L. Räss (ETHZ).*

👉 Download the `lecture6_ex1.ipynb`

notebook and edit it.

Create a new folder in your GitHub repository for this week's (

`lecture6`

) exercises, including a`README.md`

(as usual).Hand-in the finalised notebook in your GitHub lectrue 6 folder and;

Report the output of Task 7 in the

`README.md`

.

💡 Note

Values reported in this exercise are for the Nvidia P100 16GB PCIe GPU.

We will again use the packages `CUDA`

, `BenchmarkTools`

and `Plots`

to create a little performance laboratory:

```
using IJulia
using CUDA
using BenchmarkTools
using Plots
```

Before we go further, make sure we select the GPU we want to run on (if running on a multi-GPU node). In the terminal or Julia REPL in shell mode (typing `;`

), type `nvidia-smi`

command to list visible GPUs. Remember the GPU_ID you want to use.

Then, in Julia, add following if you decide to, e.g., use GPU 0:

```
GPU_ID = 0
device!(GPU_ID)
```

⚠️ Warning!

Having multiple users accessing the same GPU will result in severe performance deprecation.

Let us consider the following 2-D heat diffusion solver (the comments explain the code):

```
function diffusion2D()
# Physics
lam = 1.0 # Thermal conductivity
c0 = 2.0 # Heat capacity
lx, ly = 10.0, 10.0 # Length of computational domain in dimension x and y
# Numerics
nx, ny = 32*2, 32*2 # Number of gridpoints in dimensions x and y
nt = 100 # Number of time steps
dx = lx/(nx-1) # Space step in x-dimension
dy = ly/(ny-1) # Space step in y-dimension
_dx, _dy = 1.0/dx, 1.0/dy
# Array initializations
T = CUDA.zeros(Float64, nx, ny) # Temperature
Ci = CUDA.zeros(Float64, nx, ny) # 1/Heat capacity
qTx = CUDA.zeros(Float64, nx-1, ny-2) # Heat flux, x component
qTy = CUDA.zeros(Float64, nx-2, ny-1) # Heat flux, y component
dTdt = CUDA.zeros(Float64, nx-2, ny-2) # Change of Temperature in time
# Initial conditions
Ci .= 1/c0 # 1/Heat capacity (could vary in space)
T .= CuArray([10.0*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2) for ix=1:size(T,1), iy=1:size(T,2)]) # Initialization of Gaussian temperature anomaly
# Time loop
dt = min(dx^2,dy^2)/lam/maximum(Ci)/4.1 # Time step for 2D Heat diffusion
opts = (aspect_ratio=1, xlims=(1, nx), ylims=(1, ny), clims=(0.0, 10.0), c=:davos, xlabel="Lx", ylabel="Ly") # plotting options
for it = 1:nt
diffusion2D_step!(T, Ci, qTx, qTy, dTdt, lam, dt, _dx, _dy) # Diffusion time step.
if it % 10 == 0
IJulia.clear_output(true)
display(heatmap(Array(T)'; opts...)) # Visualization
sleep(0.1)
end
end
end
```

💡 Note

Divisions are precomputed as they are slower than multiplications.

The function to compute an actual time step is still missing to complete this solver. It can be written, e.g., as follows with finite differences using GPU *array programming* (AP):

```
@inbounds @views macro d_xa(A) esc(:( ($A[2:end , : ] .- $A[1:end-1, : ]) )) end
@inbounds @views macro d_xi(A) esc(:( ($A[2:end ,2:end-1] .- $A[1:end-1,2:end-1]) )) end
@inbounds @views macro d_ya(A) esc(:( ($A[ : ,2:end ] .- $A[ : ,1:end-1]) )) end
@inbounds @views macro d_yi(A) esc(:( ($A[2:end-1,2:end ] .- $A[2:end-1,1:end-1]) )) end
@inbounds @views macro inn(A) esc(:( $A[2:end-1,2:end-1] )) end
@inbounds @views function diffusion2D_step!(T, Ci, qTx, qTy, dTdt, lam, dt, _dx, _dy)
qTx .= .-lam.*@d_xi(T).*_dx # Fourier's law of heat conduction: qT_x = -λ ∂T/∂x
qTy .= .-lam.*@d_yi(T).*_dy # ... qT_y = -λ ∂T/∂y
dTdt .= @inn(Ci).*(.-@d_xa(qTx).*_dx .- @d_ya(qTy).*_dy) # Conservation of energy: ∂T/∂t = 1/cp (-∂qT_x/∂x - ∂qT_y/∂y)
@inn(T) .= @inn(T) .+ dt.*dTdt # Update of temperature T_new = T_old + ∂t ∂T/∂t
end
```

💡 Note

We use everywhere views to avoid allocations of temporary arrays (see here for more information).

💡 Note

💡 Note

We use simple macros to enable nice syntax, in particular macros can be used also on the left side of an equal sign (learn here more about macros).

Run now the 2-D heat diffusion solver to verify that it is working:

`diffusion2D()`

Benchmark the function `diffusion2D_step!`

using BenchmarkTools and compute a straightforward *lower bound of the total memory throughput*, `T_tot_lb`

; then, compare it to the *peak memory throughput*, `T_peak`

. You can compute `T_tot_lb`

considering only full array reads and writes and knowing that there is no data reuse between different GPU array computation statements as each statement is translated into a separate and independently launched kernel (note that to obtain the actual `T_tot`

, one would need to use a profiler).

Furthermore, use the `nx=ny`

found best in the introduction notebook (`1_memorycopy.ipynb`

) to allocate the necessary arrays if the amount of memory of your GPU allows it (else divide this `nx`

and `ny`

by 2).

To help you, there is already some code below to initialize the required arrays and scalars for the benchmarking.

💡 Note

`$`

and note that you do not need to call the solver itself (`diffusion2D`

)!```
nx = ny = # complete!
T = CUDA.rand(Float64, nx, ny);
Ci = CUDA.rand(Float64, nx, ny);
qTx = CUDA.zeros(Float64, nx-1, ny-2);
qTy = CUDA.zeros(Float64, nx-2, ny-1);
dTdt = CUDA.zeros(Float64, nx-2, ny-2);
lam = _dx = _dy = dt = rand();
```

```
# solution
t_it = @belapsed begin ... end
T_tot_lb = .../1e9*nx*ny*sizeof(Float64)/t_it
```

Save the measured minimal runtime and the computed `T_tot_lb`

in other variables (`t_it_task1`

and `T_tot_lb_task1`

) in order not to overwrite them later (adapt these two lines if you used other variable names!); moreover, we will remove the arrays we do no longer need in order to save space:

```
t_it_task1 = t_it
T_tot_lb_task1 = T_tot_lb
CUDA.unsafe_free!(qTx)
CUDA.unsafe_free!(qTy)
CUDA.unsafe_free!(dTdt)
```

`T_tot_lb`

should be relatively close to `T_peak`

. Nevertheless, one could do these computations at least three times faster. You may wonder why it is possible to predict that just looking at the code. It is because three of the four arrays that are updated every iteration are not computed based on their values in the previous iteration and their individual values could therefore be computed on-the-fly when needed or stored in the much faster on-chip memory as intermediate results; these three arrays would never need to be stored in main memory and read from there. Only the temperature array (`T`

) needs inevitably to be read from main memory and written to it at every iteration as is computed based on its values from the previous iteration (and the entire temperature array is orders of magnitudes bigger than the available on-chip memory). In addition, the heat capacity array (`Ci`

) needs to be entirely read at every iteration. To sum up, all but three of eleven full array memory reads or writes can be avoided. If we avoid them, we reduce the main memory accesses by more than a factor three and can therefore expect the code to be at least three times faster.

As a consequence, `T_tot`

and `T_tot_lb`

are often not good metrics to evaluate the optimality of an implementation. Based on these reflections, we will introduce a better metric for the performance evaluation of solvers as the above. But first, let us verify that we can indeed speed up these computations by a factor three or more.

With GPU kernel programming, we could do that as just described, fusing the four kernels that correspond to the four GPU array programming statements into one. However, we want to try an easier solution using GPU array programming at this point.

There is, however, no obvious way to compute values on-the-fly when needed or to store intermediate result on chip in order to achieve the above described. We can instead do the equivalent mathematically: we can substitute `qTx`

and `qTy`

into the expression to compute `dTdt`

and then substitute this in turn into the expression to compute `T`

to get:

Note that this would obviously be mathematically equivalent to the temperature update rule that we would obtain based on the commonly used heat diffusion equation **for constant and scalar** $λ$:

We will though use (2) in order not to make limiting assumptions and simplify the computations done in `diffusion2D_step!`

, but to solely optimize data transfer.

We remove therefore the arrays `qTx`

, `qTy`

and `dTdt`

in the main function of the 2-D heat diffusion solver as they are no longer needed; moreover, we introduce `T2`

as a second array for the temperature. `T2`

is needed to write newly computed temperature values to a different location then the old temperature values while they are still needed for computations (else we would perform the spatial derivatives partly with new temperature values instead of only with old ones). Here is the resulting main function:

```
function diffusion2D()
# Physics
lam = 1.0 # Thermal conductivity
c0 = 2.0 # Heat capacity
lx, ly = 10.0, 10.0 # Length of computational domain in dimension x and y
# Numerics
nx, ny = 32*2, 32*2 # Number of gridpoints in dimensions x and y
nt = 100 # Number of time steps
dx = lx/(nx-1) # Space step in x-dimension
dy = ly/(ny-1) # Space step in y-dimension
_dx, _dy = 1.0/dx, 1.0/dy
# Array initializations
T = CUDA.zeros(Float64, nx, ny) # Temperature
T2 = CUDA.zeros(Float64, nx, ny) # 2nd array for Temperature
Ci = CUDA.zeros(Float64, nx, ny) # 1/Heat capacity
# Initial conditions
Ci .= 1/c0 # 1/Heat capacity (could vary in space)
T .= CuArray([10.0*exp(-(((ix-1)*dx-lx/2)/2)^2-(((iy-1)*dy-ly/2)/2)^2) for ix=1:size(T,1), iy=1:size(T,2)]) # Initialization of Gaussian temperature anomaly
T2 .= T; # Assign also T2 to get correct boundary conditions.
# Time loop
dt = min(dx^2,dy^2)/lam/maximum(Ci)/4.1 # Time step for 2D Heat diffusion
opts = (aspect_ratio=1, xlims=(1, nx), ylims=(1, ny), clims=(0.0, 10.0), c=:davos, xlabel="Lx", ylabel="Ly") # plotting options
for it = 1:nt
diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy) # Diffusion time step.
if it % 10 == 0
IJulia.clear_output(true)
display(heatmap(Array(T)'; opts...)) # Visualization
sleep(0.1)
end
T, T2 = T2, T # Swap the aliases T and T2 (does not perform any array copy)
end
end
```

Write the corresponding function `diffusion2D_step!`

to compute a time step using the temperature update rule (2); write it in **a single GPU array programming statement** (it should go over multiple lines) and without using any helper macros or functions in order to be sure that all computations will get fused into one single kernel.

💡 Note

`diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy)`

.💡 Note

`diffusion2D()`

.💡 Note

`@inbounds`

macro to the function once you have verified that it work as they should. Remember that outside of these exercises it can be more convenient not to use the `@inbounds`

macro, but to deactivate bounds checking instead globally for high performance runs by calling julia as follows : `julia --check-bounds=no ...`

```
# solution
@inbounds @views function diffusion2D_step!(T2, T, Ci, lam, dt, _dx, _dy)
T2[2:end-1,2:end-1] .= T[2:end-1,2:end-1] .+ dt.* ...
end
```

Benchmark the new function `diffusion2D_step!`

and compute the runtime speed-up compared to the function benchmarked in Task 1. Then, compute `T_tot_lb`

and the ratio between this `T_tot_lb`

and the one obtained in Task 1.

```
# solution
T2 = ...
t_it = @belapsed begin ...; synchronize() end
speedup = t_it_task1/t_it
T_tot_lb = .../1e9*nx*ny*sizeof(Float64)/t_it
ratio_T_tot_lb = ...
```

Save the measured minimal runtime and the computed T*tot*lb in other variables (`t_it_task3`

and `T_tot_lb_task3`

) in order not to overwrite them later (adapt these two lines if you used other variable names!):

```
t_it_task3 = t_it
T_tot_lb_task3 = T_tot_lb
```

You should have observed a significant speedup (a speedup of factor 2 measured with the Tesla P100 GPU) even though `T_tot_lb`

has probably decreased (to 214 GB/s with the Tesla P100 GPU, i.e about 56% of `T_tot_lb`

measured in task 1). This empirically confirms our earlier statement that `T_tot_lb`

and consequently also `T_tot`

(measured with a profiler) are often not good metrics to evaluate the **optimality** of an implementation.

A good metric should certainly be tightly linked to observed runtime. We will now try to further speedup the function `diffusion2D_step!`

using straightforward GPU kernel programming.

Rewrite the function `diffusion2D_step!`

using GPU kernel programming: from within this function, call a GPU kernel, which updates the temperature using update rule (2) (you also need to write this kernel); for simplicity's sake, hardcode the kernel launch parameter `threads`

found best in the introduction (`l6_1-gpu-memcopy.ipynb`

) into the function and compute `blocks`

accordingly in order to have it work with the existing main function `diffusion2`

(use the function `size`

instead of `nx`

and `ny`

to compute `blocks`

).

💡 Note

You can base yourself on the kernel

`memcopy_triad_KP!`

from the introdution notebook to help you remember the very basics of GPU kernel programming.💡 Note

In this kind of kernels, the computations are described for one array cell (here

`T2[ix,iy]`

) rather than for whole arrays - just like in a for loop; moreover, if-statements allow to ensure to remain within the array boundaries (in for loop this is achieved with the loop ranges).💡 Note

To verify that it does the right computations, you can launch

`diffusion2D()`

(as in task 2).💡 Note

Add the

`@inbounds`

macro direcly in front of the Temperature assignement (`T2[ix,iy]`

) as else it does not propagate to the computations (more information on the propagation of `@inbounds`

can be found here; however, as noted earlier, outside of these exercises, it is often more convenient to activate and deactivate bounds-checking globally instead of using the `@inbounds`

macro).💡 Note

Only add the

`@inbounds`

macro to the function once you have verified that it work as they should (as in task 2).```
# solution
function diffusion2D_step!(...)
threads = (..., ...)
blocks = (size(...)÷threads[1], size(...)÷threads[2])
@cuda ...
end
function update_temperature!(...)
ix = ...
iy = ...
if (ix... && iy... )
@inbounds T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*( ... ))
end
return
end
```

Just like in Task 3, benchmark the new function `diffusion2D_step!`

and compute the runtime speedup compared to the function benchmarked in Task 1. Then, compute `T_tot_lb`

and the ratio between this `T_tot_lb`

and the one obtained in Task 1.

```
# solution
t_it = @belapsed begin ...; synchronize() end
speedup = ...
T_tot_lb = .../1e9*nx*ny*sizeof(Float64)/t_it
ratio_T_tot_lb = ...
```

The runtime speedup is probably even higher (a speedup of factor 5 measured with the Tesla P100 GPU), even though `T_tot_lb`

is probably somewhat similar to the one obtained in task 1 (524 GB/s with the Tesla P100 GPU, i.e about 36% above `T_tot_lb`

measured in task 1). We will now define a better metric for the performance evaluation of solvers like the one above, which is always proportional to observed runtime.

To this aim, let us recall first the reflections made after benchmarking the original GPU array programming code in Task 1:

three of the four arrays that are updated every iteration are not computed based on their values in the previous iteration and their individual values could therefore be computed on-the-fly when needed or stored in the much faster on-chip memory as intermediate results; these three arrays would never need to be stored in main memory and read from there. Only the temperature array (

`T`

) needs inevitably to be read from main memory and written to it at every iteration as is computed based on its values from the previous iteration (and the entire temperature array is orders of magnitudes bigger than the available on-chip memory). In addition, the heat capacity array (`Ci`

) needs to be entirely read at every iteration. To sum up, all but three of eleven full array memory reads or writes can be avoided. If we avoid them, we reduce the main memory accesses by more than a factor three and can therefore expect the code to be at least three times faster.

With this in mind, we will now define the metric, which we call the *effective memory throughput*, $T_\mathrm{eff}$.

The effective memory access, $A_\mathrm{eff}$ [GB], is the the sum of twice the memory footprint of the unknown fields, $D_\mathrm{u}$, (fields that depend on their own history and that need to be updated every iteration) and the known fields, $D_\mathrm{k}$, that do not change every iteration. The effective memory access divided by the execution time per iteration, t_it [sec], defines the effective memory throughput, $T_\mathrm{eff}$ [GB/s]:

$A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k}$ $T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}$The upper bound of $T_\mathrm{eff}$ is $T_\mathrm{peak}$ as measured e.g. by McCalpin, 1995 for CPUs or a GPU analogue. Defining the $T_\mathrm{eff}$ metric, we assume that 1) we evaluate an iterative stencil-based solver, 2) the problem size is much larger than the cache sizes and 3) the usage of time blocking is not feasible or advantageous (which is a reasonable assumption for real-world applications). An important concept is not to include fields within the effective memory access that do not depend on their own history (e.g. fluxes); such fields can be re-computed on the fly or stored on-chip. Defining a theoretical upper bound for $T_\mathrm{eff}$ that is closer to the real upper bound is work in progress.

Compute the effective memory throughput, $T_\mathrm{eff}$, for the solvers benchmarked in Task 1, 3 and 5 (you do not need to redo any benchmarking, but you can compute it based on the saved measured runtimes in these three tasks) and recompute the speedup achieved in Task 3 and 5 based on $T_\mathrm{eff}$ instead of based on the runtime; compare the newly computed speedups with the previous.

```
# solution
T_eff_task1 = .../t_it_task1
T_eff_task3 = .../t_it_task3
T_eff_task5 = .../t_it
speedup_Teff_task3 = T_eff_task3/T_eff_task1
speedup_Teff_task5 = T_eff_task5/T_eff_task1
```

Did the speedups you recomputed differ from the previous ones?

If yes, then you made a mistake. Due to the way $T_\mathrm{eff}$ is defined, it is always proportional to observed runtime and it reflects therefore any runtime speedup by 100% while the problem size and the number data type are kept fixed. If, however, you increase these parameters, then T_eff will reflect the additionally performed work and it therefore enables the comparison of the performance achieved in function of the problem size (or number data type). It even allows to compare the performance of different solvers or implementations to a certain point.

Most importantly though, comparing a measured $T_\mathrm{eff}$ with $T_\mathrm{peak}$ informs us about room for performance improvement.

Compute by how much percent you can improve the performance of the solver at most:

```
#solution for P100
T_peak = ... # Peak memory throughput of the Tesla P100 GPU
@show T_eff/T_peak
```

Report the value and potentially a short explanation in the `README.md`

on GitHub, within lecture 6 folder (do not forget to upload this finalised notebook as well).

👉 See Logistics for submission details.

The goal of this exercise is to:

Port the 2D fluid diffusion CPU code from Lecture 5 to GPU

Assess the performance of the GPU solver

In there, place the `Pf_diffusion_2D_perf_loop_fun.jl`

script you created for Lecture 5 homework (Exercise 1, Task 1). Duplicate the script and rename it `Pf_diffusion_2D_perf_gpu.jl`

.

Getting inspiration from the material presented in lecture 6 and exercise 1, work-out the necessary modifications in the `Pf_diffusion_2D_perf_gpu.jl`

code in order to enable it to execute on the Nvidia P100 GPU. For this task, * use a kernel programming approach*.

Hereafter, a step-wise list of changes you'll need to perform starting from your `Pf_diffusion_2D_perf_loop_fun.jl`

code.

Add `using CUDA`

at the top, in the packages.

Define, in the `# Numerics`

section, the parameters to set the block and grid size such that the number of threads per blocks are fixed to `threads = (32,4)`

(or to a better layout you could figure out from running the performance assessment you did). Define then the number of blocks `blocks`

to be computed such that `nx = threads[1]*blocks[1]`

and similarly for `ny`

.

In the `# Array initialisation`

section, make sure to now initialise CUDA arrays. You can use `CUDA.zeros(Float64,nx,ny)`

as the GPU variant of `zeros(nx,ny)`

. Also, you can use `CuArray()`

to wrap and CPU array turning it into a GPU array; `CUDA.zeros(Float64,nx,ny)`

would be equivalent to `CuArray(zeros(nx,ny))`

. This may be useful to, e.g., define initial conditions using broadcasting operations on CPU arrays and wrapping them in a GPU array for further use.

Going to the compute functions (or "kernels"), remove the nested loop(s) and replace them by the CUDA-related vectorised indices unique to each thread:

```
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
```

Pay attention that you need to enforce array bound checking (this was previously done by the loop bounds). A convenient way of doing so is using `if`

conditions:

`if (ix<=nx && iy<=ny) Pf[ix,iy] = ... end`

Moving to the `# Time loop`

, you'll now have to add information in order to allow the compute function to execute on the GPU. This can be achieved by adding `@cuda blocks threads`

prior to the function call, turning, e.g.,

`compute!(...)`

into

```
@cuda blocks blocks=blocks threads=threads compute!(...)
synchronize()
```

or alternatively

`CUDA.@sync @cuda blocks=blocks threads=threads compute!(...)`

⚠️ Warning!

Don't forget to synchronize the device to ensure all threads reached the barrier before the next iteration to avoid erroneous results.

Finally, for visualisation, you'll need to "gather" information from the GPU array (`CuArray`

) back to the CPU array (`Array`

) in order to plot it. This can be achieved by calling `Array(Pf)`

in your visualisation routine.

💡 Note

`CuArray()`

allows you to "transform" a CPU (or host) array to a GPU (or device) array, while `Array()`

allows you to bring back the GPU (device) array to a CPU (host) array.Ensure the GPU code produces similar results as the reference CPU loop code for `nx = ny = 127`

number of grid points. To assess this, save the output (pressure `Pf`

) fields for both the CPU and GPU codes after e.g. 50 iterations, and make sure their difference is close to machine precision. You could use Julia's unit testing functionalities, e.g., `testset`

, for this task as well.

Assess $T_\mathrm{peak}$ of the Nvidia Tesla P100 GPU. To do so, embed the *triad* benchmark (kernel programming version) from lecture 6 in a Julia script and use it to assess $T_\mathrm{peak}$. Upload the script to your GitHub folder and save the $T_\mathrm{peak}$ value for next task.

Report in a figure you will insert in the `README.md`

the effective memory throughput $T_\mathrm{eff}$ for the 2D fluid pressure diffusion GPU code as function of number of grid points `nx = ny`

. Realise a weak scaling benchmark varying `nx = ny = 32 .* 2 .^ (0:8) .- 1`

(or until you run out of device memory). On the same figure, report as well $T_\mathrm{peak}$ from Task 3.

Comment on the $T_\mathrm{eff}$ and $T_\mathrm{peak}$ values achieved on the Tesla P100.

👉 See Logistics for submission details.

The goal of this exercise is to:

revisit the last part of the lecture

learn how testing works in Julia

💡 Note

I had some odd errors caused by

`@views`

which I couldn't get to the bottom of. If you do too, just remove the `@views`

.💡 Note

I packaged the Demo of the lecture within the repo course-101-0250-00-L6Testing.jl, which should be the blueprint for this exercise.

Task:

Use the

`l2_diffusion_1D.jl`

script as a base and rename it`diffusion_1D_test.jl`

.Define your own

`diff()`

function as`@views Diff(A) = A[2:end].-A[1:end-1]`

Create a Julia project

`L6TestingExercise`

within the exercise submission folder`lecture6`

. Use the`generate`

command of the REPL package-mode.Add

`l2_diffusion_1D.jl`

to a`scripts/`

folderYou should remove/disable the plotting. This will make the tests run faster. Remove the

`@views`

for the main function. Also return the final`C`

and`qx`

from the function.Make two unit tests for

`Diff(A)`

function; wrap them in a`@testset`

Make a reference-test which tests the value at 20 random indices of

`C`

and`qx`

against a truth.Make sure that all tests run and pass when called via package-mode

`test`

Last modified: October 03, 2023. Website built with Franklin.jl and the Julia programming language.