Agenda

📚 Parallel computing on CPUs & performance assessment, the $T_\mathrm{eff}$ metric

💻 Unit testing in Julia

🚧 Exercises:

CPU perf. codes for 2D diffusion and memcopy

Unit tests and testset implementation

Performance limiters

Effective memory throughput metric $T_\mathrm{eff}$

Parallel computing on CPUs

Shared memory parallelisation

Recent processors (CPUs and GPUs) have multiple (or many) cores

Recent processors use their parallelism to hide latency

Multi-core CPUs and GPUs share similar challenges

*Recall from lecture 1 ( why we do it) ...*

Use **parallel computing** (to address this):

The "memory wall" in ~ 2004

Single-core to multi-core devices

GPUs are massively parallel devices

SIMD machine (programmed using threads - SPMD) (more)

Further increases the FLOPS vs Bytes gap

Taking a look at a recent GPU and CPU:

Nvidia Tesla A100 GPU

AMD EPYC "Rome" 7282 (16 cores) CPU

Device | TFLOP/s (FP64) | Memory BW TB/s |
---|---|---|

Tesla A100 | 9.7 | 1.55 |

AMD EPYC 7282 | 0.7 | 0.085 |

Current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

Quantify the imbalance:

$\frac{\mathrm{computation\;peak\;performance\;[TFLOP/s]}}{\mathrm{memory\;access\;peak\;performance\;[TB/s]}} × \mathrm{size\;of\;a\;number\;[Bytes]}$*(Theoretical peak performance values as specified by the vendors can be used).*

Back to our hardware:

Device | TFLOP/s (FP64) | Memory BW TB/s | Imbalance (FP64) |
---|---|---|---|

Tesla A100 | 9.7 | 1.55 | 9.7 / 1.55 × 8 = 50 |

AMD EPYC 7282 | 0.7 | 0.085 | 0.7 / 0.085 × 8 = 66 |

*(here computed with double precision values)*

**Meaning:** we can do 50 (GPU) and 66 (CPU) floating point operations per number accessed from main memory. Floating point operations are "for free" when we work in memory-bounded regimes

➡ Requires to re-think the numerical implementation and solution strategies

Most algorithms require only a few operations or FLOPS ...

... compared to the amount of numbers or bytes accessed from main memory.

First derivative example $∂A / ∂x$:

If we "naively" compare the "cost" of an isolated evaluation of a finite-difference first derivative, e.g., computing a flux $q$:

$q = -D~\frac{∂A}{∂x}~,$which in the discrete form reads `q[ix] = -D*(A[ix+1]-A[ix])/dx`

.

The cost of evaluating `q[ix] = -D*(A[ix+1]-A[ix])/dx`

:

1 reads + 1 write => $2 × 8$ = **16 Bytes transferred**

1 (fused) addition and division => **1 floating point operation**

assuming:

$D$, $∂x$ are scalars

$q$ and $A$ are arrays of

`Float64`

(read from main memory)

GPUs and CPUs perform 50 - 60 floating-point operations per number accessed from main memory

First derivative evaluation requires to transfer 2 numbers per floating-point operations

The FLOPS metric is no longer the most adequate for reporting the application performance of many modern applications on modern hardware.

Need for a memory throughput-based performance evaluation metric: $T_\mathrm{eff}$ [GB/s]

➡ Evaluate the performance of iterative stencil-based solvers.

The effective memory access $A_\mathrm{eff}$ [GB]

Sum of:

twice the memory footprint of the unknown fields, $D_\mathrm{u}$, (fields that depend on their own history and that need to be updated every iteration)

known fields, $D_\mathrm{k}$, that do not change every iteration.

The effective memory access divided by the execution time per iteration, $t_\mathrm{it}$ [sec], defines the effective memory throughput, $T_\mathrm{eff}$ [GB/s]:

$A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k}$ $T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}$The upper bound of $T_\mathrm{eff}$ is $T_\mathrm{peak}$ as measured, e.g., by McCalpin, 1995 for CPUs or a GPU analogue.

Defining the $T_\mathrm{eff}$ metric, we assume that:

we evaluate an iterative stencil-based solver,

the problem size is much larger than the cache sizes and

the usage of time blocking is not feasible or advantageous (reasonable for real-world applications).

💡 Note

Fields within the effective memory access that do not depend on their own history; such fields can be re-computed on the fly or stored on-chip.

As first task, we'll compute the $T_\mathrm{eff}$ for the 2D fluid pressure (diffusion) solver at the core of the porous convection algorithm from previous lecture.

👉 Download the script `l5_Pf_diffusion_2D.jl`

to get started.

**To-do list:**

copy

`l5_Pf_diffusion_2D.jl`

, rename it to`Pf_diffusion_2D_Teff.jl`

add a timer

include the performance metric formulas

deactivate visualisation

💻 Let's get started

Use

`Base.time()`

to return the current timestampDefine

`t_tic`

, the starting time, after 11 iterations steps to allow for "warm-up"Record the exact number of iterations (introduce e.g.

`niter`

)Compute the elapsed time

`t_toc`

at the end of the time loop and report:

```
t_toc = ...
A_eff = ... # Effective main memory access per iteration [GB]
t_it = ... # Execution time per iteration [s]
T_eff = A_eff/t_it # Effective memory throughput [GB/s]
```

Report

`t_toc`

,`T_eff`

and`niter`

at the end of the code, formatting output using`@printf()`

macro.Round

`T_eff`

to the 3rd significant digit.

`@printf("Time = %1.3f sec, ... \n", t_toc, ...)`

Use keyword arguments ("kwargs") to allow for default behaviour

Define a

`do_check`

flag set to`false`

```
function Pf_diffusion_2D(;??)
...
return
end
```

So far so good, we have now a timer.

Let's also boost resolution to `nx = ny = 511`

and set `maxiter = max(nx,ny)`

to have the code running ~1 sec.

In the next part, we'll work on a multi-threading implementation.

*Towards implementing shared memory parallelisation using multi-threading capabilities of modern multi-core CPUs.*

We'll work it out in 3 steps:

Precomputing scalars, removing divisions (and non-necessary arrays)

Back to loops I

Back to loops II - compute functions (future "kernels")

Precomputing scalars, removing divisions and casual arrays

Let's duplicate `Pf_diffusion_2D_Teff.jl`

and rename it as `Pf_diffusion_2D_perf.jl`

.

First, replace

`k_ηf/dx`

and`k_ηf/dy`

in the flux calculations by inverse multiplications, such that

`k_ηf_dx, k_ηf_dy = k_ηf/dx, k_ηf/dy`

Then, replace divisions

`./(1.0 + θ_dτ)`

by inverse multiplications`*_1_θ_dτ`

such that

`_1_θ_dτ = 1.0./(1.0 + θ_dτ)`

Then, replace

`./dx`

and`./dy`

in the`Pf`

update by inverse multiplications`*_dx, *_dy`

where

`_dx, _dy = 1.0/dx, 1.0/dy`

Finally, also apply the same treatment to

`./β_dτ`

Back to loops I

As first, duplicate `Pf_diffusion_2D_perf.jl`

and rename it as `Pf_diffusion_2D_perf_loop.jl`

.

The goal is now to write out the diffusion physics in a loop fashion over $x$ and $y$ dimensions.

Implement a nested loop, taking car of bounds and staggering.

```
for iy=??
for ix=??
qDx[??] -= (qDx[??] + k_ηf_dx* ?? )*_1_θ_dτ
end
end
for iy=??
for ix=??
qDy[??] -= (qDy[??] + k_ηf_dy* ?? )*_1_θ_dτ
end
end
for iy=??
for ix=??
Pf[??] -= ??
end
end
```

We could now use macros to make the code nicer and clearer. Macro expression will be replaced during pre-processing (prior to compilation). Also, macro can take arguments by appending `$`

in their definition.

Let's use macros to replace the derivative implementations

```
macro d_xa(A) esc(:( $A[??]-$A[??] )) end
macro d_ya(A) esc(:( $A[??]-$A[??] )) end
```

And update the code within the iteration loop:

```
for iy=??
for ix=??
qDx[??] -= (qDx[??] + k_ηf_dx* ?? )*_1_θ_dτ
end
end
for iy=??
for ix=??
qDy[??] -= (qDy[??] + k_ηf_dy* ?? )*_1_θ_dτ
end
end
for iy=??
for ix=??
Pf[??] -= ??
end
end
```

Performance is already quite better with the loop version 🚀.

Reasons are that `diff()`

are allocating and that Julia is overall well optimised for executing loops.

Let's now implement the final step.

Back to loops II

Duplicate `Pf_diffusion_2D_perf_loop.jl`

and rename it as `Pf_diffusion_2D_perf_loop_fun.jl`

.

In this last step, the goal is to define `compute`

functions to hold the physics calculations, and to call those within the time loop.

Create a `compute_flux!()`

and `compute_Pf!()`

functions that take input and output arrays and needed scalars as argument and return nothing.

```
function compute_flux!(...)
nx,ny=size(Pf)
...
return nothing
end
function update_Pf!(Pf,...)
nx,ny=size(Pf)
...
return nothing
end
```

💡 Note

Functions that modify arguments take a

`!`

in their name, a Julia convention.The `compute_flux!()`

and `compute_Pf!()`

functions can then be called within the time loop.

This last implementation executes a bit faster as previous one, as functions allow Julia to further optimise during just-ahead-of-time compilation.

💡 Note

For optimal performance assessment, bound-checking should be deactivated. This can be achieved by adding

`@inbounds`

in front of the compute statement, or running the scripts (or launching Julia) with the `--check-bounds=no`

option.Various timing and benchmarking tools are available in Julia's ecosystem to track performance issues. Julia's `Base`

exposes the `@time`

macro which returns timing and allocation estimation. BenchmarkTools.jl package provides finer grained timing and benchmarking tooling, namely the `@btime`

, `@belapsed`

and `@benchmark`

macros, among others.

Let's evaluate the performance of our code using `BenchmarkTools`

. We will need to wrap the two compute kernels into a `compute!()`

function in order to be able to call that one using `@belapsed`

. Query `? @belapsed`

in Julia's REPL to know more.

The `compute!()`

function:

```
function compute!(Pf,qDx,qDy, ???)
compute_flux!(...)
update_Pf!(...)
return nothing
end
```

can then be called using `@belapsed`

to return elapsed time for a single iteration, letting `BenchmarkTools`

taking car about sampling

```
t_toc = @belapsed compute!($Pf,$qDx,$qDy,???)
niter = ???
```

💡 Note

Note that variables need to be interpolated into the function call, thus taking a

`$`

in front.Julia ships with it's `Base`

feature the possibility to enable multi-threading.

The only 2 modifications needed to enable it in our code are:

Place

`Threads.@threads`

in front of the outer loop definitionExport the desired amount of threads, e.g.,

`export JULIA_NUM_THREADS=4`

, to be activate prior to launching Julia (or executing the script from the shell). You can also launch Julia with`-t`

option setting the desired numbers of threads. Setting`-t auto`

will most likely automatically use as many hardware threads as available on a machine.

The number of threads can be queried within a Julia session as following: `Threads.nthreads()`

💡 Note

For optimal performance, the numbers of threads should be identical to the number of physical cores of the target CPU (hardware threads).

Relying on Julia's LoopVectorization.jl package, it is possible to combine multi-threading with advanced vector extensions (AVX) optimisations, leveraging extensions to the x86 instruction set architecture.

To enable it:

Add

`using LoopVectorization`

at the top of the scriptReplace

`Threads.@threads`

by`@tturbo`

And here we go 🚀

We discussed main performance limiters

We implemented the effective memory throughput metric $T_\mathrm{eff}$

We optimised the Julia 2D diffusion code (multi-threading and AVX)

`Test`

moduleBasic unit tests in Julia

Provides simple

*unit testing*functionalityA way to assess if code is correct by checking that results are as expected

Helpful to ensure the code still works after changes

Can be used when developing

Should be used in package for CI

Simple unit testing can be performed with the `@test`

and `@test_throws`

macros:

```
using Test
@test true
```

Or another example

`@test [1, 2] + [2, 1] == [3, 3]`

Testing an expression which is a call using infix syntax such as approximate comparisons

`@test π ≈ 3.14 atol=0.01`

For example, suppose we want to check our new function `square!(x)`

works as expected:

`square!(x) = x^2`

If the condition is true, a `Pass`

is returned:

`@test square!(5) == 25`

If the condition is false, then a `Fail`

is returned and an exception is thrown:

`@test square!(5) == 20`

```
Test Failed at none:1
Expression: square!(5) == 20
Evaluated: 25 == 20
Test.FallbackTestSetException("There was an error during testing")
```

The `@testset`

macro can be used to group tests into sets.

All the tests in a test set will be run, and at the end of the test set a summary will be printed.

If any of the tests failed, or could not be evaluated due to an error, the test set will then throw a `TestSetException`

.

```
@testset "trigonometric identities" begin
θ = 2/3*π
@test sin(-θ) ≈ -sin(θ)
@test cos(-θ) ≈ cos(θ)
@test sin(2θ) ≈ 2*sin(θ)*cos(θ)
@test cos(2θ) ≈ cos(θ)^2 - sin(θ)^2
end;
```

Let's try it with our `square!()`

function

```
square!(x) = x^2
@testset "Square Tests" begin
@test square!(5) == 25
@test square!("a") == "aa"
@test square!("bb") == "bbbb"
end;
```

If we now introduce a bug

```
square!(x) = x^2
@testset "Square Tests" begin
@test square!(5) == 25
@test square!("a") == "aa"
@test square!("bb") == "bbbb"
@test square!(5) == 20
end;
```

```
Square Tests: Test Failed at none:6
Expression: square!(5) == 20
Evaluated: 25 == 20
Stacktrace:
[...]
Test Summary: | Pass Fail Total
Square Tests | 3 1 4
Some tests did not pass: 3 passed, 1 failed, 0 errored, 0 broken.
```

Then then the reporting tells us a test failed.

The

`Test`

module provides simple*unit testing*functionality.Tests can be grouped into sets using

`@testset`

.We'll later see how tests can be used in CI.

⚠️ Warning!

Exercises have to be handed in as monolithic Julia scripts (one code per script) and uploaded to your private (shared) GitHub repository, in a **specific folder for each lecture**. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

👉 See Logistics for submission details.

The goal of this exercise is to:

Finalise the script discussed in class

In this first exercise, you will terminate the performance oriented implementation of the 2D fluid pressure (diffusion) solver script from lecture 5.

👉 If needed, download the `l5_Pf_diffusion_2D.jl`

to get you started.

Create a new folder in your GitHub repository for this week's (lecture 5) exercises. In there, create a new subfolder `Pf_diffusion_2D`

where you will add following script:

`Pf_diffusion_2D_Teff.jl`

:`T_eff`

implementation`Pf_diffusion_2D_perf.jl`

: scalar precomputations and removing disabling`ncheck`

`Pf_diffusion_2D_loop_fun.jl`

: physics computations in`compute!()`

function, derivatives done with macros, and multi-threading

💡 Note

Refer to this section in lecture 5 to capture the starting point describing which features are specific to each version of the diffusion 2D codes.

👉 See Logistics for submission details.

The goal of this exercise is to:

Create a script to assess $T_\mathrm{peak}$, using memory-copy

Assess $T_\mathrm{peak}$ of your CPU

Perform a strong-scaling test: assess $T_\mathrm{eff}$ for the fluid pressure diffusion 2D solver as function of number of grid points and implementation

For this exercise, you will write a code to assess the peak memory throughput of your CPU and run a strong scaling benchmark using the fluid pressure diffusion 2D solver and report performance.

In the `Pf_diffusion_2D`

folder, create a new script named `memcopy.jl`

. You can use as starting point the `diffusion_2D_loop_fun.jl`

script from lecture 5 (or exercise 1).

Rename the "main" function

`memcopy()`

Modify the script to only keep following in the initialisation:

```
# Numerics
nx, ny = 512, 512
nt = 2e4
# array initialisation
C = rand(Float64, nx, ny)
C2 = copy(C)
A = copy(C)
```

Implement 2 compute functions to perform the following operation

`C2 = C + A`

, replacing the previous calculations:create an "array programming"-based function called

`compute_ap!()`

that includes a broadcasted version of the memory copy operation;create a second "kernel programming"-based function called

`compute_kp!()`

that uses a loop-based implementation with multi-threading.

Update the

`A_eff`

formula accordingly.Implement a switch to monitor performance using either a manual approach or

`BenchmarkTools`

and pass the`bench`

value as kwarg to the`memcopy()`

main function including a default:

```
if bench == :loop
# iteration loop
t_tic = 0.0
for iter=1:nt
...
end
t_toc = Base.time() - t_tic
elseif bench == :btool
t_toc = @belapsed ...
end
```

Then, create a `README.md`

file in the `Pf_diffusion_2D`

folder to report the results for each of the following tasks (including a .png of the figure when instructed)

💡 Note

Use

`![fig_name](./<relative-path>/my_fig.png)`

to insert a .png figure in the `README.md`

.Report on a figure $T_\mathrm{eff}$ of your `memcopy.jl`

code as function of number of grid points `nx × ny`

for the array and kernel programming approaches, respectively, using the `BenchmarkTools`

implementation. Vary `nx`

and `ny`

such that

`nx = ny = 16 * 2 .^ (1:8)`

*($T_\mathrm{eff}$ of your memcopy.jl code represents $T_\mathrm{peak}$, the peak memory throughput you can achieve on your CPU for a given implementation.)*

On the same figure, report the best value of memcopy obtained using the manual loop-based approach (manual timer) to assess $T_\mathrm{peak}$.

💡 Note

For performance evaluation we only need the code to run a couple of seconds; adapt

`nt`

accordingly (you could also, e.g., make `nt`

function of `nx, ny`

). Ensure also to implement "warm-up" iterations.Add the above figure in a new section of the `Pf_diffusion_2D/README.md`

, and provide a minimal description of 1) the performed test, and 2) a short description of the result. Figure out the vendor-announced peak memory bandwidth for your CPU, add it to the figure and use it to discuss your results.

Repeat the strong scaling benchmark you just realised in Task 2 using the various fluid pressure diffusion 2D codes (`Pf_diffusion_2D_Teff.jl`

; `Pf_diffusion_2D_perf.jl`

; `Pf_diffusion_2D_loop_fun.jl`

- `for`

, `Threads.@threads`

for the latter).

Report on a figure $T_\mathrm{eff}$ of the 4 diffusion solvers' implementations as function of number of grid points `nx × ny`

. Vary `nx`

and `ny`

such that `nx = ny = 16 * 2 .^ (1:8)`

. **Use the BenchmarlTools-based evaluation approach.**

On the same figure, report also the best memory copy value (as, e.g, dashed lines) and vendor announced values (if available - optional).

Add this second figure in a new section of the `Pf_diffusion_2D/README.md`

, and provide a minimal description of 1) the performed test, and 2) a short description of the result.

👉 See Logistics for submission details.

The goal of this exercise is to:

Implement basic unit tests for the diffusion and acoustic 2D scripts

Group the tests in a test-set

For this exercise, you will implement a test set of basic unit tests to verify the implementation of the diffusion and acoustic 2D solvers.

In the `Pf_diffusion_2D`

folder, duplicate the `Pf_diffusion_2D_perf_loop_fun.jl`

script and rename it `Pf_diffusion_2D_test.jl`

.

Implement a test set in order to test `Pf[xtest, ytest]`

and assess that the values returned are approximatively equal to the following ones for the given values of `nx = ny`

. Make sure to set `do_check = false`

i.e. to ensure the code to run 500 iterations.

```
xtest = [5, Int(cld(0.6*lx, dx)), nx-10]
ytest = Int(cld(0.5*ly, dy))
```

for

```
nx = ny = 16 * 2 .^ (2:5) .- 1
maxiter = 500
```

should match

`nx, ny` | `Pf[xtest, ytest]` |
---|---|

`64` | `[0.00785398056115133 0.007853980637555755 0.007853978592411982]` |

`128` | `[0.00787296974549236 0.007849556884184108 0.007847181374079883]` |

`256` | `[0.00740912103848251 0.009143711648167267 0.007419533048751209]` |

`512` | `[0.00566813765849919 0.004348785338575644 0.005618691590498087]` |

Report the output of the test set as code block in a new section of the `README.md`

in the `Pf_diffusion_2D`

folder.

💡 Note

Use triple backticks to generate code blocks in the

`README.md`

(more).Last modified: October 03, 2023. Website built with Franklin.jl and the Julia programming language.