Lecture 5

Agenda
📚 Parallel computing on CPUs & performance assessment, the $T_\mathrm{eff}$ metric
💻 Unit testing in Julia
🚧 Exercises:

CPU perf. codes for 2D diffusion and memcopy

Unit tests and testset implementation

Content

Lecture 5
Parallel computing (on CPUs) and performance assessment
Unit testing in Julia
1. Basic unit tests
2. Working with a test sets
Exercises - lecture 5

👉 get started with exercises

Parallel computing (on CPUs) and performance assessment

Performance

❓ some questions for you:

How to assess the performance of numerical application?
Are you familiar the concept of wall-time?
What are the key ingredients to understand performance?

The goal of this lecture 5 is to introduce:

Performance limiters
Effective memory throughput metric $T_\mathrm{eff}$
Parallel computing on CPUs
Shared memory parallelisation

Performance limiters

Hardware

Recent processors (CPUs and GPUs) have multiple (or many) cores
Recent processors use their parallelism to hide latency (i.e. overlapping execution times (latencies) of individual operations with execution times of other operations)
Multi-core CPUs and GPUs share similar challenges

Recall from lecture 1 (why we do it) ...

Use parallel computing (to address this):

The "memory wall" in ~ 2004
Single-core to multi-core devices

mem_wall

GPUs are massively parallel devices

SIMD machine (programmed using threads - SPMD) (more)
Further increases the FLOPS vs Bytes gap

cpu_gpu_evo

Taking a look at a recent GPU and CPU:

Nvidia Tesla A100 GPU
AMD EPYC "Rome" 7282 (16 cores) CPU

Device	TFLOP/s (FP64)	Memory BW TB/s
Tesla A100	9.7	1.55
AMD EPYC 7282	0.7	0.085

Current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

Quantify the imbalance:

\frac{\mathrm{computation\;peak\;performance\;[TFLOP/s]}}{\mathrm{memory\;access\;peak\;performance\;[TB/s]}} × \mathrm{size\;of\;a\;number\;[Bytes]}

(Theoretical peak performance values as specified by the vendors can be used).

Back to our hardware:

Device	TFLOP/s (FP64)	Memory BW TB/s	Imbalance (FP64)
Tesla A100	9.7	1.55	9.7 / 1.55 × 8 = 50
AMD EPYC 7282	0.7	0.085	0.7 / 0.085 × 8 = 66

(here computed with double precision values)

Meaning: we can do 50 (GPU) and 66 (CPU) floating point operations per number accessed from main memory. Floating point operations are "for free" when we work in memory-bounded regimes

➡ Requires to re-think the numerical implementation and solution strategies

On the scientific application side

Most algorithms require only a few operations or FLOPS ...
... compared to the amount of numbers or bytes accessed from main memory.

First derivative example $∂A / ∂x$ :

If we "naively" compare the "cost" of an isolated evaluation of a finite-difference first derivative, e.g., computing a flux $q$ :

q = -D~\frac{∂A}{∂x}~,

which in the discrete form reads q[ix] = -D*(A[ix+1]-A[ix])/dx.

The cost of evaluating q[ix] = -D*(A[ix+1]-A[ix])/dx:

1 reads + 1 write => $2 × 8$ = 16 Bytes transferred

1 (fused) addition and division => 1 floating point operation

assuming:

$D$ , $∂x$ are scalars
$q$ and $A$ are arrays of Float64 (read from main memory)

GPUs and CPUs perform 50 - 60 floating-point operations per number accessed from main memory

First derivative evaluation requires to transfer 2 numbers per floating-point operations

The FLOPS metric is no longer the most adequate for reporting the application performance of many modern applications on modern hardware.

Effective memory throughput metric $T_\mathrm{eff}$

Need for a memory throughput-based performance evaluation metric: $T_\mathrm{eff}$ [GB/s]

➡ Evaluate the performance of iterative stencil-based solvers.

The effective memory access $A_\mathrm{eff}$ [GB]

Sum of:

twice the memory footprint of the unknown fields, $D_\mathrm{u}$ , (fields that depend on their own history and that need to be updated every iteration)
known fields, $D_\mathrm{k}$ , that do not change every iteration.

The effective memory access divided by the execution time per iteration, $t_\mathrm{it}$ [sec], defines the effective memory throughput, $T_\mathrm{eff}$ [GB/s]:

A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k}

T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}

The upper bound of $T_\mathrm{eff}$ is $T_\mathrm{peak}$ as measured, e.g., by McCalpin, 1995 for CPUs or a GPU analogue.

Defining the $T_\mathrm{eff}$ metric, we assume that:

we evaluate an iterative stencil-based solver,
the problem size is much larger than the cache sizes and
the usage of time blocking is not feasible or advantageous (reasonable for real-world applications).

💡 Note

Fields within the effective memory access that do not depend on their own history; such fields can be re-computed on the fly or stored on-chip.

As first task, we'll compute the $T_\mathrm{eff}$ for the 2D fluid pressure (diffusion) solver at the core of the porous convection algorithm from previous lecture.

👉 Download the script l5_Pf_diffusion_2D.jl to get started.

To-do list:

copy l5_Pf_diffusion_2D.jl, rename it to Pf_diffusion_2D_Teff.jl
add a timer
include the performance metric formulas
deactivate visualisation

💻 Let's get started

Timer and performance

Use Base.time() to return the current timestamp
Define t_tic, the starting time, after 11 iterations steps to allow for "warm-up"
Record the exact number of iterations (introduce e.g. niter)
Compute the elapsed time t_toc at the end of the time loop and report:

t_toc = Base.time() - t_tic
A_eff = (3*2)/1e9*nx*ny*sizeof(Float64)  # Effective main memory access per iteration [GB]
t_it  = t_toc/niter                      # Execution time per iteration [s]
T_eff = A_eff/t_it                       # Effective memory throughput [GB/s]

Report t_toc, T_eff and niter at the end of the code, formatting output using @printf() macro.
Round T_eff to the 3rd significant digit.

@printf("Time = %1.3f sec, T_eff = %1.2f GB/s (niter = %d)\n", t_toc, round(T_eff, sigdigits=3), niter)

Deactivate visualisation (and error checking)

Use keyword arguments ("kwargs") to allow for default behaviour
Define a do_check flag set to false

function Pf_diffusion_2D(;do_check=false)
    if do_check && (iter%ncheck == 0)
    ...
    end
    return
end

So far so good, we have now a timer.

Let's also boost resolution to nx = ny = 511 and set maxiter = max(nx,ny) to have the code running ~1 sec.

In the next part, we'll work on a multi-threading implementation.

Parallel computing on CPUs

Towards implementing shared memory parallelisation using multi-threading capabilities of modern multi-core CPUs.

We'll work it out in 3 steps:

Precomputing scalars, removing divisions (and non-necessary arrays)
Back to loops I
Back to loops II - compute functions (future "kernels")

Precomputing scalars, removing divisions and casual arrays

Let's duplicate Pf_diffusion_2D_Teff.jl and rename it as Pf_diffusion_2D_perf.jl.

First, replace k_ηf/dx and k_ηf/dy in the flux calculations by inverse multiplications, such that

k_ηf_dx, k_ηf_dy = k_ηf/dx, k_ηf/dy

Then, replace divisions ./(1.0 + θ_dτ) by inverse multiplications *_1_θ_dτ such that

_1_θ_dτ = 1.0./(1.0 + θ_dτ)

Then, replace ./dx and ./dy in the Pf update by inverse multiplications *_dx, *_dy where

_dx, _dy = 1.0/dx, 1.0/dy

Finally, also apply the same treatment to ./β_dτ

Back to loops I

As first, duplicate Pf_diffusion_2D_perf.jl and rename it as Pf_diffusion_2D_perf_loop.jl.

The goal is now to write out the diffusion physics in a loop fashion over $x$ and $y$ dimensions.

Implement a nested loop, taking car of bounds and staggering.

for iy=1:ny
    for ix=1:nx-1
        qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*(Pf[ix+1,iy]-Pf[ix,iy]))*_1_θ_dτ
    end
end
for iy=1:ny-1
    for ix=1:nx
        qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*(Pf[ix,iy+1]-Pf[ix,iy]))*_1_θ_dτ
    end
end
for iy=1:ny
    for ix=1:nx
        Pf[ix,iy]  -= ((qDx[ix+1,iy]-qDx[ix,iy])*_dx + (qDy[ix,iy+1]-qDy[ix,iy])*_dy)*_β_dτ
    end
end

We could now use macros to make the code nicer and clearer. Macro expression will be replaced during pre-processing (prior to compilation). Also, macro can take arguments by appending $ in their definition.

Let's use macros to replace the derivative implementations

macro d_xa(A)  esc(:( $A[ix+1,iy]-$A[ix,iy] )) end
macro d_ya(A)  esc(:( $A[ix,iy+1]-$A[ix,iy] )) end

And update the code within the iteration loop:

for iy=1:ny
    for ix=1:nx-1
        qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*@d_xa(Pf))*_1_θ_dτ
    end
end
for iy=1:ny-1
    for ix=1:nx
        qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*@d_ya(Pf))*_1_θ_dτ
    end
end
for iy=1:ny
    for ix=1:nx
        Pf[ix,iy]  -= (@d_xa(qDx)*_dx + @d_ya(qDy)*_dy)*_β_dτ
    end
end

Performance is already quite better with the loop version 🚀.

Reasons are that diff() are allocating and that Julia is overall well optimised for executing loops.

Let's now implement the final step.

Back to loops II

Duplicate Pf_diffusion_2D_perf_loop.jl and rename it as Pf_diffusion_2D_perf_loop_fun.jl.

In this last step, the goal is to define compute functions to hold the physics calculations, and to call those within the time loop.

Create a compute_flux!() and compute_Pf!() functions that take input and output arrays and needed scalars as argument and return nothing.

function compute_flux!(qDx,qDy,Pf,k_ηf_dx,k_ηf_dy,_1_θ_dτ)
    nx,ny=size(Pf)
    for iy=1:ny,
        for ix=1:nx-1
            qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*@d_xa(Pf))*_1_θ_dτ
        end
    end
    for iy=1:ny-1
        for ix=1:nx
            qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*@d_ya(Pf))*_1_θ_dτ
        end
    end
    return nothing
end

function update_Pf!(Pf,qDx,qDy,_dx,_dy,_β_dτ)
    nx,ny=size(Pf)
    for iy=1:ny
        for ix=1:nx
            Pf[ix,iy]  -= (@d_xa(qDx)*_dx + @d_ya(qDy)*_dy)*_β_dτ
        end
    end
    return nothing
end

💡 Note

Functions that modify arguments take a ! in their name, a Julia convention.

The compute_flux!() and compute_Pf!() functions can then be called within the time loop.

This last implementation executes a bit faster as previous one, as functions allow Julia to further optimise during just-ahead-of-time compilation.

💡 Note

For optimal performance assessment, bound-checking should be deactivated. This can be achieved by adding @inbounds in front of the compute statement, or running the scripts (or launching Julia) with the --check-bounds=no option.

Various timing and benchmarking tools are available in Julia's ecosystem to track performance issues. Julia's Base exposes the @time macro which returns timing and allocation estimation. BenchmarkTools.jl package provides finer grained timing and benchmarking tooling, namely the @btime, @belapsed and @benchmark macros, among others.

Let's evaluate the performance of our code using BenchmarkTools. We will need to wrap the two compute kernels into a compute!() function in order to be able to call that one using @belapsed. Query ? @belapsed in Julia's REPL to know more.

The compute!() function:

function compute!(Pf,qDx,qDy,k_ηf_dx,k_ηf_dy,_1_θ_dτ,_dx,_dy,_β_dτ)
    compute_flux!(qDx,qDy,Pf,k_ηf_dx,k_ηf_dy,_1_θ_dτ)
    update_Pf!(Pf,qDx,qDy,_dx,_dy,_β_dτ)
    return nothing
end

can then be called using @belapsed to return elapsed time for a single iteration, letting BenchmarkTools taking car about sampling

t_toc = @belapsed compute!($Pf,$qDx,$qDy,$k_ηf_dx,$k_ηf_dy,$_1_θ_dτ,$_dx,$_dy,$_β_dτ)
niter = 1

💡 Note

Note that variables need to be interpolated into the function call, thus taking a $ in front.

Shared memory parallelisation

Julia ships with it's Base feature the possibility to enable multi-threading.

The only 2 modifications needed to enable it in our code are:

Place Threads.@threads in front of the outer loop definition
Export the desired amount of threads, e.g., export JULIA_NUM_THREADS=4, to be activate prior to launching Julia (or executing the script from the shell). You can also launch Julia with -t option setting the desired numbers of threads. Setting -t auto will most likely automatically use as many hardware threads as available on a machine.

The number of threads can be queried within a Julia session as following: Threads.nthreads()

💡 Note

For optimal performance, the numbers of threads should be identical to the number of physical cores of the target CPU (hardware threads).

Multi-threading and AVX (🚧 currently refactored)

Relying on Julia's LoopVectorization.jl package, it is possible to combine multi-threading with advanced vector extensions (AVX) optimisations, leveraging extensions to the x86 instruction set architecture.

To enable it:

Add using LoopVectorization at the top of the script
Replace Threads.@threads by @tturbo

And here we go 🚀

Wrapping-up

We discussed main performance limiters
We implemented the effective memory throughput metric $T_\mathrm{eff}$
We optimised the Julia 2D diffusion code (multi-threading and AVX)

⤴ back to Content

Unit testing in Julia

The Julia `Test` module

Basic unit tests in Julia

Provides simple unit testing functionality
A way to assess if code is correct by checking that results are as expected
Helpful to ensure the code still works after changes
Can be used when developing
Should be used in package for continous integration (aka CI; when tests are run on GitHub automatically on a push)

Basic unit tests

Simple unit testing can be performed with the @test and @test_throws macros:

using Test

@test 1==1

@test_throws MethodError 1+"a" ## the expected error must be provided too

Or another example

@test [1, 2] + [2, 1] == [3, 3]

Testing an expression which is a call using infix syntax such as approximate comparisons (\approx + tab)

@test π ≈ 3.14 atol=0.01

For example, suppose we want to check our new function square(x) works as expected:

square(x) = x^2

If the condition is true, a Pass is returned:

@test square(5) == 25

If the condition is false, then a Fail is returned and an exception is thrown:

@test square(5) == 20

Test Failed at none:1
  Expression: square(5) == 20
   Evaluated: 25 == 20
Test.FallbackTestSetException("There was an error during testing")

Working with a test sets

The @testset macro can be used to group tests into sets.

All the tests in a test set will be run, and at the end of the test set a summary will be printed.

If any of the tests failed, or could not be evaluated due to an error, the test set will then throw a TestSetException.

@testset "trigonometric identities" begin
    θ = 2/3*π
    @test sin(-θ) ≈ -sin(θ)
    @test cos(-θ) ≈ cos(θ)
    @test sin(2θ) ≈ 2*sin(θ)*cos(θ)
    @test cos(2θ) ≈ cos(θ)^2 - sin(θ)^2
end;

Let's try it with our square() function

square(x) = x^2

@testset "Square Tests" begin
    @test square(5) == 25
    @test square("a") == "aa"
    @test square("bb") == "bbbb"
end;

If we now introduce a bug

square(x) = x^2

@testset "Square Tests" begin
    @test square(5) == 25
    @test square("a") == "aa"
    @test square("bb") == "bbbb"
    @test square(5) == 20
end;

Square Tests: Test Failed at none:6
  Expression: square(5) == 20
   Evaluated: 25 == 20
Stacktrace:
 [...]
Test Summary: | Pass  Fail  Total
Square Tests  |    3     1      4
Some tests did not pass: 3 passed, 1 failed, 0 errored, 0 broken.

Then then the reporting tells us a test failed.

Where to put them, how to run them

The simplest is to just put the tests in your script, along the tested function. Then the tests run every time the script is executed.

However, for bigger pieces of software, such as packages, this becomes unwieldly and also undesired (as we don't want tests to run all the time). Then tests are put into test/runtests.jl. If they are there they will be run when called from package mode or from automated test (CI).

Example for the package Literate.jl (we use that to generate the website):

julia> using Pkg

julia> Pkg.test("Literate")
     Testing Literate
   ...
     Testing Running tests...
Test Summary:  | Pass  Total  Time
Literate.parse |  533    533  0.8s
Test Summary:   | Pass  Broken  Total  Time
Literate.script |   36       1     37  2.1s
Test Summary:     | Pass  Broken  Total  Time
Literate.markdown |   82       2     84  4.3s
Test Summary:     | Pass  Broken  Total  Time
Literate.notebook |   86       1     87  5.2s
Test Summary: | Pass  Total  Time
Configuration |    7      7  0.3s
     Testing Literate tests passed

Wrapping-up

The Test module provides simple unit testing functionality.
Tests can be grouped into sets using @testset.
We'll later see how tests can be used in CI.

⤴ back to Content

Exercises - lecture 5

⚠️ Warning!

Exercises have to be handed in as monolithic Julia scripts (one code per script) and uploaded to your private (shared) GitHub repository, in a specific folder for each lecture. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

Exercise 1 — Performance implementation

👉 See Logistics for submission details.

The goal of this exercise is to:

Finalise the script discussed in class

In this first exercise, you will terminate the performance oriented implementation of the 2D fluid pressure (diffusion) solver script from lecture 5.

👉 If needed, download the l5_Pf_diffusion_2D.jl to get you started.

Task 1

Create a new folder in your GitHub repository for this week's (lecture 5) exercises. In there, create a new subfolder Pf_diffusion_2D where you will add following script:

Pf_diffusion_2D_Teff.jl: T_eff implementation
Pf_diffusion_2D_perf.jl: scalar precomputations and removing disabling ncheck
Pf_diffusion_2D_loop_fun.jl: physics computations in compute!() function, derivatives done with macros, and multi-threading

💡 Note

Refer to this section in lecture 5 to capture the starting point describing which features are specific to each version of the diffusion 2D codes.

⤴ back to Content

Exercise 2 — Performance evaluation

👉 See Logistics for submission details.

The goal of this exercise is to:

Create a script to assess $T_\mathrm{peak}$ , using memory-copy
Assess $T_\mathrm{peak}$ of your CPU
Perform a strong-scaling test: assess $T_\mathrm{eff}$ for the fluid pressure diffusion 2D solver as function of number of grid points and implementation

For this exercise, you will write a code to assess the peak memory throughput of your CPU and run a strong scaling benchmark using the fluid pressure diffusion 2D solver and report performance.

Task 1

In the Pf_diffusion_2D folder, create a new script named memcopy.jl. You can use as starting point the diffusion_2D_loop_fun.jl script from lecture 5 (or exercise 1).

Rename the "main" function memcopy()
Modify the script to only keep following in the initialisation:

# Numerics
nx, ny  = 512, 512
nt      = 2e4
# array initialisation
C       = rand(Float64, nx, ny)
C2      = copy(C)
A       = copy(C)

Implement 2 compute functions to perform the following operation C2 = C + A, replacing the previous calculations:
- create an "array programming"-based function called compute_ap!() that includes a broadcasted version of the memory copy operation;
- create a second "kernel programming"-based function called compute_kp!() that uses a loop-based implementation with multi-threading.
Update the A_eff formula accordingly.
Implement a switch to monitor performance using either a manual approach or BenchmarkTools and pass the bench value as kwarg to the memcopy() main function including a default:

if bench == :loop
    # iteration loop
    t_tic = 0.0
    for iter=1:nt
      ...
    end
    t_toc = Base.time() - t_tic
elseif bench == :btool
    t_toc = @belapsed ...
end

Then, create a README.md file in the Pf_diffusion_2D folder to report the results for each of the following tasks (including a .png of the figure when instructed)

💡 Note

Use ![fig_name](./<relative-path>/my_fig.png) to insert a .png figure in the README.md.

Task 2

Report on a figure $T_\mathrm{eff}$ of your memcopy.jl code as function of number of grid points nx × ny for the array and kernel programming approaches, respectively, using the BenchmarkTools implementation. Vary nxand ny such that

nx = ny = 16 * 2 .^ (1:8)

( $T_\mathrm{eff}$ of your memcopy.jl code represents $T_\mathrm{peak}$ , the peak memory throughput you can achieve on your CPU for a given implementation.)

On the same figure, report the best value of memcopy obtained using the manual loop-based approach (manual timer) to assess $T_\mathrm{peak}$ .

💡 Note

For performance evaluation we only need the code to run a couple of seconds; adapt nt accordingly (you could also, e.g., make nt function of nx, ny). Ensure also to implement "warm-up" iterations.

Add the above figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result. Figure out the vendor-announced peak memory bandwidth for your CPU, add it to the figure and use it to discuss your results.

Task 3

Repeat the strong scaling benchmark you just realised in Task 2 using the various fluid pressure diffusion 2D codes (Pf_diffusion_2D_Teff.jl; Pf_diffusion_2D_perf.jl; Pf_diffusion_2D_loop_fun.jl - with/without Threads.@threads for the latter).

Report on a figure $T_\mathrm{eff}$ of the 4 diffusion solvers' implementations as function of number of grid points nx × ny. Vary nxand ny such that nx = ny = 16 * 2 .^ (1:8). Use the BenchmarlTools-based evaluation approach.

On the same figure, report also the best memory copy value (as, e.g, dashed lines) and vendor announced values (if available - optional).

Add this second figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result.

⤴ back to Content

Exercise 3 — Unit tests

👉 See Logistics for submission details.

The goal of this exercise is to:

Implement basic unit tests for the diffusion and acoustic 2D scripts
Group the tests in a test-set

For this exercise, you will implement a test set of basic unit tests using Julia's built-in testing infrastructure to verify the implementation of the diffusion and acoustic 2D solvers.

Task 1

In the Pf_diffusion_2D folder, duplicate the Pf_diffusion_2D_perf_loop_fun.jl script and rename it Pf_diffusion_2D_test.jl.

Implement a test set using @testset and @test macros from Test.jl in order to test Pf[xtest, ytest] and assess that the values returned are approximatively equal to the following ones for the given values of nx = ny. Make sure to set do_check = false i.e. to ensure the code to run 500 iterations.

xtest = [5, Int(cld(0.6*lx, dx)), nx-10]
ytest = Int(cld(0.5*ly, dy))

for

nx = ny = 16 * 2 .^ (2:5) .- 1
maxiter = 500

should match

`nx, ny`	`Pf[xtest, ytest]`
`63`	`[0.00785398056115133 0.007853980637555755 0.007853978592411982]`
`127`	`[0.00787296974549236 0.007849556884184108 0.007847181374079883]`
`255`	`[0.00740912103848251 0.009143711648167267 0.007419533048751209]`
`511`	`[0.00566813765849919 0.004348785338575644 0.005618691590498087]`

Report the output of the test set as code block in a new section of the README.md in the Pf_diffusion_2D folder.

💡 Note

Use triple backticks to generate code blocks in the README.md (more).

⤴ back to Content

Edit this page on
Last modified: December 01, 2024. Website built with Franklin.jl and the Julia programming language.