Lecture 5

Agenda
📚 Parallel computing on CPUs & performance assessment, the TeffT_\mathrm{eff} metric
💻 Unit testing in Julia
🚧 Exercises:

  • CPU perf. codes for 2D diffusion and memcopy

  • Unit tests and testset implementation


Content

👉 get started with exercises


Parallel computing (on CPUs) and performance assessment

The goal of this lecture 5 is to introduce:

Performance limiters

Hardware

Recall from lecture 1 (why we do it) ...

Use parallel computing (to address this):

mem_wall

GPUs are massively parallel devices

cpu_gpu_evo

Taking a look at a recent GPU and CPU:

DeviceTFLOP/s (FP64)Memory BW TB/s
Tesla A1009.71.55
AMD EPYC 72820.70.085

Current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

Quantify the imbalance:

computation peak performance [TFLOP/s]memory access peak performance [TB/s]×size of a number [Bytes] \frac{\mathrm{computation\;peak\;performance\;[TFLOP/s]}}{\mathrm{memory\;access\;peak\;performance\;[TB/s]}} × \mathrm{size\;of\;a\;number\;[Bytes]}

(Theoretical peak performance values as specified by the vendors can be used).

Back to our hardware:

DeviceTFLOP/s (FP64)Memory BW TB/sImbalance (FP64)
Tesla A1009.71.559.7 / 1.55 × 8 = 50
AMD EPYC 72820.70.0850.7 / 0.085 × 8 = 66

(here computed with double precision values)

Meaning: we can do 50 (GPU) and 66 (CPU) floating point operations per number accessed from main memory. Floating point operations are "for free" when we work in memory-bounded regimes

➡ Requires to re-think the numerical implementation and solution strategies

On the scientific application side

First derivative example A/x∂A / ∂x:

If we "naively" compare the "cost" of an isolated evaluation of a finite-difference first derivative, e.g., computing a flux qq:

q=D Ax ,q = -D~\frac{∂A}{∂x}~,

which in the discrete form reads q[ix] = -D*(A[ix+1]-A[ix])/dx.

The cost of evaluating q[ix] = -D*(A[ix+1]-A[ix])/dx:

1 reads + 1 write => 2×82 × 8 = 16 Bytes transferred

1 (fused) addition and division => 1 floating point operation

assuming:

GPUs and CPUs perform 50 - 60 floating-point operations per number accessed from main memory

First derivative evaluation requires to transfer 2 numbers per floating-point operations

The FLOPS metric is no longer the most adequate for reporting the application performance of many modern applications on modern hardware.

Effective memory throughput metric TeffT_\mathrm{eff}

Need for a memory throughput-based performance evaluation metric: TeffT_\mathrm{eff} [GB/s]

➡ Evaluate the performance of iterative stencil-based solvers.

The effective memory access AeffA_\mathrm{eff} [GB]

Sum of:

The effective memory access divided by the execution time per iteration, titt_\mathrm{it} [sec], defines the effective memory throughput, TeffT_\mathrm{eff} [GB/s]:

Aeff=2 Du+Dk A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k} Teff=Aefftit T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}

The upper bound of TeffT_\mathrm{eff} is TpeakT_\mathrm{peak} as measured, e.g., by McCalpin, 1995 for CPUs or a GPU analogue.

Defining the TeffT_\mathrm{eff} metric, we assume that:

  1. we evaluate an iterative stencil-based solver,

  2. the problem size is much larger than the cache sizes and

  3. the usage of time blocking is not feasible or advantageous (reasonable for real-world applications).

💡 Note
Fields within the effective memory access that do not depend on their own history; such fields can be re-computed on the fly or stored on-chip.

As first task, we'll compute the TeffT_\mathrm{eff} for the 2D fluid pressure (diffusion) solver at the core of the porous convection algorithm from previous lecture.

👉 Download the script l5_Pf_diffusion_2D.jl to get started.

To-do list:

💻 Let's get started

Timer and performance

t_toc = ...
A_eff = ...          # Effective main memory access per iteration [GB]
t_it  = ...          # Execution time per iteration [s]
T_eff = A_eff/t_it   # Effective memory throughput [GB/s]
@printf("Time = %1.3f sec, ... \n", t_toc, ...)

Deactivate visualisation (and error checking)

function Pf_diffusion_2D(;??)
    ...
    return
end

So far so good, we have now a timer.

Let's also boost resolution to nx = ny = 511 and set maxiter = max(nx,ny) to have the code running ~1 sec.

In the next part, we'll work on a multi-threading implementation.

Parallel computing on CPUs

Towards implementing shared memory parallelisation using multi-threading capabilities of modern multi-core CPUs.

We'll work it out in 3 steps:

  1. Precomputing scalars, removing divisions (and non-necessary arrays)

  2. Back to loops I

  3. Back to loops II - compute functions (future "kernels")

  1. Precomputing scalars, removing divisions and casual arrays

Let's duplicate Pf_diffusion_2D_Teff.jl and rename it as Pf_diffusion_2D_perf.jl.

k_ηf_dx, k_ηf_dy = k_ηf/dx, k_ηf/dy
_1_θ_dτ = 1.0./(1.0 + θ_dτ)
_dx, _dy = 1.0/dx, 1.0/dy

  1. Back to loops I

As first, duplicate Pf_diffusion_2D_perf.jl and rename it as Pf_diffusion_2D_perf_loop.jl.

The goal is now to write out the diffusion physics in a loop fashion over xx and yy dimensions.

Implement a nested loop, taking car of bounds and staggering.

for iy=??
    for ix=??
        qDx[??] -= (qDx[??] + k_ηf_dx* ?? )*_1_θ_dτ
    end
end
for iy=??
    for ix=??
        qDy[??] -= (qDy[??] + k_ηf_dy* ?? )*_1_θ_dτ
    end
end
for iy=??
    for ix=??
        Pf[??]  -= ??
    end
end

We could now use macros to make the code nicer and clearer. Macro expression will be replaced during pre-processing (prior to compilation). Also, macro can take arguments by appending $ in their definition.

Let's use macros to replace the derivative implementations

macro d_xa(A)  esc(:( $A[??]-$A[??] )) end
macro d_ya(A)  esc(:( $A[??]-$A[??] )) end

And update the code within the iteration loop:

for iy=??
    for ix=??
        qDx[??] -= (qDx[??] + k_ηf_dx* ?? )*_1_θ_dτ
    end
end
for iy=??
    for ix=??
        qDy[??] -= (qDy[??] + k_ηf_dy* ?? )*_1_θ_dτ
    end
end
for iy=??
    for ix=??
        Pf[??]  -= ??
    end
end

Performance is already quite better with the loop version 🚀.

Reasons are that diff() are allocating and that Julia is overall well optimised for executing loops.

Let's now implement the final step.

  1. Back to loops II

Duplicate Pf_diffusion_2D_perf_loop.jl and rename it as Pf_diffusion_2D_perf_loop_fun.jl.

In this last step, the goal is to define compute functions to hold the physics calculations, and to call those within the time loop.

Create a compute_flux!() and compute_Pf!() functions that take input and output arrays and needed scalars as argument and return nothing.

function compute_flux!(...)
    nx,ny=size(Pf)
    ...
    return nothing
end

function update_Pf!(Pf,...)
    nx,ny=size(Pf)
    ...
    return nothing
end
💡 Note
Functions that modify arguments take a ! in their name, a Julia convention.

The compute_flux!() and compute_Pf!() functions can then be called within the time loop.

This last implementation executes a bit faster as previous one, as functions allow Julia to further optimise during just-ahead-of-time compilation.

💡 Note
For optimal performance assessment, bound-checking should be deactivated. This can be achieved by adding @inbounds in front of the compute statement, or running the scripts (or launching Julia) with the --check-bounds=no option.

Various timing and benchmarking tools are available in Julia's ecosystem to track performance issues. Julia's Base exposes the @time macro which returns timing and allocation estimation. BenchmarkTools.jl package provides finer grained timing and benchmarking tooling, namely the @btime, @belapsed and @benchmark macros, among others.

Let's evaluate the performance of our code using BenchmarkTools. We will need to wrap the two compute kernels into a compute!() function in order to be able to call that one using @belapsed. Query ? @belapsed in Julia's REPL to know more.

The compute!() function:

function compute!(Pf,qDx,qDy, ???)
    compute_flux!(...)
    update_Pf!(...)
    return nothing
end

can then be called using @belapsed to return elapsed time for a single iteration, letting BenchmarkTools taking car about sampling

t_toc = @belapsed compute!($Pf,$qDx,$qDy,???)
niter = ???
💡 Note
Note that variables need to be interpolated into the function call, thus taking a $ in front.

Shared memory parallelisation

Julia ships with it's Base feature the possibility to enable multi-threading.

The only 2 modifications needed to enable it in our code are:

  1. Place Threads.@threads in front of the outer loop definition

  2. Export the desired amount of threads, e.g., export JULIA_NUM_THREADS=4, to be activate prior to launching Julia (or executing the script from the shell). You can also launch Julia with -t option setting the desired numbers of threads. Setting -t auto will most likely automatically use as many hardware threads as available on a machine.

The number of threads can be queried within a Julia session as following: Threads.nthreads()

💡 Note
For optimal performance, the numbers of threads should be identical to the number of physical cores of the target CPU (hardware threads).

Multi-threading and AVX (🚧 currently refactored)

Relying on Julia's LoopVectorization.jl package, it is possible to combine multi-threading with advanced vector extensions (AVX) optimisations, leveraging extensions to the x86 instruction set architecture.

To enable it:

  1. Add using LoopVectorization at the top of the script

  2. Replace Threads.@threads by @tturbo

And here we go 🚀

Wrapping-up

back to Content

Unit testing in Julia

The Julia Test module

Basic unit tests in Julia

Basic unit tests

Simple unit testing can be performed with the @test and @test_throws macros:

using Test

@test true

Or another example

@test [1, 2] + [2, 1] == [3, 3]

Testing an expression which is a call using infix syntax such as approximate comparisons

@test π3.14 atol=0.01

For example, suppose we want to check our new function square!(x) works as expected:

square!(x) = x^2

If the condition is true, a Pass is returned:

@test square!(5) == 25

If the condition is false, then a Fail is returned and an exception is thrown:

@test square!(5) == 20
Test Failed at none:1
  Expression: square!(5) == 20
   Evaluated: 25 == 20
Test.FallbackTestSetException("There was an error during testing")

Working with a test sets

The @testset macro can be used to group tests into sets.

All the tests in a test set will be run, and at the end of the test set a summary will be printed.

If any of the tests failed, or could not be evaluated due to an error, the test set will then throw a TestSetException.

@testset "trigonometric identities" begin
    θ = 2/3*π
    @test sin(-θ) ≈ -sin(θ)
    @test cos(-θ) ≈ cos(θ)
    @test sin(2θ) ≈ 2*sin(θ)*cos(θ)
    @test cos(2θ) ≈ cos(θ)^2 - sin(θ)^2
end;

Let's try it with our square!() function

square!(x) = x^2

@testset "Square Tests" begin
    @test square!(5) == 25
    @test square!("a") == "aa"
    @test square!("bb") == "bbbb"
end;

If we now introduce a bug

square!(x) = x^2

@testset "Square Tests" begin
    @test square!(5) == 25
    @test square!("a") == "aa"
    @test square!("bb") == "bbbb"
    @test square!(5) == 20
end;
Square Tests: Test Failed at none:6
  Expression: square!(5) == 20
   Evaluated: 25 == 20
Stacktrace:
 [...]
Test Summary: | Pass  Fail  Total
Square Tests  |    3     1      4
Some tests did not pass: 3 passed, 1 failed, 0 errored, 0 broken.

Then then the reporting tells us a test failed.

Wrapping-up

back to Content

Exercises - lecture 5

⚠️ Warning!
Exercises have to be handed in as monolithic Julia scripts (one code per script) and uploaded to your private (shared) GitHub repository, in a specific folder for each lecture. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

Exercise 1 - Performance implementation

👉 See Logistics for submission details.

The goal of this exercise is to:

In this first exercise, you will terminate the performance oriented implementation of the 2D fluid pressure (diffusion) solver script from lecture 5.

👉 If needed, download the l5_Pf_diffusion_2D.jl to get you started.

Task 1

Create a new folder in your GitHub repository for this week's (lecture 5) exercises. In there, create a new subfolder Pf_diffusion_2D where you will add following script:

💡 Note
Refer to this section in lecture 5 to capture the starting point describing which features are specific to each version of the diffusion 2D codes.

back to Content


Exercise 2 - Performance evaluation

👉 See Logistics for submission details.

The goal of this exercise is to:

For this exercise, you will write a code to assess the peak memory throughput of your CPU and run a strong scaling benchmark using the fluid pressure diffusion 2D solver and report performance.

Task 1

In the Pf_diffusion_2D folder, create a new script named memcopy.jl. You can use as starting point the diffusion_2D_loop_fun.jl script from lecture 5 (or exercise 1).

  1. Rename the "main" function memcopy()

  2. Modify the script to only keep following in the initialisation:

# Numerics
nx, ny  = 512, 512
nt      = 2e4
# array initialisation
C       = rand(Float64, nx, ny)
C2      = copy(C)
A       = copy(C)
  1. Implement 2 compute functions to perform the following operation C2 = C + A, replacing the previous calculations:

    • create an "array programming"-based function called compute_ap!() that includes a broadcasted version of the memory copy operation;

    • create a second "kernel programming"-based function called compute_kp!() that uses a loop-based implementation with multi-threading.

  2. Update the A_eff formula accordingly.

  3. Implement a switch to monitor performance using either a manual approach or BenchmarkTools and pass the bench value as kwarg to the memcopy() main function including a default:

if bench == :loop
    # iteration loop
    t_tic = 0.0
    for iter=1:nt
      ...
    end
    t_toc = Base.time() - t_tic
elseif bench == :btool
    t_toc = @belapsed ...
end

Then, create a README.md file in the Pf_diffusion_2D folder to report the results for each of the following tasks (including a .png of the figure when instructed)

💡 Note
Use ![fig_name](./<relative-path>/my_fig.png) to insert a .png figure in the README.md.

Task 2

Report on a figure TeffT_\mathrm{eff} of your memcopy.jl code as function of number of grid points nx × ny for the array and kernel programming approaches, respectively, using the BenchmarkTools implementation. Vary nxand ny such that

nx = ny = 16 * 2 .^ (1:8)

(TeffT_\mathrm{eff} of your memcopy.jl code represents TpeakT_\mathrm{peak}, the peak memory throughput you can achieve on your CPU for a given implementation.)

On the same figure, report the best value of memcopy obtained using the manual loop-based approach (manual timer) to assess TpeakT_\mathrm{peak}.

💡 Note
For performance evaluation we only need the code to run a couple of seconds; adapt nt accordingly (you could also, e.g., make nt function of nx, ny). Ensure also to implement "warm-up" iterations.

Add the above figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result. Figure out the vendor-announced peak memory bandwidth for your CPU, add it to the figure and use it to discuss your results.

Task 3

Repeat the strong scaling benchmark you just realised in Task 2 using the various fluid pressure diffusion 2D codes (Pf_diffusion_2D_Teff.jl; Pf_diffusion_2D_perf.jl; Pf_diffusion_2D_loop_fun.jl - for, Threads.@threads for the latter).

Report on a figure TeffT_\mathrm{eff} of the 4 diffusion solvers' implementations as function of number of grid points nx × ny. Vary nxand ny such that nx = ny = 16 * 2 .^ (1:8). Use the BenchmarlTools-based evaluation approach.

On the same figure, report also the best memory copy value (as, e.g, dashed lines) and vendor announced values (if available - optional).

Add this second figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result.

back to Content


Exercise 3 - Unit tests

👉 See Logistics for submission details.

The goal of this exercise is to:

For this exercise, you will implement a test set of basic unit tests to verify the implementation of the diffusion and acoustic 2D solvers.

Task 1

In the Pf_diffusion_2D folder, duplicate the Pf_diffusion_2D_perf_loop_fun.jl script and rename it Pf_diffusion_2D_test.jl.

Implement a test set in order to test Pf[xtest, ytest] and assess that the values returned are approximatively equal to the following ones for the given values of nx = ny. Make sure to set do_check = false i.e. to ensure the code to run 500 iterations.

xtest = [5, Int(cld(0.6*lx, dx)), nx-10]
ytest = Int(cld(0.5*ly, dy))

for

nx = ny = 16 * 2 .^ (2:5) .- 1
maxiter = 500

should match

nx, nyPf[xtest, ytest]
64[0.00785398056115133 0.007853980637555755 0.007853978592411982]
128[0.00787296974549236 0.007849556884184108 0.007847181374079883]
256[0.00740912103848251 0.009143711648167267 0.007419533048751209]
512[0.00566813765849919 0.004348785338575644 0.005618691590498087]

Report the output of the test set as code block in a new section of the README.md in the Pf_diffusion_2D folder.

💡 Note
Use triple backticks to generate code blocks in the README.md (more).

back to Content