Lecture 5

Agenda
📚 Parallel computing on CPUs & performance assessment, the TeffT_\mathrm{eff} metric
💻 Unit testing in Julia
🚧 Exercises:

  • CPU perf. codes for 2D diffusion and memcopy

  • Unit tests and testset implementation


Content

👉 get started with exercises


Parallel computing (on CPUs) and performance assessment

Performance

❓ some questions for you:

The goal of this lecture 5 is to introduce:

Performance limiters

Hardware

Recall from lecture 1 (why we do it) ...

Use parallel computing (to address this):

mem_wall

GPUs are massively parallel devices

cpu_gpu_evo

Taking a look at a recent GPU and CPU:

DeviceTFLOP/s (FP64)Memory BW TB/s
Tesla A1009.71.55
AMD EPYC 72820.70.085

Current GPUs (and CPUs) can do many more computations in a given amount of time than they can access numbers from main memory.

Quantify the imbalance:

computation peak performance [TFLOP/s]memory access peak performance [TB/s]×size of a number [Bytes] \frac{\mathrm{computation\;peak\;performance\;[TFLOP/s]}}{\mathrm{memory\;access\;peak\;performance\;[TB/s]}} × \mathrm{size\;of\;a\;number\;[Bytes]}

(Theoretical peak performance values as specified by the vendors can be used).

Back to our hardware:

DeviceTFLOP/s (FP64)Memory BW TB/sImbalance (FP64)
Tesla A1009.71.559.7 / 1.55 × 8 = 50
AMD EPYC 72820.70.0850.7 / 0.085 × 8 = 66

(here computed with double precision values)

Meaning: we can do 50 (GPU) and 66 (CPU) floating point operations per number accessed from main memory. Floating point operations are "for free" when we work in memory-bounded regimes

➡ Requires to re-think the numerical implementation and solution strategies

On the scientific application side

First derivative example A/x∂A / ∂x:

If we "naively" compare the "cost" of an isolated evaluation of a finite-difference first derivative, e.g., computing a flux qq:

q=D Ax ,q = -D~\frac{∂A}{∂x}~,

which in the discrete form reads q[ix] = -D*(A[ix+1]-A[ix])/dx.

The cost of evaluating q[ix] = -D*(A[ix+1]-A[ix])/dx:

1 reads + 1 write => 2×82 × 8 = 16 Bytes transferred

1 (fused) addition and division => 1 floating point operation

assuming:

GPUs and CPUs perform 50 - 60 floating-point operations per number accessed from main memory

First derivative evaluation requires to transfer 2 numbers per floating-point operations

The FLOPS metric is no longer the most adequate for reporting the application performance of many modern applications on modern hardware.

Effective memory throughput metric TeffT_\mathrm{eff}

Need for a memory throughput-based performance evaluation metric: TeffT_\mathrm{eff} [GB/s]

➡ Evaluate the performance of iterative stencil-based solvers.

The effective memory access AeffA_\mathrm{eff} [GB]

Sum of:

The effective memory access divided by the execution time per iteration, titt_\mathrm{it} [sec], defines the effective memory throughput, TeffT_\mathrm{eff} [GB/s]:

Aeff=2 Du+Dk A_\mathrm{eff} = 2~D_\mathrm{u} + D_\mathrm{k} Teff=Aefftit T_\mathrm{eff} = \frac{A_\mathrm{eff}}{t_\mathrm{it}}

The upper bound of TeffT_\mathrm{eff} is TpeakT_\mathrm{peak} as measured, e.g., by McCalpin, 1995 for CPUs or a GPU analogue.

Defining the TeffT_\mathrm{eff} metric, we assume that:

  1. we evaluate an iterative stencil-based solver,

  2. the problem size is much larger than the cache sizes and

  3. the usage of time blocking is not feasible or advantageous (reasonable for real-world applications).

💡 Note
Fields within the effective memory access that do not depend on their own history; such fields can be re-computed on the fly or stored on-chip.

As first task, we'll compute the TeffT_\mathrm{eff} for the 2D fluid pressure (diffusion) solver at the core of the porous convection algorithm from previous lecture.

👉 Download the script l5_Pf_diffusion_2D.jl to get started.

To-do list:

💻 Let's get started

Timer and performance

t_toc = Base.time() - t_tic
A_eff = (3*2)/1e9*nx*ny*sizeof(Float64)  # Effective main memory access per iteration [GB]
t_it  = t_toc/niter                      # Execution time per iteration [s]
T_eff = A_eff/t_it                       # Effective memory throughput [GB/s]
@printf("Time = %1.3f sec, T_eff = %1.2f GB/s (niter = %d)\n", t_toc, round(T_eff, sigdigits=3), niter)

Deactivate visualisation (and error checking)

function Pf_diffusion_2D(;do_check=false)
    if do_check && (iter%ncheck == 0)
    ...
    end
    return
end

So far so good, we have now a timer.

Let's also boost resolution to nx = ny = 511 and set maxiter = max(nx,ny) to have the code running ~1 sec.

In the next part, we'll work on a multi-threading implementation.

Parallel computing on CPUs

Towards implementing shared memory parallelisation using multi-threading capabilities of modern multi-core CPUs.

We'll work it out in 3 steps:

  1. Precomputing scalars, removing divisions (and non-necessary arrays)

  2. Back to loops I

  3. Back to loops II - compute functions (future "kernels")

  1. Precomputing scalars, removing divisions and casual arrays

Let's duplicate Pf_diffusion_2D_Teff.jl and rename it as Pf_diffusion_2D_perf.jl.

k_ηf_dx, k_ηf_dy = k_ηf/dx, k_ηf/dy
_1_θ_dτ = 1.0./(1.0 + θ_dτ)
_dx, _dy = 1.0/dx, 1.0/dy

  1. Back to loops I

As first, duplicate Pf_diffusion_2D_perf.jl and rename it as Pf_diffusion_2D_perf_loop.jl.

The goal is now to write out the diffusion physics in a loop fashion over xx and yy dimensions.

Implement a nested loop, taking car of bounds and staggering.

for iy=1:ny
    for ix=1:nx-1
        qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*(Pf[ix+1,iy]-Pf[ix,iy]))*_1_θ_dτ
    end
end
for iy=1:ny-1
    for ix=1:nx
        qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*(Pf[ix,iy+1]-Pf[ix,iy]))*_1_θ_dτ
    end
end
for iy=1:ny
    for ix=1:nx
        Pf[ix,iy]  -= ((qDx[ix+1,iy]-qDx[ix,iy])*_dx + (qDy[ix,iy+1]-qDy[ix,iy])*_dy)*_β_dτ
    end
end

We could now use macros to make the code nicer and clearer. Macro expression will be replaced during pre-processing (prior to compilation). Also, macro can take arguments by appending $ in their definition.

Let's use macros to replace the derivative implementations

macro d_xa(A)  esc(:( $A[ix+1,iy]-$A[ix,iy] )) end
macro d_ya(A)  esc(:( $A[ix,iy+1]-$A[ix,iy] )) end

And update the code within the iteration loop:

for iy=1:ny
    for ix=1:nx-1
        qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*@d_xa(Pf))*_1_θ_dτ
    end
end
for iy=1:ny-1
    for ix=1:nx
        qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*@d_ya(Pf))*_1_θ_dτ
    end
end
for iy=1:ny
    for ix=1:nx
        Pf[ix,iy]  -= (@d_xa(qDx)*_dx + @d_ya(qDy)*_dy)*_β_dτ
    end
end

Performance is already quite better with the loop version 🚀.

Reasons are that diff() are allocating and that Julia is overall well optimised for executing loops.

Let's now implement the final step.

  1. Back to loops II

Duplicate Pf_diffusion_2D_perf_loop.jl and rename it as Pf_diffusion_2D_perf_loop_fun.jl.

In this last step, the goal is to define compute functions to hold the physics calculations, and to call those within the time loop.

Create a compute_flux!() and compute_Pf!() functions that take input and output arrays and needed scalars as argument and return nothing.

function compute_flux!(qDx,qDy,Pf,k_ηf_dx,k_ηf_dy,_1_θ_dτ)
    nx,ny=size(Pf)
    for iy=1:ny,
        for ix=1:nx-1
            qDx[ix+1,iy] -= (qDx[ix+1,iy] + k_ηf_dx*@d_xa(Pf))*_1_θ_dτ
        end
    end
    for iy=1:ny-1
        for ix=1:nx
            qDy[ix,iy+1] -= (qDy[ix,iy+1] + k_ηf_dy*@d_ya(Pf))*_1_θ_dτ
        end
    end
    return nothing
end

function update_Pf!(Pf,qDx,qDy,_dx,_dy,_β_dτ)
    nx,ny=size(Pf)
    for iy=1:ny
        for ix=1:nx
            Pf[ix,iy]  -= (@d_xa(qDx)*_dx + @d_ya(qDy)*_dy)*_β_dτ
        end
    end
    return nothing
end
💡 Note
Functions that modify arguments take a ! in their name, a Julia convention.

The compute_flux!() and compute_Pf!() functions can then be called within the time loop.

This last implementation executes a bit faster as previous one, as functions allow Julia to further optimise during just-ahead-of-time compilation.

💡 Note
For optimal performance assessment, bound-checking should be deactivated. This can be achieved by adding @inbounds in front of the compute statement, or running the scripts (or launching Julia) with the --check-bounds=no option.

Various timing and benchmarking tools are available in Julia's ecosystem to track performance issues. Julia's Base exposes the @time macro which returns timing and allocation estimation. BenchmarkTools.jl package provides finer grained timing and benchmarking tooling, namely the @btime, @belapsed and @benchmark macros, among others.

Let's evaluate the performance of our code using BenchmarkTools. We will need to wrap the two compute kernels into a compute!() function in order to be able to call that one using @belapsed. Query ? @belapsed in Julia's REPL to know more.

The compute!() function:

function compute!(Pf,qDx,qDy,k_ηf_dx,k_ηf_dy,_1_θ_dτ,_dx,_dy,_β_dτ)
    compute_flux!(qDx,qDy,Pf,k_ηf_dx,k_ηf_dy,_1_θ_dτ)
    update_Pf!(Pf,qDx,qDy,_dx,_dy,_β_dτ)
    return nothing
end

can then be called using @belapsed to return elapsed time for a single iteration, letting BenchmarkTools taking car about sampling

t_toc = @belapsed compute!($Pf,$qDx,$qDy,$k_ηf_dx,$k_ηf_dy,$_1_θ_dτ,$_dx,$_dy,$_β_dτ)
niter = 1
💡 Note
Note that variables need to be interpolated into the function call, thus taking a $ in front.

Shared memory parallelisation

Julia ships with it's Base feature the possibility to enable multi-threading.

The only 2 modifications needed to enable it in our code are:

  1. Place Threads.@threads in front of the outer loop definition

  2. Export the desired amount of threads, e.g., export JULIA_NUM_THREADS=4, to be activate prior to launching Julia (or executing the script from the shell). You can also launch Julia with -t option setting the desired numbers of threads. Setting -t auto will most likely automatically use as many hardware threads as available on a machine.

The number of threads can be queried within a Julia session as following: Threads.nthreads()

💡 Note
For optimal performance, the numbers of threads should be identical to the number of physical cores of the target CPU (hardware threads).

Multi-threading and AVX (🚧 currently refactored)

Relying on Julia's LoopVectorization.jl package, it is possible to combine multi-threading with advanced vector extensions (AVX) optimisations, leveraging extensions to the x86 instruction set architecture.

To enable it:

  1. Add using LoopVectorization at the top of the script

  2. Replace Threads.@threads by @tturbo

And here we go 🚀

Wrapping-up

back to Content

Unit testing in Julia

The Julia Test module

Basic unit tests in Julia

Basic unit tests

Simple unit testing can be performed with the @test and @test_throws macros:

using Test

@test 1==1

@test_throws MethodError 1+"a" ## the expected error must be provided too

Or another example

@test [1, 2] + [2, 1] == [3, 3]

Testing an expression which is a call using infix syntax such as approximate comparisons (\approx + tab)

@test π3.14 atol=0.01

For example, suppose we want to check our new function square(x) works as expected:

square(x) = x^2

If the condition is true, a Pass is returned:

@test square(5) == 25

If the condition is false, then a Fail is returned and an exception is thrown:

@test square(5) == 20
Test Failed at none:1
  Expression: square(5) == 20
   Evaluated: 25 == 20
Test.FallbackTestSetException("There was an error during testing")

Working with a test sets

The @testset macro can be used to group tests into sets.

All the tests in a test set will be run, and at the end of the test set a summary will be printed.

If any of the tests failed, or could not be evaluated due to an error, the test set will then throw a TestSetException.

@testset "trigonometric identities" begin
    θ = 2/3*π
    @test sin(-θ) ≈ -sin(θ)
    @test cos(-θ) ≈ cos(θ)
    @test sin(2θ) ≈ 2*sin(θ)*cos(θ)
    @test cos(2θ) ≈ cos(θ)^2 - sin(θ)^2
end;

Let's try it with our square() function

square(x) = x^2

@testset "Square Tests" begin
    @test square(5) == 25
    @test square("a") == "aa"
    @test square("bb") == "bbbb"
end;

If we now introduce a bug

square(x) = x^2

@testset "Square Tests" begin
    @test square(5) == 25
    @test square("a") == "aa"
    @test square("bb") == "bbbb"
    @test square(5) == 20
end;
Square Tests: Test Failed at none:6
  Expression: square(5) == 20
   Evaluated: 25 == 20
Stacktrace:
 [...]
Test Summary: | Pass  Fail  Total
Square Tests  |    3     1      4
Some tests did not pass: 3 passed, 1 failed, 0 errored, 0 broken.

Then then the reporting tells us a test failed.

Where to put them, how to run them

The simplest is to just put the tests in your script, along the tested function. Then the tests run every time the script is executed.

However, for bigger pieces of software, such as packages, this becomes unwieldly and also undesired (as we don't want tests to run all the time). Then tests are put into test/runtests.jl. If they are there they will be run when called from package mode or from automated test (CI).

Example for the package Literate.jl (we use that to generate the website):

julia> using Pkg

julia> Pkg.test("Literate")
     Testing Literate
   ...
     Testing Running tests...
Test Summary:  | Pass  Total  Time
Literate.parse |  533    533  0.8s
Test Summary:   | Pass  Broken  Total  Time
Literate.script |   36       1     37  2.1s
Test Summary:     | Pass  Broken  Total  Time
Literate.markdown |   82       2     84  4.3s
Test Summary:     | Pass  Broken  Total  Time
Literate.notebook |   86       1     87  5.2s
Test Summary: | Pass  Total  Time
Configuration |    7      7  0.3s
     Testing Literate tests passed

Wrapping-up

back to Content

Exercises - lecture 5

⚠️ Warning!
Exercises have to be handed in as monolithic Julia scripts (one code per script) and uploaded to your private (shared) GitHub repository, in a specific folder for each lecture. The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

Exercise 1 — Performance implementation

👉 See Logistics for submission details.

The goal of this exercise is to:

In this first exercise, you will terminate the performance oriented implementation of the 2D fluid pressure (diffusion) solver script from lecture 5.

👉 If needed, download the l5_Pf_diffusion_2D.jl to get you started.

Task 1

Create a new folder in your GitHub repository for this week's (lecture 5) exercises. In there, create a new subfolder Pf_diffusion_2D where you will add following script:

💡 Note
Refer to this section in lecture 5 to capture the starting point describing which features are specific to each version of the diffusion 2D codes.

back to Content


Exercise 2 — Performance evaluation

👉 See Logistics for submission details.

The goal of this exercise is to:

For this exercise, you will write a code to assess the peak memory throughput of your CPU and run a strong scaling benchmark using the fluid pressure diffusion 2D solver and report performance.

Task 1

In the Pf_diffusion_2D folder, create a new script named memcopy.jl. You can use as starting point the diffusion_2D_loop_fun.jl script from lecture 5 (or exercise 1).

  1. Rename the "main" function memcopy()

  2. Modify the script to only keep following in the initialisation:

# Numerics
nx, ny  = 512, 512
nt      = 2e4
# array initialisation
C       = rand(Float64, nx, ny)
C2      = copy(C)
A       = copy(C)
  1. Implement 2 compute functions to perform the following operation C2 = C + A, replacing the previous calculations:

    • create an "array programming"-based function called compute_ap!() that includes a broadcasted version of the memory copy operation;

    • create a second "kernel programming"-based function called compute_kp!() that uses a loop-based implementation with multi-threading.

  2. Update the A_eff formula accordingly.

  3. Implement a switch to monitor performance using either a manual approach or BenchmarkTools and pass the bench value as kwarg to the memcopy() main function including a default:

if bench == :loop
    # iteration loop
    t_tic = 0.0
    for iter=1:nt
      ...
    end
    t_toc = Base.time() - t_tic
elseif bench == :btool
    t_toc = @belapsed ...
end

Then, create a README.md file in the Pf_diffusion_2D folder to report the results for each of the following tasks (including a .png of the figure when instructed)

💡 Note
Use ![fig_name](./<relative-path>/my_fig.png) to insert a .png figure in the README.md.

Task 2

Report on a figure TeffT_\mathrm{eff} of your memcopy.jl code as function of number of grid points nx × ny for the array and kernel programming approaches, respectively, using the BenchmarkTools implementation. Vary nxand ny such that

nx = ny = 16 * 2 .^ (1:8)

(TeffT_\mathrm{eff} of your memcopy.jl code represents TpeakT_\mathrm{peak}, the peak memory throughput you can achieve on your CPU for a given implementation.)

On the same figure, report the best value of memcopy obtained using the manual loop-based approach (manual timer) to assess TpeakT_\mathrm{peak}.

💡 Note
For performance evaluation we only need the code to run a couple of seconds; adapt nt accordingly (you could also, e.g., make nt function of nx, ny). Ensure also to implement "warm-up" iterations.

Add the above figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result. Figure out the vendor-announced peak memory bandwidth for your CPU, add it to the figure and use it to discuss your results.

Task 3

Repeat the strong scaling benchmark you just realised in Task 2 using the various fluid pressure diffusion 2D codes (Pf_diffusion_2D_Teff.jl; Pf_diffusion_2D_perf.jl; Pf_diffusion_2D_loop_fun.jl - with/without Threads.@threads for the latter).

Report on a figure TeffT_\mathrm{eff} of the 4 diffusion solvers' implementations as function of number of grid points nx × ny. Vary nxand ny such that nx = ny = 16 * 2 .^ (1:8). Use the BenchmarlTools-based evaluation approach.

On the same figure, report also the best memory copy value (as, e.g, dashed lines) and vendor announced values (if available - optional).

Add this second figure in a new section of the Pf_diffusion_2D/README.md, and provide a minimal description of 1) the performed test, and 2) a short description of the result.

back to Content


Exercise 3 — Unit tests

👉 See Logistics for submission details.

The goal of this exercise is to:

For this exercise, you will implement a test set of basic unit tests using Julia's built-in testing infrastructure to verify the implementation of the diffusion and acoustic 2D solvers.

Task 1

In the Pf_diffusion_2D folder, duplicate the Pf_diffusion_2D_perf_loop_fun.jl script and rename it Pf_diffusion_2D_test.jl.

Implement a test set using @testset and @test macros from Test.jl in order to test Pf[xtest, ytest] and assess that the values returned are approximatively equal to the following ones for the given values of nx = ny. Make sure to set do_check = false i.e. to ensure the code to run 500 iterations.

xtest = [5, Int(cld(0.6*lx, dx)), nx-10]
ytest = Int(cld(0.5*ly, dy))

for

nx = ny = 16 * 2 .^ (2:5) .- 1
maxiter = 500

should match

nx, nyPf[xtest, ytest]
63[0.00785398056115133 0.007853980637555755 0.007853978592411982]
127[0.00787296974549236 0.007849556884184108 0.007847181374079883]
255[0.00740912103848251 0.009143711648167267 0.007419533048751209]
511[0.00566813765849919 0.004348785338575644 0.005618691590498087]

Report the output of the test set as code block in a new section of the README.md in the Pf_diffusion_2D folder.

💡 Note
Use triple backticks to generate code blocks in the README.md (more).

back to Content