Lecture 10

Agenda
📚 Distributed multi-xPU computing with ImplicitGlobalGrid.jl
💻 Documenting your code
🚧 Exercises:

  • 2D diffusion with multi-xPU


Content

👉 get started with exercises


Using ImplicitGlobalGrid.jl

The goal of this lecture 10:

Distributed computing

Let's have look at ImplicitGlobalGrid.jl's repository.

ImplicitGlobalGrid.jl can render distributed parallelisation with GPU and CPU for HPC a very simple task. Moreover, ImplicitGlobalGrid.jl elegantly combines with ParallelStencil.jl.

Finally, the cool part: using both packages together enables to hide communication behind computation. This feature enables a parallel efficiency close to 1.

Getting started with ImplicitGlobalGrid

For this development, we'll start from the l9_diffusion_2D_perf_xpu.jl code.

Only a few changes are required to enable multi-xPU execution, namely:

  1. Initialise the implicit global grid

  2. Use global coordinates to compute the initial condition

  3. Update halo (and overlap communication with computation)

  4. Finalise the global grid

  5. Tune visualisation

But before we start programming the multi-xPU implementation, let's get setup with GPU MPI on daint.alps. Follow steps are needed:

💡 Note
See GPU computing on Alps for detailed information on how to run MPI GPU (multi-GPU) applications on daint.alps.

To (1.) initialise the global grid, one first needs to use the package

using ImplicitGlobalGrid

Then, one can add the global grid initialisation in the # Derived numerics section

me, dims = init_global_grid(nx, ny, 1; select_device = false)  # Initialization of MPI and more...
dx, dy  = Lx/nx_g(), Ly/ny_g()
💡 Note
On Alps, SLURM makes device with ID=0 visible to each MPI rank, which requires to disable device selection in the call to init_global_grid(...; select_device = false).

Then, for (2.), one can use x_g() and y_g() to compute the global coordinates in the initialisation (to correctly spread the Gaussian distribution over all local processes)

C       = @zeros(nx,ny)
C      .= Data.Array([exp(-(x_g(ix,dx,C)+dx/2 -Lx/2)^2 -(y_g(iy,dy,C)+dy/2 -Ly/2)^2) for ix=1:size(C,1), iy=1:size(C,2)])

The halo update (3.) can be simply performed adding following line after the compute! kernel

update_halo!(C)

Now, when running on GPUs, it is possible to hide MPI communication behind computations!

This option implements as:

@hide_communication (8, 2) begin
    @parallel compute!(C2, C, D_dx, D_dy, dt, _dx, _dy, size_C1_2, size_C2_2)
    C, C2 = C2, C # pointer swap
    update_halo!(C)
end

The @hide_communication (8, 2) will first compute the first and last 8 and 2 grid points in x and y dimension, respectively. Then, while exchanging boundaries the rest of the local domains computations will be perform (overlapping the MPI communication).

To (4.) finalise the global grid,

finalize_global_grid()

needs to be added before the return of the "main".

The last changes to take care of is to (5.) handle visualisation in an appropriate fashion. Here, several options exists.

To implement the latter and generate a gif, one needs to define a global array for visualisation:

if do_visu
    if (me==0) ENV["GKSwstype"]="nul"; if isdir("viz2D_mxpu_out")==false mkdir("viz2D_mxpu_out") end; loadpath = "./viz2D_mxpu_out/"; anim = Animation(loadpath,String[]); println("Animation directory: $(anim.dir)") end
    nx_v, ny_v = (nx-2)*dims[1], (ny-2)*dims[2]
    if (nx_v*ny_v*sizeof(Data.Number) > 0.8*Sys.free_memory()) error("Not enough memory for visualization.") end
    C_v   = zeros(nx_v, ny_v) # global array for visu
    C_inn = zeros(nx-2, ny-2) # no halo local array for visu
    xi_g, yi_g = LinRange(dx+dx/2, Lx-dx-dx/2, nx_v), LinRange(dy+dy/2, Ly-dy-dy/2, ny_v) # inner points only
end

Then, the plotting routine can be adapted to first gather the inner points of the local domains into the global array (using gather! function) and then plot and/or save the global array (here C_v) from the master process me==0:

# Visualize
if do_visu && (it % nout == 0)
    C_inn .= Array(C)[2:end-1,2:end-1]; gather!(C_inn, C_v)
    if (me==0)
        opts = (aspect_ratio=1, xlims=(xi_g[1], xi_g[end]), ylims=(yi_g[1], yi_g[end]), clims=(0.0, 1.0), c=:turbo, xlabel="Lx", ylabel="Ly", title="time = $(round(it*dt, sigdigits=3))")
        heatmap(xi_g, yi_g, Array(C_v)'; opts...); frame(anim)
    end
end

To finally generate the gif, one needs to place the following after the time loop:

if (do_visu && me==0) gif(anim, "diffusion_2D_mxpu.gif", fps = 5)  end
💡 Note
We here did rely on CUDA-aware MPI. To use this feature set (and export) IGG_CUDAAWARE_MPI=1. Note that the examples using ImplicitGlobalGrid.jl would also work if USE_GPU = false; however, the communication and computation overlap feature is then currently not yet available as its implementation relies at present on leveraging GPU streams.

Hiding communication

Hiding communication behind computation is a common optimisation technique in distributed stencil computing.

You can think about it as each MPI rank being a watermelon.

  1. We first want to compute the updates for the crust region (green), and then directly start the MPI non-blocking communication (Isend/Irecv).

  2. In the meantime, we asychronously compute the updates of the inner region (red) of the watermelon.

The aim is to hide step (1.) while computing step (2.).

We can examine the effect of hiding communication looking at the profiler trace produced running a 3D diffusion code under NVIDIA Nsight System profiler (see the Profiling on Alps section about how to launch the profiler).

hidecomm

no hidecomm

back to Content

Documenting your code

This lecture we will learn:

comic

Why should I document my code?

Why should I write code comments?

Why should I write documentation?

Documentation easily rots...

Worse than no documentation/code comments is documentation which is outdated.

I find the best way to keep documentation up to date is:

Documentation tools: doc-strings

A Julia doc-string (Julia manual):

"Typical size of a beer crate"
const BEERBOX = 12
?BEERBOX

Documentation tools: doc-strings with examples

One can add examples to doc-strings (they can even be part of testing: doc-tests).

(Run it in the REPL and copy paste to the docstring.)

"""
    transform(r, θ)

Transform polar `(r,θ)` to cartesian coordinates `(x,y)`.

# Example
```jldoctest
julia> transform(4.5, pi/5)
(3.6405764746872635, 2.6450336353161292)
```
"""
transform(r, θ) = (r*cos(θ), r*sin(θ))
?transform

Documentation tools: GitHub markdown rendering

The easiest way to write long-form documentation is to just use GitHub's markdown rendering.

A nice example is this short course by Ludovic (incidentally about solving PDEs on GPUs 🙂).

👉 this is a good and low-overhead way to produce pretty nice documentation

Documentation tools: Literate.jl

There are several tools which render .jl files (with special formatting) into markdown files. These files can then be added to GitHub and will be rendered there.

Example

Literate.markdown("car_travels.jl", directory_of_this_file, execute=true, documenter=false, credit=false)

But this is not automatic! Manual steps: run Literate, add files, commit and push...

or use GitHub Actions...

Documentation tools: Automating Literate.jl

Demonstrated in the repo course-101-0250-00-L8Documentation.jl

name: Run Literate.jl
# adapted from https://lannonbr.com/blog/2019-12-09-git-commit-in-actions

on: push

jobs:
  lit:
    runs-on: ubuntu-latest
    steps:
      # Checkout the branch
      - uses: actions/checkout@v4

      - uses: julia-actions/setup-julia@latest
        with:
          version: '1.12'
          arch: x64

      - uses: julia-actions/cache@v1
      - uses: julia-actions/julia-buildpkg@latest

      - name: run Literate
        run: QT_QPA_PLATFORM=offscreen julia --color=yes --project -e 'cd("scripts"); include("literate-script.jl")'

      - name: setup git config
        run: |
          # setup the username and email. I tend to use 'GitHub Actions Bot' with no email by default
          git config user.name "GitHub Actions Bot"
          git config user.email "<>"

      - name: commit
        run: |
          # Stage the file, commit and push
          git add scripts/md/*
          git commit -m "Commit markdown files fom Literate"
          git push origin master

Documentation tools: Documenter.jl

If you want to have full-blown documentation, including, e.g., automatic API documentation generation, versioning, then use Documenter.jl.

Examples:

Notes:

back to Content

Exercises - lecture 10

Exercise 1 — Multi-xPU computing

👉 See Logistics for submission details.

The goal of this exercise is to:

In this exercise, you will:

Start by fetching the l9_diffusion_2D_perf_xpu.jl code from the scripts/l9_scripts folder and copy it to your lecture_10 folder.

Make a copy and rename it diffusion_2D_perf_multixpu.jl.

Task 1

Follow the steps listed in the section from lecture 10 about using ImplicitGlobalGrid.jl to add multi-xPU support to the 2D diffusion code.

The 5 steps you'll need to implement are summarised hereafter:

  1. Initialise the implicit global grid

  2. Use global coordinates to compute the initial condition

  3. Update halo (and overlap communication with computation)

  4. Finalise the global grid

  5. Tune visualisation

Once the above steps are implemented, head to daint.alps and configure either an salloc or prepare a sbatch script to access 1 node.

Task 2

Run the single xPU l9_diffusion_2D_perf_xpu.jl code on a single CPU and single GPU (changing the USE_GPU flag accordingly) for following parameters

# Physics
Lx, Ly  = 10.0, 10.0
D       = 1.0
ttot    = 1.0
# Numerics
nx, ny  = 126, 126
nout    = 20

and save output C data. Confirm that the difference between CPU and GPU implementation is negligible, reporting it in a new section of the README.md for this exercise 2 within the lecture_10 folder in your shared private GitHub repo.

Task 3

Then run the newly created diffusion_2D_perf_multixpu.jl script with following parameters on 4 MPI processes having set USE_GPU = true:

# Physics
Lx, Ly  = 10.0, 10.0
D       = 1.0
ttot    = 1e0
# Numerics
nx, ny  = 64, 64 # number of grid points
nout    = 20
# Derived numerics
me, dims = init_global_grid(nx, ny, 1; select_device = false)  # Initialization of MPI and more...

Save the global C_v output array. Ensure its size matches the inner points of the single xPU produced output (C[2:end-1,2:end-1]) and then compare the results to the existing 2 outputs produced in Task 2

Task 4

Now that we are confident the xPU and multi-xPU codes produce correct physical output, we will asses performance.

Use the code diffusion_2D_perf_multixpu.jl and make sure to deactivate visualisation, saving or any other operation that would save to disk or slow the code down.

Strong scaling: Using a single GPU, gather the effective memory throughput T_eff varying nx, ny as following

nx = ny = 16 * 2 .^ (1:10)
⚠️ Warning!
Make sur the code only spends about 1-2 seconds in the time loop, adapting ttot or nt accordingly.

In a new figure you'll add to the README.md, report T_eff as function of nx, and include a short comment on what you see.

Task 5

Weak scaling: Select the smallest nx,ny values from previous step (2.) for which you've gotten the best T_eff. Run now the same code using this optimal local resolution varying the number of MPI process as following np = 1,4,16,25,64.

⚠️ Warning!
Make sure the code only executes a couple of seconds each time otherwise we will run out of node hours for the rest of the course.

In a new figure, report the execution time for the various runs normalising them with the execution time of the single process run. Comment in one sentence on what you see.

Task 6

Finally, let's assess the impact of hiding communication behind computation achieved using the @hide_communication macro in the multi-xPU code.

Using the 64 MPI processes configuration, run the multi-xPU code changing the values of the tuple after @hide_communication such that

@hide_communication (2,2)
@hide_communication (16,4)
@hide_communication (16,16)

Then, you should also run once the code commenting both @hide_communication and corresponding end statements. On a figure report the execution time as function of [no-hidecomm, (2,2), (8,2), (16,4), (16,16)] (note that the (8,2) case you should have from Task 4 and/or 5) making sure to normalise it by the single process execution time (from Task 5). Add a short comment related to your results.

back to Content