Lecture 8

Agenda
📚 Distributed multi-xPU computing, MPI
💻 Running multi-GPU applications on supercomputers
🚧 Exercises:

  • Fake-parallelisation, Julia MPI

  • Using ImplicitGlobalGrid.jl

  • Multi-xPU diffusion 2D


Content

👉 get started with exercises


Distributed computing in Julia

The goal of this lecture 8:

New to distributed computing?

If this is the case or not - hold-on, we certainly have some good stuff for everyone

Distributed computing

Adds one additional layer of parallelisation:

Simply said:

If one CPU or GPU is not sufficient to solve a problem, then use more than one and solve a subset of the global problem on each.

Distributed (memory) computing permits to take advantage of computing "clusters", many similar compute nodes interconnected by high-throughput network. That's also what supercomputers are.

Parallel scaling

So here we go. Let's assume we want to solve a certain problem, which we will call the "global problem". This global problem, we split then into several local problems that execute concurrently.

Two scaling approaches exist:

Increasing the amount of computing resources to resolve the same global problem would increase parallelism and may result in faster execution (wall-time). This parallelisation is called strong scaling; the resources are increased but the global problem size does not change, resulting in an increase in the number of (smaller) local problems that can be solved in parallel.

The strong scaling approach is often used when parallelising legacy CPU codes, as increasing the number of parallel local problems can lead to some speed-up, reaching an optimum beyond which additional local processes is no longer be beneficial.

However, we won't follow that path when developing parallel multi-GPU applications from scratch. Why?

Because GPUs' performance is very sensitive to the local problem size as we experienced when trying to tune the kernel launch parameters (threads, blocks, i.e., the local problem size).

When developing multi-GPU applications from scratch, it is likely more suitably to approach distributed parallelisation from a weak scaling perspective; defining first the optimal local problem size to resolve on a single GPU and then increasing the number of optimal local problems (and the number of GPUs) until reaching the global problem one originally wants to solve.

Implicit Global Grid

We can thus replicate a local problem multiple times in each dimension of the Cartesian space to obtain a global grid, which is therefore defined implicitly. Local problems define each others local boundary conditions by exchanging internal boundary values using intra-node communication (e.g., message passing interface - MPI), as depicted on the figure hereafter:

IGG

Distributing computations - challenges

Many things could potentially go wrong in distributed computing. However, the ultimate goal (at least for us) is to keep up with parallel efficiency.

The parallel efficiency defines as the speed-up divided by the number of processors. The speed-up defines as the execution time using an increasing number of processors normalised by the single processor execution time. We will use the parallel efficiency in a weak scaling configuration.

Ideally, the parallel efficiency should stay close to 1 while increasing the number of computing resources proportionally with the global problem size (i.e. keeping the constant local problem sizes), meaning no time is lost (no overhead) in due to, e.g., inter-process communication, network congestion, congestion of shared filesystem, etc... as shown in the figure hereafter:

Parallel scaling


Let's get started

we will explore distributed computing with Julia's MPI wrapper MPI.jl. This will enable our codes to run on multiple CPUs and GPUs in order to scale on modern multi-CPU/GPU nodes, clusters and supercomputers. In the proposed approach, each MPI process handles one CPU or GPU.

We're going to work out the following steps to tackle distributed parallelisation in this lecture (in 5 steps):

Fake parallelisation

As a first step, we will look at the below 1-D diffusion code which solves the linear diffusion equations using a "fake-parallelisation" approach. We split the calculation on two distinct left and right domains, which requires left and right C arrays, CL and CR, respectively.

In this "fake parallelisation" code, the computations for the left and right domain are performed sequentially on one process, but they could be computed on two distinct processes if the needed boundary update (often referred to as halo update in literature) was done with MPI.

1D Global grid

The idea of this fake parallelisation approach is the following:

# Compute physics locally
CL[2:end-1] .= CL[2:end-1] .+ dt*D*diff(diff(CL)/dx)/dx
CR[2:end-1] .= CR[2:end-1] .+ dt*D*diff(diff(CR)/dx)/dx
# Update boundaries (MPI)
CL[end] = ...
CR[1]   = ...
# Global picture
C .= [CL[1:end-1]; CR[2:end]]

We see that a correct boundary update will be the critical part for a successful implementation. In our approach, we need an overlap of 2 cells between CL and CR in order to avoid any wrong computations at the transition between the left and right domains.

Step 1 (fake parallelisation with 2 fake processes)

Run the "fake parallelisation" 1-D diffusion code l8_diffusion_1D_2procs.jl, which is missing the boundary updates of the 2 fake processes and describe what you see in the visualisation.

Then, add the required boundary update:

# Update boundaries (MPI)
CL[end] = ...
CR[1]   = ...

in order make the code work properly and run it again. Note what has changed in the visualisation.

The next step will be to generalise the fake parallelisation with 2 fake processes to work with n fake processes. The idea of this generalised fake parallelisation approach is the following:

for ip = 1:np # compute physics locally
    C[2:end-1,ip] .= C[2:end-1,ip] .+ dt*D*diff(diff(C[:,ip])/dxg)/dxg
end
for ip = 1:np-1 # update boundaries
   # ...
end
for ip = 1:np # global picture
    i1 = 1 + (ip-1)*(nx-2)
    Cg[i1:i1+nx-2] .= C[1:end-1,ip]
end

The array C contains now n local domains where each domain belongs to one fake process, namely the fake process indicated by the second index of C (ip). The boundary updates are to be adapted accordingly. All the physical calculations happen on the local chunks of the arrays.

We only need "global" knowledge in the definition of the initial condition.

The previous simple initial conditions can be easily defined without computing any Cartesian coordinates. To define other initial conditions we often need to compute global coordinates.

In the code below, which serves to define a Gaussian anomaly in the centre of the domain, Cartesian coordinates can be computed for each cell based on the process ID (ip), the cell ID (ix), the array size (nx), the overlap of the local domains (2) and the grid spacing of the global grid (dxg); moreover, the origin of the coordinate system can be moved to any position using the global domain length (lx):

# Initial condition
for ip = 1:np
    for ix = 1:nx
        x[ix,ip] = ...
        C[ix,ip] = exp(-x[ix,ip]^2)
    end
    i1 = 1 + (ip-1)*(nx-2)
    xt[i1:i1+nx-2] .= x[1:end-1,ip]; if (ip==np) xt[i1+nx-1] = x[end,ip] end
    Ct[i1:i1+nx-2] .= C[1:end-1,ip]; if (ip==np) Ct[i1+nx-1] = C[end,ip] end
end

Step 2 (fake parallelisation with n fake processes)

Modify the initial condition in the 1-D diffusion code l8_diffusion_1D_nprocs.jl to a centred (Lx/2)(L_x/2) Gaussian anomaly.

Then run this code which is missing the boundary updates of the n fake processes and describe what you see in the visualisation. Then, add the required boundary update in order make the code work properly and run it again. Note what has changed in the visualisation.

Julia and MPI

We are now ready to write a code that will truly distribute calculations on different processors using MPI.jl for inter-process communication.

💡 Note
At this point, make sure to have a working Julia MPI environment. Head to Julia MPI install to set-up Julia MPI. See Julia MPI GPU on Piz Daint for detailed information on how to run MPI GPU (multi-GPU) applications on Piz Daint.

Let us see what are the somewhat minimal requirements that will allow us to write a distributed code in Julia using MPI.jl. We will solve the following linear diffusion physics (see l8_diffusion_1D_mpi.jl):

for it = 1:nt
    qx         .= .-D*diff(C)/dx
    C[2:end-1] .= C[2:end-1] .- dt*diff(qx)/dx
end

To enable distributed parallelisation, we will do the following steps:

  1. Initialise MPI and set up a Cartesian communicator

  2. Implement a boundary exchange routine

  3. Create a "global" initial condition

  4. Finalise MPI

To (1.) initialise MPI and prepare the Cartesian communicator, we do (upon import MPI):

import MPI

MPI.Init()
dims        = [0]
comm        = MPI.COMM_WORLD
nprocs      = MPI.Comm_size(comm)
MPI.Dims_create!(nprocs, dims)
comm_cart   = MPI.Cart_create(comm, dims, [0], 1)
me          = MPI.Comm_rank(comm_cart)
coords      = MPI.Cart_coords(comm_cart)
neighbors_x = MPI.Cart_shift(comm_cart, 0, 1)

where me represents the process ID unique to each MPI process (the analogue to ip in the fake parallelisation).

Then, we need to (2.) implement a boundary update routine, which can have the following structure:

@views function update_halo!(A, neighbors_x, comm)
    # Send to / receive from neighbour 1 ("left neighbor")
    if neighbors_x[1] != MPI.PROC_NULL
        sendbuf = ??
        recvbuf = ??
        MPI.Send(??,  neighbors_x[?], 0, comm)
        MPI.Recv!(??, neighbors_x[?], 1, comm)
        A[1] = ??
    end
    # Send to / receive from neighbour 2 ("right neighbor")
    if neighbors_x[2] != MPI.PROC_NULL
        sendbuf = ??
        recvbuf = ??
        MPI.Recv!(??, neighbors_x[?], 0, comm)
        MPI.Send(??,  neighbors_x[?], 1, comm)
        A[end] = ??
    end
    return
end

Then, we (3.) initialize C with a "global" initial Gaussian anomaly that spans correctly over all local domains. This can be achieved, e.g., as given here:

x0    = coords[1]*(nx-2)*dx
xc    = [x0 + ix*dx - dx/2 - 0.5*lx  for ix=1:nx]
C     = exp.(.-xc.^2)

where x0 represents the first global x-coordinate on every process (computed in function of coords) and xc represents the local chunk of the global coordinates on each local process (this is analogue to the initialisation in the fake parallelisation).

Last, we need to (4.) finalise MPI prior to returning from the main function:

MPI.Finalize()

All the above described is found in the code l8_diffusion_1D_mpi.jl, except for the boundary updates (see 2.).

Step 3 (1-D parallelisation with MPI)

Run the code l8_diffusion_1D_mpi.jl which is still missing the boundary updates three times: with 1, 2 and 4 processes (replacing np by the number of processes):

mpiexecjl -n <np> julia --project <my_script.jl>

Visualise the results after each run with the l8_vizme1D_mpi.jl code (adapt the variable nprocs!). Describe what you see in the visualisation. Then, add the required boundary update in order make the code work properly and run it again. Note what has changed in the visualisation.

💡 Note
For the boundary updates, you can use the following approach for the communication with each neighbour: 1) create a sendbuffer and receive buffer, storing the right value in the send buffer; 2) use MPI.Send and MPI.Recv! to send/receive the data; 3) store the received data in the right position in the array.

Congratulations! You just did a distributed memory diffusion solver in only 70 lines of code.

Let us now do the same in 2D: there is not much new there, but it may be interesting to work out how boundary update routines can be defined in 2D as one now needs to exchange vectors instead of single values.

👉 You'll find a working 1D script in the scripts/l8_scripts folder after the lecture.

Step 4 (2-D parallelisation with MPI)

Run the code l8_diffusion_2D_mpi.jl which is still missing the boundary updates three times: with 1, 2 and 4 processes.

Visualise the results after each run with the l8_vizme2D_mpi.jl code (adapt the variable nprocs!). Describe what you see in the visualisation. Then, add the required boundary update in order make the code work properly and run it again. Note what has changed in the visualisation.

diffusion 2D MPI

The last step is to create a multi-GPU solver out of the above multi-CPU solver. CUDA-aware MPI is of great help in this task, because it allows to directly pass GPU arrays to the MPI functions.

Besides facilitating the programming, it can leverage Remote Direct Memory Access (RDMA) which can be of great benefit in many HPC scenarios.

Step 5 (multi-GPU)

Translate the code diffusion_2D_mpi.jl from Step 4 to GPU using GPU array programming. Note what changes were needed to go from CPU to GPU in this distributed solver.

Use a similar approach as implemented in the CPU code to perform the boundary updates. You can use the copyto! function in order to copy the data from the GPU memory into the send buffers (CPU memory) or to copy the receive buffer data (CPU memory) baclk to the GPU array.

💡 Note
Have a look at the l8_hello_mpi_gpu.jl code to get an idea on how to select a GPU based on node-local MPI infos.

Head to the exercise section for further directions on this step which is part of this week's homework assignments.

💡 Note
As alternative, one could use the same approach as in the CPU code to perform the boundary updates thanks to CUDA-aware MPI (it allows to pass GPU arrays directly to the MPI functions). However, this requires MPI being specifically built for that purpose.

This completes the introduction to distributed parallelisation with Julia.

Note that high-level Julia packages as for example ImplicitGlobalGrid.jl can render distributed parallelisation with GPU and CPU for HPC a very simple task. Let's check it out!

Using ImplicitGlobalGrid.jl

Let's have look at ImplicitGlobalGrid.jl's repository.

ImplicitGlobalGrid.jl can render distributed parallelisation with GPU and CPU for HPC a very simple task. Moreover, ImplicitGlobalGrid.jl elegantly combines with ParallelStencil.jl.

Finally, the cool part: using both packages together enables to hide communication behind computation. This feature enables a parallel efficiency close to 1.

For this development, we'll start from the l8_diffusion_2D_perf_xpu.jl code.

Only a few changes are required to enable multi-xPU execution, namely:

  1. Initialise the implicit global grid

  2. Use global coordinates to compute the initial condition

  3. Update halo (and overlap communication with computation)

  4. Finalise the global grid

  5. Tune visualisation

But before we start programming the multi-xPU implementation, let's get setup with GPU MPI on Piz Daint. Follow steps are needed:

💡 Note
See Julia MPI GPU on Piz Daint for detailed information on how to run MPI GPU (multi-GPU) applications on Piz Daint.

To (1.) initialise the global grid, one first needs to use the package

using ImplicitGlobalGrid

Then, one can add the global grid initialisation in the # Derived numerics section

me, dims = init_global_grid(nx, ny, 1)  # Initialization of MPI and more...
dx, dy  = Lx/nx_g(), Ly/ny_g()
💡 Note
The function init_global_grid takes care of MPI GPU mapping based on node-local MPI infos. Have a look at the l8_hello_mpi_gpu.jl code to get an idea about the process.

Then, for (2.), one can use x_g() and y_g() to compute the global coordinates in the initialisation (to correctly spread the Gaussian distribution over all local processes)

C       = @zeros(nx,ny)
C      .= Data.Array([exp(-(x_g(ix,dx,C)+dx/2 -Lx/2)^2 -(y_g(iy,dy,C)+dy/2 -Ly/2)^2) for ix=1:size(C,1), iy=1:size(C,2)])

The halo update (3.) can be simply performed adding following line after the compute! kernel

update_halo!(C)

Now, when running on GPUs, it is possible to hide MPi communication behind computations!

This option implements as:

@hide_communication (8, 2) begin
    @parallel compute!(C2, C, D_dx, D_dy, dt, _dx, _dy, size_C1_2, size_C2_2)
    C, C2 = C2, C # pointer swap
    update_halo!(C)
end

The @hide_communication (8, 2) will first compute the first and last 8 and 2 grid points in x and y dimension, respectively. Then, while exchanging boundaries the rest of the local domains computations will be perform (overlapping the MPI communication).

To (4.) finalise the global grid,

finalize_global_grid()

needs to be added before the return of the "main".

The last changes to take care of is to (5.) handle visualisation in an appropriate fashion. Here, several options exists.

To implement the latter and generate a gif, one needs to define a global array for visualisation:

if do_visu
    if (me==0) ENV["GKSwstype"]="nul"; if isdir("viz2D_mxpu_out")==false mkdir("viz2D_mxpu_out") end; loadpath = "./viz2D_mxpu_out/"; anim = Animation(loadpath,String[]); println("Animation directory: $(anim.dir)") end
    nx_v, ny_v = (nx-2)*dims[1], (ny-2)*dims[2]
    if (nx_v*ny_v*sizeof(Data.Number) > 0.8*Sys.free_memory()) error("Not enough memory for visualization.") end
    C_v   = zeros(nx_v, ny_v) # global array for visu
    C_inn = zeros(nx-2, ny-2) # no halo local array for visu
    xi_g, yi_g = LinRange(dx+dx/2, Lx-dx-dx/2, nx_v), LinRange(dy+dy/2, Ly-dy-dy/2, ny_v) # inner points only
end

Then, the plotting routine can be adapted to first gather the inner points of the local domains into the global array (using gather! function) and then plot and/or save the global array (here C_v) from the master process me==0:

# Visualize
if do_visu && (it % nout == 0)
    C_inn .= Array(C)[2:end-1,2:end-1]; gather!(C_inn, C_v)
    if (me==0)
        opts = (aspect_ratio=1, xlims=(xi_g[1], xi_g[end]), ylims=(yi_g[1], yi_g[end]), clims=(0.0, 1.0), c=:turbo, xlabel="Lx", ylabel="Ly", title="time = $(round(it*dt, sigdigits=3))")
        heatmap(xi_g, yi_g, Array(C_v)'; opts...); frame(anim)
    end
end

To finally generate the gif, one needs to place the following after the time loop:

if (do_visu && me==0) gif(anim, "diffusion_2D_mxpu.gif", fps = 5)  end
💡 Note
We here did not rely on CUDA-aware MPI. To use this feature set (and export) IGG_CUDAAWARE_MPI=1. Note that the examples using ImplicitGlobalGrid.jl would also work if USE_GPU = false; however, the communication and computation overlap feature is then currently not yet available as its implementation relies at present on leveraging CUDA streams.

Wrapping up

Let's recall what we learned today about distributed computing in Julia using GPUs:

back to Content

Exercises - lecture 8

⚠️ Warning!
Exercises have to be handed in and uploaded to your private (shared) GitHub repository, in a newly created lecture8 folder (and not in the PorousConvection folder). The git commit hash (or SHA) of the final push needs to be uploaded on Moodle (more).

Exercise 1 — Towards distributed memory computing on GPUs

👉 See Logistics for submission details.

The goal of this exercise is to:

In this exercise, you will:

Create a new lectrue_8 folder for this first exercise in your shared private GitHub repository for this week's exercises.

Task 1

Finalise the l8_diffusion_1D_2procs.jl and l8_diffusion_1D_nprocs.jl scripts discussed during lecture 8. Make sure to correctly implement the halo update in order to exchange the internal boundaries among the fake parallel processes (left and right and ip in the "2procs" and "nprocs" codes, respectively). See here for details.

Report in two separate figures the final distribution of concentration C for both fake parallel codes. Include these figure in a first section of your lecture's 8 README.md adding a description sentence to each.

Task 2

Finalise the l8_diffusion_2D_mpi.jl script discussed during lecture 8. In particular, finalise the update_halo functions to allow for correct internal boundary exchange among the distributed parallel MPI processes. Add the final code to your GitHub lecture 8 folder.

For each of the (4) neighbour exchanges:

  1. Start by defining a sendbuffer sendbuf to hold the vector you need to send

  2. Initialise a receive buffer recvbuf to later hold the vector received from the corresponding neighbouring process

  3. Use MPI.Send and MPI.Recv! functions to perform the boundary exchange

  4. Assign the values within the receive buffer to the corresponding row or column of the array A

💡 Note
Apply similar overlap and halo update as in the fake parallelisation examples. Look-up MPI.Send and MPI.recv! for further details.

In a new section of your lecture's 8 README.md, add a .gif animation showing the diffusion of the quantity C, running on 4 MPI processes, for the physical and numerical parameters suggested in the initial file. Add a short description of the results and provide the command used to launch the script in the README.md as well.

Task 3

Create a multi-GPU implementation of the l8_diffusion_2D_mpi.jl script as suggested here. To this end, create a new script l8_diffusion_2D_mpi_gpu.jl that you will upload to your lecture 8 GitHub repository upon completion.

Translate the l8_diffusion_2D_mpi.jl code from exercise 1 (task 3) to GPU using GPU array programming. You can use a similar approach as in the CPU code to perform the boundary updates. You should use copyto! function in order to copy the data from the GPU memory into the send buffers (CPU memory) or to copy the receive buffer data to the GPU array.

The steps to realise this task summarise as following:

  1. Select the GPU based on node-local MPI infos (see the l8_hello_mpi_gpu.jl code to get started.)

  2. Use GPU array initialisation (CUDA.zeros, CuArray(), ...)

  3. Gather the GPU arrays back on the host memory for visualisation or saving (using Array())

  4. Modify the update_halo function; use copyto! to copy device data to the host into the send buffer or to copy host data to the device from the receive buffer

In a new (3rd) section of your lecture's 8 README.md, add .gif animation showing the diffusion of the quantity C, running on 4 GPUs (MPI processes), for the physical and numerical parameters suggested in the initial file. Add a short description of the results and provide the command used to launch the script in the README.md as well. Note what changes were needed to go from CPU to GPU in this distributed solver.

back to Content


Exercise 2 — Multi-xPU computing

👉 See Logistics for submission details.

The goal of this exercise is to:

In this exercise, you will:

Start by fetching the l8_diffusion_2D_perf_xpu.jl code from the scripts/l8_scripts folder and copy it to your lectrue_8 folder.

Make a copy and rename it diffusion_2D_perf_multixpu.jl.

Task 1

Follow the steps listed in the section from lecture 8 about using ImplicitGlobalGrid.jl to add multi-xPU support to the 2D diffusion code.

The 5 steps you'll need to implement are summarised hereafter:

  1. Initialise the implicit global grid

  2. Use global coordinates to compute the initial condition

  3. Update halo (and overlap communication with computation)

  4. Finalise the global grid

  5. Tune visualisation

Once the above steps are implemented, head to Piz Daint and configure either an salloc or prepare a sbatch script to access 4 nodes.

Task 2

Run the single xPU l8_diffusion_2D_perf_xpu.jl code on a single CPU and single GPU (changing the USE_GPU flag accordingly) for following parameters

# Physics
Lx, Ly  = 10.0, 10.0
D       = 1.0
ttot    = 1.0
# Numerics
nx, ny  = 126, 126
nout    = 20

and save output C data. Confirm that the difference between CPU and GPU implementation is negligible, reporting it in a new section of the README.md for this exercise 2 within the lecture_8 folder in your shared private GitHub repo.

Task 3

Then run the newly created diffusion_2D_perf_multixpu.jl script with following parameters on 4 MPI processes having set USE_GPU = true:

# Physics
Lx, Ly  = 10.0, 10.0
D       = 1.0
ttot    = 1e0
# Numerics
nx, ny  = 64, 64 # number of grid points
nout    = 20
# Derived numerics
me, dims = init_global_grid(nx, ny, 1)  # Initialization of MPI and more...

Save the global C_v output array. Ensure its size matches the inner points of the single xPU produced output (C[2:end-1,2:end-1]) and then compare the results to the existing 2 outputs produced in Task 2

Task 4

Now that we are confident the xPU and multi-xPU codes produce correct physical output, we will asses performance.

Use the code diffusion_2D_perf_multixpu.jl and make sure to deactivate visualisation, saving or any other operation that would save to disk or slow the code down.

Strong scaling: Using a single GPU, gather the effective memory throughput T_eff varying nx, ny as following

nx = ny = 16 * 2 .^ (1:10)
⚠️ Warning!
Make sur the code only spends about 1-2 seconds in the time loop, adapting ttot or nt accordingly.

In a new figure you'll add to the README.md, report T_eff as function of nx, and include a short comment on what you see.

Task 5

Weak scaling: Select the smallest nx,ny values from previous step (2.) for which you've gotten the best T_eff. Run now the same code using this optimal local resolution varying the number of MPI process as following np = 1,4,16,25,64.

⚠️ Warning!
Make sure the code only executes a couple of seconds each time otherwise we will run out of node hours for the rest of the course.

In a new figure, report the execution time for the various runs normalising them with the execution time of the single process run. Comment in one sentence on what you see.

Task 6

Finally, let's assess the impact of hiding communication behind computation achieved using the @hide_communication macro in the multi-xPU code.

Using the 64 MPI processes configuration, run the multi-xPU code changing the values of the tuple after @hide_communication such that

@hide_communication (2,2)
@hide_communication (16,4)
@hide_communication (16,16)

Then, you should also run once the code commenting both @hide_communication and corresponding end statements. On a figure report the execution time as function of [no-hidecomm, (2,2), (8,2), (16,4), (16,16)] (note that the (8,2) case you should have from Task 4 and/or 5) making sure to normalise it by the single process execution time (from Task 5). Add a short comment related to your results.

back to Content