Programming project Dijkstra's algorithm
Summer school: ' Introduction to High performance Computing’ Center for High-Performance Computing & KTH School of Computer Science and Communication
Salvator Gkalea Department ICT/ES KTH
[email protected]
Tutor: Stefano Markidis Date: October 2014
Dionysios Zelios Department of Physics Stockholm University
[email protected]
Contents Abstract ..................................................................................................................................... 3 Introduction ............................................................................................................................... 3 Dijkstra's Parallelization ............................................................................................................ 5 Simulation results and analysis ................................................................................................. 8 References ............................................................................................................................... 14
Abstract Dijkstra's algorithm is a graph search algorithm that solves the singlesource shortest path problem for a graph with non-negative edge path costs, producing a shortest path tree. This algorithm is often used in routing and as a subroutine in other graph algorithms. In this project we investigate the parallelization of this algorithm and its speedup against the sequential one. Message Passing Interface (MPI) is used for this purpose, in order to parallelize the source-shortest path algorithm.
Introduction Dijkstra’s algorithm determines the shortest path from a single source vertex to every other vertex. It can also be used for finding costs of shortest paths from a single vertex to a single destination vertex by stopping the algorithm once the shortest path to the destination vertex has been determined. For instance, if the vertices of the graph represent cities and edge path costs represent driving distances between pairs of cities connected by a direct road, Dijkstra's algorithm can be used to find the shortest route between one city and all other cities. A pseudo-code representation of the Dijkstra's sequential single-source shortest-paths Algorithm is given. The pseudo-algorithm proceeds as follows:
1. function Dijkstra(V, E, w, s) 2. begin 3. VT := {s}; 4. for all v exists in (V - VT) do 5. if (s, v) exists set l[v] := w(s, v); 6. else set l[v] := infinitive; end for 7. while VT ≠ V do 8. begin 9. find a vertex u such that l[u] := min{l[v]|v exists in (V - VT)}; 10. VT := VT joint {u}; 11. for all v exists in (V - VT) do 12. l[v] := min{l[v], l[u] + w(u, v)}; 13. end while 14. end Dijkstra
Suppose we have a given weighted graph G = (V,E,w), where V = vertices, E = edges, w = weight. The pseudo-code above represents the procedure where the shortest path, from vertex s to every other vertex v, is computed. For every vertex v that exists in (V - VT), the algorithm stores the minimum cost to reach vertex v from vertex s. In line 6-7 the procedure stores every weight of each vertex to the l[] array. In the rest of the lines, the algorithm finds the closest vertex u to vertex s , based on the weight, among all those not yet examined and then updates the minimum distance ( l[] array ). The main parallelization idea to approach this problem is to assume that every process can discover its node that is closest to the source. Then, it is supposed to make a reduction in order to distinguish the closest node and upon success it will broadcast the selection of this node to all the other processes. At the end each process will update locally its distant array ( l[] array ). The parallel computing can be achieved with some basic MPI communication functions. These functions are described below and they are going to be used to parallelize the single-source shortest-paths Algorithm.
MPI_Init MPI_Finalize
Initialize the parallel environment End the parallel environment, release system resources MPI_Comm_size Returns the number of parallel computing units MPI_Comm_rank Returns the current calculation unit identification nimber MPI_Reduce Reduces values on all processes to a single value MPI_Bcast Broadcasts a message from the process with rank "root" to all other processes of the communicator MPI_Gather Gathers together values from a group of processes
Dijkstra's Parallelization The pseudo-code mentioned above can be parallelized based on the line 9 in which the minimum distance among all the nodes is computed or based on the line 11 where the algorithm calculates the minimum overall distance and stores it to the local minimum distance array. The classic approach is to break down the data that need to be computed into segments. Then each node is responsible to process and compute a segment of data. At the end all the results from all the segments must be gathered at a node. The C code above demonstrates the logic mentioned above.
#include
#include int numV,
//number of vertices *todo, //vertices to be analysed numNodes, //number of nodes chunk, //number of vertices handled by every node start, end, //start, end point vertex for each node myNode; //ID of node
unsigned maxInt, localMin[2], // [0]: min for local node, [1]: vertex for that min
globalMin[2], // [0]: global min for all the node, [1]: vertex for that min *graph, //hops between vertices *minDistance; // min distance double T1, T2; // start, end time // Generation of the Graph void init(int ac, char **av) { int i, j; unsigned u, tmp; numV = atoi(av[1]); MPI_Init(&ac, &av); MPI_Comm_size(MPI_COMM_WORLD, &numNodes); MPI_Comm_rank(MPI_COMM_WORLD, &myNode); chunk = numV / numNodes; start = myNode * chunk; end = start + chunk - 1; maxInt = -1 >> 1; graph = malloc(chunk * numV * sizeof(int)); minDistance = malloc(numV * sizeof(int)); todo = malloc(numV * sizeof(int)); for (i = 0; i < chunk; i++) { for (j = i; j < numV; j++) { if (j == i) graph[i][j] = 0; else { graph[i][j] = rand(1) % 21 + 1; graph[j][i] = graph[i][j]; } } } for (i = 0; i < numV; i++) { todo[i] = 1; minDistance[i] = maxInt; } minDistance[0] = 0; } int main(int ac, char **av) { int i,j,k; //Initialization process init(ac, av); if (myNode == 0) T1 = MPI_Wtime(); for (step = 0; step < numV; step++) { // Find local minimum distance
for (i = start; i <= end; i++) { if (todo[i] && minDistance[i] < maxInt) { localMin[0] = minDistance[i]; localMin[1] = i; } } MPI_Reduce(localMin,globalMin,1,MPI_2INT,MPI_MINLOC,0,MPI_COMM_ WORLD); MPI_Bcast(globalMin,1,MPI_2INT,0,MPI_COMM_WORLD); //check that vertex as passed todo[globalMin[1]] = 0; //Update the local min distance array for (j = 0; j < chunk; j++) { if (globalMin[0] + graph[globalMin[1] * chunk + j] < minDistance[j + (myNode * chunk)]){ minDistance[j + (myNode * chunk)] = globalMin[0] + graph[globalMin[1] * chunk + j]; } } } MPI_Gather(minDistance + start, chunk, MPI_INT, minDistance, chunk, MPI_INT, 0, MPI_COMM_WORLD); if (myNode == 0) T2 = MPI_Wtime();
if (0 && myNode == 0) for (k = 1; k < numV; i++) { printf("node[%d] = %u\n", i, minDistance[i]); } if (myNode == 0) printf("elapsed: %f\n", (float) (T2 - T1)); MPI_Finalize(); }
The node 0 is actually acting as a ‘master’ which is responsible to broadcast (MPI_Bcast) the global minimum distance that has been computed by all the other nodes. The other nodes are acting as receivers, waiting for the global minimum distance from the node 0. As it was mentioned before, each node is responsible to process a segment of the whole data. The results from all the nodes are gathered (MPI_Gather) to the node 0 and are been checked with the local results in order to be stored or rejected (MPI_Reduce).
Simulation results and analysis The experiments have been conducted on Milner with the Intel compiler. Time measurements First, we run our algorithm for different values of meshes. Each time we vary the number of cores and we get as a result the time needed for the algorithm to run. The measurements are shown in the table below and were executed as aprun –n Nodes ./Dijkstra.out mesh, where Nodes=1…32 and mesh=the size of the graph (ex. Mesh=4*106 means the number of vertices) mesh= 4*106, Nodes=1..32 Nodes 1 Time 0.009083 (sec)
2 0.010213
4 0.007118
8 0.008694
16 0.015567
32 0.034436
2 0.150809
4 0.092254
8 0.062077
16 0.105193
32 0.1638634
2 0.597349
4 0.352130
8 0.202681
16 0.189242
32 0.200911
2 2.077290
4 1.213774
8 0.874406
16 0.85756
32 1.056482
mesh= 64*106 Nodes Time (sec)
1 0.14132
mesh= 256*106 Nodes Time (sec)
1 0.564806
mesh=900*106 Nodes Time (sec)
1 1.984156
Plotting now the aforementioned values in the same diagram, we have:
It is obvious that the running time is decreasing as the number of cores increases until it reaches the smallest value, then the running time will increase because of the communication latency. However, for middle size network the phenomenon of a reducing running time is not that obvious. For a small size network, the running
time is even increasing as the number of cores increases, because the communication latency outperforms the benefit from using more cores.
Speed up measurements In our next step, we want to investigate the speed up of our algorithm. In order to do that, we divided the base time in one node with time needed for multiple nodes to be executed. The measurements and the corresponding plot are given below: mesh= 4*106 Nodes 1 Time 1 (sec)
2 0.889356
4 1.276060
8 1.044743
16 0.583477
32 0.263764
2 0,9370793
4 1,531857
8 2,27652
16 1,3434353
32 0,862425
2 0,945520
4 1,60397
8 2,78667
16 2,98457
32 2,81122
2 0,9551
4 1,63469
8 2,269147
16 2,3137
32 1,8780
mesh= 64*106 Nodes Time (sec)
1 1
mesh= 256*106 Nodes Time (sec)
1 1
mesh=900*106 Nodes Time (sec)
1 1
The speed up is increasing as the number of cores increases until it reaches the maximum value, then the speed up is decreasing. This is happening because more cores are used. The speed up is decreasing because the communication latency outperforms the benefit from using more cores. As the network size increases, the number of cores used to get the maximum speed up increases.
Cost measurements
In addition, we would like to investigate the cost in comparison with the number of cores that we have used. In order to do that, we multiplied the number of cores with the execution time in one core. The measurements and the corresponding diagram are presented below:
mesh= 4*106 Nodes 1 Time 1,87807 (sec)
2 0,020426
4 0,028472
8 0,069552
16 0,249072
32 1,101952
2 0,301618
4 0,369016
8 0,496616
16 1,683088
32 5,2436288
2 1,194698
4 1,40852
8 1,621448
16 3,027872
32 6,429152
2 4,15458
4 4,855096
8 6,995248
16 13,72096
32 36,97687
mesh= 64*106 Nodes Time (sec)
1 0,14132
mesh= 256*106 Nodes Time (sec)
1 0,564806
mesh=900*106 Nodes Time (sec)
1 1,984156
The cost is increasing because the speed up (or the benefit of a reduced running time) cannot outperform the cost of using more cores.
References
1) Wikipedia: http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm 2) Lecture notes by Erwin Laure: http://agenda.albanova.se/conferenceDisplay.py?confId=4384 3)Han, Xiao Gang, Qin Lei Sun, and Jiang Wei Fan. "Parallel Dijkstra's Algorithm Based on Multi-Core and MPI." Applied Mechanics and Materials 441 (2014): 750-753. 4) http://www.mpich.org/ 5) http://www.open-mpi.org/ 6) Crauser, Andreas, et al. "A parallelization of Dijkstra's shortest path algorithm."Mathematical Foundations of Computer Science 1998. Springer Berlin Heidelberg, 1998. 722-731 7) Meyer, Ulrich, and Peter Sanders. "Δ-stepping: A parallel single source shortest path algorithm." Algorithms—ESA’98. Springer Berlin Heidelberg, 1998. 393-404. 8)http://www.inf.ed.ac.uk/publications/thesis/online/IM040172.pdf