Solutions An Introduction to Parallel Programming - Pachecho - Chapter 3

3.1. What happens in the greetings program if, instead of strlen(greeting) + 1, we use strlen(greeting) for the length of the message being sent by processes 1, 2, ... , comm sz+1? What happens if we use MAX STRING instead of strlen(greeting) + 1? Can you explain these results? Get solution

3.2. Modify the trapezoidal rule so that it will correctly estimate the integral even if comm sz doesn’t evenly divide n. (You can still assume that n >= comm sz.) Get solution

3.3. Determine which of the variables in the trapezoidal rule program are local and which are global. Get solution

3.4. Modify the program that just prints a line of output from each process (mpi output.c) so that the output is printed in process rank order: process 0s output first, then process 1s, and so on. Get solution

3.5. In a binary tree, there is a unique shortest path from each node to the root. The length of this path is often called the depth of the node. A binary tree in which every nonleaf has two children is called a full binary tree, and a full binary tree in which every leaf has the same depth is sometimes called a complete binary tree. See Figure 3.14. Use the principle of mathematical induction to prove that if T is a complete binary tree with n leaves, then the depth of the leaves is log2.n/. Get solution

3.6. Suppose comm sz = 4 and suppose that x is a vector with n = 14 components. a. How would the components of x be distributed among the processes in a program that used a block distribution?

 b. How would the components of x be distributed among the processes in a program that used a cyclic distribution? c. How would the components of x be distributed among the processes in a program that used a block-cyclic distribution with blocksize b = 2? You should try to make your distributions general so that they could be used regardless of what comm sz and n are. You should also try to make your distributions “fair” so that if q and r are any two processes, the difference between the number of components assigned to q and the number of components assigned to r is as small as possible. Get solution

3.7. What do the various MPI collective functions do if the communicator contains a single process? Get solution

3.8. Suppose comm sz = 8 and n = 16. a. Draw a diagram that shows how MPI Scatter can be implemented using tree-structured communication with comm sz processes when process 0 needs to distribute an array containing n elements. b. Draw a diagram that shows how MPI Gather can be implemented using tree-structured communication when an n-element array that has been distributed among comm sz processes needs to be gathered onto process 0. Get solution

3.9. Write an MPI program that implements multiplication of a vector by a scalar and dot product. The user should enter two vectors and a scalar, all of which are read in by process 0 and distributed among the processes. The results are calculated and collected onto process 0, which prints them. You can assume that n, the order of the vectors, is evenly divisible by comm sz. Get solution

3.10. In the Read vector function shown in Program 3.9, we use local n as the actual argument for two of the formal arguments to MPI Scatter: send count and recv count. Why is it OK to alias these arguments? Get solution

3.11. Finding prefix sums is a generalization of global sum. Rather than simply finding the sum of n values, x0 +x1 +...+xn-1, the prefix sums are the n partial sums x0, x0 +x1, x0 +x1 +x2, ... , x0 +x1 +..+xn-1. a. Devise a serial algorithm for computing the n prefix sums of an array with n elements. b. Parallelize your serial algorithm for a system with n processes, each of which is storing one of the x is. c. Suppose n D 2k for some positive integer k. Can you devise a serial algorithm and a parallelization of the serial algorithm so that the parallel algorithm requires only k communication phases? d. MPI provides a collective communication function, MPI Scan, that can be used to compute prefix sums:

 It operates on arrays with count elements; both sendbuf p and recvbuf p should refer to blocks of count elements of type datatype. The op argument is the same as op for MPI Reduce. Write an MPI program that generates a random array of count elements on each MPI process, finds the prefix sums, and prints the results. Get solution

3.12. An alternative to a butterfly-structured allreduce is a ring-pass structure. In a ring-pass, if there are p processes, each process q sends data to process q+1, except that process p-1 sends data to process 0. This is repeated until each process has the desired result. Thus, we can implement allreduce with the following code: sum = temp val = my val; for (i = 1; i < p; i++) { MPI Sendrecv replace(&temp val, 1, MPI INT, dest, sendtag, source, recvtag, comm, &status); sum += temp val; } a. Write an MPI program that implements this algorithm for allreduce. How does its performance compare to the butterfly-structured allreduce? b. Modify the MPI program you wrote in the first part so that it implements prefix sums. Get solution

3.13. MPI Scatter and MPI Gather have the limitation that each process must send or receive the same number of data items. When this is not the case, we must use the MPI functions MPI Gatherv and MPI Scatterv. Look at the man pages for these functions, and modify your vector sum, dot product program so that it can correctly handle the case when n isn’t evenly divisible by comm sz. Get solution

3.14. a. Write a serial C program that defines a two-dimensional array in the main function. Just use numeric constants for the dimensions: int two d[3][4]; Initialize the array in the main function. After the array is initialized, call a function that attempts to print the array. The prototype for the function should look something like this. void Print two d(int two d[][], int rows, int cols); After writing the function try to compile the program. Can you explain why it won’t compile? b. After consulting a C reference (e.g., Kernighan and Ritchie [29]), modify the program so that it will compile and run, but so that it still uses a twodimensional C array. Get solution

3.15. What is the relationship between the “row-major” storage for twodimensional arrays that we discussed in Section 2.2.3 and the one-dimensional storage we use in Section 3.4.9? Get solution

3.16. Suppose comm sz = 8 and the vector x = .0, 1, 2, ... ,15/ has been distributed among the processes using a block distribution. Draw a diagram illustrating the steps in a butterfly implementation of allgather of x. Get solution

3.17. MPI Type contiguous can be used to build a derived datatype from a collection of contiguous elements in an array. Its syntax is

 Modify the Read vector and Print vector functions so that they use an MPI datatype created by a call to MPI Type contiguous and a count argument of 1 in the calls to MPI Scatter and MPI Gather. Get solution

3.18. MPI Type vector can be used to build a derived datatype from a collection of blocks of elements in an array as long as the blocks all have the same size and they’re equally spaced. Its syntax is

 For example, if we had an array x of 18 doubles and we wanted to build a type corresponding to the elements in positions 0, 1, 6, 7, 12, 13, we could call int MPI Type vector(3, 2, 6, MPI DOUBLE, &vect mpi t); since the type consists of 3 blocks, each of which has 2 elements, and the spacing between the starts of the blocks is 6 doubles. Write Read vector and Print vector functions that will allow process 0 to read and print, respectively, a vector with a block-cyclic distribution. But beware! Do not use MPI Scatter or MPI Gather. There is a technical issue involved in using these functions with types created with MPI Type vector. (See, for example, [23].) Just use a loop of sends on process 0 in Read vector and a loop of receives on process 0 in Print vector. The other processes should be able to complete their calls to Read vector and Print vector with a single call to MPI Recv and MPI Send. The communication on process 0 should use a derived datatype created by MPI Type vector. The calls on the other processes should just use the count argument to the communication function, since they’re receiving/sending elements that they will store in contiguous array locations. Get solution

3.19. MPI Type indexed can be used to build a derived datatype from arbitrary array elements. Its syntax is

Unlike MPI Type create struct, the displacements are measured in units of old mpi t—not bytes. Use MPI Type indexed to create a derived datatype that corresponds to the upper triangular part of a square matrix. For example, in the 4x4 matrix


the upper triangular part is the elements 0, 1, 2, 3, 5, 6, 7, 10,11, 15. Process 0 should read in an nxn matrix as a one-dimensional array, create the derived datatype, and send the upper triangular part with a single call to MPI Send. Process 1 should receive the upper triangular part with a single call to MPI Recv and then print the data it received. Get solution

3.20. The functions MPI Pack and MPI Unpack provide an alternative to derived
datatypes for grouping data. MPI Pack copies the data to be sent, one block at
a time, into a user-provided buffer. The buffer can then be sent and received.
After the data is received, MPI Unpack can be used to unpack it from the
receive buffer. The syntax of MPI Pack is

 We could therefore pack the input data to the trapezoidal rule program with
the following code:
The key is the position argument. When MPI Pack is called, position should refer to the first available slot in pack buf. When MPI Pack returns, it refers to the first available slot after the data that was just packed, so after process 0 executes this code, all the processes can call MPI Bcast:
MPI Bcast(pack buf, 100, MPI PACKED, 0, comm);
Note that the MPI datatype for a packed buffer is MPI PACKED. Now the other processes can unpack the data using: MPI Unpack:


This can be used by “reversing” the steps in MPI Pack, that is, the data is unpacked one block at a time starting with position = 0.
Write another Get input function for the trapezoidal rule program. This one should use MPI Pack on process 0 and MPI Unpack on the other processes.  Get solution

3.21. How does your system compare to ours? What run-times does your system get for matrix-vector multiplication? What kind of variability do you see in the times for a given value of comm sz and n? Do the results tend to cluster around the minimum, the mean, or the median? Get solution

3.22. Time our implementation of the trapezoidal rule that uses MPI Reduce. How will you choose n, the number of trapezoids? How do the minimum times compare to the mean and median times? What are the speedups? What are the efficiencies? On the basis of the data you collected, would you say that the trapezoidal rule is scalable? Get solution

3.23. Although we don’t know the internals of the implementation of MPI Reduce, we might guess that it uses a structure similar to the binary tree we discussed. If this is the case, we would expect that its run-time would grow roughly at the rate of log2.p/, since there are roughly log2.p/ levels in the tree. (Here, p D comm sz.) Since the run-time of the serial trapezoidal rule is roughly proportional
to n, the number of trapezoids, and the parallel trapezoidal rule simply applies the serial rule to n=p trapezoids on each process, with our assumption about MPI Reduce, we get a formula for the overall run-time of the parallel trapezoidal rule that looks like
for some constants a and b.
a. Use the formula, the times you’ve taken in Exercise 3.22, and your favorite program for doing mathematical calculations (e.g., MATLABR ) to get aleast-squares estimate of the values of a and b.
b. Comment on the quality of the predicted run-times using the formula and the values for a and b computed in part (a). Get solution

3.24. Take a look at Programming Assignment 3.7. The code that we outlined for timing the cost of sending messages should work even if the count argument is zero. What happens on your system when the count argument is 0? Can you explain why you get a nonzero elapsed time when you send a zero-byte message? Get solution

3.25. If comm sz D p, we mentioned that the “ideal” speedup is p. Is it possible to do better?
a. Consider a parallel program that computes a vector sum. If we only time the vector sum—that is, we ignore input and output of the vectors—how might this program achieve speedup greater than p?
b. A program that achieves speedup greater than p is said to have superlinear speedup. Our vector sum example only achieved superlinear speedup by overcoming certain “resource limitations.” What were these resource limitations? Is it possible for a program to obtain superlinear speedup without overcoming resource limitations? Get solution

3.26. Serial odd-even transposition sort of an n-element list can sort the list in considerably
fewer than n phases. As an extreme example, if the input list is already sorted, the algorithm requires 0 phases.
a. Write a serial Is sorted function that determines whether a list is sorted.
b. Modify the serial odd-even transposition sort program so that it checks whether the list is sorted after each phase.
c. If this program is tested on a random collection of n-element lists, roughly what fraction get improved performance by checking whether the list is sorted? Get solution

3.27. Find the speedups and efficiencies of the parallel odd-even sort. Does the program obtain linear speedups? Is it scalable? Is it strongly scalable? Is it weakly scalable? Get solution

3.28. Modify the parallel odd-even transposition sort so that the Merge functions simply swap array pointers after finding the smallest or largest elements. What effect does this change have on the overall run-time? Get solution

PROGRAMMING ASSIGNMENTS

3.1. Use MPI to implement the histogram program discussed in Section 2.7.1. Have process 0 read in the input data and distribute it among the processes. Also have process 0 print out the histogram. Get solution

3.2. Suppose we toss darts randomly at a square dartboard, whose bullseye is at the origin, and whose sides are 2 feet in length. Suppose also that there’s a circle inscribed in the square dartboard. The radius of the circle is 1 foot, and it’s area is square feet. If the points that are hit by the darts are uniformly distributed (and we always hit the square), then the number of darts that hit inside the circle should approximately satisfy the equation

number in circle/total number of tosses = PI/4
,
since the ratio of the area of the circle to the area of the square is =4.
We can use this formula to estimate the value of with a random number generator:
number in circle = 0;
for (toss = 0; toss < number of tosses; toss++) {
x = random double between -1 and 1;
y = random double between -1 and 1;
distance squared = x x + y y;
if (distance squared <= 1) number in circle++;
}
pi estimate = 4 number in circle/((double) number of tosses);

This is called a “Monte Carlo” method, since it uses randomness (the dart tosses).
Write an MPI program that uses a Monte Carlo method to estimate .
Process 0 should read in the total number of tosses and broadcast it to the other processes. Use MPI Reduce to find the global sum of the local variable number in circle, and have process 0 print the result. You may want to use long long ints for the number of hits in the circle and the number of tosses, since both may have to be very large to get a reasonable estimate of . Get solution

3.3. Write an MPI program that computes a tree-structured global sum. First write your program for the special case in which comm sz is a power of two. Then, after you’ve gotten this version working, modify your program so that it can handle any comm sz. Get solution

3.4. Write an MPI program that computes a global sum using a butterfly. First write your program for the special case in which comm sz is a power of two. Can you modify your program so that it will handle any number of processes? Get solution Part a   -  Get solution Part b

3.5. Implement matrix-vector multiplication using a block-column distribution of the matrix. You can have process 0 read in the matrix and simply use a loop of sends to distribute it among the processes. Assume the matrix is square of order n and that n is evenly divisible by comm sz. You may want to look at the MPI function MPI Reduce scatter. Get solution

3.6. Implement matrix-vector multiplication using a block-submatrix distribution of the matrix. Assume that the vectors are distributed among the diagonal processes.
Once again, you can have process 0 read in the matrix and aggregate the sub-matrices before sending them to the processes. Assume comm sz is a perfect square and that sqrt(comm_sz) evenly divides the order of the matrix. Get solution

3.7. A ping-pong is a communication in which two messages are sent, first from process A to process B (ping) and then from process B back to process A (pong). Timing blocks of repeated ping-pongs is a common way to estimate the cost of sending messages. Time a ping-pong program using the C clock function on your system. How long does the code have to run before clock gives a nonzero run-time? How do the times you got with the clock function compare to times taken with MPI Wtime? Get solution

3.8. Parallel merge sort starts with n=comm sz keys assigned to each process. It ends with all the keys stored on process 0 in sorted order. To achieve this, it uses the same tree-structured communication that we used to implement a global sum. However, when a process receives another process’ keys, it merges the new keys into its already sorted list of keys. Write a program that implements parallel mergesort. Process 0 should read in n and broadcast it to the other processes. Each process should use a random number generator to create a local list of n/comm_sz ints. Each process should then sort its local list, and process 0 should gather and print the local lists. Then the processes should use tree-structured communication to merge the global list onto process 0, which prints the result. Get solution

3.9. Write a program that can be used to determine the cost of changing the distribution of a distributed data structure. How long does it take to change from a block distribution of a vector to a cyclic distribution? How long does the reverse redistribution take? Get solution