Solutions An Introduction to Parallel Programming - Pachecho - Chapter 4

4.1. When we discussed matrix-vector multiplication we assumed that both m and n, the number of rows and the number of columns, respectively, were evenly divisible by t, the number of threads. How do the formulas for the assignments change if this is not the case? Get solution

4.2. If we decide to physically divide a data structure among the threads, that is, if we decide to make various members local to individual threads, we need to consider at least three issues: a. How are the members of the data structure used by the individual threads? b. Where and how is the data structure initialized? c. Where and how is the data structure used after its members are computed? We briefly looked at the first issue in the matrix-vector multiplication function. We saw that the entire vector x was used by all of the threads, so it seemed pretty clear that it should be shared. However, for both the matrix A and the product vector y, just looking at (a) seemed to suggest that A and y should have their components distributed among the threads. Let’s take a closer look at this. What would we have to do in order to divide A and y among the threads? Dividing y wouldn’t be difficult–each thread could allocate a block of memory that could be used for storing its assigned components. Presumably, we could do the same for A–each thread could allocate a block of memory for storing its assigned rows. Modify the matrix-vector multiplication program so that it distributes both of these data structures. Can you “schedule” the input and output so that the threads can read in A and print out y? How does distributing A and y affect the run-time of the matrix-vector multiplication? (Don’t include input or output in your run-time.) Get solution

4.3. Recall that the compiler is unaware that an ordinary C program is multithreaded, and as a consequence, it may make optimizations that can interfere with busy-waiting. (Note that compiler optimizations should not affect mutexes, condition variables, or semaphores.) An alternative to completely turning off compiler optimizations is to identify some shared variables with the C keyword volatile. This tells the compiler that these variables may be updated by mutiple threads and, as a consequence, it shouldn’t apply optimizations to statements involving them. As an example, recall our busy-wait solution to the race condition when multiple threads attempt to add a private variable into a shared variable:
 /* x and flag are shared, y is private */
/* x and flag are initialized to 0 by main thread */
 y = Compute(my rank); while (flag != my rank);
 x = x + y; flag++;

 It’s impossible to tell by looking at this code that the order of the while statement and the x = x + y statement is important; if this code were singlethreaded, the order of these two statements wouldn’t affect the outcome of the code. But if the compiler determined that it could improve register usage by interchanging the order of these two statements, the resulting code would be erroneous. If, instead of defining
int flag;
 int x;
we define
 int volatile flag;
 int volatile x;
 then the compiler will know that both x and flag can be updated by other threads, so it shouldn’t try reordering the statements. With the gcc compiler, the default behavior is no optimization. You can make certain of this by adding the option -O0 to the command line. Try running the pi calculation program that uses busy-waiting (pth pi busy.c) without optimization. How does the result of the multithreaded calculation compare to the single-threaded calculation? Now try running it with optimization; if you’re using gcc, replace the -O0 option with -O2. If you found an error, how many threads did you use? Which variables should be made volatile in the pi calculation? Change these variables so that they’re volatile and rerun the program with and without optimization. How do the results compare to the single-threaded program? Get solution

4.4. The performance of the pi calculation program that uses mutexes remains roughly constant once we increase the number of threads beyond the number of available CPUs. What does this suggest about how the threads are scheduled on the available processors? Get solution

4.5. Modify the mutex version of thepi calculation program so that the critical section is in the for loop. How does the performance of this version compare to the performance of the original busy-wait version? How might we explain this? Get solution

4.6. Modify the mutex version of the pi calculation program so that it uses a semaphore instead of a mutex. How does the performance of this version compare with the mutex version? Get solution

4.7. Although producer-consumer synchronization is easy to implement with semaphores, it’s also possible to implement it with mutexes. The basic idea is to have the producer and the consumer share a mutex. A flag variable that’s initialized to false by the main thread indicates whether there’s anything to consume. With two threads we’d execute something like this:

 So if the consumer gets into the loop first, it will see there’s no message available and return to the call to pthread mutex lock. It will continue this process until the producer creates the message. Write a Pthreads program that implements this version of producer-consumer synchronization with two threads. Can you generalize this so that it works with 2k threads– odd-ranked threads are consumers and even-ranked threads are producers? Can you generalize this so that each thread is both a producer and a consumer? For example, suppose that thread q “sends” a message to thread .q+1/ mod t and “receives” a message from thread .q-1+t/ mod t? Does this use busy-waiting? Get solution

4.8. If a program uses more than one mutex, and the mutexes can be acquired in different orders, the program can deadlock. That is, threads may block forever waiting to acquire one of the mutexes. As an example, suppose that a program has two shared data structures–for example, two arrays or two linked lists–each of which has an associated mutex. Further suppose that each data structure can be accessed (read or modified) after acquiring the data structure’s associated mutex. a. Suppose the program is run with two threads. Further suppose that the following sequence of events occurs:


What happens? b. Would this be a problem if the program used busy-waiting (with two flag variables) instead of mutexes? c. Would this be a problem if the program used semaphores instead of mutexes? Get solution

4.9. Some implementations of Pthreads define barriers. The function
 int pthread barrier init(
 pthread barrier_t* barrier p /* out */,
const pthread barrierattr t* attr p /* in */,
 unsigned count /* in */);
 initializes a barrier object, barrier p. As usual, we’ll ignore the second argument and just pass in NULL. The last argument indicates the number of threads that must reach the barrier before they can continue. The barrier itself is a call to the function
 int pthread barrier wait(
 pthread barrier_t* barrier p /* in/out */);

As with most other Pthreads objects, there is a destroy function
int pthread barrier destroy(
 pthread barrier_t* barrier p /* in/out */);
Modify one of the barrier programs from the book’s website so that it uses a Pthreads barrier. Find a system with a Pthreads implementation that includes barrier and run your program with various numbers of threads. How does its performance compare to the other implementations? Get solution

4.10. Modify one of the programs you wrote in the Programming Assignments that follow so that it uses the scheme outlined in Section 4.8 to time itself. In order to get the time that has elapsed since some point in the past, you can use the macro GET TIME defined in the header file timer.h on the book’s website. Note that this will give wall clock time, not CPU time. Also note that since it’s a macro, it can operate directly on its argument. For example, to implement Store current time in my start; you would use
GET TIME(my start);
not
GET TIME(&my start);
How will you implement the barrier? How will you implement the following pseudo code?

elapsed = Maximum of my elapsed values; Get solution

4.11. Give an example of a linked list and a sequence of memory accesses to the linked list in which the following pairs of operations can potentially result in problems: a. Two deletes executed simultaneously b. An insert and a delete executed simultaneously c. A member and a delete executed simultaneously d. Two inserts executed simultaneously e. An insert and a member executed simultaneously Get solution

4.12. The linked list operations Insert and Delete consist of two distinct “phases.” In the first phase, both operations search the list for either the position of the new node or the position of the node to be deleted. If the outcome of the first phase so indicates, in the second phase a new node is inserted or an existing node is deleted. In fact, it’s quite common for linked list programs to split each of these operations into two function calls. For both operations, the first phase involves only read-access to the list; only the second phase modifies the list. Would it be safe to lock the list using a read-lock for the first phase? And then to lock the list using a write-lock for the second phase? Explain your answer. Get solution

4.13. Download the various threaded linked list programs from the website. In our examples, we ran a fixed percentage of searches and split the remaining percentage among inserts and deletes. a. Rerun the experiments with all searches and all inserts. b. Rerun the experiments with all searches and all deletes. Is there a difference in the overall run-times? Is insert or delete more expensive? Get solution

4.14. Recall that in C a function that takes a two-dimensional array argument must specify the number of columns in the argument list. Thus it is quite common for C programmers to only use one-dimensional arrays, and to write explicit code for converting pairs of subscripts into a single dimension. Modify the Pthreads matrix-vector multiplication so that it uses a one-dimensional array for the matrix and calls a matrix-vector multiplication function. How does this change affect the run-time? Get solution

4.15. Download the source file pth mat vect rand split.c from the book’s website. Find a program that does cache profiling (for example, Valgrind [49]) and compile the program according to the instructions in the cache profiler documentation. (with Valgrind you will want a symbol table and full optimization gcc -g -O2 . . .). Now run the program according to the instructions in the cache profiler documentation, using input k x(k x 10e6), (k . 10e3)x (k . 10e3), and (k . 10e6)xk. Choose k so large that the number of level 2 cache misses is of the order 106 for at least one of the input sets of data. a. How many level 1 cache write-misses occur with each of the three inputs? b. How many level 2 cache write-misses occur with each of the three inputs? c. Where do most of the write-misses occur? For which input data does the program have the most write-misses? Can you explain why? d. How many level 1 cache read-misses occur with each of the three inputs? e. How many level 2 cache read-misses occur with each of the three inputs? f. Where do most of the read-misses occur? For which input data does the program have the most read-misses? Can you explain why? g. Run the program with each of the three inputs, but without using the cache profiler. With which input is the program the fastest? With which input is the program the slowest? Can your observations about cache misses help explain the differences? How? Get solution

4.16. Recall the matrix-vector multiplication example with the 8000x8000 input. Suppose that the program is run with four threads, and thread 0 and thread 2 are assigned to different processors. If a cache line contains 64 bytes or eight doubles, is it possible for false sharing between threads 0 and 2 to occur for any part of the vector y? Why? What about if thread 0 and thread 3 are assigned to different processors–is it possible for false sharing to occur between them for any part of y? Get solution

4.17. Recall the matrix-vector multiplication example with an 8x8, 000,000 matrix. Suppose that doubles use 8 bytes of memory and that a cache line is 64 bytes. Also suppose that our system consists of two dual-core processors. a. What is the minimum number of cache lines that are needed to store the vector y? b. What is the maximum number of cache lines that are needed to store the vector y? c. If the boundaries of cache lines always coincide with the boundaries of 8-byte doubles, in how many different ways can the components of y be assigned to cache lines? d. If we only consider which pairs of threads share a processor, in how many different ways can four threads be assigned to the processors in our computer? Here we’re assuming that cores on the same processor share a cache. e. Is there an assignment of components to cache lines and threads to processors that will result in no falses sharing in our example? In other words, is it possible that the threads assigned to one processor will have their components of y in one cache line, and the threads assigned to the other processor will have their components in a different cache line? f. How many assignments of components to cache lines and threads to processors are there? g. Of these assignments, how many will result in no false sharing? Get solution

4.18. a. Modify the matrix-vector multiplication program so that it pads the vector y when there’s a possibility of false sharing. The padding should be done so that if the threads execute in lock-step, there’s no possibility that a single cache line containing an element of y will be shared by two or more threads. Suppose, for example, that a cache line stores eight double s and we run the program with four threads. If we allocate storage for at least 48 doubles in y, then, on each pass through the for i loop, there’s no possibility that two threads will simultaneously access the same cache line. b. Modify the matrix-vector multiplication so that each thread uses private storage for its part of y during the for i loop. When a thread is done computing its part of y, it should copy its private storage into the shared variable. c. How does the performance of these two alternatives compare to the original program? How do they compare to each other? Get solution

4.19. Although strtok r is thread-safe, it has the rather unfortunate property that it gratuitously modifies the input string. Write a tokenizer that is thread-safe and doesn’t modify the input string.  Get solution

 PROGRAMMING ASSIGNMENTS

4.1. Write a Pthreads program that implements the histogram program in Chapter 2. Get solution

4.2. Suppose we toss darts randomly at a square dartboard, whose bullseye is at the origin, and whose sides are 2 feet in length. Suppose also that there’s a circle inscribed in the square dartboard. The radius of the circle is 1 foot, and it’s area is pi square feet. If the points that are hit by the darts are uniformly distributed (and we always hit the square), then the number of darts that hit inside the circle should approximately satisfy the equation number in circle/total number of tosses = pi/4, since the ratio of the area of the circle to the area of the square is pi/4. We can use this formula to estimate the value of pi with a random number generator:

This is called a “Monte Carlo” method, since it uses randomness (the dart tosses). Write a Pthreads program that uses a Monte Carlo method to estimate . The main thread should read in the total number of tosses and print the estimate. You may want to use long long ints for the number of hits in the circle and the number of tosses, since both may have to be very large to get a reasonable estimate of pi. Get solution

4.3. Write a Pthreads program that implements the trapezoidal rule. Use a shared variable for the sum of all the threads’ computations. Use busy-waiting, mutexes, and semaphores to enforce mutual exclusion in the critical section. What advantages and disadvantages do you see with each approach? Get solution

4.4. Write a Pthreads program that finds the average time required by your system to create and terminate a thread. Does the number of threads affect the average time? If so, how? Get solution

4.5. Write a Pthreads program that implements a “task queue.” The main thread begins by starting a user-specified number of threads that immediately go to sleep in a condition wait. The main thread generates blocks of tasks to be carried out by the other threads; each time it generates a new block of tasks, it awakens a thread with a condition signal. When a thread finishes executing its block of tasks, it should return to a condition wait. When the main thread completes generating tasks, it sets a global variable indicating that there will be no more tasks, and awakens all the threads with a condition broadcast. For the sake of explicitness, make your tasks linked list operations. Get solution

4.6. Write a Pthreads program that uses two condition variables and a mutex to implement a read-write lock. Download the online linked list program that uses Pthreads read-write locks, and modify it to use your read-write locks. Now compare the performance of the program when readers are given preference with the program when writers are given preference. Can you make any generalizations? Get solution

Solutions An Introduction to Parallel Programming - Pachecho - Chapter 3

3.1. What happens in the greetings program if, instead of strlen(greeting) + 1, we use strlen(greeting) for the length of the message being sent by processes 1, 2, ... , comm sz+1? What happens if we use MAX STRING instead of strlen(greeting) + 1? Can you explain these results? Get solution

3.2. Modify the trapezoidal rule so that it will correctly estimate the integral even if comm sz doesn’t evenly divide n. (You can still assume that n >= comm sz.) Get solution

3.3. Determine which of the variables in the trapezoidal rule program are local and which are global. Get solution

3.4. Modify the program that just prints a line of output from each process (mpi output.c) so that the output is printed in process rank order: process 0s output first, then process 1s, and so on. Get solution

3.5. In a binary tree, there is a unique shortest path from each node to the root. The length of this path is often called the depth of the node. A binary tree in which every nonleaf has two children is called a full binary tree, and a full binary tree in which every leaf has the same depth is sometimes called a complete binary tree. See Figure 3.14. Use the principle of mathematical induction to prove that if T is a complete binary tree with n leaves, then the depth of the leaves is log2.n/. Get solution

3.6. Suppose comm sz = 4 and suppose that x is a vector with n = 14 components. a. How would the components of x be distributed among the processes in a program that used a block distribution?

 b. How would the components of x be distributed among the processes in a program that used a cyclic distribution? c. How would the components of x be distributed among the processes in a program that used a block-cyclic distribution with blocksize b = 2? You should try to make your distributions general so that they could be used regardless of what comm sz and n are. You should also try to make your distributions “fair” so that if q and r are any two processes, the difference between the number of components assigned to q and the number of components assigned to r is as small as possible. Get solution

3.7. What do the various MPI collective functions do if the communicator contains a single process? Get solution

3.8. Suppose comm sz = 8 and n = 16. a. Draw a diagram that shows how MPI Scatter can be implemented using tree-structured communication with comm sz processes when process 0 needs to distribute an array containing n elements. b. Draw a diagram that shows how MPI Gather can be implemented using tree-structured communication when an n-element array that has been distributed among comm sz processes needs to be gathered onto process 0. Get solution

3.9. Write an MPI program that implements multiplication of a vector by a scalar and dot product. The user should enter two vectors and a scalar, all of which are read in by process 0 and distributed among the processes. The results are calculated and collected onto process 0, which prints them. You can assume that n, the order of the vectors, is evenly divisible by comm sz. Get solution

3.10. In the Read vector function shown in Program 3.9, we use local n as the actual argument for two of the formal arguments to MPI Scatter: send count and recv count. Why is it OK to alias these arguments? Get solution

3.11. Finding prefix sums is a generalization of global sum. Rather than simply finding the sum of n values, x0 +x1 +...+xn-1, the prefix sums are the n partial sums x0, x0 +x1, x0 +x1 +x2, ... , x0 +x1 +..+xn-1. a. Devise a serial algorithm for computing the n prefix sums of an array with n elements. b. Parallelize your serial algorithm for a system with n processes, each of which is storing one of the x is. c. Suppose n D 2k for some positive integer k. Can you devise a serial algorithm and a parallelization of the serial algorithm so that the parallel algorithm requires only k communication phases? d. MPI provides a collective communication function, MPI Scan, that can be used to compute prefix sums:

 It operates on arrays with count elements; both sendbuf p and recvbuf p should refer to blocks of count elements of type datatype. The op argument is the same as op for MPI Reduce. Write an MPI program that generates a random array of count elements on each MPI process, finds the prefix sums, and prints the results. Get solution

3.12. An alternative to a butterfly-structured allreduce is a ring-pass structure. In a ring-pass, if there are p processes, each process q sends data to process q+1, except that process p-1 sends data to process 0. This is repeated until each process has the desired result. Thus, we can implement allreduce with the following code: sum = temp val = my val; for (i = 1; i < p; i++) { MPI Sendrecv replace(&temp val, 1, MPI INT, dest, sendtag, source, recvtag, comm, &status); sum += temp val; } a. Write an MPI program that implements this algorithm for allreduce. How does its performance compare to the butterfly-structured allreduce? b. Modify the MPI program you wrote in the first part so that it implements prefix sums. Get solution

3.13. MPI Scatter and MPI Gather have the limitation that each process must send or receive the same number of data items. When this is not the case, we must use the MPI functions MPI Gatherv and MPI Scatterv. Look at the man pages for these functions, and modify your vector sum, dot product program so that it can correctly handle the case when n isn’t evenly divisible by comm sz. Get solution

3.14. a. Write a serial C program that defines a two-dimensional array in the main function. Just use numeric constants for the dimensions: int two d[3][4]; Initialize the array in the main function. After the array is initialized, call a function that attempts to print the array. The prototype for the function should look something like this. void Print two d(int two d[][], int rows, int cols); After writing the function try to compile the program. Can you explain why it won’t compile? b. After consulting a C reference (e.g., Kernighan and Ritchie [29]), modify the program so that it will compile and run, but so that it still uses a twodimensional C array. Get solution

3.15. What is the relationship between the “row-major” storage for twodimensional arrays that we discussed in Section 2.2.3 and the one-dimensional storage we use in Section 3.4.9? Get solution

3.16. Suppose comm sz = 8 and the vector x = .0, 1, 2, ... ,15/ has been distributed among the processes using a block distribution. Draw a diagram illustrating the steps in a butterfly implementation of allgather of x. Get solution

3.17. MPI Type contiguous can be used to build a derived datatype from a collection of contiguous elements in an array. Its syntax is

 Modify the Read vector and Print vector functions so that they use an MPI datatype created by a call to MPI Type contiguous and a count argument of 1 in the calls to MPI Scatter and MPI Gather. Get solution

3.18. MPI Type vector can be used to build a derived datatype from a collection of blocks of elements in an array as long as the blocks all have the same size and they’re equally spaced. Its syntax is

 For example, if we had an array x of 18 doubles and we wanted to build a type corresponding to the elements in positions 0, 1, 6, 7, 12, 13, we could call int MPI Type vector(3, 2, 6, MPI DOUBLE, &vect mpi t); since the type consists of 3 blocks, each of which has 2 elements, and the spacing between the starts of the blocks is 6 doubles. Write Read vector and Print vector functions that will allow process 0 to read and print, respectively, a vector with a block-cyclic distribution. But beware! Do not use MPI Scatter or MPI Gather. There is a technical issue involved in using these functions with types created with MPI Type vector. (See, for example, [23].) Just use a loop of sends on process 0 in Read vector and a loop of receives on process 0 in Print vector. The other processes should be able to complete their calls to Read vector and Print vector with a single call to MPI Recv and MPI Send. The communication on process 0 should use a derived datatype created by MPI Type vector. The calls on the other processes should just use the count argument to the communication function, since they’re receiving/sending elements that they will store in contiguous array locations. Get solution

3.19. MPI Type indexed can be used to build a derived datatype from arbitrary array elements. Its syntax is

Unlike MPI Type create struct, the displacements are measured in units of old mpi t—not bytes. Use MPI Type indexed to create a derived datatype that corresponds to the upper triangular part of a square matrix. For example, in the 4x4 matrix


the upper triangular part is the elements 0, 1, 2, 3, 5, 6, 7, 10,11, 15. Process 0 should read in an nxn matrix as a one-dimensional array, create the derived datatype, and send the upper triangular part with a single call to MPI Send. Process 1 should receive the upper triangular part with a single call to MPI Recv and then print the data it received. Get solution

3.20. The functions MPI Pack and MPI Unpack provide an alternative to derived
datatypes for grouping data. MPI Pack copies the data to be sent, one block at
a time, into a user-provided buffer. The buffer can then be sent and received.
After the data is received, MPI Unpack can be used to unpack it from the
receive buffer. The syntax of MPI Pack is

 We could therefore pack the input data to the trapezoidal rule program with
the following code:
The key is the position argument. When MPI Pack is called, position should refer to the first available slot in pack buf. When MPI Pack returns, it refers to the first available slot after the data that was just packed, so after process 0 executes this code, all the processes can call MPI Bcast:
MPI Bcast(pack buf, 100, MPI PACKED, 0, comm);
Note that the MPI datatype for a packed buffer is MPI PACKED. Now the other processes can unpack the data using: MPI Unpack:


This can be used by “reversing” the steps in MPI Pack, that is, the data is unpacked one block at a time starting with position = 0.
Write another Get input function for the trapezoidal rule program. This one should use MPI Pack on process 0 and MPI Unpack on the other processes.  Get solution

3.21. How does your system compare to ours? What run-times does your system get for matrix-vector multiplication? What kind of variability do you see in the times for a given value of comm sz and n? Do the results tend to cluster around the minimum, the mean, or the median? Get solution

3.22. Time our implementation of the trapezoidal rule that uses MPI Reduce. How will you choose n, the number of trapezoids? How do the minimum times compare to the mean and median times? What are the speedups? What are the efficiencies? On the basis of the data you collected, would you say that the trapezoidal rule is scalable? Get solution

3.23. Although we don’t know the internals of the implementation of MPI Reduce, we might guess that it uses a structure similar to the binary tree we discussed. If this is the case, we would expect that its run-time would grow roughly at the rate of log2.p/, since there are roughly log2.p/ levels in the tree. (Here, p D comm sz.) Since the run-time of the serial trapezoidal rule is roughly proportional
to n, the number of trapezoids, and the parallel trapezoidal rule simply applies the serial rule to n=p trapezoids on each process, with our assumption about MPI Reduce, we get a formula for the overall run-time of the parallel trapezoidal rule that looks like
for some constants a and b.
a. Use the formula, the times you’ve taken in Exercise 3.22, and your favorite program for doing mathematical calculations (e.g., MATLABR ) to get aleast-squares estimate of the values of a and b.
b. Comment on the quality of the predicted run-times using the formula and the values for a and b computed in part (a). Get solution

3.24. Take a look at Programming Assignment 3.7. The code that we outlined for timing the cost of sending messages should work even if the count argument is zero. What happens on your system when the count argument is 0? Can you explain why you get a nonzero elapsed time when you send a zero-byte message? Get solution

3.25. If comm sz D p, we mentioned that the “ideal” speedup is p. Is it possible to do better?
a. Consider a parallel program that computes a vector sum. If we only time the vector sum—that is, we ignore input and output of the vectors—how might this program achieve speedup greater than p?
b. A program that achieves speedup greater than p is said to have superlinear speedup. Our vector sum example only achieved superlinear speedup by overcoming certain “resource limitations.” What were these resource limitations? Is it possible for a program to obtain superlinear speedup without overcoming resource limitations? Get solution

3.26. Serial odd-even transposition sort of an n-element list can sort the list in considerably
fewer than n phases. As an extreme example, if the input list is already sorted, the algorithm requires 0 phases.
a. Write a serial Is sorted function that determines whether a list is sorted.
b. Modify the serial odd-even transposition sort program so that it checks whether the list is sorted after each phase.
c. If this program is tested on a random collection of n-element lists, roughly what fraction get improved performance by checking whether the list is sorted? Get solution

3.27. Find the speedups and efficiencies of the parallel odd-even sort. Does the program obtain linear speedups? Is it scalable? Is it strongly scalable? Is it weakly scalable? Get solution

3.28. Modify the parallel odd-even transposition sort so that the Merge functions simply swap array pointers after finding the smallest or largest elements. What effect does this change have on the overall run-time? Get solution

PROGRAMMING ASSIGNMENTS

3.1. Use MPI to implement the histogram program discussed in Section 2.7.1. Have process 0 read in the input data and distribute it among the processes. Also have process 0 print out the histogram. Get solution

3.2. Suppose we toss darts randomly at a square dartboard, whose bullseye is at the origin, and whose sides are 2 feet in length. Suppose also that there’s a circle inscribed in the square dartboard. The radius of the circle is 1 foot, and it’s area is square feet. If the points that are hit by the darts are uniformly distributed (and we always hit the square), then the number of darts that hit inside the circle should approximately satisfy the equation

number in circle/total number of tosses = PI/4
,
since the ratio of the area of the circle to the area of the square is =4.
We can use this formula to estimate the value of with a random number generator:
number in circle = 0;
for (toss = 0; toss < number of tosses; toss++) {
x = random double between -1 and 1;
y = random double between -1 and 1;
distance squared = x x + y y;
if (distance squared <= 1) number in circle++;
}
pi estimate = 4 number in circle/((double) number of tosses);

This is called a “Monte Carlo” method, since it uses randomness (the dart tosses).
Write an MPI program that uses a Monte Carlo method to estimate .
Process 0 should read in the total number of tosses and broadcast it to the other processes. Use MPI Reduce to find the global sum of the local variable number in circle, and have process 0 print the result. You may want to use long long ints for the number of hits in the circle and the number of tosses, since both may have to be very large to get a reasonable estimate of . Get solution

3.3. Write an MPI program that computes a tree-structured global sum. First write your program for the special case in which comm sz is a power of two. Then, after you’ve gotten this version working, modify your program so that it can handle any comm sz. Get solution

3.4. Write an MPI program that computes a global sum using a butterfly. First write your program for the special case in which comm sz is a power of two. Can you modify your program so that it will handle any number of processes? Get solution Part a   -  Get solution Part b

3.5. Implement matrix-vector multiplication using a block-column distribution of the matrix. You can have process 0 read in the matrix and simply use a loop of sends to distribute it among the processes. Assume the matrix is square of order n and that n is evenly divisible by comm sz. You may want to look at the MPI function MPI Reduce scatter. Get solution

3.6. Implement matrix-vector multiplication using a block-submatrix distribution of the matrix. Assume that the vectors are distributed among the diagonal processes.
Once again, you can have process 0 read in the matrix and aggregate the sub-matrices before sending them to the processes. Assume comm sz is a perfect square and that sqrt(comm_sz) evenly divides the order of the matrix. Get solution

3.7. A ping-pong is a communication in which two messages are sent, first from process A to process B (ping) and then from process B back to process A (pong). Timing blocks of repeated ping-pongs is a common way to estimate the cost of sending messages. Time a ping-pong program using the C clock function on your system. How long does the code have to run before clock gives a nonzero run-time? How do the times you got with the clock function compare to times taken with MPI Wtime? Get solution

3.8. Parallel merge sort starts with n=comm sz keys assigned to each process. It ends with all the keys stored on process 0 in sorted order. To achieve this, it uses the same tree-structured communication that we used to implement a global sum. However, when a process receives another process’ keys, it merges the new keys into its already sorted list of keys. Write a program that implements parallel mergesort. Process 0 should read in n and broadcast it to the other processes. Each process should use a random number generator to create a local list of n/comm_sz ints. Each process should then sort its local list, and process 0 should gather and print the local lists. Then the processes should use tree-structured communication to merge the global list onto process 0, which prints the result. Get solution

3.9. Write a program that can be used to determine the cost of changing the distribution of a distributed data structure. How long does it take to change from a block distribution of a vector to a cyclic distribution? How long does the reverse redistribution take? Get solution

Solutions An Introduction to Parallel Programming - Pachecho - Chapter 2

2.1. When we were discussing floating point addition, we made the simplifying assumption that each of the functional units took the same amount of time. Suppose that fetch and store each take 2 nanoseconds and the remaining operations each take 1 nanosecond. a. How long does a floating point addition take with these assumptions? b. How long will an unpipelined addition of 1000 pairs of floats take with these assumptions? c. How long will a pipelined addition of 1000 pairs of floats take with these assumptions? d. The time required for fetch and store may vary considerably if the operands/ results are stored in different levels of the memory hierarchy. Suppose that a fetch from a level 1 cache takes two nanoseconds, while a fetch from a level 2 cache takes five nanoseconds, and a fetch from main memory takes fifty nanoseconds. What happens to the pipeline when there is a level 1 cache miss on a fetch of one of the operands? What happens when there is a level 2 miss? Get solution

2.2. Explain how a queue, implemented in hardware in the CPU, could be used to improve the performance of a write-through cache. Get solution

2.3. Recall the example involving cache reads of a two-dimensional array (page 22). How does a larger matrix and a larger cache affect the performance of the two pairs of nested loops? What happens if MAX = 8 and the cache can store four lines? How many misses occur in the reads of A in the first pair of nested loops? How many misses occur in the second pair? Get solution

2.4. In Table 2.2, virtual addresses consist of a byte offset of 12 bits and a virtual page number of 20 bits. How many pages can a program have if it’s run on a system with this page size and this virtual address size? Get solution

2.5. Does the addition of cache and virtual memory to a von Neumann system change its designation as an SISD system? What about the addition of pipelining? Multiple issue? Hardware multithreading? Get solution

2.6. Suppose that a vector processor has a memory system in which it takes 10 cycles to load a single 64-bit word from memory. How many memory banks are needed so that a stream of loads can, on average, require only one cycle per load? Get solution

2.7. Discuss the differences in how a GPU and a vector processor might execute the following code: sum = 0.0; for (i = 0; i < n; i++) { y[i] += a-x[i]; sum += z[i]-z[i]; } Get solution

2.8. Explain why the performance of a hardware multithreaded processing core might degrade if it had large caches and it ran many threads. Get solution

2.9. In our discussion of parallel hardware, we used Flynn’s taxonomy to identify three types of parallel systems: SISD, SIMD, and MIMD. None of our systems were identified as multiple instruction, single data, or MISD. How would an MISD system work? Give an example. Get solution

2.10. Suppose a program must execute 1012 instructions in order to solve a particular problem. Suppose further that a single processor system can solve the problem in 106 seconds (about 11.6 days). So, on average, the single processor system executes 106 or a million instructions per second. Now suppose that the program has been parallelized for execution on a distributed-memory system. Suppose also that if the parallel program uses p processors, each processor will execute 1012=p instructions and each processor must send 109.p-1/ messages. Finally, suppose that there is no additional overhead in executing the parallel program. That is, the program will complete after each processor has executed all of its instructions and sent all of its messages, and there won’t be any delays due to things such as waiting for messages. a. Suppose it takes 10-9 seconds to send a message. How long will it take the program to run with 1000 processors, if each processor is as fast as the single processor on which the serial program was run? b. Suppose it takes 10-3 seconds to send a message. How long will it take the program to run with 1000 processors? Get solution

2.11. Derive formulas for the total number of links in the various distributedmemory interconnects. Get solution

2.12. a. A planar mesh is just like a toroidal mesh, except that it doesn’t have the “wraparound” links. What is the bisection width of a square planar mesh? b. A three-dimensional mesh is similar to a planar mesh, except that it also has depth. What is the bisection width of a three-dimensional mesh? Get solution

2.13. a. Sketch a four-dimensional hypercube. b. Use the inductive definition of a hypercube to explain why the bisection width of a hypercube is p=2. Get solution

2.14. To define the bisection width for indirect networks, the processors are partitioned into two groups so that each group has half the processors. Then, links are removed from anywhere in the network so that the two groups are no longer connected. The minimum number of links removed is the bisection width. When we count links, if the diagram uses unidirectional links, two unidirectional links count as one link. Show that an eight-by-eight crossbar has a bisection width less than or equal to eight. Also show that an omega network with eight processors has a bisection width less than or equal to four. Get solution

2.15. a. Suppose a shared-memory system uses snooping cache coherence and write-back caches. Also suppose that core 0 has the variable x in its cache, and it executes the assignment x = 5. Finally suppose that core 1 doesn’t have x in its cache, and after core 0’s update to x, core 1 tries to execute y = x. What value will be assigned to y? Why? b. Suppose that the shared-memory system in the previous part uses a directory-based protocol. What value will be assigned to y? Why? c. Can you suggest how any problems you found in the first two parts might be solved? Get solution

2.16. a. Suppose the run-time of a serial program is given by Tserial D n2, where the units of the run-time are in microseconds. Suppose that a parallelization of this program has run-time Tparallel D n2=pClog2.p/. Write a program that finds the speedups and efficiencies of this program for various values of n and p. Run your program with n D 10, 20, 40, : : : , 320, and p D 1, 2, 4, : : : , 128. What happens to the speedups and efficiencies as p is increased and n is held fixed? What happens when p is fixed and n is increased? b. Suppose that Tparallel D Tserial=pCToverhead. Also suppose that we fix p and increase the problem size. - Show that if Toverhead grows more slowly than Tserial, the parallel efficiency will increase as we increase the problem size. - Show that if, on the other hand, Toverhead grows faster than Tserial, the parallel efficiency will decrease as we increase the problem size. Get solution

2.17. A parallel program that obtains a speedup greater than p—the number of processes or threads—is sometimes said to have superlinear speedup. However, many authors don’t count programs that overcome “resource limitations” as having superlinear speedup. For example, a program that must use secondary storage for its data when it’s run on a single processor system might be able to fit all its data into main memory when run on a large distributed-memory system. Give another example of how a program might overcome a resource limitation and obtain speedups greater than p. Get solution

2.18. Look at three programs you wrote in your Introduction to Computer Science class. What (if any) parts of these programs are inherently serial? Does the inherently serial part of the work done by the program decrease as the problem size increases? Or does it remain roughly the same? Get solution

2.19. Suppose Tserial D n and Tparallel D n=pClog2.p/, where times are in microseconds. If we increase p by a factor of k, find a formula for how much we’ll need to increase n in order to maintain constant efficiency. How much should we increase n by if we double the number of processes from 8 to 16? Is the parallel program scalable? Get solution

2.20. Is a program that obtains linear speedup strongly scalable? Explain your answer. Get solution

2.21. Bob has a program that he wants to time with two sets of data, input data1 and input data2. To get some idea of what to expect before adding timing functions to the code he’s interested in, he runs the program with two sets of data and the Unix shell command time: $ time ./bobs prog < input data1 real 0m0.001s user 0m0.001s sys 0m0.000s $ time ./bobs prog < input data2 real 1m1.234s user 1m0.001s sys 0m0.111s The timer function Bob is using has millisecond resolution. Should Bob use it to time his program with the first set of data? What about the second set of data? Why or why not? Get solution

2.22. As we saw in the preceding problem, the Unix shell command time reports the user time, the system time, and the “real” time or total elapsed time. Suppose that Bob has defined the following functions that can be called in a C program: double utime(void); double stime(void); double rtime(void); The first returns the number of seconds of user time that have elapsed since the program started execution, the second returns the number of system seconds, and the third returns the total number of seconds. Roughly, user time is time spent in the user code and library functions that don’t need to use the operating system—for example, sin and cos. System time is time spent in functions that do need to use the operating system—for example, printf and scanf. a. What is the mathematical relation among the three function values? That is, suppose the program contains the following code: u = double utime(void); s = double stime(void); r = double rtime(void); Write a formula that expresses the relation between u, s, and r. (You can assume that the time it takes to call the functions is negligible.) b. On Bob’s system, any time that an MPI process spends waiting for messages isn’t counted by either utime or stime, but the time is counted by rtime. Explain how Bob can use these facts to determine whether an MPI process is spending too much time waiting for messages. c. Bob has given Sally his timing functions. However, Sally has discovered that on her system, the time an MPI process spends waiting for messages is counted as user time. Furthermore, sending messages doesn’t use any system time. Can Sally use Bob’s functions to determine whether an MPI process is spending too much time waiting for messages? Explain your answer. Get solution

2.23. In our application of Foster’s methodology to the construction of a histogram, we essentially identified aggregate tasks with elements of data. An apparent alternative would be to identify aggregate tasks with elements of bin counts, so an aggregate task would consist of all increments of bin counts[b] and consequently all calls to Find bin that return b. Explain why this aggregation might be a problem. Get solution

2.24. If you haven’t already done so in Chapter 1, try to write pseudo-code for our tree-structured global sum, which sums the elements of loc bin cts. First consider how this might be done in a shared-memory setting. Then consider how this might be done in a distributed-memory setting. In the shared-memory setting, which variables are shared and which are private? Get solution

Solutions An Introduction to Parallel Programming - Pachecho - Chapter 1

1.1 Devise formulas for the functions that calculate my first i and my last i in the global sum example. Remember that each core should be assigned roughly the same number of elements of computations in the loop. Hint: First consider the case when n is evenly divisible by p. Get solution

1.2 We’ve implicitly assumed that each call to Compute next value requires roughly the same amount of work as the other calls. How would you change your answer to the preceding question if call i D k requires kC1 times as much work as the call with i D 0? So if the first call (i D 0) requires 2 milliseconds, the second call (i D 1) requires 4, the third (i D 2) requires 6, and so on. Get solution

1.3 Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, . . . ). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core difference = 1 and 1 - core difference = 0, so 0 and 1 are paired in the first iteration. Get solution

1.4 As an alternative to the approach outlined in the preceding problem, we can use C’s bitwise operators to implement the tree-structured global sum. In order to see how this works, it helps to write down the binary (base 2) representation of each of the core ranks, and note the pairings during each stage:



 From the table we see that during the first stage each core is paired with the core whose rank differs in the rightmost or first bit. During the second stage cores that continue are paired with the core whose rank differs in the second bit, and during the third stage cores are paired with the core whose rank differs in the third bit. Thus, if we have a binary value bitmask that is 0012 for the first stage, 0102 for the second, and 1002 for the third, we can get the rank of the core we’re paired with by “inverting” the bit in our rank that is nonzero in bitmask. This can be done using the bitwise exclusive or ^ operator. Implement this algorithm in pseudo-code using the bitwise exclusive or and the left-shift operator. Get solution

1.5 What happens if your pseudo-code in Exercise 1.3 or Exercise 1.4 is run when the number of cores is not a power of two (e.g., 3, 5, 6, 7)? Can you modify the pseudo-code so that it will work correctly regardless of the number of cores? Get solution

1.6 Derive formulas for the number of receives and additions that core 0 carries out using a. the original pseudo-code for a global sum, and b. the tree-structured global sum. Make a table showing the numbers of receives and additions carried out by core 0 when the two sums are used with 2, 4, 8, : : : ,1024 cores. Get solution

1.7 The first part of the global sum example—when each core adds its assigned computed values—is usually considered to be an example of data-parallelism, while the second part of the first global sum—when the cores send their partial sums to the master core, which adds them—could be considered to be an example of task-parallelism. What about the second part of the second global sum—when the cores use a tree structure to add their partial sums? Is this an example of data- or task-parallelism? Why? Get solution

1.8 Suppose the faculty are going to have a party for the students in the department. a. Identify tasks that can be assigned to the faculty members that will allow them to use task-parallelism when they prepare for the party. Work out a schedule that shows when the various tasks can be performed. b. We might hope that one of the tasks in the preceding part is cleaning the house where the party will be held. How can we use data-parallelism to partition the work of cleaning the house among the faculty? c. Use a combination of task- and data-parallelism to prepare for the party. (If there’s too much work for the faculty, you can use TAs to pick up the slack.) Get solution

1.9 Write an essay describing a research problem in your major that would benefit from the use of parallel computing. Provide a rough outline of how parallelism would be used. Would you use task- or data-parallelism? Get solution