Solutions An Introduction to Parallel Programming - Pachecho - Chapter 4

4.1. When we discussed matrix-vector multiplication we assumed that both m and n, the number of rows and the number of columns, respectively, were evenly divisible by t, the number of threads. How do the formulas for the assignments change if this is not the case? Get solution

4.2. If we decide to physically divide a data structure among the threads, that is, if we decide to make various members local to individual threads, we need to consider at least three issues: a. How are the members of the data structure used by the individual threads? b. Where and how is the data structure initialized? c. Where and how is the data structure used after its members are computed? We briefly looked at the first issue in the matrix-vector multiplication function. We saw that the entire vector x was used by all of the threads, so it seemed pretty clear that it should be shared. However, for both the matrix A and the product vector y, just looking at (a) seemed to suggest that A and y should have their components distributed among the threads. Let’s take a closer look at this. What would we have to do in order to divide A and y among the threads? Dividing y wouldn’t be difficult–each thread could allocate a block of memory that could be used for storing its assigned components. Presumably, we could do the same for A–each thread could allocate a block of memory for storing its assigned rows. Modify the matrix-vector multiplication program so that it distributes both of these data structures. Can you “schedule” the input and output so that the threads can read in A and print out y? How does distributing A and y affect the run-time of the matrix-vector multiplication? (Don’t include input or output in your run-time.) Get solution

4.3. Recall that the compiler is unaware that an ordinary C program is multithreaded, and as a consequence, it may make optimizations that can interfere with busy-waiting. (Note that compiler optimizations should not affect mutexes, condition variables, or semaphores.) An alternative to completely turning off compiler optimizations is to identify some shared variables with the C keyword volatile. This tells the compiler that these variables may be updated by mutiple threads and, as a consequence, it shouldn’t apply optimizations to statements involving them. As an example, recall our busy-wait solution to the race condition when multiple threads attempt to add a private variable into a shared variable:
 /* x and flag are shared, y is private */
/* x and flag are initialized to 0 by main thread */
 y = Compute(my rank); while (flag != my rank);
 x = x + y; flag++;

 It’s impossible to tell by looking at this code that the order of the while statement and the x = x + y statement is important; if this code were singlethreaded, the order of these two statements wouldn’t affect the outcome of the code. But if the compiler determined that it could improve register usage by interchanging the order of these two statements, the resulting code would be erroneous. If, instead of defining
int flag;
 int x;
we define
 int volatile flag;
 int volatile x;
 then the compiler will know that both x and flag can be updated by other threads, so it shouldn’t try reordering the statements. With the gcc compiler, the default behavior is no optimization. You can make certain of this by adding the option -O0 to the command line. Try running the pi calculation program that uses busy-waiting (pth pi busy.c) without optimization. How does the result of the multithreaded calculation compare to the single-threaded calculation? Now try running it with optimization; if you’re using gcc, replace the -O0 option with -O2. If you found an error, how many threads did you use? Which variables should be made volatile in the pi calculation? Change these variables so that they’re volatile and rerun the program with and without optimization. How do the results compare to the single-threaded program? Get solution

4.4. The performance of the pi calculation program that uses mutexes remains roughly constant once we increase the number of threads beyond the number of available CPUs. What does this suggest about how the threads are scheduled on the available processors? Get solution

4.5. Modify the mutex version of thepi calculation program so that the critical section is in the for loop. How does the performance of this version compare to the performance of the original busy-wait version? How might we explain this? Get solution

4.6. Modify the mutex version of the pi calculation program so that it uses a semaphore instead of a mutex. How does the performance of this version compare with the mutex version? Get solution

4.7. Although producer-consumer synchronization is easy to implement with semaphores, it’s also possible to implement it with mutexes. The basic idea is to have the producer and the consumer share a mutex. A flag variable that’s initialized to false by the main thread indicates whether there’s anything to consume. With two threads we’d execute something like this:

 So if the consumer gets into the loop first, it will see there’s no message available and return to the call to pthread mutex lock. It will continue this process until the producer creates the message. Write a Pthreads program that implements this version of producer-consumer synchronization with two threads. Can you generalize this so that it works with 2k threads– odd-ranked threads are consumers and even-ranked threads are producers? Can you generalize this so that each thread is both a producer and a consumer? For example, suppose that thread q “sends” a message to thread .q+1/ mod t and “receives” a message from thread .q-1+t/ mod t? Does this use busy-waiting? Get solution

4.8. If a program uses more than one mutex, and the mutexes can be acquired in different orders, the program can deadlock. That is, threads may block forever waiting to acquire one of the mutexes. As an example, suppose that a program has two shared data structures–for example, two arrays or two linked lists–each of which has an associated mutex. Further suppose that each data structure can be accessed (read or modified) after acquiring the data structure’s associated mutex. a. Suppose the program is run with two threads. Further suppose that the following sequence of events occurs:


What happens? b. Would this be a problem if the program used busy-waiting (with two flag variables) instead of mutexes? c. Would this be a problem if the program used semaphores instead of mutexes? Get solution

4.9. Some implementations of Pthreads define barriers. The function
 int pthread barrier init(
 pthread barrier_t* barrier p /* out */,
const pthread barrierattr t* attr p /* in */,
 unsigned count /* in */);
 initializes a barrier object, barrier p. As usual, we’ll ignore the second argument and just pass in NULL. The last argument indicates the number of threads that must reach the barrier before they can continue. The barrier itself is a call to the function
 int pthread barrier wait(
 pthread barrier_t* barrier p /* in/out */);

As with most other Pthreads objects, there is a destroy function
int pthread barrier destroy(
 pthread barrier_t* barrier p /* in/out */);
Modify one of the barrier programs from the book’s website so that it uses a Pthreads barrier. Find a system with a Pthreads implementation that includes barrier and run your program with various numbers of threads. How does its performance compare to the other implementations? Get solution

4.10. Modify one of the programs you wrote in the Programming Assignments that follow so that it uses the scheme outlined in Section 4.8 to time itself. In order to get the time that has elapsed since some point in the past, you can use the macro GET TIME defined in the header file timer.h on the book’s website. Note that this will give wall clock time, not CPU time. Also note that since it’s a macro, it can operate directly on its argument. For example, to implement Store current time in my start; you would use
GET TIME(my start);
not
GET TIME(&my start);
How will you implement the barrier? How will you implement the following pseudo code?

elapsed = Maximum of my elapsed values; Get solution

4.11. Give an example of a linked list and a sequence of memory accesses to the linked list in which the following pairs of operations can potentially result in problems: a. Two deletes executed simultaneously b. An insert and a delete executed simultaneously c. A member and a delete executed simultaneously d. Two inserts executed simultaneously e. An insert and a member executed simultaneously Get solution

4.12. The linked list operations Insert and Delete consist of two distinct “phases.” In the first phase, both operations search the list for either the position of the new node or the position of the node to be deleted. If the outcome of the first phase so indicates, in the second phase a new node is inserted or an existing node is deleted. In fact, it’s quite common for linked list programs to split each of these operations into two function calls. For both operations, the first phase involves only read-access to the list; only the second phase modifies the list. Would it be safe to lock the list using a read-lock for the first phase? And then to lock the list using a write-lock for the second phase? Explain your answer. Get solution

4.13. Download the various threaded linked list programs from the website. In our examples, we ran a fixed percentage of searches and split the remaining percentage among inserts and deletes. a. Rerun the experiments with all searches and all inserts. b. Rerun the experiments with all searches and all deletes. Is there a difference in the overall run-times? Is insert or delete more expensive? Get solution

4.14. Recall that in C a function that takes a two-dimensional array argument must specify the number of columns in the argument list. Thus it is quite common for C programmers to only use one-dimensional arrays, and to write explicit code for converting pairs of subscripts into a single dimension. Modify the Pthreads matrix-vector multiplication so that it uses a one-dimensional array for the matrix and calls a matrix-vector multiplication function. How does this change affect the run-time? Get solution

4.15. Download the source file pth mat vect rand split.c from the book’s website. Find a program that does cache profiling (for example, Valgrind [49]) and compile the program according to the instructions in the cache profiler documentation. (with Valgrind you will want a symbol table and full optimization gcc -g -O2 . . .). Now run the program according to the instructions in the cache profiler documentation, using input k x(k x 10e6), (k . 10e3)x (k . 10e3), and (k . 10e6)xk. Choose k so large that the number of level 2 cache misses is of the order 106 for at least one of the input sets of data. a. How many level 1 cache write-misses occur with each of the three inputs? b. How many level 2 cache write-misses occur with each of the three inputs? c. Where do most of the write-misses occur? For which input data does the program have the most write-misses? Can you explain why? d. How many level 1 cache read-misses occur with each of the three inputs? e. How many level 2 cache read-misses occur with each of the three inputs? f. Where do most of the read-misses occur? For which input data does the program have the most read-misses? Can you explain why? g. Run the program with each of the three inputs, but without using the cache profiler. With which input is the program the fastest? With which input is the program the slowest? Can your observations about cache misses help explain the differences? How? Get solution

4.16. Recall the matrix-vector multiplication example with the 8000x8000 input. Suppose that the program is run with four threads, and thread 0 and thread 2 are assigned to different processors. If a cache line contains 64 bytes or eight doubles, is it possible for false sharing between threads 0 and 2 to occur for any part of the vector y? Why? What about if thread 0 and thread 3 are assigned to different processors–is it possible for false sharing to occur between them for any part of y? Get solution

4.17. Recall the matrix-vector multiplication example with an 8x8, 000,000 matrix. Suppose that doubles use 8 bytes of memory and that a cache line is 64 bytes. Also suppose that our system consists of two dual-core processors. a. What is the minimum number of cache lines that are needed to store the vector y? b. What is the maximum number of cache lines that are needed to store the vector y? c. If the boundaries of cache lines always coincide with the boundaries of 8-byte doubles, in how many different ways can the components of y be assigned to cache lines? d. If we only consider which pairs of threads share a processor, in how many different ways can four threads be assigned to the processors in our computer? Here we’re assuming that cores on the same processor share a cache. e. Is there an assignment of components to cache lines and threads to processors that will result in no falses sharing in our example? In other words, is it possible that the threads assigned to one processor will have their components of y in one cache line, and the threads assigned to the other processor will have their components in a different cache line? f. How many assignments of components to cache lines and threads to processors are there? g. Of these assignments, how many will result in no false sharing? Get solution

4.18. a. Modify the matrix-vector multiplication program so that it pads the vector y when there’s a possibility of false sharing. The padding should be done so that if the threads execute in lock-step, there’s no possibility that a single cache line containing an element of y will be shared by two or more threads. Suppose, for example, that a cache line stores eight double s and we run the program with four threads. If we allocate storage for at least 48 doubles in y, then, on each pass through the for i loop, there’s no possibility that two threads will simultaneously access the same cache line. b. Modify the matrix-vector multiplication so that each thread uses private storage for its part of y during the for i loop. When a thread is done computing its part of y, it should copy its private storage into the shared variable. c. How does the performance of these two alternatives compare to the original program? How do they compare to each other? Get solution

4.19. Although strtok r is thread-safe, it has the rather unfortunate property that it gratuitously modifies the input string. Write a tokenizer that is thread-safe and doesn’t modify the input string.  Get solution

 PROGRAMMING ASSIGNMENTS

4.1. Write a Pthreads program that implements the histogram program in Chapter 2. Get solution

4.2. Suppose we toss darts randomly at a square dartboard, whose bullseye is at the origin, and whose sides are 2 feet in length. Suppose also that there’s a circle inscribed in the square dartboard. The radius of the circle is 1 foot, and it’s area is pi square feet. If the points that are hit by the darts are uniformly distributed (and we always hit the square), then the number of darts that hit inside the circle should approximately satisfy the equation number in circle/total number of tosses = pi/4, since the ratio of the area of the circle to the area of the square is pi/4. We can use this formula to estimate the value of pi with a random number generator:

This is called a “Monte Carlo” method, since it uses randomness (the dart tosses). Write a Pthreads program that uses a Monte Carlo method to estimate . The main thread should read in the total number of tosses and print the estimate. You may want to use long long ints for the number of hits in the circle and the number of tosses, since both may have to be very large to get a reasonable estimate of pi. Get solution

4.3. Write a Pthreads program that implements the trapezoidal rule. Use a shared variable for the sum of all the threads’ computations. Use busy-waiting, mutexes, and semaphores to enforce mutual exclusion in the critical section. What advantages and disadvantages do you see with each approach? Get solution

4.4. Write a Pthreads program that finds the average time required by your system to create and terminate a thread. Does the number of threads affect the average time? If so, how? Get solution

4.5. Write a Pthreads program that implements a “task queue.” The main thread begins by starting a user-specified number of threads that immediately go to sleep in a condition wait. The main thread generates blocks of tasks to be carried out by the other threads; each time it generates a new block of tasks, it awakens a thread with a condition signal. When a thread finishes executing its block of tasks, it should return to a condition wait. When the main thread completes generating tasks, it sets a global variable indicating that there will be no more tasks, and awakens all the threads with a condition broadcast. For the sake of explicitness, make your tasks linked list operations. Get solution

4.6. Write a Pthreads program that uses two condition variables and a mutex to implement a read-write lock. Download the online linked list program that uses Pthreads read-write locks, and modify it to use your read-write locks. Now compare the performance of the program when readers are given preference with the program when writers are given preference. Can you make any generalizations? Get solution