4.1. When we discussed matrix-vector multiplication we assumed that both m and
n, the number of rows and the number of columns, respectively, were evenly
divisible by t, the number of threads. How do the formulas for the assignments
change if this is not the case? Get solution
4.2. If we decide to physically divide a data structure among the threads, that is,
if we decide to make various members local to individual threads, we need to
consider at least three issues:
a. How are the members of the data structure used by the individual threads?
b. Where and how is the data structure initialized?
c. Where and how is the data structure used after its members are computed?
We briefly looked at the first issue in the matrix-vector multiplication function.
We saw that the entire vector x was used by all of the threads, so it seemed
pretty clear that it should be shared. However, for both the matrix A and the
product vector y, just looking at (a) seemed to suggest that A and y should
have their components distributed among the threads. Let’s take a closer look
at this.
What would we have to do in order to divide A and y among the threads?
Dividing y wouldn’t be difficult–each thread could allocate a block of memory
that could be used for storing its assigned components. Presumably, we could
do the same for A–each thread could allocate a block of memory for storing
its assigned rows. Modify the matrix-vector multiplication program so that
it distributes both of these data structures. Can you “schedule” the input and
output so that the threads can read in A and print out y? How does distributing
A and y affect the run-time of the matrix-vector multiplication? (Don’t include
input or output in your run-time.) Get solution
4.3. Recall that the compiler is unaware that an ordinary C program is multithreaded,
and as a consequence, it may make optimizations that can interfere
with busy-waiting. (Note that compiler optimizations should not affect
mutexes, condition variables, or semaphores.) An alternative to completely
turning off compiler optimizations is to identify some shared variables with
the C keyword volatile. This tells the compiler that these variables may be
updated by mutiple threads and, as a consequence, it shouldn’t apply optimizations
to statements involving them. As an example, recall our busy-wait
solution to the race condition when multiple threads attempt to add a private
variable into a shared variable:
/* x and flag are shared, y is private */
/* x and flag are initialized to 0 by main thread */
y = Compute(my rank);
while (flag != my rank);
x = x + y;
flag++;
It’s impossible to tell by looking at this code that the order of the while
statement and the x = x + y statement is important; if this code were singlethreaded,
the order of these two statements wouldn’t affect the outcome of the
code. But if the compiler determined that it could improve register usage by
interchanging the order of these two statements, the resulting code would be
erroneous.
If, instead of defining
int flag;
int x;
we define
int volatile flag;
int volatile x;
then the compiler will know that both x and flag can be updated by other
threads, so it shouldn’t try reordering the statements.
With the gcc compiler, the default behavior is no optimization. You can
make certain of this by adding the option -O0 to the command line. Try
running the pi calculation program that uses busy-waiting (pth pi busy.c)
without optimization. How does the result of the multithreaded calculation
compare to the single-threaded calculation? Now try running it with optimization;
if you’re using gcc, replace the -O0 option with -O2. If you found an
error, how many threads did you use?
Which variables should be made volatile in the pi calculation? Change
these variables so that they’re volatile and rerun the program with and without
optimization. How do the results compare to the single-threaded program? Get solution
4.4. The performance of the pi calculation program that uses mutexes remains
roughly constant once we increase the number of threads beyond the number
of available CPUs. What does this suggest about how the threads are
scheduled on the available processors? Get solution
4.5. Modify the mutex version of thepi calculation program so that the critical
section is in the for loop. How does the performance of this version compare
to the performance of the original busy-wait version? How might we explain
this? Get solution
4.6. Modify the mutex version of the pi calculation program so that it uses a
semaphore instead of a mutex. How does the performance of this version
compare with the mutex version? Get solution
4.7. Although producer-consumer synchronization is easy to implement with
semaphores, it’s also possible to implement it with mutexes. The basic idea is
to have the producer and the consumer share a mutex. A flag variable that’s
initialized to false by the main thread indicates whether there’s anything to
consume. With two threads we’d execute something like this:
So if the consumer gets into the loop first, it will see there’s no message
available and return to the call to pthread mutex lock. It will continue
this process until the producer creates the message. Write a Pthreads program
that implements this version of producer-consumer synchronization
with two threads. Can you generalize this so that it works with 2k threads–
odd-ranked threads are consumers and even-ranked threads are producers?
Can you generalize this so that each thread is both a producer and a consumer?
For example, suppose that thread q “sends” a message to thread
.q+1/ mod t and “receives” a message from thread .q-1+t/ mod t? Does
this use busy-waiting? Get solution
4.8. If a program uses more than one mutex, and the mutexes can be acquired
in different orders, the program can deadlock. That is, threads may block
forever waiting to acquire one of the mutexes. As an example, suppose that a
program has two shared data structures–for example, two arrays or two linked
lists–each of which has an associated mutex. Further suppose that each data
structure can be accessed (read or modified) after acquiring the data structure’s
associated mutex.
a. Suppose the program is run with two threads. Further suppose that the
following sequence of events occurs:
What happens?
b. Would this be a problem if the program used busy-waiting (with two flag
variables) instead of mutexes?
c. Would this be a problem if the program used semaphores instead of
mutexes? Get solution
4.9. Some implementations of Pthreads define barriers. The function
int pthread barrier init(
pthread barrier_t* barrier p /* out */,
const pthread barrierattr t* attr p /* in */,
unsigned count /* in */);
initializes a barrier object, barrier p. As usual, we’ll ignore the second argument
and just pass in NULL. The last argument indicates the number of threads
that must reach the barrier before they can continue. The barrier itself is a call
to the function
int pthread barrier wait(
pthread barrier_t* barrier p /* in/out */);
As with most other Pthreads objects, there is a destroy function
int pthread barrier destroy(
pthread barrier_t* barrier p /* in/out */);
Modify one of the barrier programs from the book’s website so that it uses a
Pthreads barrier. Find a system with a Pthreads implementation that includes
barrier and run your program with various numbers of threads. How does its
performance compare to the other implementations? Get solution
4.10. Modify one of the programs you wrote in the Programming Assignments that
follow so that it uses the scheme outlined in Section 4.8 to time itself. In order
to get the time that has elapsed since some point in the past, you can use
the macro GET TIME defined in the header file timer.h on the book’s website.
Note that this will give wall clock time, not CPU time. Also note that since it’s
a macro, it can operate directly on its argument. For example, to implement
Store current time in my start;
you would use
GET TIME(my start);
not
GET TIME(&my start);
How will you implement the barrier? How will you implement the following
pseudo code?
elapsed = Maximum of my elapsed values; Get solution
4.11. Give an example of a linked list and a sequence of memory accesses to the
linked list in which the following pairs of operations can potentially result in
problems:
a. Two deletes executed simultaneously
b. An insert and a delete executed simultaneously
c. A member and a delete executed simultaneously
d. Two inserts executed simultaneously
e. An insert and a member executed simultaneously Get solution
4.12. The linked list operations Insert and Delete consist of two distinct
“phases.” In the first phase, both operations search the list for either the position
of the new node or the position of the node to be deleted. If the outcome
of the first phase so indicates, in the second phase a new node is inserted or
an existing node is deleted. In fact, it’s quite common for linked list programs
to split each of these operations into two function calls. For both operations,
the first phase involves only read-access to the list; only the second phase
modifies the list. Would it be safe to lock the list using a read-lock for the
first phase? And then to lock the list using a write-lock for the second phase?
Explain your answer. Get solution
4.13. Download the various threaded linked list programs from the website. In
our examples, we ran a fixed percentage of searches and split the remaining
percentage among inserts and deletes.
a. Rerun the experiments with all searches and all inserts.
b. Rerun the experiments with all searches and all deletes.
Is there a difference in the overall run-times? Is insert or delete more
expensive? Get solution
4.14. Recall that in C a function that takes a two-dimensional array argument must
specify the number of columns in the argument list. Thus it is quite common
for C programmers to only use one-dimensional arrays, and to write explicit
code for converting pairs of subscripts into a single dimension. Modify the
Pthreads matrix-vector multiplication so that it uses a one-dimensional array
for the matrix and calls a matrix-vector multiplication function. How does this
change affect the run-time? Get solution
4.15. Download the source file pth mat vect rand split.c from the book’s website.
Find a program that does cache profiling (for example, Valgrind [49])
and compile the program according to the instructions in the cache profiler
documentation. (with Valgrind you will want a symbol table and full
optimization gcc -g -O2 . . .). Now run the program according to the
instructions in the cache profiler documentation, using input k x(k x 10e6),
(k . 10e3)x (k . 10e3), and (k . 10e6)xk. Choose k so large that the number of
level 2 cache misses is of the order 106 for at least one of the input sets of
data.
a. How many level 1 cache write-misses occur with each of the three inputs?
b. How many level 2 cache write-misses occur with each of the three inputs?
c. Where do most of the write-misses occur? For which input data does the
program have the most write-misses? Can you explain why?
d. How many level 1 cache read-misses occur with each of the three inputs?
e. How many level 2 cache read-misses occur with each of the three inputs?
f. Where do most of the read-misses occur? For which input data does the
program have the most read-misses? Can you explain why?
g. Run the program with each of the three inputs, but without using the cache
profiler. With which input is the program the fastest? With which input is
the program the slowest? Can your observations about cache misses help
explain the differences? How? Get solution
4.16. Recall the matrix-vector multiplication example with the 8000x8000 input.
Suppose that the program is run with four threads, and thread 0 and thread
2 are assigned to different processors. If a cache line contains 64 bytes or
eight doubles, is it possible for false sharing between threads 0 and 2 to
occur for any part of the vector y? Why? What about if thread 0 and thread
3 are assigned to different processors–is it possible for false sharing to occur
between them for any part of y? Get solution
4.17. Recall the matrix-vector multiplication example with an 8x8, 000,000
matrix. Suppose that doubles use 8 bytes of memory and that a cache line is 64
bytes. Also suppose that our system consists of two dual-core processors.
a. What is the minimum number of cache lines that are needed to store the
vector y?
b. What is the maximum number of cache lines that are needed to store the
vector y?
c. If the boundaries of cache lines always coincide with the boundaries of
8-byte doubles, in how many different ways can the components of y be
assigned to cache lines?
d. If we only consider which pairs of threads share a processor, in how
many different ways can four threads be assigned to the processors in our
computer? Here we’re assuming that cores on the same processor share a
cache.
e. Is there an assignment of components to cache lines and threads to processors
that will result in no falses sharing in our example? In other words,
is it possible that the threads assigned to one processor will have their
components of y in one cache line, and the threads assigned to the other
processor will have their components in a different cache line?
f. How many assignments of components to cache lines and threads to
processors are there?
g. Of these assignments, how many will result in no false sharing? Get solution
4.18. a. Modify the matrix-vector multiplication program so that it pads the vector
y when there’s a possibility of false sharing. The padding should be
done so that if the threads execute in lock-step, there’s no possibility that
a single cache line containing an element of y will be shared by two or
more threads. Suppose, for example, that a cache line stores eight double
s and we run the program with four threads. If we allocate storage for at
least 48 doubles in y, then, on each pass through the for i loop, there’s no
possibility that two threads will simultaneously access the same cache line.
b. Modify the matrix-vector multiplication so that each thread uses private
storage for its part of y during the for i loop. When a thread is done
computing its part of y, it should copy its private storage into the shared
variable.
c. How does the performance of these two alternatives compare to the
original program? How do they compare to each other? Get solution
4.19. Although strtok r is thread-safe, it has the rather unfortunate property that
it gratuitously modifies the input string. Write a tokenizer that is thread-safe
and doesn’t modify the input string. Get solution
PROGRAMMING ASSIGNMENTS
4.1. Write a Pthreads program that implements the histogram program in Chapter 2. Get solution
4.2. Suppose we toss darts randomly at a square dartboard, whose bullseye is at the
origin, and whose sides are 2 feet in length. Suppose also that there’s a circle
inscribed in the square dartboard. The radius of the circle is 1 foot, and it’s area
is pi square feet. If the points that are hit by the darts are uniformly distributed
(and we always hit the square), then the number of darts that hit inside the circle
should approximately satisfy the equation
number in circle/total number of tosses = pi/4,
since the ratio of the area of the circle to the area of the square is pi/4.
We can use this formula to estimate the value of pi with a random number
generator:
This is called a “Monte Carlo” method, since it uses randomness (the dart
tosses).
Write a Pthreads program that uses a Monte Carlo method to estimate .
The main thread should read in the total number of tosses and print the estimate.
You may want to use long long ints for the number of hits in the circle and
the number of tosses, since both may have to be very large to get a reasonable
estimate of pi. Get solution
4.3. Write a Pthreads program that implements the trapezoidal rule. Use a shared
variable for the sum of all the threads’ computations. Use busy-waiting,
mutexes, and semaphores to enforce mutual exclusion in the critical section.
What advantages and disadvantages do you see with each approach? Get solution
4.4. Write a Pthreads program that finds the average time required by your system
to create and terminate a thread. Does the number of threads affect the average
time? If so, how? Get solution
4.5. Write a Pthreads program that implements a “task queue.” The main thread
begins by starting a user-specified number of threads that immediately go to
sleep in a condition wait. The main thread generates blocks of tasks to be carried
out by the other threads; each time it generates a new block of tasks, it
awakens a thread with a condition signal. When a thread finishes executing
its block of tasks, it should return to a condition wait. When the main thread
completes generating tasks, it sets a global variable indicating that there will be
no more tasks, and awakens all the threads with a condition broadcast. For the
sake of explicitness, make your tasks linked list operations. Get solution
4.6. Write a Pthreads program that uses two condition variables and a mutex to
implement a read-write lock. Download the online linked list program that
uses Pthreads read-write locks, and modify it to use your read-write locks.
Now compare the performance of the program when readers are given preference
with the program when writers are given preference. Can you make any
generalizations?
Get solution