2.1. When we were discussing floating point addition, we made the simplifying
assumption that each of the functional units took the same amount of time.
Suppose that fetch and store each take 2 nanoseconds and the remaining
operations each take 1 nanosecond.
a. How long does a floating point addition take with these assumptions?
b. How long will an unpipelined addition of 1000 pairs of floats take with
these assumptions?
c. How long will a pipelined addition of 1000 pairs of floats take with these
assumptions?
d. The time required for fetch and store may vary considerably if the operands/
results are stored in different levels of the memory hierarchy. Suppose that
a fetch from a level 1 cache takes two nanoseconds, while a fetch from a
level 2 cache takes five nanoseconds, and a fetch from main memory takes
fifty nanoseconds. What happens to the pipeline when there is a level 1
cache miss on a fetch of one of the operands? What happens when there
is a level 2 miss? Get solution
2.2. Explain how a queue, implemented in hardware in the CPU, could be used to
improve the performance of a write-through cache. Get solution
2.3. Recall the example involving cache reads of a two-dimensional array
(page 22). How does a larger matrix and a larger cache affect the performance
of the two pairs of nested loops? What happens if MAX = 8 and the cache can
store four lines? How many misses occur in the reads of A in the first pair of
nested loops? How many misses occur in the second pair? Get solution
2.4. In Table 2.2, virtual addresses consist of a byte offset of 12 bits and a virtual
page number of 20 bits. How many pages can a program have if it’s run on a
system with this page size and this virtual address size? Get solution
2.5. Does the addition of cache and virtual memory to a von Neumann system
change its designation as an SISD system? What about the addition of
pipelining? Multiple issue? Hardware multithreading? Get solution
2.6. Suppose that a vector processor has a memory system in which it takes 10
cycles to load a single 64-bit word from memory. How many memory banks
are needed so that a stream of loads can, on average, require only one cycle
per load? Get solution
2.7. Discuss the differences in how a GPU and a vector processor might execute
the following code:
sum = 0.0;
for (i = 0; i < n; i++) {
y[i] += a-x[i];
sum += z[i]-z[i];
} Get solution
2.8. Explain why the performance of a hardware multithreaded processing core
might degrade if it had large caches and it ran many threads. Get solution
2.9. In our discussion of parallel hardware, we used Flynn’s taxonomy to identify
three types of parallel systems: SISD, SIMD, and MIMD. None of our systems
were identified as multiple instruction, single data, or MISD. How would an
MISD system work? Give an example. Get solution
2.10. Suppose a program must execute 1012 instructions in order to solve a particular
problem. Suppose further that a single processor system can solve the problem
in 106 seconds (about 11.6 days). So, on average, the single processor system
executes 106 or a million instructions per second. Now suppose that the
program has been parallelized for execution on a distributed-memory system.
Suppose also that if the parallel program uses p processors, each processor will
execute 1012=p instructions and each processor must send 109.p-1/ messages.
Finally, suppose that there is no additional overhead in executing the
parallel program. That is, the program will complete after each processor has
executed all of its instructions and sent all of its messages, and there won’t be
any delays due to things such as waiting for messages.
a. Suppose it takes 10-9 seconds to send a message. How long will it take
the program to run with 1000 processors, if each processor is as fast as the
single processor on which the serial program was run?
b. Suppose it takes 10-3 seconds to send a message. How long will it take the
program to run with 1000 processors? Get solution
2.11. Derive formulas for the total number of links in the various distributedmemory
interconnects. Get solution
2.12. a. A planar mesh is just like a toroidal mesh, except that it doesn’t have
the “wraparound” links. What is the bisection width of a square planar
mesh?
b. A three-dimensional mesh is similar to a planar mesh, except that it also
has depth. What is the bisection width of a three-dimensional mesh? Get solution
2.13. a. Sketch a four-dimensional hypercube.
b. Use the inductive definition of a hypercube to explain why the bisection
width of a hypercube is p=2. Get solution
2.14. To define the bisection width for indirect networks, the processors are partitioned
into two groups so that each group has half the processors. Then,
links are removed from anywhere in the network so that the two groups are
no longer connected. The minimum number of links removed is the bisection
width. When we count links, if the diagram uses unidirectional links, two unidirectional
links count as one link. Show that an eight-by-eight crossbar has a
bisection width less than or equal to eight. Also show that an omega network
with eight processors has a bisection width less than or equal to four. Get solution
2.15. a. Suppose a shared-memory system uses snooping cache coherence and
write-back caches. Also suppose that core 0 has the variable x in its cache,
and it executes the assignment x = 5. Finally suppose that core 1 doesn’t
have x in its cache, and after core 0’s update to x, core 1 tries to execute
y = x. What value will be assigned to y? Why?
b. Suppose that the shared-memory system in the previous part uses a
directory-based protocol. What value will be assigned to y? Why?
c. Can you suggest how any problems you found in the first two parts might
be solved? Get solution
2.16. a. Suppose the run-time of a serial program is given by Tserial D n2, where
the units of the run-time are in microseconds. Suppose that a parallelization
of this program has run-time Tparallel D n2=pClog2.p/. Write a
program that finds the speedups and efficiencies of this program for various
values of n and p. Run your program with n D 10, 20, 40, : : : , 320, and
p D 1, 2, 4, : : : , 128. What happens to the speedups and efficiencies as p
is increased and n is held fixed? What happens when p is fixed and n is
increased?
b. Suppose that Tparallel D Tserial=pCToverhead. Also suppose that we fix p and
increase the problem size.
- Show that if Toverhead grows more slowly than Tserial, the parallel
efficiency will increase as we increase the problem size.
- Show that if, on the other hand, Toverhead grows faster than Tserial, the
parallel efficiency will decrease as we increase the problem size. Get solution
2.17. A parallel program that obtains a speedup greater than p—the number of processes
or threads—is sometimes said to have superlinear speedup. However,
many authors don’t count programs that overcome “resource limitations” as
having superlinear speedup. For example, a program that must use secondary
storage for its data when it’s run on a single processor system might be able
to fit all its data into main memory when run on a large distributed-memory
system. Give another example of how a program might overcome a resource
limitation and obtain speedups greater than p. Get solution
2.18. Look at three programs you wrote in your Introduction to Computer Science
class. What (if any) parts of these programs are inherently serial? Does the
inherently serial part of the work done by the program decrease as the problem
size increases? Or does it remain roughly the same? Get solution
2.19. Suppose Tserial D n and Tparallel D n=pClog2.p/, where times are in microseconds.
If we increase p by a factor of k, find a formula for how much we’ll
need to increase n in order to maintain constant efficiency. How much should
we increase n by if we double the number of processes from 8 to 16? Is the
parallel program scalable? Get solution
2.20. Is a program that obtains linear speedup strongly scalable? Explain your answer. Get solution
2.21. Bob has a program that he wants to time with two sets of data, input data1
and input data2. To get some idea of what to expect before adding timing
functions to the code he’s interested in, he runs the program with two sets of
data and the Unix shell command time:
$ time ./bobs prog < input data1
real 0m0.001s
user 0m0.001s
sys 0m0.000s
$ time ./bobs prog < input data2
real 1m1.234s
user 1m0.001s
sys 0m0.111s
The timer function Bob is using has millisecond resolution. Should Bob use
it to time his program with the first set of data? What about the second set of
data? Why or why not? Get solution
2.22. As we saw in the preceding problem, the Unix shell command time reports the
user time, the system time, and the “real” time or total elapsed time. Suppose
that Bob has defined the following functions that can be called in a C program:
double utime(void);
double stime(void);
double rtime(void);
The first returns the number of seconds of user time that have elapsed since the
program started execution, the second returns the number of system seconds,
and the third returns the total number of seconds. Roughly, user time is time
spent in the user code and library functions that don’t need to use the operating
system—for example, sin and cos. System time is time spent in functions that
do need to use the operating system—for example, printf and scanf.
a. What is the mathematical relation among the three function values? That
is, suppose the program contains the following code:
u = double utime(void);
s = double stime(void);
r = double rtime(void);
Write a formula that expresses the relation between u, s, and r. (You can
assume that the time it takes to call the functions is negligible.)
b. On Bob’s system, any time that an MPI process spends waiting for messages
isn’t counted by either utime or stime, but the time is counted by
rtime. Explain how Bob can use these facts to determine whether an MPI
process is spending too much time waiting for messages.
c. Bob has given Sally his timing functions. However, Sally has discovered
that on her system, the time an MPI process spends waiting for messages
is counted as user time. Furthermore, sending messages doesn’t use
any system time. Can Sally use Bob’s functions to determine whether an
MPI process is spending too much time waiting for messages? Explain your
answer. Get solution
2.23. In our application of Foster’s methodology to the construction of a histogram,
we essentially identified aggregate tasks with elements of data. An apparent
alternative would be to identify aggregate tasks with elements of bin counts,
so an aggregate task would consist of all increments of bin counts[b] and
consequently all calls to Find bin that return b. Explain why this aggregation
might be a problem. Get solution
2.24. If you haven’t already done so in Chapter 1, try to write pseudo-code for our
tree-structured global sum, which sums the elements of loc bin cts. First
consider how this might be done in a shared-memory setting. Then consider
how this might be done in a distributed-memory setting. In the shared-memory
setting, which variables are shared and which are private?
Get solution