Solutions An Introduction to Parallel Programming - Pachecho - Chapter 2

2.1. When we were discussing floating point addition, we made the simplifying assumption that each of the functional units took the same amount of time. Suppose that fetch and store each take 2 nanoseconds and the remaining operations each take 1 nanosecond. a. How long does a floating point addition take with these assumptions? b. How long will an unpipelined addition of 1000 pairs of floats take with these assumptions? c. How long will a pipelined addition of 1000 pairs of floats take with these assumptions? d. The time required for fetch and store may vary considerably if the operands/ results are stored in different levels of the memory hierarchy. Suppose that a fetch from a level 1 cache takes two nanoseconds, while a fetch from a level 2 cache takes five nanoseconds, and a fetch from main memory takes fifty nanoseconds. What happens to the pipeline when there is a level 1 cache miss on a fetch of one of the operands? What happens when there is a level 2 miss? Get solution

2.2. Explain how a queue, implemented in hardware in the CPU, could be used to improve the performance of a write-through cache. Get solution

2.3. Recall the example involving cache reads of a two-dimensional array (page 22). How does a larger matrix and a larger cache affect the performance of the two pairs of nested loops? What happens if MAX = 8 and the cache can store four lines? How many misses occur in the reads of A in the first pair of nested loops? How many misses occur in the second pair? Get solution

2.4. In Table 2.2, virtual addresses consist of a byte offset of 12 bits and a virtual page number of 20 bits. How many pages can a program have if it’s run on a system with this page size and this virtual address size? Get solution

2.5. Does the addition of cache and virtual memory to a von Neumann system change its designation as an SISD system? What about the addition of pipelining? Multiple issue? Hardware multithreading? Get solution

2.6. Suppose that a vector processor has a memory system in which it takes 10 cycles to load a single 64-bit word from memory. How many memory banks are needed so that a stream of loads can, on average, require only one cycle per load? Get solution

2.7. Discuss the differences in how a GPU and a vector processor might execute the following code: sum = 0.0; for (i = 0; i < n; i++) { y[i] += a-x[i]; sum += z[i]-z[i]; } Get solution

2.8. Explain why the performance of a hardware multithreaded processing core might degrade if it had large caches and it ran many threads. Get solution

2.9. In our discussion of parallel hardware, we used Flynn’s taxonomy to identify three types of parallel systems: SISD, SIMD, and MIMD. None of our systems were identified as multiple instruction, single data, or MISD. How would an MISD system work? Give an example. Get solution

2.10. Suppose a program must execute 1012 instructions in order to solve a particular problem. Suppose further that a single processor system can solve the problem in 106 seconds (about 11.6 days). So, on average, the single processor system executes 106 or a million instructions per second. Now suppose that the program has been parallelized for execution on a distributed-memory system. Suppose also that if the parallel program uses p processors, each processor will execute 1012=p instructions and each processor must send 109.p-1/ messages. Finally, suppose that there is no additional overhead in executing the parallel program. That is, the program will complete after each processor has executed all of its instructions and sent all of its messages, and there won’t be any delays due to things such as waiting for messages. a. Suppose it takes 10-9 seconds to send a message. How long will it take the program to run with 1000 processors, if each processor is as fast as the single processor on which the serial program was run? b. Suppose it takes 10-3 seconds to send a message. How long will it take the program to run with 1000 processors? Get solution

2.11. Derive formulas for the total number of links in the various distributedmemory interconnects. Get solution

2.12. a. A planar mesh is just like a toroidal mesh, except that it doesn’t have the “wraparound” links. What is the bisection width of a square planar mesh? b. A three-dimensional mesh is similar to a planar mesh, except that it also has depth. What is the bisection width of a three-dimensional mesh? Get solution

2.13. a. Sketch a four-dimensional hypercube. b. Use the inductive definition of a hypercube to explain why the bisection width of a hypercube is p=2. Get solution

2.14. To define the bisection width for indirect networks, the processors are partitioned into two groups so that each group has half the processors. Then, links are removed from anywhere in the network so that the two groups are no longer connected. The minimum number of links removed is the bisection width. When we count links, if the diagram uses unidirectional links, two unidirectional links count as one link. Show that an eight-by-eight crossbar has a bisection width less than or equal to eight. Also show that an omega network with eight processors has a bisection width less than or equal to four. Get solution

2.15. a. Suppose a shared-memory system uses snooping cache coherence and write-back caches. Also suppose that core 0 has the variable x in its cache, and it executes the assignment x = 5. Finally suppose that core 1 doesn’t have x in its cache, and after core 0’s update to x, core 1 tries to execute y = x. What value will be assigned to y? Why? b. Suppose that the shared-memory system in the previous part uses a directory-based protocol. What value will be assigned to y? Why? c. Can you suggest how any problems you found in the first two parts might be solved? Get solution

2.16. a. Suppose the run-time of a serial program is given by Tserial D n2, where the units of the run-time are in microseconds. Suppose that a parallelization of this program has run-time Tparallel D n2=pClog2.p/. Write a program that finds the speedups and efficiencies of this program for various values of n and p. Run your program with n D 10, 20, 40, : : : , 320, and p D 1, 2, 4, : : : , 128. What happens to the speedups and efficiencies as p is increased and n is held fixed? What happens when p is fixed and n is increased? b. Suppose that Tparallel D Tserial=pCToverhead. Also suppose that we fix p and increase the problem size. - Show that if Toverhead grows more slowly than Tserial, the parallel efficiency will increase as we increase the problem size. - Show that if, on the other hand, Toverhead grows faster than Tserial, the parallel efficiency will decrease as we increase the problem size. Get solution

2.17. A parallel program that obtains a speedup greater than p—the number of processes or threads—is sometimes said to have superlinear speedup. However, many authors don’t count programs that overcome “resource limitations” as having superlinear speedup. For example, a program that must use secondary storage for its data when it’s run on a single processor system might be able to fit all its data into main memory when run on a large distributed-memory system. Give another example of how a program might overcome a resource limitation and obtain speedups greater than p. Get solution

2.18. Look at three programs you wrote in your Introduction to Computer Science class. What (if any) parts of these programs are inherently serial? Does the inherently serial part of the work done by the program decrease as the problem size increases? Or does it remain roughly the same? Get solution

2.19. Suppose Tserial D n and Tparallel D n=pClog2.p/, where times are in microseconds. If we increase p by a factor of k, find a formula for how much we’ll need to increase n in order to maintain constant efficiency. How much should we increase n by if we double the number of processes from 8 to 16? Is the parallel program scalable? Get solution

2.20. Is a program that obtains linear speedup strongly scalable? Explain your answer. Get solution

2.21. Bob has a program that he wants to time with two sets of data, input data1 and input data2. To get some idea of what to expect before adding timing functions to the code he’s interested in, he runs the program with two sets of data and the Unix shell command time: $ time ./bobs prog < input data1 real 0m0.001s user 0m0.001s sys 0m0.000s $ time ./bobs prog < input data2 real 1m1.234s user 1m0.001s sys 0m0.111s The timer function Bob is using has millisecond resolution. Should Bob use it to time his program with the first set of data? What about the second set of data? Why or why not? Get solution

2.22. As we saw in the preceding problem, the Unix shell command time reports the user time, the system time, and the “real” time or total elapsed time. Suppose that Bob has defined the following functions that can be called in a C program: double utime(void); double stime(void); double rtime(void); The first returns the number of seconds of user time that have elapsed since the program started execution, the second returns the number of system seconds, and the third returns the total number of seconds. Roughly, user time is time spent in the user code and library functions that don’t need to use the operating system—for example, sin and cos. System time is time spent in functions that do need to use the operating system—for example, printf and scanf. a. What is the mathematical relation among the three function values? That is, suppose the program contains the following code: u = double utime(void); s = double stime(void); r = double rtime(void); Write a formula that expresses the relation between u, s, and r. (You can assume that the time it takes to call the functions is negligible.) b. On Bob’s system, any time that an MPI process spends waiting for messages isn’t counted by either utime or stime, but the time is counted by rtime. Explain how Bob can use these facts to determine whether an MPI process is spending too much time waiting for messages. c. Bob has given Sally his timing functions. However, Sally has discovered that on her system, the time an MPI process spends waiting for messages is counted as user time. Furthermore, sending messages doesn’t use any system time. Can Sally use Bob’s functions to determine whether an MPI process is spending too much time waiting for messages? Explain your answer. Get solution

2.23. In our application of Foster’s methodology to the construction of a histogram, we essentially identified aggregate tasks with elements of data. An apparent alternative would be to identify aggregate tasks with elements of bin counts, so an aggregate task would consist of all increments of bin counts[b] and consequently all calls to Find bin that return b. Explain why this aggregation might be a problem. Get solution

2.24. If you haven’t already done so in Chapter 1, try to write pseudo-code for our tree-structured global sum, which sums the elements of loc bin cts. First consider how this might be done in a shared-memory setting. Then consider how this might be done in a distributed-memory setting. In the shared-memory setting, which variables are shared and which are private? Get solution