Question 1

What is the von neumann architecture?

Accepted Answer

An abstraction of a general purpose computer. All program instructions and data are stored in a single memory unit. Main memory is separate from the CPU. Registers are built-in to the CPU. as well as the cache which is used to improve memory performance.

Question 2

Describe a  chip multiprocessor:

Accepted Answer

More than 1 complete CPU on a silicon chip. Each core: executes independently, has it's own L1 cache and has identical access to a SINGLE SHARED memory. The cahces work together to keep the processors view of memory consistent.

Question 3

Describe the symmetric multiprocessor architecture:

Accepted Answer

Multiple processors with a single logical memory. An example is a multi-socket server where each socket holds a multi-core chip. Limitations: Maintaining consistency between caches adds complexity and overhead, access to main memory is a limiting factor.

Question 4

Describe heterogeneous chip designs:

Accepted Answer

One or more general purpose CPUs are paired with one or more specialised compute engines. e.g. GPU, Field programmable gate array or a cell processor.

Question 5

What is a cell processor?

Accepted Answer

Dual-threaded 64-bit power PC processor (general purpose).
8 x 32-bit synergistic processing elements (SPE)
High-speed element interconnect bus connecting SPE, Power PC and IO.

Question 6

Describe clusters:

Accepted Answer

Parallel computers made from commodity parts. Several standard boards, each with 1+ processors, and RAM, that are interconnected.
Also packaged as blade servers (e.g. server room)

Question 7

Describe IBMs BlueGene:

Accepted Answer

Up to 65336 dual core nodes, each connected to three different interconnects: 3D torus for general comms, collective network for fast reductions, and a barrier/interrupt network for co-ordination.

Question 8

Compare shared and disjoint memory:

Accepted Answer

A single shared memory allows a simple programming model, but does not scale well to a large number of processors. Disjoin memory is a more complicated programming model, but scales a lot better. Using disjoint memory must be considered when programming.

Question 9

What are the classifications in Flynn's taxonomy?

Accepted Answer

SISD (Single Instruction, Single Data - Sequential)
SIMD (Single Instruction, Multiple Data - Vector)
MISD (Multiple Instruction, Single Data - Not commonly used)
MIMD (Multiple Instruction, Multiple Data - any other parallel approach)

Question 10

Describe the random access machine model for sequential computers:

Accepted Answer

Single sequence of instructions, executed one at a time, on data stored in a single randomly accessible memory. The time taken to access memory is assumed to be the same for any location.

Question 11

Describe the parallel random access machine model:

Accepted Answer

Could be modelled as multiple processors all connection to a single random access memory. But this cannot be built in practice with a large number of processors.

Question 12

What is the Candidate Type Architecture (CTA)?

Accepted Answer

P sequential computers, each with a processor and local memory, joined by an interconnection network that has a limited number of connections to each node, with a bounded delay. A controller can be optionally included to help with initialisation and sync.

Question 13

What is the locality rule?

Accepted Answer

Fast programs tend to maximise the number of local memory references and minimise the non-local memory ones.

Question 14

What is a system thread?

Accepted Answer

A thread that is supported by the OS, which schedules these threads over the physical processor, deciding when they execute.

Question 15

What is a user space thread?

Accepted Answer

A thread that uses only co-operative multithreading and all share a single physical processor. They do not have  pre-emptive multitasking like system threads do, and the OS cannot interrupt them.

Question 16

Define a critical section:

Accepted Answer

Groups of instructions that work together to update memory, but which cannot be safely interleaved as they would produce inconsistent results. The most common solution is to ensure that only one critical section can execute at any moment.

Question 17

What are the issues with locks?

Accepted Answer

Obtaining and releasing a lock is slow, even with one thread. They cause the work done in the critical section to be serialised, so most of the potential for parallel work is thrown away. Each thread spends most of it's time waiting for the lock.

Question 18

What are the issues with using private variables locally in a  parallel program such as count 3s?

Accepted Answer

The private variables are held in the same cache row. The cache is locked out when wrote to, thus only one thread can increase their count at at time. A solution to this is to pad out the private variables so they are further apart in memory.

Question 19

What are the limits to performance gain?

Accepted Answer

At some point, getting values from and/or to main memory will become the limiting factor. Past this point each thread doesn't have enough work to whilst it waits for more data to arrive.

Question 20

Define execution time/latency:

Accepted Answer

The total time the problem takes to execute.

Question 21

Define Throughput:

Accepted Answer

The total amount of work done in a unit of time.

Question 22

Define Speedup:

Accepted Answer

The execution time of a sequential program, divided by the execution time of a parallel program that computers the same answer. Ideally, for a machine with P processors, speedup would equal P.

Question 23

What does it mean if the speedup of a parallel program is greater than the number of processors?

Accepted Answer

It is super-linear, meaning the parallel version is doing less work than the sequential version.

Question 24

Define efficiency:

Accepted Answer

Normalised speedup, indicating how efficiently each of the processors is being used. Speedup / Number of processors. Ideally equals 1.

Question 25

Define scaled speedup:

Accepted Answer

Most intuitive. It tries to increase the size of the problem to match the size of the machine. This can be very hard if not impossible in practice due to the non-linearity of memory requirements or the algorithm.

Question 26

Define fixed size speedup:

Accepted Answer

This tries to find a problem size that can be used to compare the program  fairly. This can be hard due to the fact that a small problem may not scale well to a large machine and vice versa.

Question 27

Define overhead:

Accepted Answer

Any cost incurred in the parallel solution but not in the serial one. This can include the time time required to create and destroy threads, inter-thread communication and synchronisation.

Question 28

What is non-parallelisable computation?

Accepted Answer

If a computation is inherently sequential, then more processors will not help.

Question 29

What is Amdahl's law?

Accepted Answer

If 1 / Speedup of a computation is inherently sequential, the maximum possible speedup is S.

Question 30

What does "Contention for resources" mean?

Accepted Answer

Degradation of system performance caused by competition for a shared resource. It can cause parallel programs to become slower as more processors are added.

Question 31

Why are idle processors a source of performance loss?

Accepted Answer

Ideally all processors will be working all of the time, but a thread / process may not be able to proceed because of:  A lack of work, or waiting for some external event such as the arrival of new work.

Question 32

Define latency hiding (overlapping communication and computation):

Accepted Answer

Identifying some computation that is independent of the communication, then you can: Initiate the communication, do the independent computation while you wait, complete the communication and continue. Usually has few or not costs.

Question 33

Describe the shared memory reference mechanism:

Accepted Answer

A single logical address space, accessed by all processors. All processors see consistent values when accessing memory though different systems provide different levels.

Question 34

What is sequential shared memory?

Accepted Answer

As if all memory access are in a strict order. Least efficient.

Question 35

What is relaxed shared memory?

Accepted Answer

Difference processors might "see" different memory accesses under some circumstances.

Question 36

Describe the one-sided communication memory reference mechanism:

Accepted Answer

A single, logical address space accessed by all processors. One processor gets and sets values in another processors memory. There is no co-ordination between processors. A form of relaxed consistency, meaning additional synchronisation will be needed.

Question 37

Describe the message passing memory reference mechanism:

Accepted Answer

No shared address space. To receive non-local information processors send and receive messages. In the program, access to non-local information is completed completely differently from access to local memory.

Question 38

What is a data parallel computation?

Accepted Answer

A computation that performs the same operation(s) to different items of data at the same time. Potential parallelism grows with the size of the data.

Question 39

What is a task parallel computation?

Accepted Answer

A computation that performs distinct computations (tasks) at the same time. The set of tasks is fixed and so parallelism is not scalable. One example of this is pipelining, where a series of tasks are solved in sequence.

Question 40

Define dependence:

Accepted Answer

An ordering relationship between two computations, which must be observed for correct results.

Question 41

What is Flow dependence?

Accepted Answer

Read after write.  A true dependence, equivalent to the reader waiting for a message from the writer.

Question 42

What is an Anti-dependence?

Accepted Answer

Write after read. A false dependence caused by re-using memory locations. Can be eliminated by using separate memory (e.g. different variables)

Question 43

What is an Output dependence?

Accepted Answer

Write after write. Same as anti-dependence.

Question 44

What is an Input dependence?

Accepted Answer

Read after read. Does NOT impose an ordering constraint.

Question 45

Define granularity:

Accepted Answer

The granularity of a parallel computation is how much work (the number of instructions) can be done within a single thread/process between each interaction with another thread/process.

Question 46

What is fine grained granularity?

Accepted Answer

Few instructions between interactions, interaction is frequent.

Question 47

What is coarse grained granularity?

Accepted Answer

Many instructions between interactions, interaction is infrequent.

Question 48

Batching is a method of reducing granularity, what is it?

Accepted Answer

Performing work as a group. Makes computation more coarse-grained by reducing the frequency of interaction. Only makes sense if there are still enough chunks of work for all the processors and the individual tasks don't have dependencies with other tasks.

Question 49

Over-dividing work is a method of increasing granularity, what is it?

Accepted Answer

Dividing the work into more, smaller units makes computation more fine-grained, since interaction is needed for at least every unit of work. This can make it easier to keep all processors busy. Especially useful if units of work are variable.

Question 50

What is Privatisation in terms of locality?

Accepted Answer

Rather than threads competing to access a single shared variable, give each thread it's own separate copy that can be used independently.

Question 51

What is Padding in terms of locality?

Accepted Answer

Variables that are close together in memory can be cached together. Extra padding can break this dependency.

Question 52

What is a Redundant Computation?

Accepted Answer

Each thread calculates the same value locally, rather than one thread calculating it and communicating the value to each thread, increasing locality. This is useful if each thread cannot progress until it has this value, as it may as well do it itself.

Question 53

What is the execution model for OpenMP?

Accepted Answer

It uses a thread fork-join execution model, threads are created to run a task/tasks in parallel and then join when the task is complete.

Question 54

What is the memory model for OpenMP?

Accepted Answer

It has a relaxed consistency shared memory model. All threads can read/write a common shared memory, each thread can also have it's own temporary view of memory.  Each thread can also have it's own private memory.

Question 55

What options are available for controlling concurrency in OpenMP?

Accepted Answer

#pragma omp critical
#pragma omp atomic
#pragma omg single
#pragma omg barrier

Question 56

Describe the Schwartz algorithm:

Accepted Answer

Consider a tree operation (e.g. +/- reduce) performed by P parallel processes on n data items where p<n: The tree should connect the P parallel processes rather than the n data items. This will minimise communication, co-ordination and overhead.

Question 57

Discuss the Schwartz algorithm:

Accepted Answer

Best way to go if each of the chunks are consistently balanced. An application of the locality rule to tree algorithms. Most important aspect is having each of the processes performa balanced share of the computation locally.

Question 58

Describe a generalised reduce:

Accepted Answer

The intermediate or tally value calculated by each thread can be a different type from the data items. The global summary type can be a different type from the tally values. They need not be single values, they could be compound values or an array.

Question 59

How does a reduce algorithm implemented in a Schwartz style work?

Accepted Answer

Each thread combines a portion of data to produce it's own value. Values from each thread are then combine in a tree to produce the global value.

Question 60

Describe a Scan:

Accepted Answer

Another common parallel operation for calculating local information that depends on non-local information. Calculates a value at each point in the array that depends on the values before it.

Question 61

How is a scan calculated in parallel?

Accepted Answer

Requires two passes. An upward pass like reduce, calculates intermediates. A downward pass that combines and distributes intermediate values back from the top of the tree to individual threads.

Question 62

Describe overlap regions in static work allocation:

Accepted Answer

Block allocation still requires some non-local references. It is usually better to explicitly fetch and cache the required overlap regions first, then performing the computation entirely locally.  This also reduces overhead.

Question 63

Describe block cyclic allocations in static work allocation:

Accepted Answer

Useful where work is not proportional to the amount of data. Each process is allocated many smaller blocks of data spread across the entire array rather than large chunks. Meaning on average each process should end up with equal amounts of work.

Question 64

Describe a work queue in dynamic work allocation:

Accepted Answer

A shared data structure that holds the definitions of the currently unallocated tasks. Blocks of data to be processed. Worker threads repeatedly take a new task, execute it and update the global state. The queue holds pointers rather than copying.

G53PDC

Parallel and distributed computing

"Know" box contains:
Time elapsed:
Retries: