Parallel and distributed computing

Quiz yourself by thinking what should be in each of the black spaces below before clicking on it to display the answer.

Help!

Question

Answer

What is the von neumann architecture?

An abstraction of a general purpose computer. All program instructions and data are stored in a single memory unit. Main memory is separate from the CPU. Registers are built-in to the CPU. as well as the cache which is used to improve memory performance.

🗑

Describe a chip multiprocessor:

More than 1 complete CPU on a silicon chip. Each core: executes independently, has it's own L1 cache and has identical access to a SINGLE SHARED memory. The cahces work together to keep the processors view of memory consistent.

🗑

Describe the symmetric multiprocessor architecture:

Multiple processors with a single logical memory. An example is a multi-socket server where each socket holds a multi-core chip. Limitations: Maintaining consistency between caches adds complexity and overhead, access to main memory is a limiting factor.

🗑

Describe heterogeneous chip designs:

One or more general purpose CPUs are paired with one or more specialised compute engines. e.g. GPU, Field programmable gate array or a cell processor.

🗑

What is a cell processor?

Dual-threaded 64-bit power PC processor (general purpose). 8 x 32-bit synergistic processing elements (SPE) High-speed element interconnect bus connecting SPE, Power PC and IO.

🗑

Describe clusters:

Parallel computers made from commodity parts. Several standard boards, each with 1+ processors, and RAM, that are interconnected. Also packaged as blade servers (e.g. server room)

🗑

Describe IBMs BlueGene:

Up to 65336 dual core nodes, each connected to three different interconnects: 3D torus for general comms, collective network for fast reductions, and a barrier/interrupt network for co-ordination.

🗑

Compare shared and disjoint memory:

A single shared memory allows a simple programming model, but does not scale well to a large number of processors. Disjoin memory is a more complicated programming model, but scales a lot better. Using disjoint memory must be considered when programming.

🗑

What are the classifications in Flynn's taxonomy?

SISD (Single Instruction, Single Data - Sequential) SIMD (Single Instruction, Multiple Data - Vector) MISD (Multiple Instruction, Single Data - Not commonly used) MIMD (Multiple Instruction, Multiple Data - any other parallel approach)

🗑

Describe the random access machine model for sequential computers:

Single sequence of instructions, executed one at a time, on data stored in a single randomly accessible memory. The time taken to access memory is assumed to be the same for any location.

🗑

Describe the parallel random access machine model:

Could be modelled as multiple processors all connection to a single random access memory. But this cannot be built in practice with a large number of processors.

🗑

What is the Candidate Type Architecture (CTA)?

P sequential computers, each with a processor and local memory, joined by an interconnection network that has a limited number of connections to each node, with a bounded delay. A controller can be optionally included to help with initialisation and sync.

🗑

What is the locality rule?

Fast programs tend to maximise the number of local memory references and minimise the non-local memory ones.

🗑

What is a system thread?

A thread that is supported by the OS, which schedules these threads over the physical processor, deciding when they execute.

🗑

What is a user space thread?

A thread that uses only co-operative multithreading and all share a single physical processor. They do not have pre-emptive multitasking like system threads do, and the OS cannot interrupt them.

🗑

Define a critical section:

Groups of instructions that work together to update memory, but which cannot be safely interleaved as they would produce inconsistent results. The most common solution is to ensure that only one critical section can execute at any moment.

🗑

What are the issues with locks?

Obtaining and releasing a lock is slow, even with one thread. They cause the work done in the critical section to be serialised, so most of the potential for parallel work is thrown away. Each thread spends most of it's time waiting for the lock.

🗑

What are the issues with using private variables locally in a parallel program such as count 3s?

The private variables are held in the same cache row. The cache is locked out when wrote to, thus only one thread can increase their count at at time. A solution to this is to pad out the private variables so they are further apart in memory.

🗑

What are the limits to performance gain?

At some point, getting values from and/or to main memory will become the limiting factor. Past this point each thread doesn't have enough work to whilst it waits for more data to arrive.

🗑

Define execution time/latency:

The total time the problem takes to execute.

🗑

Define Throughput:

The total amount of work done in a unit of time.

🗑

Define Speedup:

The execution time of a sequential program, divided by the execution time of a parallel program that computers the same answer. Ideally, for a machine with P processors, speedup would equal P.

🗑

What does it mean if the speedup of a parallel program is greater than the number of processors?

It is super-linear, meaning the parallel version is doing less work than the sequential version.

🗑

Define efficiency:

Normalised speedup, indicating how efficiently each of the processors is being used. Speedup / Number of processors. Ideally equals 1.

🗑

Define scaled speedup:

Most intuitive. It tries to increase the size of the problem to match the size of the machine. This can be very hard if not impossible in practice due to the non-linearity of memory requirements or the algorithm.

🗑

Define fixed size speedup:

This tries to find a problem size that can be used to compare the program fairly. This can be hard due to the fact that a small problem may not scale well to a large machine and vice versa.

🗑

Define overhead:

Any cost incurred in the parallel solution but not in the serial one. This can include the time time required to create and destroy threads, inter-thread communication and synchronisation.

🗑

What is non-parallelisable computation?

If a computation is inherently sequential, then more processors will not help.

🗑

What is Amdahl's law?

If 1 / Speedup of a computation is inherently sequential, the maximum possible speedup is S.

🗑

What does "Contention for resources" mean?

Degradation of system performance caused by competition for a shared resource. It can cause parallel programs to become slower as more processors are added.

🗑

Why are idle processors a source of performance loss?

Ideally all processors will be working all of the time, but a thread / process may not be able to proceed because of: A lack of work, or waiting for some external event such as the arrival of new work.

🗑

Define latency hiding (overlapping communication and computation):

Identifying some computation that is independent of the communication, then you can: Initiate the communication, do the independent computation while you wait, complete the communication and continue. Usually has few or not costs.

🗑

Describe the shared memory reference mechanism:

A single logical address space, accessed by all processors. All processors see consistent values when accessing memory though different systems provide different levels.

🗑

What is sequential shared memory?

As if all memory access are in a strict order. Least efficient.

🗑

What is relaxed shared memory?

Difference processors might "see" different memory accesses under some circumstances.

🗑

Describe the one-sided communication memory reference mechanism:

A single, logical address space accessed by all processors. One processor gets and sets values in another processors memory. There is no co-ordination between processors. A form of relaxed consistency, meaning additional synchronisation will be needed.

🗑

Describe the message passing memory reference mechanism:

No shared address space. To receive non-local information processors send and receive messages. In the program, access to non-local information is completed completely differently from access to local memory.

🗑

What is a data parallel computation?

A computation that performs the same operation(s) to different items of data at the same time. Potential parallelism grows with the size of the data.

🗑

What is a task parallel computation?

A computation that performs distinct computations (tasks) at the same time. The set of tasks is fixed and so parallelism is not scalable. One example of this is pipelining, where a series of tasks are solved in sequence.

🗑

Define dependence:

An ordering relationship between two computations, which must be observed for correct results.

🗑

What is Flow dependence?

Read after write. A true dependence, equivalent to the reader waiting for a message from the writer.

🗑

What is an Anti-dependence?

Write after read. A false dependence caused by re-using memory locations. Can be eliminated by using separate memory (e.g. different variables)

🗑

What is an Output dependence?

Write after write. Same as anti-dependence.

🗑

What is an Input dependence?

Read after read. Does NOT impose an ordering constraint.

🗑

Define granularity:

The granularity of a parallel computation is how much work (the number of instructions) can be done within a single thread/process between each interaction with another thread/process.

🗑

What is fine grained granularity?

Few instructions between interactions, interaction is frequent.

🗑

What is coarse grained granularity?

Many instructions between interactions, interaction is infrequent.

🗑

Batching is a method of reducing granularity, what is it?

Performing work as a group. Makes computation more coarse-grained by reducing the frequency of interaction. Only makes sense if there are still enough chunks of work for all the processors and the individual tasks don't have dependencies with other tasks.

🗑

Over-dividing work is a method of increasing granularity, what is it?

Dividing the work into more, smaller units makes computation more fine-grained, since interaction is needed for at least every unit of work. This can make it easier to keep all processors busy. Especially useful if units of work are variable.

🗑

What is Privatisation in terms of locality?

Rather than threads competing to access a single shared variable, give each thread it's own separate copy that can be used independently.

🗑

What is Padding in terms of locality?

Variables that are close together in memory can be cached together. Extra padding can break this dependency.

🗑

What is a Redundant Computation?

Each thread calculates the same value locally, rather than one thread calculating it and communicating the value to each thread, increasing locality. This is useful if each thread cannot progress until it has this value, as it may as well do it itself.

🗑

What is the execution model for OpenMP?

It uses a thread fork-join execution model, threads are created to run a task/tasks in parallel and then join when the task is complete.

🗑

What is the memory model for OpenMP?

It has a relaxed consistency shared memory model. All threads can read/write a common shared memory, each thread can also have it's own temporary view of memory. Each thread can also have it's own private memory.

🗑

What options are available for controlling concurrency in OpenMP?

#pragma omp critical #pragma omp atomic #pragma omg single #pragma omg barrier

🗑

Describe the Schwartz algorithm:

Consider a tree operation (e.g. +/- reduce) performed by P parallel processes on n data items where p<n: The tree should connect the P parallel processes rather than the n data items. This will minimise communication, co-ordination and overhead.

🗑

Discuss the Schwartz algorithm:

Best way to go if each of the chunks are consistently balanced. An application of the locality rule to tree algorithms. Most important aspect is having each of the processes performa balanced share of the computation locally.

🗑

Describe a generalised reduce:

The intermediate or tally value calculated by each thread can be a different type from the data items. The global summary type can be a different type from the tally values. They need not be single values, they could be compound values or an array.

🗑

How does a reduce algorithm implemented in a Schwartz style work?

Each thread combines a portion of data to produce it's own value. Values from each thread are then combine in a tree to produce the global value.

🗑

Describe a Scan:

Another common parallel operation for calculating local information that depends on non-local information. Calculates a value at each point in the array that depends on the values before it.

🗑

How is a scan calculated in parallel?

Requires two passes. An upward pass like reduce, calculates intermediates. A downward pass that combines and distributes intermediate values back from the top of the tree to individual threads.

🗑

Describe overlap regions in static work allocation:

Block allocation still requires some non-local references. It is usually better to explicitly fetch and cache the required overlap regions first, then performing the computation entirely locally. This also reduces overhead.

🗑

Describe block cyclic allocations in static work allocation:

Useful where work is not proportional to the amount of data. Each process is allocated many smaller blocks of data spread across the entire array rather than large chunks. Meaning on average each process should end up with equal amounts of work.

🗑

Describe a work queue in dynamic work allocation:

A shared data structure that holds the definitions of the currently unallocated tasks. Blocks of data to be processed. Worker threads repeatedly take a new task, execute it and update the global state. The queue holds pointers rather than copying.

🗑

What is the GPU?

Graphics Processing Unit. part of the video rendering system of a computer that is specialised to assist with the rendering of 3D graphics and visual effects. Performance depends on fast access to video memory.

🗑

What is a general purpose GPU?

Allow developers to control the computations performed on the GPUs.

🗑

How did NVIDIA's GeForce 3 allow general purpose GPU computing?

Computations were written in specialised graphics shading languages. Input data had to be stored in texture images and issued to the GPU by submitting triangles. Output had to be cast as a set of pixels generated from the raster operations.

🗑

What is CUDA?

Architecture that includes components designed strictly for GPU computing. Every ALU on the chip could be used for this. ALUs implemented IEE single precious floating point arithmetic included suitable instruction sets suitable for general computation.

🗑

What is OpenCL?

An alternative to CUDA. The open standard for parallel programming of heterogeneous systems.

🗑

Describe a GPUs architecture:

A larger chip area can be used to implement ALUs. Allowing more floating point operations to be done at once. It has a simpler and more restrictive memory model for "normal" use.

🗑

Describe the interaction between the CPU and GPU:

The CPU is the host and runs the main program. The GPU is a computer device which is attached to and managed by the host. They work in parallel with each other, communicating asynchronously via a command queue / buffer.

🗑

What is the basic unit of parallel execution?

A kernel, equivalent to a C/C++ function. The host compiles and sends kernel(s) to the GPU. The GPU can execute many copies of a kernel at once, each on different data (work item). The GPU schedules work items like small threads on available SPUs.

🗑

Describe kernel memory:

Kernels do not and often cannot access main memory. Each kernel works with: it's own private memory plus some local memory shared by a group of work items. The application must move data explicitly between host, device, global and local memory.

🗑

What is MPI?

An interface library that allows messages to be seen between different parts of a program and potentially a network.

🗑

Describe MPIs memory model:

Assumes each process' memory is disjoint (e.g. no hardware shared memory). Exactly as in CTA. All communication is via message passing, typically on standard network tech. It defines a set of standard data types, automatically converting between them.

🗑

Describe the process of message passing:

Sending and receiving of a message from one process to another. It is logically blocking, meaning the rest of the program cannot run until the message has been set or received. It is sent using the rank of the destination process. A true dependency.

🗑

What are the limitations of MPI?

Sending and receiving messages have significant overheads compared to shared memory access. It is important to batch and organise communication. Each individual process still needs to be optimised for best parallel performance, e.g. using OpenMP in them.

🗑

What is a group?

An ordered set of processes.

🗑

What is a communicator?

A scoping mechanism that defines a set of processes that can communicate with one another.

🗑

Describe the relationship between groups and communicators:

Each communicator includes a group. Each process in a group is assigned a unique rank / ID. A process can belong to multiple groups. Communicators can be dynamically defined. A topology can be associated with a communicator.

🗑

Define a broadcast:

Copy value(s) from one process to all processes associated with the communicator.

🗑

Define a scatter:

Distribute an array of values between all processes in the communicator.

🗑

Define a gather:

Assemble an array of values from sub-arrays in each process.

🗑

Define an all-gather:

Equivalent to a gather followed by broadcasting the result to a combined array.

🗑

Define a complete exchange:

Exchanges parts of the array on all processes.

🗑

What is the scope of a collective communication?

A single communicator. If only a subset of processes should take part in a collective communication then a new communicator will have to be defined from a defined sub-group of the processes.

🗑

Describe asynchronous message passing:

The actual send/receive happens in the background, with the send or receive command returning immediately. When the data is sent, it the caller cannot read the data yet. It can check the progress of the send or wait until it's finished.

🗑

What is particularly important in an MPI program?

The principles of scalable parallel program design, due to the high cost of communication. (e.g. locality and batching communication)

🗑

What is a distributed system?

A system of multiple, autonomous computers communicating through a network in which each computer is a node. The computers interact to achieve a common goal such as providing a common service.

🗑

What are the four architectures associated with distributed systems?

Client-server, the most common type (e.g. www). Multi-tier, one group of servers are clients of another (e.g. a database or application server). Tightly coupled, a cluster of machines that work closely together running a shared process in parallel. P2P.

🗑

What are the similarities between parallel processing and distributed systems?

Both are concurrent, the programs need to express concurrency and concurrency control. Parallel tasks needs to be identified and managed in both. Large and diverse parallel systems also need to be designed as distributed systems.

🗑

What are the differences between parallel processing and distributed systems?

Parallel processing includes both shared and disjoint memory whereas distributed systems are almost all disjoint. Parallel processing assumes computers and networks will not fail, distributed systems assumes they may fail. etc...

🗑

What is heterogeneity?

Computers in a distributed network could have varied and different networks, hardware, OSs and programming languages.

🗑

What masks network differences?

Internet protocols, all computers on the internet need to include an implementation of the protocol.

🗑

What masks programming language differences?

Middleware is a software abstraction layer. Mobile code can be transferred from one computer to another.

🗑

What is openness?

Whether a system can be extended and re-implemented in various ways. Key software and/or protocol interfaces need to be agreed, published and used.

🗑

Describe the security concerns of a distributed system:

A node of a distributed system is by definition, networked. Hence they are vulnerable to attack from the network. Many resources managed by a distributed system have a high intrinsic value to their users

🗑

Describe the scalability concerns of a distributed system:

Distributed systems have to operate efficiently and effectively on many different scales. (i.e. a small intranet -> internet). A system is scalable if it will remain effective when there is a significant increase in the number of references and/or users.

🗑

What is failure handling in regards to a distributed system?

Failures in a distributed system are usually partial, some components fail whilst others continue to function - this makes it quite difficult to detect and handle them.

🗑

What does masking a failure involve?

Some detected failures can be hidden or made less severe. An example of this being the re-transmission of messages that failed to arrive.

🗑

What does tolerating a failure involve?

Using redundant components to handle the failure. (e.g. DNS replication) Servers redirect traffic if a fault is detected.

🗑

What does recovering from a failure involve?

Design of software so that the state of permanent data can be recovered or "rolled back" to a previous version.

🗑

What is concurrency in regards to a distributed system?

Services and applications generally allow multiple client requests to be processed concurrently.

🗑

What is transparency in regards to a distributed system?

The concealment of the separation of components from the user, so that the system is perceived as a whole rather than a collection of independent components.

🗑

What is access transparency?

Enables local and remote resources to be accessed using identical operations.

🗑

What is location transparency?

Enables resources to be accessed without knowledge of that physical / network location.

🗑

How does reliability help measure the quality of service?

Ability to consistently conform to its specification even in the face of errors.

🗑

How does performance help measure the quality of service?

Ability to meet timeliness requirements. QOS may be guaranteed, which implies careful planning and resource management. It may also be best effort, implies they may not always be met.

🗑

Review the information in the table. When you are ready to quiz yourself you can hide individual columns or the entire table. Then you can click on the empty cells to reveal the answer. Try to recall what will be displayed before clicking the empty cell.

To hide a column, click on the column name.

To hide the entire table, click on the "Hide All" button.

You may also shuffle the rows of the table by clicking on the "Shuffle" button.

Or sort by any of the columns using the down arrow next to any column heading.
If you know all the data on any row, you can temporarily remove it by tapping the trash can to the right of the row.

Embed Code - If you would like this activity on your web page, copy the script below and paste it into your web page.

Normal Size Small Size show me how

Created by: Sparksy

Popular Computers sets

Definitions for Word Processing and the main parts of the MS Word 2010 window

WP Page Formatting

WP Paragraph Formatting terms

Review business letter parts

Review the three paragraph formats (block, indented, hanging indent)

WP font formatting features in Word 2010

Arkansas CBA unit 1 Hardware