High-Performance Computing Technologies ¶

Lecturer:	Yuri Grygorovych Gordienko

Contents

Lecture 1. Introduction ¶

date: Wed Feb 15 13:47:01 EET 2017

Historical info ¶

Parallel processing cannot be replaced with the sequential one because of Moore’ Law

Number of transistors in a dense integrated circuit doubles approximately every two years.

Примітка

Storage law: storage capacity increases 2x every 12 months
Gilder’s law: network (optical) bandwidth increases 2x every 9 months
Distributed computing: a collection of independent computers that appears to its users as a single coherent system

Lesslie Lamport’ test for distributed system

You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done

Differences between parallel and distributed systems ¶

Criteria	Parallel System	Distributed System
Connectivity	Tightly coupled	Loosely coupled
Memory access	Uses shared memory to exchange information	Uses message passing as each CPU has its own memory
Granularity	Finer grained	Coarse grained

Usages ¶

Strategic systems
Visualisation and Graphics
Economics and Finance
Scientific computing
- Physics (LHC)
- Bioinformatics (protein-docking)
- Geology (seismography)
- Astronomy (simulation of galaxy)

Models ¶

Distributed computations are characterised with models.

Architectural models

Flynn’s taxonomy distinguishes the following categories:
- SISD: traditional uniprocessor computers
- MISD: Space shuttle flight control computer
- SIMD: array processor, GPU
- MIMD: parallel and distributed systems
There are different architectural-service models as well:
- Centralised (mainframe, cluster)
- Client-server (mail, banking, computations)
- Multi-tier (grid, DNS)
- Peer-to-peer (file exchange, computations)
Interaction models

Interaction models give answers to the following questions:
- How do we handle time?
- Are there time limits?
There are two major models:
- Synchronous
- Asynchronous
Fault models

The crucial question here is:

What kind of faults can occur
- Omission faults (a processor or communication fails to perform it is supposed to do)
- Timing faults (in synchronous distributed systems)
- Arbitrary faults (WTF has happened?)

Distributed computing ¶

Advantanges ¶

Performance
Reliability
Distribution
Incremental growth
Sharing computation/data/resources/management
Communication
Economics
Flexibility

Disadvantages ¶

Heterogeneity (hardware, software, operation, etc)
Software development
Networking
Incremental growth (scalability is a pain)

Pitfals ¶

The network is NOT reliable
The network is NOT secure
The network is NOT homogeneous
The topology is NOT constant
Latency is NOT zero
Bandwidth is NOT infinite
Transport cost is NOT zero
There is NO single administrator

Design ¶

Main charachteristics:

Transparency

How to make impression that the collection of machines is a “simple” single computer?
- Access
- Location
- Migration
- Replication
- Concurrency
- Failure
- Performance
Scalability
Performance
- Performance of individual workstations
- Speed of the communication infrastructure
- Extent of reliability
- Flexibility in workload allocations (i.e. idle processors should be allocated automatically to a user’s task)
Heterogeneity
- different hardware
- different software
- various devices (PCs, mobiles, ATM-machines, sensors, etc)
- diverse networks and protocols

Cluster and Grid computing ¶

Cluster computing:
	collection of high-end computers usually closely connected through LAN

Homogeneous: OS, hardware
Work: together like a single computer
Applications are hosted on one machine and user machines connect to it. Clients connect via terminals

High-performance computing center at KPI

Grid computing:	collection of clusters, which may be combined in a “GRID” of a massive computing power

Heterogeneous
Work: for collaborations grids use virtual organizations

Lecture 2. Parallel computing ¶

Parallel computing: a form of computuation in which many calculations are carried out simultaneously.

There are different levels of parallel computing:

Instruction

a single operation of a processor

Thread

stream of execution (has one or multiple instructions)

Task

Process

Level of parallel computing

Task level

Instruction level

Bit level

Classification of system ¶

Flynn’s taxonomy
Memory access
- shared memory - centralized (SMP) - distributed (NUMA)
- individual memory - distributed

UMA (Uniform memory access)¶

equal acccess rights
equal memory access time

NUMA ¶

usually physically linked 2 or more SMPs so they can access mem of each other directly
Not all have eq access time
Memory acces across the link is much slower

Distributed ¶

Each CPI has its own local mem and changes are not visible to other CPUs
Processors are connected by network
Program must define a way to transfer data between processors

Massively parallel processors ¶

MPP architecture consists of nodes each having its own processor, memory and I/O subsystem

Programming models ¶

Programming Model: some model which represents an abstraction of the computer system and enables the expression of ideas in some form

Shared model
- Processors read write the variables stored in a shared address space asynchronously
- Access to the shared memory is controlled by some mechanisms (locks/semaphores)
Threads model
Data parallelization
Message Passing

Aspect	Shared memory	Message passing
Communication	Implicit	Explicit mesages
Synchronization	Explicit	Implicit (via message)
Hardware support	Typically required	None
Development effort	Lower	Higher
Tuning Effort	Higher	Lower

Implementation of parallelism ¶

Load balancing ¶

To ditribute work among all tasks so they are all kept busy all of the time

Ways to achieve:

Adequate partitioning
Dynamic work assignment - Scheduler/task-pool - Algorithm to detect and handle imbalances

Примітка

If barrier synchronization is used then the slowest task determines the time of execution

Granularty ¶

computation/communication ratio

Fine grained parallelism:

few computation events are done between communication events

High communication overhead
Small opportunity to enhance performance

Coarse-grain parallelism:

many computational events are done between communication events.

Large opportunity to enhance performance
Harder to do load balancing efficiently

Amdahl’s law ¶

Suppose that the sequential execution of a program takes \(T_1\) time units and the parallel execution on \(p\) processors takes \(T_p\) time units
Suppose that out of the entire execution of the program, \(s\) fraction of it is not parallelizable while \(1-s\) fraction is parallelizable
Then the speedup:

\[\frac{T_1}{T_p} = \frac{T_1}{T_1 \cdot s + T_1 \cdot \frac{1 - s}{p}} = \frac{1}{s + \frac{1 - s}{p}}\]

Примітка

Amdahl’s Law is too simple for real cases
The communication overhead and workload balance among processes (in general) should be taken into account

There are other Laws of paralel computing performance:

Gustafsons Law (1988)

another way to evaluate the performance of a parallel program
Karp/Flat Metric (1990)

whether the principle barrier to the program speedup is the amount of inherently sequential code or parallel overhead
Isoefficiency (isogranularity) metric

the scalability of a parallel algorithm executing on parallel systems

Performance guidelines ¶

Maximize the fraction of our program that can be parallelized
Balance the workload of parallel processes
Minimize the time spent for communication

Lecture 3. Metrics ¶

Time
Speedup

\[Speedup = \Psi(n,p) = \frac{\text{sequential execution time}}{\text{parallel execution time}} = \frac{t_s}{t_p}\]
Efficiency

measure of processor utilisation as the speedup divided by the number of processors

\[Efficiency = \varepsilon(n,p) = \frac{\text{Speedup}}{\text{Processors}}\]

Note that

\[\text{speedup} \leq \text{processors}\]

Since \(\text{speedup} \geq 0\) and \(\text{processors} > 1\), it follows that

\[0 \leq \varepsilon(n,p) \leq 1\]

However there are superlinear algorithms, when

\[\text{speedup} > \text{processors}\]

and for this case

\[\varepsilon(n,p) > 1\]
Cost

\[\text{cost} = \text{parallel running time} \cdot \text{processors}\]

Types of computing problems ¶

Embarassingly parallel problem:
	is one for which little or no effort is required to separate the problem into a number of parallel tasks. They are thus well suited to large internet based distributed platforms and do not suffer from parallel slowdown. They require little or no communication of results between tasks.
Distributed computing problems:
	require communication between taks, especially communication of intermediate results.
Inheritably serial computing problems:
	cannot be parallelized at all. They are diametric opposite to embarrassingly parallel problems.

More General Speedup Formula ¶

A better version of the Amdahl’s law

\[\Psi(n,p) \leq \frac{\sigma(n) + \phi(n)}{\sigma(n) + \frac{\phi(n)}{p} + \kappa(n,p)}\]

Speedup is an increasing function of problem size

Karp-Flatt Metric ¶

analyze parallel program performance
predict speedup with additional processors

Start with the speedup formula

\[\Psi(n,p) \leq \frac{\sigma(n) + \phi(n)}{\sigma(n) + \frac{\phi(n)}{p} + \kappa(n,p)}\]

The experimentally determined serial fraction e is a function of speedup and the number pf processors

\[e = \frac{1/\Psi - 1/p}{1 - 1/p}\]

from this we can define \(\Psi\) in terms of \(e\) and \(p\)

\[\Psi = \frac{p}{e \cdot (p - 1) + 1}\]

Interpretation of e

if e is constant as num of CPUs increases, then speedup is constrained by the sequential component

Isoefficiency metric ¶

n - data size
p - num processes
T(n,p) – execution time using p processors
Psi – speedup

T_0(n,p) – the total wasting time spent by processes doing work not done by sequential algorithm .. math:

T_0(n,p) = (p - 1) \cdot \sigma(n) + p \cdot \kappa(n,p)

For example, \(T_(n,1)\) is the sequential execution time
We want the algorithm to maintain a constant level of efficiency as the data size n increases

Main steps to derivation

begin with speedup formula
compute total amount of overhead

\[T(n,1) \geq C \cdot T_0(n,p)\]

where

\[C = \frac{\varepsilon(n,p)}{1 - \varepsilon(n,p)}\]

it is used to determine the max number of CPUs for which the given level of efficiency can be maintained
How to maintain a given efficiency? – To increase the problem size when the number of processors increases
The maximum problem size we can solve is limited by available amount of memory
Usually for most parallel systems the memory size \(M\) is a constant multiple of the number of processors

Scalability function ¶

To maintain efficiency \(\varepsilon(n,p)\) when increasing \(p\) we must increase \(n\)
Max problem size is limited by available memory \(M\)
Scalability function \(scale(p)\) shows how memory usage per processor \(M(f(p))\) must grow to maintain efficiency
If the scalability function is constant this means the parallel system is perfectly scalable

Examples ¶

Floyd-Warshall Algorithm:
Reduction task:	collects the answers to all the subproblems and combines them in some way to form the output
	graph analysis algorithm for finding shorted path in weighted graph
Finite difference method:
	numerical methods for approximating the solutions to differential equations using finite difference equations to approximate derivatives

Lecture 4 ¶

P2P ¶

Pro

No central point of failure

Cons

Decentralised
Nodes are not equal
Hard programmatically

Volonteering computing ¶

Public-resource computing:

Combines the resources of personal computers and game consoles belonging to the general public to perform scientific coputations

Started with

Global Internet Mersenne Prime Search (1996)
Distributed.net (1997)

High-Performance Computing Technologies ¶

Lecture 1. Introduction ¶

Historical info ¶

Differences between parallel and distributed systems ¶

Usages ¶

Models ¶

Distributed computing ¶

Advantanges ¶

Disadvantages ¶

Pitfals ¶

Design ¶

Cluster and Grid computing ¶

Lecture 2. Parallel computing ¶

Classification of system ¶

UMA (Uniform memory access)¶

NUMA ¶

Distributed ¶

Massively parallel processors ¶

Programming models ¶

Implementation of parallelism ¶

Load balancing ¶

Granularty ¶

Amdahl’s law ¶

Performance guidelines ¶

Lecture 3. Metrics ¶

Types of computing problems ¶

More General Speedup Formula ¶

Karp-Flatt Metric ¶

Isoefficiency metric ¶

Scalability function ¶

Examples ¶

Lecture 4 ¶

P2P ¶

Volonteering computing ¶

IP-32 Blog

Навігація

Related Topics