High-Performance Computing Technologies¶
Lecturer: | Yuri Grygorovych Gordienko |
---|
Contents
Lecture 1. Introduction¶
date: Wed Feb 15 13:47:01 EET 2017
Historical info¶
Parallel processing cannot be replaced with the sequential one because of Moore’ Law
Number of transistors in a dense integrated circuit doubles approximately every two years.
Примітка
See also a Kurzweil’s extension of Moore’s law
- Storage law
- storage capacity increases 2x every 12 months
- Gilder’s law
- network (optical) bandwidth increases 2x every 9 months
- Distributed computing
- a collection of independent computers that appears to its users as a single coherent system
Lesslie Lamport’ test for distributed system
You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done
Differences between parallel and distributed systems¶
Criteria | Parallel System | Distributed System |
---|---|---|
Connectivity | Tightly coupled | Loosely coupled |
Memory access | Uses shared memory to exchange information | Uses message passing as each CPU has its own memory |
Granularity | Finer grained | Coarse grained |
Usages¶
- Strategic systems
- Visualisation and Graphics
- Economics and Finance
- Scientific computing
- Physics (LHC)
- Bioinformatics (protein-docking)
- Geology (seismography)
- Astronomy (simulation of galaxy)
Models¶
Distributed computations are characterised with models.
Architectural models
Flynn’s taxonomy distinguishes the following categories:
- SISD: traditional uniprocessor computers
- MISD: Space shuttle flight control computer
- SIMD: array processor, GPU
- MIMD: parallel and distributed systems
There are different architectural-service models as well:
- Centralised (mainframe, cluster)
- Client-server (mail, banking, computations)
- Multi-tier (grid, DNS)
- Peer-to-peer (file exchange, computations)
Interaction models
Interaction models give answers to the following questions:
- How do we handle time?
- Are there time limits?
There are two major models:
- Synchronous
- Asynchronous
Fault models
The crucial question here is:
What kind of faults can occur
- Omission faults (a processor or communication fails to perform it is supposed to do)
- Timing faults (in synchronous distributed systems)
- Arbitrary faults (WTF has happened?)
Distributed computing¶
Advantanges¶
- Performance
- Reliability
- Distribution
- Incremental growth
- Sharing computation/data/resources/management
- Communication
- Economics
- Flexibility
Disadvantages¶
- Heterogeneity (hardware, software, operation, etc)
- Software development
- Networking
- Incremental growth (scalability is a pain)
Pitfals¶
- The network is NOT reliable
- The network is NOT secure
- The network is NOT homogeneous
- The topology is NOT constant
- Latency is NOT zero
- Bandwidth is NOT infinite
- Transport cost is NOT zero
- There is NO single administrator
Design¶
Main charachteristics:
Transparency
How to make impression that the collection of machines is a “simple” single computer?
- Access
- Location
- Migration
- Replication
- Concurrency
- Failure
- Performance
Scalability
Performance
- Performance of individual workstations
- Speed of the communication infrastructure
- Extent of reliability
- Flexibility in workload allocations (i.e. idle processors should be allocated automatically to a user’s task)
Heterogeneity
- different hardware
- different software
- various devices (PCs, mobiles, ATM-machines, sensors, etc)
- diverse networks and protocols
Cluster and Grid computing¶
Cluster computing: | |
---|---|
collection of high-end computers usually closely connected through LAN |
- Homogeneous: OS, hardware
- Work: together like a single computer
- Applications are hosted on one machine and user machines connect to it. Clients connect via terminals
High-performance computing center at KPI
Grid computing: | collection of clusters, which may be combined in a “GRID” of a massive computing power |
---|
- Heterogeneous
- Work: for collaborations grids use virtual organizations
Lecture 2. Parallel computing¶
- Parallel computing
- a form of computuation in which many calculations are carried out simultaneously.
There are different levels of parallel computing:
Instruction
a single operation of a processor
Thread
stream of execution (has one or multiple instructions)
Task
Process
Level of parallel computing
- Task level
- Instruction level
- Bit level
Classification of system¶
- Flynn’s taxonomy
- Memory access
- shared memory - centralized (SMP) - distributed (NUMA)
- individual memory - distributed
UMA (Uniform memory access)¶
- equal acccess rights
- equal memory access time
NUMA¶
- usually physically linked 2 or more SMPs so they can access mem of each other directly
- Not all have eq access time
- Memory acces across the link is much slower
Distributed¶
- Each CPI has its own local mem and changes are not visible to other CPUs
- Processors are connected by network
- Program must define a way to transfer data between processors
Massively parallel processors¶
MPP architecture consists of nodes each having its own processor, memory and I/O subsystem
Programming models¶
- Programming Model
- some model which represents an abstraction of the computer system and enables the expression of ideas in some form
- Shared model
- Processors read write the variables stored in a shared address space asynchronously
- Access to the shared memory is controlled by some mechanisms (locks/semaphores)
- Threads model
- Data parallelization
- Message Passing
Aspect | Shared memory | Message passing |
---|---|---|
Communication | Implicit | Explicit mesages |
Synchronization | Explicit | Implicit (via message) |
Hardware support | Typically required | None |
Development effort | Lower | Higher |
Tuning Effort | Higher | Lower |
Implementation of parallelism¶
Load balancing¶
To ditribute work among all tasks so they are all kept busy all of the time
Ways to achieve:
- Adequate partitioning
- Dynamic work assignment - Scheduler/task-pool - Algorithm to detect and handle imbalances
Примітка
If barrier synchronization is used then the slowest task determines the time of execution
Granularty¶
computation/communication ratio
Fine grained parallelism: | |
---|---|
few computation events are done between communication events
|
|
Coarse-grain parallelism: | |
many computational events are done between communication events.
|
Amdahl’s law¶
Suppose that the sequential execution of a program takes \(T_1\) time units and the parallel execution on \(p\) processors takes \(T_p\) time units
Suppose that out of the entire execution of the program, \(s\) fraction of it is not parallelizable while \(1-s\) fraction is parallelizable
Then the speedup:
\[\frac{T_1}{T_p} = \frac{T_1}{T_1 \cdot s + T_1 \cdot \frac{1 - s}{p}} = \frac{1}{s + \frac{1 - s}{p}}\]
Примітка
- Amdahl’s Law is too simple for real cases
- The communication overhead and workload balance among processes (in general) should be taken into account
There are other Laws of paralel computing performance:
- Gustafsons Law (1988)
- another way to evaluate the performance of a parallel program
- Karp/Flat Metric (1990)
- whether the principle barrier to the program speedup is the amount of inherently sequential code or parallel overhead
- Isoefficiency (isogranularity) metric
- the scalability of a parallel algorithm executing on parallel systems
Performance guidelines¶
- Maximize the fraction of our program that can be parallelized
- Balance the workload of parallel processes
- Minimize the time spent for communication
Lecture 3. Metrics¶
Time
Speedup
\[Speedup = \Psi(n,p) = \frac{\text{sequential execution time}}{\text{parallel execution time}} = \frac{t_s}{t_p}\]Efficiency
measure of processor utilisation as the speedup divided by the number of processors
\[Efficiency = \varepsilon(n,p) = \frac{\text{Speedup}}{\text{Processors}}\]Note that
\[\text{speedup} \leq \text{processors}\]Since \(\text{speedup} \geq 0\) and \(\text{processors} > 1\), it follows that
\[0 \leq \varepsilon(n,p) \leq 1\]However there are superlinear algorithms, when
\[\text{speedup} > \text{processors}\]and for this case
\[\varepsilon(n,p) > 1\]Cost
\[\text{cost} = \text{parallel running time} \cdot \text{processors}\]
Types of computing problems¶
Embarassingly parallel problem: | |
---|---|
is one for which little or no effort is required to separate the problem into a number of parallel tasks. They are thus well suited to large internet based distributed platforms and do not suffer from parallel slowdown. They require little or no communication of results between tasks. | |
Distributed computing problems: | |
require communication between taks, especially communication of intermediate results. | |
Inheritably serial computing problems: | |
cannot be parallelized at all. They are diametric opposite to embarrassingly parallel problems. |
More General Speedup Formula¶
A better version of the Amdahl’s law
Speedup is an increasing function of problem size
Karp-Flatt Metric¶
- analyze parallel program performance
- predict speedup with additional processors
Start with the speedup formula
The experimentally determined serial fraction e is a function of speedup and the number pf processors
from this we can define \(\Psi\) in terms of \(e\) and \(p\)
Interpretation of e
- if e is constant as num of CPUs increases, then speedup is constrained by the sequential component
Isoefficiency metric¶
- n - data size
- p - num processes
- T(n,p) – execution time using p processors
- Psi – speedup
T_0(n,p) – the total wasting time spent by processes doing work not done by sequential algorithm .. math:
T_0(n,p) = (p - 1) \cdot \sigma(n) + p \cdot \kappa(n,p)
- For example, \(T_(n,1)\) is the sequential execution time
- We want the algorithm to maintain a constant level of efficiency as the data size n increases
Main steps to derivation
- begin with speedup formula
- compute total amount of overhead
where
- it is used to determine the max number of CPUs for which the given level of efficiency can be maintained
- How to maintain a given efficiency? – To increase the problem size when the number of processors increases
- The maximum problem size we can solve is limited by available amount of memory
- Usually for most parallel systems the memory size \(M\) is a constant multiple of the number of processors
Scalability function¶
- To maintain efficiency \(\varepsilon(n,p)\) when increasing \(p\) we must increase \(n\)
- Max problem size is limited by available memory \(M\)
- Scalability function \(scale(p)\) shows how memory usage per processor \(M(f(p))\) must grow to maintain efficiency
- If the scalability function is constant this means the parallel system is perfectly scalable
Examples¶
Reduction task: | collects the answers to all the subproblems and combines them in some way to form the output |
---|---|
Floyd-Warshall Algorithm: | |
graph analysis algorithm for finding shorted path in weighted graph | |
Finite difference method: | |
numerical methods for approximating the solutions to differential equations using finite difference equations to approximate derivatives |
Lecture 4¶
Volonteering computing¶
Public-resource computing: | |
---|---|
Combines the resources of personal computers and game consoles belonging to the general public to perform scientific coputations Started with
|