=======================================
High-Performance Computing Technologies
=======================================

:Lecturer: Yuri Grygorovych Gordienko

.. contents::
   :depth: 4
..

--------------


Lecture 1. Introduction
=======================

date: Wed Feb 15 13:47:01 EET 2017

Historical info
---------------

Parallel processing cannot be replaced with the sequential
one because of `Moore' Law <https://en.wikipedia.org/wiki/Moore's_law>`_

  Number of transistors in a dense integrated circuit 
  doubles approximately every two years.

.. note::

  See also a 
  `Kurzweil's extension of Moore's law <http://www.kurzweilai.net/the-law-of-accelerating-returns>`_

Storage law
  storage capacity increases 2x every 12 months

Gilder's law
  network (optical) bandwidth increases 2x every 9 months

Distributed computing
  a collection of independent computers that appears to its users
  as a single coherent system 

Lesslie Lamport' test for distributed system

  You know you have a distributed system when the crash of a computer
  you've never heard of stops you from getting any work done


Differences between parallel and distributed systems
----------------------------------------------------

+--------------+-------------+--------------+
| Criteria     | Parallel    | Distributed  |
|              | System      | System       |
+==============+=============+==============+
| Connectivity | Tightly     | Loosely      |
|              | coupled     | coupled      |
+--------------+-------------+--------------+
| Memory       | Uses shared | Uses message |
| access       | memory      | passing as   |
|              | to exchange | each CPU has |
|              | information | its own      |
|              |             | memory       |
+--------------+-------------+--------------+
| Granularity  | Finer       | Coarse       |
|              | grained     | grained      |
+--------------+-------------+--------------+


Usages
------

- Strategic systems
- Visualisation and Graphics
- Economics and Finance
- Scientific computing

  - Physics (LHC)
  - Bioinformatics (protein-docking)
  - Geology (seismography)
  - Astronomy (simulation of galaxy)

Models
------

Distributed computations are characterised with models.

- Architectural models
  
  `Flynn's taxonomy <https://en.wikipedia.org/wiki/Flynn's_taxonomy>`_
  distinguishes the following categories:

  - **SISD**: traditional uniprocessor computers
  - **MISD**: Space shuttle flight control computer
  - **SIMD**: array processor, GPU
  - **MIMD**: parallel and distributed systems

  There are different architectural-service models as well:

  - Centralised (mainframe, cluster)
  - Client-server (mail, banking, computations)
  - Multi-tier (grid, DNS)
  - Peer-to-peer (file exchange, computations)
  
- Interaction models

  Interaction models give answers to the following questions:

  - How do we handle time?
  - Are there time limits?

  There are two major models:

  - Synchronous
  - Asynchronous

- Fault models

  The crucial question here is:
    
    What kind of faults can occur

  - Omission faults (a processor or communication fails to perform it is supposed to do)
  - Timing faults (in synchronous distributed systems)
  - Arbitrary faults (WTF has happened?)

Distributed computing
---------------------

Advantanges
~~~~~~~~~~~

- Performance
- Reliability
- Distribution
- Incremental growth
- Sharing computation/data/resources/management
- Communication
- Economics
- Flexibility

Disadvantages
~~~~~~~~~~~~~

- Heterogeneity (hardware, software, operation, etc)
- Software development
- Networking
- Incremental growth (scalability is a pain)

Pitfals
~~~~~~~

- The network is **NOT** reliable
- The network is **NOT** secure
- The network is **NOT** homogeneous
- The topology is **NOT** constant
- Latency is **NOT** zero
- Bandwidth is **NOT** infinite
- Transport cost is **NOT** zero
- There is **NO** single administrator

Design
~~~~~~

Main charachteristics:

- Transparency
  
    How to make impression that the collection of machines is a "simple" single computer?

  - Access
  - Location
  - Migration
  - Replication
  - Concurrency
  - Failure
  - Performance
- Scalability
- Performance

  - Performance of individual workstations
  - Speed of the communication infrastructure
  - Extent of reliability
  - Flexibility in workload allocations (i.e. idle processors 
    should be allocated automatically to a user's task)
- Heterogeneity

  - different hardware
  - different software
  - various devices (PCs, mobiles, ATM-machines, sensors, etc)
  - diverse networks and protocols

Cluster and Grid computing
--------------------------

:Cluster computing:
  collection of high-end computers usually
  closely connected through LAN

- Homogeneous: OS, hardware
- Work: together like a single computer 
- Applications are hosted on one machine and user machines connect to it.
  Clients connect via terminals

`High-performance computing center at KPI <http://hpcc.kpi.ua>`_

:Grid computing:
  collection of clusters, which may be combined in a "GRID"
  of a massive computing power

- Heterogeneous
- Work: for collaborations grids use virtual organizations

Lecture 2. Parallel computing
=============================

Parallel computing
  a form of computuation in which many calculations are carried out simultaneously.

There are different levels of parallel computing:

  - Instruction

    a single operation of a processor
  - Thread

    stream of execution (has one or multiple instructions)
  - Task
  - Process

Level of parallel computing

 - Task level
 - Instruction level
 - Bit level

Classification of system
------------------------

- Flynn's taxonomy
- Memory access

  - shared memory
    - centralized (SMP)
    - distributed (NUMA)
  - individual memory
    - distributed

UMA (Uniform memory access)
~~~~~~~~~~~~~~~~~~~~~~~~~~~

- equal acccess rights
- equal memory access time

NUMA
~~~~

- usually physically linked 2 or more SMPs so they can access mem of each other directly
- Not all have eq access time
- Memory acces across the link is much slower

Distributed
~~~~~~~~~~~

- Each CPI has its own local mem and changes are not visible to other CPUs
- Processors are connected by network
- Program must define a way to transfer data between processors

Massively parallel processors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MPP architecture consists of nodes each having its own processor, memory and I/O subsystem

Programming models
------------------

Programming Model
  some model which represents an abstraction of the computer system and enables the expression of
  ideas in some form

- Shared model

  - Processors read write the variables stored in a shared address space asynchronously
  - Access to the shared memory is controlled by some mechanisms (locks/semaphores)

- Threads model
- Data parallelization
- Message Passing


+--------------------+---------------+------------------------+
| Aspect             | Shared memory | Message passing        |
+====================+===============+========================+
| Communication      | Implicit      | Explicit mesages       |
+--------------------+---------------+------------------------+
| Synchronization    | Explicit      | Implicit (via message) |
+--------------------+---------------+------------------------+
| Hardware support   | Typically     |                        |
|                    | required      | None                   |
+--------------------+---------------+------------------------+
| Development effort | Lower         | Higher                 |
+--------------------+---------------+------------------------+
| Tuning Effort      | Higher        | Lower                  |
+--------------------+---------------+------------------------+

Implementation of parallelism
-----------------------------

Load balancing
~~~~~~~~~~~~~~

To ditribute work among all tasks so they are all kept busy all of the time

Ways to achieve:

- Adequate partitioning
- Dynamic work assignment
  - Scheduler/task-pool
  - Algorithm to detect and handle imbalances

.. note::

  If barrier synchronization is used then the slowest task determines the
  time of execution

Granularty
~~~~~~~~~~

computation/communication ratio

:Fine grained parallelism: 

  **few** computation events are done between communication events

  - High communication overhead
  - Small opportunity to enhance performance

:Coarse-grain parallelism: 

  **many** computational events are done between communication events.

  - Large opportunity to enhance performance
  - Harder to do load balancing efficiently

Amdahl's law
------------

- Suppose that the sequential execution of a program takes :math:`T_1` time units
  and the parallel execution on :math:`p` processors takes :math:`T_p` time units
- Suppose that out of the entire execution of the program, :math:`s` fraction of it
  is not parallelizable while :math:`1-s` fraction is parallelizable
- Then the speedup:

  .. math::
    
      \frac{T_1}{T_p} = \frac{T_1}{T_1 \cdot s + T_1 \cdot \frac{1 - s}{p}} 
                      = \frac{1}{s + \frac{1 - s}{p}}

.. note::

  - Amdahl's Law is too simple for real cases
  - The communication overhead and workload balance among processes (in general) should 
    be taken into account

There are other Laws of paralel computing performance:

- Gustafsons Law (1988)
    another way to evaluate the performance of a parallel program
- Karp/Flat Metric (1990)
    whether the principle barrier to the program speedup is the amount of inherently
    sequential code or parallel overhead
- Isoefficiency (isogranularity) metric
    the scalability of a parallel algorithm executing on parallel systems

Performance guidelines
----------------------

- Maximize the fraction of our program that can be parallelized
- Balance the workload of parallel processes
- Minimize the time spent for communication


Lecture 3. Metrics
==================

- Time
- Speedup

  .. math::

     Speedup = \Psi(n,p) = \frac{\text{sequential execution time}}{\text{parallel execution time}}
             = \frac{t_s}{t_p}


- Efficiency

  measure of processor utilisation as the speedup divided by the number of processors

  .. math::

     Efficiency = \varepsilon(n,p) = \frac{\text{Speedup}}{\text{Processors}}

  Note that

  .. math::
      \text{speedup} \leq \text{processors}

  Since :math:`\text{speedup} \geq 0` and :math:`\text{processors} > 1`, it follows that

  .. math::
     
    0 \leq \varepsilon(n,p) \leq 1

  However there are **superlinear** algorithms, when

  .. math::
    
     \text{speedup} > \text{processors}

  and for this case 

  .. math::
    
    \varepsilon(n,p) > 1

- Cost 

  .. math::

     \text{cost} = \text{parallel running time} \cdot \text{processors}


Types of computing problems
---------------------------

:Embarassingly parallel problem:
  is one for which little or no effort is required to separate the problem into a number of parallel tasks.
  They are thus well suited to large internet based distributed platforms and do not suffer from parallel slowdown.
  They require little or no communication of results between tasks.

:Distributed computing problems:
  require communication between taks, especially communication of intermediate results.

:Inheritably serial computing problems:
  cannot be parallelized at all. They are diametric opposite to embarrassingly parallel problems.

More General Speedup Formula
----------------------------

A better version of the `Amdahl's law`_

.. math::
   
   \Psi(n,p) \leq \frac{\sigma(n) + \phi(n)}{\sigma(n) + \frac{\phi(n)}{p} + \kappa(n,p)}

Speedup is an increasing function of problem size

Karp-Flatt Metric
-----------------

- analyze parallel program performance
- predict speedup with additional processors

Start with the speedup formula

.. math::

   \Psi(n,p) \leq \frac{\sigma(n) + \phi(n)}{\sigma(n) + \frac{\phi(n)}{p} + \kappa(n,p)}

The experimentally determined serial fraction e is a function of speedup and the number pf processors

.. math::

   e = \frac{1/\Psi - 1/p}{1 - 1/p}

from this we can define :math:`\Psi` in terms of :math:`e` and :math:`p`

.. math::

  \Psi = \frac{p}{e \cdot (p - 1) + 1}

**Interpretation of e**

- if e is constant as num of CPUs increases, then speedup is constrained by the sequential component

Isoefficiency metric
--------------------

- n - data size
- p - num processes
- T(n,p) -- execution time using p processors
- \Psi -- speedup

T_0(n,p) -- the total wasting time spent by processes doing work not done by sequential algorithm
.. math::

   T_0(n,p) = (p - 1) \cdot \sigma(n) + p \cdot \kappa(n,p)

- For example, :math:`T_(n,1)` is the sequential execution time
- We want the algorithm to maintain a constant level of efficiency as the data size n increases


**Main steps to derivation**

- begin with speedup formula
- compute total amount of overhead


.. math::
   T(n,1) \geq C \cdot T_0(n,p)

where 

.. math::

   C = \frac{\varepsilon(n,p)}{1 - \varepsilon(n,p)}


- it is used to determine the max number of CPUs for which 
  the given level of efficiency can be maintained
- How to maintain a given efficiency? --
  To increase the problem size when the number of processors increases
- The maximum problem size we can solve is limited by available amount of memory
- Usually for most parallel systems the memory size :math:`M` 
  is a constant multiple of the number of processors 

Scalability function
~~~~~~~~~~~~~~~~~~~~

- To maintain efficiency :math:`\varepsilon(n,p)` when increasing 
  :math:`p` we must increase :math:`n`
- Max problem size is limited by available memory :math:`M`
- Scalability function :math:`scale(p)` shows how memory usage per processor 
  :math:`M(f(p))` must grow to maintain efficiency
- If the scalability function is constant this means 
  the **parallel system is perfectly scalable**

Examples
~~~~~~~~

:Reduction task:
  collects the answers to all the subproblems and combines 
  them in some way to form the output
:Floyd-Warshall Algorithm:
  graph analysis algorithm for finding shorted path in weighted graph
:Finite difference method:
  numerical methods for approximating the solutions 
  to differential equations using 
  finite difference equations to approximate derivatives

Lecture 4
=========

P2P
---

Pro

- No central point of failure

Cons

- Decentralised
- Nodes are not equal
- Hard programmatically

Volonteering computing
----------------------

:Public-resource computing:

  Combines the resources of personal computers and game consoles belonging
  to the general public to perform scientific coputations

  Started with

  - Global Internet Mersenne Prime Search (1996)
  - Distributed.net (1997)