CS 475: Parallel Programming Introduction

Relevanta dokument
Support Manual HoistLocatel Electronic Locks

Isolda Purchase - EDI

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

Chapter 2: Random Variables

What Is Hyper-Threading and How Does It Improve Performance

Writing with context. Att skriva med sammanhang

Module 6: Integrals and applications

FÖRBERED UNDERLAG FÖR BEDÖMNING SÅ HÄR

Beijer Electronics AB 2000, MA00336A,

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

Kurskod: TAMS28 MATEMATISK STATISTIK Provkod: TEN1 05 June 2017, 14:00-18:00. English Version

Adding active and blended learning to an introductory mechanics course

Preschool Kindergarten


Boiler with heatpump / Värmepumpsberedare

Vässa kraven och förbättra samarbetet med hjälp av Behaviour Driven Development Anna Fallqvist Eriksson

Theory 1. Summer Term 2010

Webbreg öppen: 26/ /

Webbregistrering pa kurs och termin

Workplan Food. Spring term 2016 Year 7. Name:

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Kvalitetsarbete I Landstinget i Kalmar län. 24 oktober 2007 Eva Arvidsson

Parallellism i NVIDIAs Fermi GPU

Senaste trenderna från testforskningen: Passar de industrin? Robert Feldt,

Datasäkerhet och integritet

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Mönster. Ulf Cederling Växjö University Slide 1

SOA One Year Later and With a Business Perspective. BEA Education VNUG 2006

Styrteknik: Binära tal, talsystem och koder D3:1

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Isometries of the plane

Quicksort. Koffman & Wolfgang kapitel 8, avsnitt 9

Resultat av den utökade första planeringsövningen inför RRC september 2005

Lösenordsportalen Hosted by UNIT4 For instructions in English, see further down in this document

12.6 Heat equation, Wave equation

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

PFC and EMI filtering

Övning 5 ETS052 Datorkommuniktion Routing och Networking

Signatursida följer/signature page follows

Kurskod: TAIU06 MATEMATISK STATISTIK Provkod: TENA 17 August 2015, 8:00-12:00. English Version

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Bridging the gap - state-of-the-art testing research, Explanea, and why you should care

Ett hållbart boende A sustainable living. Mikael Hassel. Handledare/ Supervisor. Examiner. Katarina Lundeberg/Fredric Benesch

Datorteknik och datornät. Case Study Topics

Module 1: Functions, Limits, Continuity

The Municipality of Ystad

F ξ (x) = f(y, x)dydx = 1. We say that a random variable ξ has a distribution F (x), if. F (x) =

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

TRENDERNA SOM FORMAR DIN VERKLIGHET 2014 ÅRETS IT AVDELNING

Förändrade förväntningar

CHANGE WITH THE BRAIN IN MIND. Frukostseminarium 11 oktober 2018

Stad + Data = Makt. Kart/GIS-dag SamGIS Skåne 6 december 2017

Tentamen i Matematik 2: M0030M.

CVUSD Online Education. Summer School 2010

Pre-Test 1: M0030M - Linear Algebra.

Protokoll Föreningsutskottet

2.1 Installation of driver using Internet Installation of driver from disk... 3

Hur fattar samhället beslut när forskarna är oeniga?

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

The Algerian Law of Association. Hotel Rivoli Casablanca October 22-23, 2009

SWESIAQ Swedish Chapter of International Society of Indoor Air Quality and Climate

Scalable Dynamic Analysis of Binary Code

PRESS FÄLLKONSTRUKTION FOLDING INSTRUCTIONS

The Swedish National Patient Overview (NPO)

Biblioteket.se. A library project, not a web project. Daniel Andersson. Biblioteket.se. New Communication Channels in Libraries Budapest Nov 19, 2007

Make a speech. How to make the perfect speech. söndag 6 oktober 13

InstalationGuide. English. MODEL:150NHighGain/30NMiniUSBAdapter

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

The present situation on the application of ICT in precision agriculture in Sweden

SAMMANFATTNING AV SUMMARY OF

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Provlektion Just Stuff B Textbook Just Stuff B Workbook

BOENDEFORMENS BETYDELSE FÖR ASYLSÖKANDES INTEGRATION Lina Sandström

EXTERNAL ASSESSMENT SAMPLE TASKS SWEDISH BREAKTHROUGH LSPSWEB/0Y09

In Bloom CAL # 8, sista varv och ihopsättning / last rows and assemble

Datorarkitekturer med operativsystem ERIK LARSSON

Förskola i Bromma- Examensarbete. Henrik Westling. Supervisor. Examiner

Documentation SN 3102

- den bredaste guiden om Mallorca på svenska! -

Stiftelsen Allmänna Barnhuset KARLSTADS UNIVERSITET

Hjälpmedel: Inga, inte ens miniräknare Göteborgs Universitet Datum: 2018 kl Telefonvakt: Jonatan Kallus Telefon: ankn 5325

Mot hållbar elbilsanvändning

Kurskod: TAMS11 Provkod: TENB 28 August 2014, 08:00-12:00. English Version

PowerCell Sweden AB. Ren och effektiv energi överallt där den behövs

Solutions to exam in SF1811 Optimization, June 3, 2014

Taking Flight! Migrating to SAS 9.2!

Pedagogisk planering. Ron Chlebek. Centralt Innehåll. Svenska/Engelska. Lego Mindstorms. Syfte: Matematik

GeoGebra in a School Development Project Mathematics Education as a Learning System

Unit course plan English class 8C

Module 4 Applications of differentiation

Botnia-Atlantica Information Meeting

ASSEMBLY INSTRUCTIONS SCALE SQUARE - STANDARD

Cacheminne Intel Core i7

Read Texterna består av enkla dialoger mellan två personer A och B. Pedagogen bör presentera texten så att uttalet finns med under bearbetningen.

Att använda data och digitala kanaler för att fatta smarta beslut och nå nya kunder.

Transkript:

CS 475: Parallel Programming Introduction Wim Bohm Colorado State University Fall 2012

Why Parallel Programming n Need for speed Many applications require orders of magnitude more compute power than we have right now. Speed came for a long time from technology improvements, mainly increased clock speed and chip density. The technology improvements have slowed down, and the only way to get more speed is to exploit parallelism. All new computers are now parallel computers.

Technology: Moore s Law n Empirical observation (1965) by Gordon Moore: n The chip density: number of transistors that can be inexpensively placed on an integrated circuit, is increasing exponentially, doubling approximately every two years. n en.wikipedia.org/wiki/moore's_law n This has held true until now and is expected to hold until at least 2015 (news.cnet.com/new-life-for- Moores-Law/2009-1006_3-5672485.html).

source: Wikipedia

Corollary of exponential growth n When two quantities grow exponentially, but at different rates, their ratio also grows exponentially. " y 1 = a x, and y 2 = b x for a b 1, therefore a % $ ' # b & n In CS terms 2 n O(3 n ) or 2 n grows a lot faster than (1.1) n n Consequence for computer architecture: growth rate for e.g. memory is not as high as or processor, therefore, the memory gets slower and slower (in terms of clock cycles) as compared to the processor. This gives rise to so called gaps or walls x = α x, α 1

Gaps or walls of Moore s Law n Memory gap/wall: Memory bandwidth and latency improve much slower than processor speeds (since mid 80s, this was addressed by ever increasing on-chip caches). n (Brick) Power Wall Higher frequency means more power means more heat. Microprocessor chips are getting exceedingly hot. Consequence: We can no longer increase the clock frequency of our micro processors.

source: The Free Lunch Is Over Herb Sutter Dr. Dobbs, 2005

Enter Parallel Computing n The goal is to deliver performance. The only solution to the power wall is to have multiple processor cores on a chip, because we can still increase the chip density. n Nowadays, there are no more processors with one core. The processor is the new transistor. n Number of cores will continue to grow at an exponential rate (at least until 2015).

Implicit Parallel Programming n Let the compiler / runtime system detect parallelism, and do task and data allocation, and scheduling. This has been a holy grail of compiler and parallel computing research for a long time. This, after >40 years, still requires research on automatic parallelization, languages, OS, fault tolerance, etc.

Explicit Parallel Programming Let the programmer express parallelism, task and data partitioning, allocation, synchronization, and scheduling, using programming languages extended with explicit parallel programming constructs. n This makes the programming task harder (but a lot of fun J )

Explicit parallel programming n Explicit parallel programming Multithreading: OpenMP Message Passing: MPI Data parallel programming: CUDA n Explicit Parallelism complicates programming creation, allocation, scheduling of processes data partitioning Synchronization ( semaphores, locks, messages )

Computer Architectures n Sequential: John von Neumann Stored Program, single instruction stream, single data stream n Parallel: Michael Flynn SIMD: Single instruction multiple data n data parallel MIMD: Multiple instruction multiple data

SIMD n One control unit n Many processing elements All executing the same instruction Local memories Conditional execution n Where ocean perform ocean step n Where land perform land step n Gives rise to idle processors (optimized in GPUs) n PE interconnection Usually a 2D mesh

SIMD machines n Bit level n ILLIAC IV (very early research machine, Illinois) n CM2 (Thinking Machines) n now GPU: grid of SMs (Streaming Multiprocessors) n each SM has a collection of Processing Elements n each SM has a shared memory and a register file one global memory

MIMD n Multiple Processors Each executing its own code n Distributed memory MIMD Complete PE+Memory connected to network n Cosmic Cube, N-Cube, our Cray n Extreme: Network of Workstations, Data centers: racks of processors with high speed bus interconnects n Programming model: Message Passing

Shared memory MIMD n Shared memory CPUs or cores, bus, shared memory All our processors are now multi-core Programming model: OpenMP n NUMA: Non Uniform Memory Access Memory Hierarchy (registers, cache, local, global) Potential for better performance, but problem: memory (cache) coherence (resolved by architecture)

Speedup n Why do we write parallel programs again? to get speedup: go faster than sequential n What is speedup? T 1 = sequential time to execute a program sometimes called T, or S T p = time to execute the same program with p processors (cores, PEs) S p = T 1 / T p speedup for p processors

Ideal Speedup, linear speedup n Ideal: p processors à p fold speedup: S p = p Ideal not always possible. WHY? n Certain parts of the computation are inherently sequential n Tasks are data dependent, so not all processors are always busy, and nee to synchronize n Remote data needs communication Memory wall PLUS Communication wall n Linear speedup: S p = f p f usually less than 1

Efficiency n Speedup is usually not ideal, nor linear n We express this in terms of efficiency E p : E p = S p / p E p defines the average utilization of p processors Range? n What does E p = 1 signify? n What does E p = f (0<f<1) signify?

More realistic speedups n T 1 = 1, T p = σ + (ο+π)/p n S p = 1 / (σ + (ο+π)/p ) σ is sequential fraction of the program π is parallel fraction of the program σ+π = 1 ο is parallel overhead (does not occur in sequential execution) Draw speedup curves for p=1,2,4,8,16,32 σ= ¼, ½ ο= ¼, ½ When p goes to, S p goes to?

Plotting speedups T 1 = 1, T p =σ+(ο+π)/p S p = 1/(σ+(ο+π)/p ) p Sp (s:1/4 o:1/4) Sp (s:1/4 o:1/2) Sp (s:1/2 o:1/4) Sp (s:1/2 o:1/2) 1 1 1 1 1 2 1.33 1.14 1.14 1.00 4 2.00 1.78 1.45 1.33 8 2.67 2.46 1.68 1.60 16 3.2 3.05 1.83 1.78 32 3.56 3.46 1.91 1.88 4" 3.5" 3" 2.5" Speedup&&Sp&&& 2" 1.5" 1" 0.5" Sp"(s:1/4"o:1/4)" Sp"(s:1/4"o:1/2)" Sp"(s:1/2"o:1/4)" Sp"(s:1/2"o:1/2)" 0" 0" 10" 20" 30" 40" Number&of&processors:&p&

This class: explicit parallel programming Class Objectives n Study the foundations of parallel programming n Write explicit parallel programs In C Using OpenMP (shared memory), MPI (message passing) and CUDA (data parallel) Execute and measure, plot speedups, interpret the outcomes (σ,π,ο), document performance

Class Format n Hands on: write efficient parallel code. This often starts with writing efficient sequential code. n Lab covers many details, discussions follow up on what s learned in labs. n PAs: Write effective and scalable parallel programs: You will submit your programs and report their performance.

Parallel Programming steps n Partition: break program in (smallest possible) parallel tasks and partition the data accordingly. n Communicate: decide which tasks need to communicate (exchange data). n Agglomerate: group tasks into larger tasks reducing communication overhead taking LOCALITY into account n High Performance Computing is all about locality

Locality A process (running on a processor) needs to have its data as close to it as possible. Processors have non uniform memory access (NUMA): n registers 0 (extra) cycles n cache 1,2,3 cycles there are multiple caches on a multicore chip n local memory 50s to 100s of cycles n Process n n global memory, processor-processor interconnect, disk COARSE in OS world: a program in execution FINER GRAIN in HPC world: (a number of) loop iterations, or a function invocation

Exercise: summing numbers 10x10 grid of 100 processors. Add up 4000 numbers initially at the corner processor[0,0], processors can only communicate with NEWS neighbors Different cost models: Each communication event may have an arbitrarily large set of numbers (cost independent of volume). n Alternate: Communication cost is proportional to volume Cost: Communication takes 100x the time of adding n Alternate: communication and computation take the same time

Outline n scatter : the numbers to processors row or column wise Each processor along the edge gets a stack n Keeps whatever is needed for its row/column n Forwards the rest to the processor to south (or east) n Initiates an independent scatter along its row/column n each processor adds its local set of sums n gather : sums are sent back in reverse order of scatter and added up along the way

Broadcast / Reduction n One processor has a value, and we want every processor to have a copy of that value. n Grid: Time for a processor to receive the value must be at least equal to its distance (number of hops in the grid graph) from the source. For the opposite corner, this is the perimeter of the grid, ~ 2sqrt(p) hops n Reduction: dual of broadcast First add n/p numbers on each processor Then do the broadcast in reverse and add along the way

Analysis n Scatter, add local, communicate back to the corner Intersperse with one addition operation problem size N=4000, machine size P=100, machine topology: square grid, comms cost C=100) Scatter: 18 steps (get data, keep what you need, send the rest onwards) (2(sqrt(P)-1))) Add 40 numbers Reduce the local answers: 18 steps (one send, add 2 numbers) n Is P=100 good for this problem?