Overhead of the spin-lock in UltraSPARC T2

Relevanta dokument
The present situation on the application of ICT in precision agriculture in Sweden

Adding active and blended learning to an introductory mechanics course

What Is Hyper-Threading and How Does It Improve Performance

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

A study of the performance

Resultat av den utökade första planeringsövningen inför RRC september 2005

Taking Flight! Migrating to SAS 9.2!

Scalable Dynamic Analysis of Binary Code

Datorteknik och datornät. Case Study Topics

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

Isolda Purchase - EDI

Grafisk teknik. Sasan Gooran (HT 2006)

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Support Manual HoistLocatel Electronic Locks

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Pipelining i Intel 80486

Digitalteknik och Datorarkitektur 5hp

Semantic and Physical Modeling and Simulation of Multi-Domain Energy Systems: Gas Turbines and Electrical Power Networks

EMIR-European Market Infrastructure Regulation

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

A metadata registry for Japanese construction field

D-RAIL AB. All Rights Reserved.

Stiftelsen Allmänna Barnhuset KARLSTADS UNIVERSITET

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

Boiler with heatpump / Värmepumpsberedare

Datorarkitekturer med operativsystem ERIK LARSSON

Measuring child participation in immunization registries: two national surveys, 2001

PowerCell Sweden AB. Ren och effektiv energi överallt där den behövs

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Beijer Electronics AB 2000, MA00336A,

PFC and EMI filtering


Förändrade förväntningar

Datorarkitekturer med operativsystem ERIK LARSSON

Foto: Rona Proudfoot (some rights reserved) Datorarkitektur 1. Datapath & Control. December

Biblioteket.se. A library project, not a web project. Daniel Andersson. Biblioteket.se. New Communication Channels in Libraries Budapest Nov 19, 2007

Managing addresses in the City of Kokkola Underhåll av adresser i Karleby stad

Nya driftförutsättningar för Svensk kärnkraft. Kjell Ringdahl EON Kärnkraft Sverige AB

COPENHAGEN Environmentally Committed Accountants

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

Enterprise App Store. Sammi Khayer. Igor Stevstedt. Konsultchef mobila lösningar. Teknisk Lead mobila lösningar

Styrteknik: Binära tal, talsystem och koder D3:1


3rd September 2014 Sonali Raut, CA, CISA DGM-Internal Audit, Voltas Ltd.

Dagens OS. Unix, Linux och Windows. Unix. Unix. En översikt av dagens OS Titt på hur de gör. Många varianter Mycket gemensamt. En del som skiljer

RADIATION TEST REPORT. GAMMA: 30.45k, 59.05k, 118.8k/TM1019 Condition D

Föreläsning 4 IS1300 Inbyggda system

Accomodations at Anfasteröd Gårdsvik, Ljungskile

Ringmaster RM3 - RM 5 RM3 RM 4 RM 5

EFFEKTIVA PROJEKT MED WEBBASERAD PROJEKTLEDNING

Image quality Technical/physical aspects

Mönster. Ulf Cederling Växjö University Slide 1

Här kan du sova. Sleep here with a good conscience

2.45GHz CF Card Reader User Manual. Version /09/15

Preschool Kindergarten

Olika OS. Unix, Linux och Windows. Unix. Unix. En översikt av ett par OS. Titt på hur de gör. Många varianter. Mycket gemensamt. En del som skiljer

EASA Standardiseringsrapport 2014

Cacheminne Intel Core i7

Bridging the gap - state-of-the-art testing research, Explanea, and why you should care

Aktivitetsschemaläggning för flerkärninga processorer

INSTALLATION INSTRUCTIONS

CHANGE WITH THE BRAIN IN MIND. Frukostseminarium 11 oktober 2018

Hållbar utveckling i kurser lå 16-17

This manual should be saved! EcoFlush Manual

Kursplan. AB1029 Introduktion till Professionell kommunikation - mer än bara samtal. 7,5 högskolepoäng, Grundnivå 1

The Municipality of Ystad

Protected areas in Sweden - a Barents perspective

Projekt Laxförvaltning för framtiden & Älvspecifik laxförvaltning Salmon Management for the Future / River Specific Management

SVENSK STANDARD SS-ISO :2010/Amd 1:2010

Solutions to exam in SF1811 Optimization, June 3, 2014

Affärsmodellernas förändring inom handeln

Prestandapåverkan på databashanterare av flertrådiga processorer. Jesper Dahlgren

Installation Instructions

Robust och energieffektiv styrning av tågtrafik

This manual should be saved! EcoFlush Manual. Wostman 2018:2

Sara Skärhem Martin Jansson Dalarna Science Park

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Urban Runoff in Denser Environments. Tom Richman, ASLA, AICP

Datorarkitekturer med operativsystem ERIK LARSSON

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Module 1: Functions, Limits, Continuity

Parallellism i NVIDIAs Fermi GPU

Mapping sequence reads & Calling variants

Manhour analys EASA STI #17214

2.1 Installation of driver using Internet Installation of driver from disk... 3

Grass to biogas turns arable land to carbon sink LOVISA BJÖRNSSON

Kunskapsintensiva företagstjänster en förutsättning för en konkurrenskraftig industri. HLG on Business Services 2014

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

CUSTOMER READERSHIP HARRODS MAGAZINE CUSTOMER OVERVIEW. 63% of Harrods Magazine readers are mostly interested in reading about beauty

Accessing & Allocating Alternatives

EXPERT SURVEY OF THE NEWS MEDIA

Geo installationsguide

Make a speech. How to make the perfect speech. söndag 6 oktober 13

FANNY AHLFORS AUTHORIZED ACCOUNTING CONSULTANT,

PRESS FÄLLKONSTRUKTION FOLDING INSTRUCTIONS

Kursplan. NA1032 Makroekonomi, introduktion. 7,5 högskolepoäng, Grundnivå 1. Introductory Macroeconomics

Materialplanering och styrning på grundnivå. 7,5 högskolepoäng

Transkript:

Overhead of the spin-lock in UltraSPARC T2 5th HiPEAC Industrial Workshop June 4th, 2008 Vladimir Cakarevic, Petar Radojkovic, Javier Verdú, Alex Pajuelo, Roberto Gioiosa, Francisco J. Cazorla, Mario Nemirovsky, Mateo Valero Barcelona, Spain. June 4th, 2008 1

Motivation Synchronization methods for multithreaded applications Trade-off: Overhead vs Responsiveness To spin OS with no capabilities to suspend and resume a process NetraDPS is a run-time system with no process scheduler High responsiveness, but crucial impact on high contention locks on multicore multithreaded processors or not to spin Make the process sleep until the lock is released Release the CPU without going to sleep Software support Yield() system call Hardware support to delay or even pause the spinning loop Pause and monitor/mwait (ISA extensions included by Intel) Barcelona, Spain. June 4th, 2008 2

Objectives Study of spin-lock loop overhead in UltraSPARC T2 Performance degradation depending on the amount of resources shared among active threads NetraDPS is an ideal framework to do this analysis (it introduces almost no overhead) Detection of the most critical shared resources Responsible sources of the performance slowdown (overhead) Modification of the delay code by adding different type of long-latency instructions Overcome the lack of system support to pause an active thread Reduction of spin-lock loop overhead on the active thread Decreasing the demand of critical shared resources (e.g. fetch bandwidth) Barcelona, Spain. June 4th, 2008 3

Methodology We measure the execution time of a single-threaded application 1 active thread (the thread under study) 1 or more threads constantly running the spin-lock loop We use two metrics Slowdown (overhead) Execution time of active thread vs running in isolation Responsiveness (indirect metric) The time interval between two spin-lock reads Measurements framework is built using FAME techniques Vera et al. Analysis of system overhead on parallel computers, PACT 2007 Barcelona, Spain. June 4th, 2008 4

Levels of sharing resources on UltraSPARC T2 Experiments executed on Sun UltraSPARC T2 CMT processor 8 cores, 64 hardware threads (strands) running at 1,4GHz 2 sets (pipes) of 4 strands per core Each pipe has its own execution unit Pipes fetch and execute instructions in parallel Shared per-core FP unit, Load Store unit, and cryptography unit L2 cache is shared per chip UltraSPARC T2 shows different levels of shared resources than the T1 processor Barcelona, Spain. June 4th, 2008 5

Framework Dedicated virtual machines (logical domains) Managed by Sun Logical Domains (LDoms) software No measurable overhead when using LDoms software Each dedicated domain contains two cores and 4GB of dedicated memory The whole system contains 64GB of physical memory NetraDPS: low-overhead run time environment It provides a smaller set of functions compared to other full-fledged OSes (e.g. Linux, Solaris) But, it introduces almost no overhead It executes bootable images Thread functions are bound to core at compile time (mapping file) Applications cannot migrate to other strands at run time NetraDPS doesn t provide a run time scheduler Barcelona, Spain. June 4th, 2008 6

Benchmarks (Active threads) Single-behavior A set of micro-benchmarks that execute only single instruction all the time Add, branch always (baa), load that hits in L2$, load that hits in L1$, integer mul (mulx) Multiple-behavior A set of applications that emulate some real tasks Matrix-Vector multiplication multiplies very large matrix with large vector Mostly memory bound (large number of L2$ misses) running Int and FP versions QuickSort recursive quicksort algorithm CPU bound and short-latency operations Hashing Use medium to large 2Dhash map CPU bound and non-sequential, pseudo-random memory accesses Barcelona, Spain. June 4th, 2008 7

Linux s spin-lock loop Linux s spin-lock loop Kernel 2.6.25.4 Linux default spin-lock loop Linux spin-lock loop shows low CPI Thread executing spin-lock loops are ready to fetch most of the time Barcelona, Spain. June 4th, 2008 8

Overhead Evaluation Spin-lock loops generate significant negative impact on the strands of the same core The fetch bandwidth is the main source of interaction High slowdown to low CPI tasks Barcelona, Spain. June 4th, 2008 9

Variants of the Linux spin-lock loop Reducing the fetch bandwidth demand of spin-lock loop We insert a long-latency operation inside the loop Less fetch bandwidth and resource requirements This inserted instruction must not cause additional overhead due to shared resources FDIV floating point division 33 cycles IDIV Integer division 12-41 cycles CASX always goes directly to L2$ 20-30 cycles L2miss load misses every time in L2$ (pointer chasing) 220 cycles Barcelona, Spain. June 4th, 2008 10

Results Single-behavior benchmark Multiple-behavior benchmark We reduce the average overhead suffered by the active thread (multiple-behavior benchmark) from 42% on avg. (61% worst case) to 1.5% on avg. (2% worst case) Reduction of the fetch bandwidth demand of spinning threads The worst response time is the time of a loop iteration The time that passes from releasing and acquiring the lock by a spinning thread L2miss is around 220 cycles vs 10 cycles in default Linux spin-lock Barcelona, Spain. June 4th, 2008 11

Conclusions The Linux s spin-lock loop introduces significant negative interaction to other threads running in the same core It can seriously degrade the performance of other active threads The fetch bandwidth is the most critical shared resource and the responsible of performance loss Inserting a long-latency instruction reduces the fetch demand of the spin-lock loop Slower responsiveness Overhead in other shared resources Our delayed spin-lock loop significantly reduces the contention in the fetch stage From 42% (Linux s spin-lock loop) to 1.5% (spin-lock loop with L2miss instruction) of slowdown Future work Use the delayed spin-lock loop in real multithreaded applications and quantify improvement Propose hardware solutions to avoid/reduce the use of shared resources Barcelona, Spain. June 4th, 2008 12

Acknowledgements This work has been supported by The Ministry of Science and Technology of Spain Under contracts TIN-2004-07739-C02-01, TIN-2007-60625 The HiPEAC European Network of Excellence Collaboration Agreement between Sun Microsystems Inc. and BSC We wish to thank Jochen Behrens, Gunawan Ali-Santosa and Ariel Hendel from Sun Microsystems Inc. for their technical support Barcelona, Spain. June 4th, 2008 13