Institutionen för systemteknik

Relevanta dokument
Styrteknik: Binära tal, talsystem och koder D3:1

Beijer Electronics AB 2000, MA00336A,

Viktig information för transmittrar med option /A1 Gold-Plated Diaphragm

Rastercell. Digital Rastrering. AM & FM Raster. Rastercell. AM & FM Raster. Sasan Gooran (VT 2007) Rastrering. Rastercell. Konventionellt, AM

Grafisk teknik IMCDP IMCDP IMCDP. IMCDP(filter) Sasan Gooran (HT 2006) Assumptions:

Support Manual HoistLocatel Electronic Locks

Stiftelsen Allmänna Barnhuset KARLSTADS UNIVERSITET

Preschool Kindergarten

Isometries of the plane

Grafisk teknik IMCDP. Sasan Gooran (HT 2006) Assumptions:

Technique and expression 3: weave. 3.5 hp. Ladokcode: AX1 TE1 The exam is given to: Exchange Textile Design and Textile design 2.

A study of the performance

Grafisk teknik. Sasan Gooran (HT 2006)

Master Thesis. Study on a second-order bandpass Σ -modulator for flexible AD-conversion Hanna Svensson. LiTH - ISY - EX -- 08/ SE

Make a speech. How to make the perfect speech. söndag 6 oktober 13

LUNDS TEKNISKA HÖGSKOLA Institutionen för Elektro- och Informationsteknik

Theory 1. Summer Term 2010

Information technology Open Document Format for Office Applications (OpenDocument) v1.0 (ISO/IEC 26300:2006, IDT) SWEDISH STANDARDS INSTITUTE

Om oss DET PERFEKTA KOMPLEMENTET THE PERFECT COMPLETION 04 EN BINZ ÄR PRECIS SÅ BRA SOM DU FÖRVÄNTAR DIG A BINZ IS JUST AS GOOD AS YOU THINK 05

Scalable Dynamic Analysis of Binary Code

Resultat av den utökade första planeringsövningen inför RRC september 2005

PFC and EMI filtering

Swedish adaptation of ISO TC 211 Quality principles. Erik Stenborg

8 < x 1 + x 2 x 3 = 1, x 1 +2x 2 + x 4 = 0, x 1 +2x 3 + x 4 = 2. x 1 2x 12 1A är inverterbar, och bestäm i så fall dess invers.

Michael Q. Jones & Matt B. Pedersen University of Nevada Las Vegas

Problem som kan uppkomma vid registrering av ansökan

Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

denna del en poäng. 1. (Dugga 1.1) och v = (a) Beräkna u (2u 2u v) om u = . (1p) och som är parallell

Semantic and Physical Modeling and Simulation of Multi-Domain Energy Systems: Gas Turbines and Electrical Power Networks

1. Compute the following matrix: (2 p) 2. Compute the determinant of the following matrix: (2 p)

Exempel på uppgifter från 2010, 2011 och 2012 års ämnesprov i matematik för årskurs 3. Engelsk version

Health café. Self help groups. Learning café. Focus on support to people with chronic diseases and their families

PowerCell Sweden AB. Ren och effektiv energi överallt där den behövs

Custom-made software solutions for increased transport quality and creation of cargo specific lashing protocols.

Module 6: Integrals and applications

Aborter i Sverige 2008 januari juni

Examensarbete Introduk)on - Slutsatser Anne Håkansson annehak@kth.se Studierektor Examensarbeten ICT-skolan, KTH

VHDL Basics. Component model Code model Entity Architecture Identifiers and objects Operations for relations. Bengt Oelmann -- copyright

Adding active and blended learning to an introductory mechanics course

6 th Grade English October 6-10, 2014

Collaborative Product Development:

Module 1: Functions, Limits, Continuity

Documentation SN 3102

Installation Instructions

Isolda Purchase - EDI

Image quality Technical/physical aspects

BOENDEFORMENS BETYDELSE FÖR ASYLSÖKANDES INTEGRATION Lina Sandström

12.6 Heat equation, Wave equation

Thesis work at McNeil AB Evaluation/remediation of psychosocial risks and hazards.

2.1 Installation of driver using Internet Installation of driver from disk... 3

Pre-Test 1: M0030M - Linear Algebra.

Support for Artist Residencies

CHANGE WITH THE BRAIN IN MIND. Frukostseminarium 11 oktober 2018

Mönster. Ulf Cederling Växjö University Slide 1

Solutions to exam in SF1811 Optimization, June 3, 2014

District Application for Partnership

Datorteknik och datornät. Case Study Topics

Vässa kraven och förbättra samarbetet med hjälp av Behaviour Driven Development Anna Fallqvist Eriksson

Användning av Erasmus+ deltagarrapporter för uppföljning

Methods to increase work-related activities within the curricula. S Nyberg and Pr U Edlund KTH SoTL 2017

Högskolan i Skövde (SK, JS) Svensk version Tentamen i matematik

(D1.1) 1. (3p) Bestäm ekvationer i ett xyz-koordinatsystem för planet som innehåller punkterna


Schenker Privpak AB Telefon VAT Nr. SE Schenker ABs ansvarsbestämmelser, identiska med Box 905 Faxnr Säte: Borås

ASSEMBLY INSTRUCTIONS SCALE SQUARE - STANDARD

Projektmodell med kunskapshantering anpassad för Svenska Mässan Koncernen

The present situation on the application of ICT in precision agriculture in Sweden

The Algerian Law of Association. Hotel Rivoli Casablanca October 22-23, 2009

Alias 1.0 Rollbaserad inloggning

Gradientbaserad Optimering,

Sammanfattning hydraulik

Tentamen i Matematik 2: M0030M.

INSTALLATION INSTRUCTIONS

CELL PLANNING. 1 Definitions

Chapter 2: Random Variables

The Swedish National Patient Overview (NPO)

Att använda data och digitala kanaler för att fatta smarta beslut och nå nya kunder.

This exam consists of four problems. The maximum sum of points is 20. The marks 3, 4 and 5 require a minimum

Kursutvärderare: IT-kansliet/Christina Waller. General opinions: 1. What is your general feeling about the course? Antal svar: 17 Medelvärde: 2.

Kvalitetsarbete I Landstinget i Kalmar län. 24 oktober 2007 Eva Arvidsson

Datasäkerhet och integritet

EBBA2 European Breeding Bird Atlas

Exempel på uppgifter från års ämnesprov i matematik för årskurs 3. Engelsk version

Boiler with heatpump / Värmepumpsberedare

SVENSK STANDARD SS-EN ISO 19108:2005/AC:2015

Webbregistrering pa kurs och termin

DVG C01 TENTAMEN I PROGRAMSPRÅK PROGRAMMING LANGUAGES EXAMINATION :15-13: 15

ALGEBRA I SEMESTER 1 EXAM ITEM SPECIFICATION SHEET & KEY

Surfaces for sports areas Determination of vertical deformation. Golvmaterial Sportbeläggningar Bestämning av vertikal deformation

Digitalteknik och Datorarkitektur 5hp

FORSKNINGSKOMMUNIKATION OCH PUBLICERINGS- MÖNSTER INOM UTBILDNINGSVETENSKAP

1. Varje bevissteg ska motiveras formellt (informella bevis ger 0 poang)

EXPERT SURVEY OF THE NEWS MEDIA

Quick-guide to Min ansökan

Klicka här för att ändra format

SVENSK STANDARD SS-ISO :2010/Amd 1:2010

Institutionen för systemteknik

Hållbar utveckling i kurser lå 16-17

Transkript:

Institutionen för systemteknik Department of Electrical Engineering Eamensarbete A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Eamensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Jonas Einemo Magnus Lundqvist LiTH-ISY-EX--10/4392--SE Linköping 2010 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping

A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Eamensarbete utfört i Datorteknik vid Tekniska högskolan i Linköping av Jonas Einemo Magnus Lundqvist LiTH-ISY-EX--10/4392--SE Handledare: Eaminator: Olof Kraigher isy, Linköpings universitet Dake Liu isy, Linköpings universitet Linköping, 15 June, 2010

Avdelning, Institution Division, Department Division of Computer Engineering Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Datum Date 2010-06-15 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Eamensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--10/4392--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version http://www.da.isy.liu.se/en/inde.html http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-4292 Titel Title A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Författare Author Jonas Einemo, Magnus Lundqvist Sammanfattning Abstract H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called epuma. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the epuma system simulator and the results are compared to an implementation of an eisting H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder. Nyckelord Keywords epuma, DSP, SIMD, H.264, Parallel Programming, Motion Estimation, DCT

Abstract H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called epuma. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the epuma system simulator and the results are compared to an implementation of an eisting H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.

Acknowledgments We would like to thank everyone that has helped us during our thesis work, especially our supervisor Olof Kraigher for all help and useful hints and our eaminer Professor Dake Liu for his support, comments and the opportunity to do this thesis. We would also like to thank Jian Wang for the support on the DMA firmware, Jens Ogniewski for the help with understanding the H.264 standard, our families and friends for their support and for bearing with us during the work on this thesis. Jonas Einemo Linköping, June 2010 Magnus Lundqvist

Contents 1 Introduction 1 1.1 Background............................... 1 1.2 Purpose................................. 2 1.3 Scope.................................. 2 1.4 Way of Work.............................. 2 1.5 Outline................................. 2 2 Overview of Video Coding 5 2.1 Introduction to Video Coding..................... 5 2.2 Color Spaces.............................. 6 2.3 Predictive Coding........................... 7 2.4 Transform Coding and Quantization................. 7 2.5 Entropy Coding............................. 8 2.6 Quality Measurements......................... 8 2.6.1 Subjective Quality....................... 8 2.6.2 Objective Quality....................... 8 3 Overview of H.264 11 3.1 Introduction to H.264......................... 11 3.2 Coded Slices.............................. 12 3.2.1 I Slice.............................. 12 3.2.2 P Slice.............................. 12 3.2.3 B Slice.............................. 12 3.2.4 SP Slice............................. 12 3.2.5 SI Slice............................. 13 3.3 Intra Prediction............................. 13 3.4 Inter Prediction............................. 14 3.4.1 Heagon search......................... 17 3.5 Transform Coding and Quantization................. 18 3.5.1 Discrete Cosine Transform................... 18 3.5.2 Inverse Discrete Cosine Transform.............. 20 3.5.3 Quantization.......................... 21 3.5.4 Rescaling............................ 22 3.6 Deblocking filter............................ 23 i

ii Contents 3.7 Entropy coding............................. 25 4 Overview of the epuma Architecture 27 4.1 Introduction to epuma........................ 27 4.2 epuma Memory Hierarchy...................... 27 4.3 Master Core............................... 29 4.3.1 Master Memory Architecture................. 29 4.3.2 Master Instruction Set..................... 29 4.3.3 Datapath............................ 29 4.4 Sleipnir Core.............................. 30 4.4.1 Sleipnir Memory Architecture................. 31 4.4.2 Datapath............................ 33 4.4.3 Sleipnir Instruction Set.................... 34 4.4.4 Comple Instructions..................... 34 4.5 DMA Controller............................ 34 4.6 Simulator................................ 35 5 Elaboration of Objectives 37 5.1 Task Specification........................... 37 5.1.1 Questions at Issue....................... 38 5.2 Method................................. 38 5.3 Procedure................................ 38 6 Implementation 39 6.1 Motion Estimation........................... 39 6.1.1 Motion Estimation Reference................. 39 6.1.2 Comple Instructions..................... 40 6.1.3 Sleipnir Blocks......................... 41 6.1.4 Master Code.......................... 47 6.2 Discrete Cosine Transform and Quantization............ 49 6.2.1 Forward DCT and Quantization............... 50 6.2.2 Rescaling and Inverse DCT.................. 53 7 Results and Analysis 57 7.1 Motion Estimation........................... 57 7.1.1 Kernel 1............................. 58 7.1.2 Kernel 2............................. 60 7.1.3 Kernel 3............................. 62 7.1.4 Kernel 4............................. 63 7.1.5 Kernel 5............................. 65 7.1.6 Master Code.......................... 69 7.1.7 Summary............................ 71 7.2 Transform and Quantization...................... 75

Contents iii 8 Discussion 79 8.1 DMA.................................. 79 8.2 Main Memory.............................. 79 8.3 Program Memory............................ 80 8.4 Constant Memory........................... 80 8.5 Vector Register File.......................... 80 8.6 Register Forwarding.......................... 80 8.7 New Instructions............................ 81 8.7.1 SAD Calculations....................... 81 8.7.2 Call and Return........................ 81 8.8 Master and Sleipnir Core....................... 81 8.9 epuma H.264 Encoding Performance................ 82 8.10 epuma Advantages.......................... 82 8.11 Observations.............................. 83 9 Conclusions and Future Work 85 9.1 Conclusions............................... 85 9.2 Future Work.............................. 86 Bibliography 87 A Proposed Instructions 89 B Results 92

iv Contents List of Figures 2.1 Overview of the data flow in a basic encoder and a decoder.... 5 2.2 YUV 4:2:0 sampling format...................... 7 3.1 Overview of the data flow in an H.264 encoder........... 12 3.2 44 luma prediction modes...................... 13 3.3 1616 luma prediction modes..................... 13 3.4 Different ways to split a macroblock in inter prediction....... 14 3.5 Subsamples interpolated from neighboring piels.......... 15 3.6 Multiple frame prediction....................... 16 3.7 Large(a) and small(b) search pattern in the heagon search algorithm. 17 3.8 Movement of the heagon pattern in a search area and the change to the smaller search pattern...................... 18 3.9 DCT functional schematic....................... 19 3.10 IDCT functional schematic...................... 20 3.11 Filtering order of a 1616 piel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) 24 3.12 Piels in blocks adjacent to vertical and horizontal boundaries.. 24 4.1 epuma memory hierarchy...................... 28 4.2 epuma star network interconnection................. 28 4.3 Senior datapath for short instructions................ 30 4.4 Sleipnir datapath pipeline schematic................. 33 4.5 Sleipnir Local Store switch....................... 35 6.1 Motion estimation program flowchart................. 42 6.2 Motion estimation computational flowchart............. 43 6.3 Heagon search program flow controller............... 44 6.4 Proposed implementation of call and return hardware....... 45 6.5 Reference macroblock overlap..................... 45 6.6 Reference macroblock partitioning for 13 data macroblocks.... 46 6.7 Master program flowchart....................... 47 6.8 Memory allocation of data memory in the master(a) and main memory allocation(b)............................ 48 6.9 Sleipnir core motion estimation task partitioning and synchronization 49 6.10 DCT flowchart............................. 51 6.11 Memory transpose schematic..................... 51 7.1 Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed. 72 7.2 Frame 10 from Pedestrian Area video sequence........... 73 7.3 Difference between frame 10 and frame 11 in Pedestrian Area video sequence................................. 73 7.4 Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence.................... 74 7.5 Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation................. 74

Contents v 8.1 Sleipnir core DCT task partitioning and synchronization...... 83 8.2 Memory allocation of macroblock in LVM for intra coding..... 83 A.1 HVBSUMABSDWA.......................... 89 A.2 HVBSUMABSDNA.......................... 90 A.3 HVBSUBWA.............................. 90 A.4 HVBSUBNA.............................. 91

vi Contents List of Tables 3.1 Q step for a few different values of QP................. 21 3.2 Multiplication factor MF........................ 22 3.3 Scaling factor V............................. 23 4.1 Pipeline specification.......................... 30 4.2 Register file access types........................ 31 4.3 Address register increment operations................ 32 4.4 Addressing modes eamples...................... 32 7.1 Short names for kernels that have been tested............ 58 7.2 Description of table columns...................... 58 7.3 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core..... 59 7.4 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores.... 59 7.5 Block 1 costs.............................. 59 7.6 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core..... 60 7.7 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores.... 61 7.8 Block 2 costs.............................. 61 7.9 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores.... 62 7.10 Kernel 3 costs.............................. 62 7.11 Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores.... 63 7.12 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores.... 64 7.13 Kernel 4 costs.............................. 64 7.14 Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores.... 65 7.15 Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores..... 66 7.16 Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores 66 7.17 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores.... 67 7.18 Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores...... 67 7.19 Kernel 5 costs.............................. 68 7.20 Master code cost............................ 69 7.21 Prolog and epilog cycle costs..................... 70 7.22 Simulated epilog cycle cost including waiting for last Sleipnir to finish 70 7.23 DMA cycle costs............................ 71

Contents vii 7.24 Costs for DCT with quantization block and IDCT with rescaling block................................... 75 B.1 Simulation cycle cost of motion estimation kernels......... 92

viii Contents Abbreviations AGU ALU AVC CABAC CAVLC CB CM CODEC DCT DMA DSP epuma FIR FPS FS HDTV HVBSUBNA HVBSUBWA HVBSUMABSDNA HVBSUMABSDWA IDCT IEC ISO ITU LS LVM MAE MB MC ME MF MPEG MSE NAL NoC PM PSNR QP RAM Address Generation Unit Arithmetic Logic Unit Advanced Video Coding Contet-based Adaptive Binary Arithmetic Coding Contet-based Adaptive Variable Length Coding Copy Back Constant Memory COder/DECoder Discrete Cosine Transform Direct Memory Access Digital Signal Processing Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access Finite Impulse Response Frames Per Second Full Search High-Definition Television Half Vector Bytewise SUBtraction Not word Aligned Half Vector Bytewise SUBtraction Word Aligned Half Vector Bytewise SUM of ABSolute Differences Not word Aligned Half Vector Bytewise SUM of ABSolute Differences Word Aligned Inverse Discrete Cosine Transform International Electrotechnical Commission International Organization for Standardization International Telecommunications Union Local Storage Local Vector Memory Mean Abolute Error Macroblock Motion Compensation Motion Estimation Multiplication Factor Moving Picture Eperts Group Mean Square Error Network Abstraction Layer Network on Chip Program Memory Peak Signal to Noise Ration Quantization Parameter Random Access Memory

Contents i RGB ROM SAD SPRF STI V VCEG VRF YUV Red, Green and Blue, A color space Read Only Memory Sum of absolute difference SPecial Register File Sony Toshiba IBM Rescaling Factor Video Coding Eperts Group Vector Register File A color space

Chapter 1 Introduction This chapter gives a background to the thesis, defines the purpose, scope, way of work and presents the outline of the thesis. 1.1 Background With new handheld devices and mobile systems with more advanced services the need for increased computational power at low cost, both in terms of chip area and power dissipation, is ever increasing. Now that video playback and recording are more standard applications than features in mobile devices, high computational power at a low cost is still a problem without a sufficient solution. The Division of Computer Engineering at the Department of Electrical Engineering at Linköpings Tekniska Högskola has for some time been part of a research project called epuma, which can be read out as Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access. The development is driven by the pursuit of the net generation of digital signal processing demands. By developing a cheap and low power processor with large calculation power this new architecture aims to meet tomorrows demands in digital signal processing. The main applications for the processor is future radio base stations, radar and High-Definition Television (HDTV). H.264 is a standard for video compression that saw daylight back in 2003. It is now a mature and widely spread standard that is used in Blu-Ray, popular video streaming websites like Youtube, television services and video conferencing. It provides very good compression at the cost of high computational compleity. The hope is that the epuma multi-core architecture will be able to handle realtime video encoding using the H.264 standard. At the Division of Computer Engineering previous work has been done on implementing an H.264 encoder for another multi-core architecture. This work was done on the STI Cell which is used in e.g. the popular video gaming console PLAYSTATION 3. 1

2 Introduction 1.2 Purpose The purpose of this master thesis is to evaluate the capability of the epuma processor architecture, in aspect of real-time video encoding using the H.264 video compression standard and aim to find and epose possible areas of improvement on the epuma architecture. This will be done by implementing parts of an H.264 encoder and if possible compare the cycles needed to the previously implemented STI Cell H.264 encoder. 1.3 Scope By implementing the most computationally epensive parts in the H.264 standard it would be possible to better estimate if the epuma processor architecture is capable of encoding video using the H.264 standard in real time. Studying the H.264 standard it can be seen that entropy coding is the most time consuming part if it is done in software. Because of the large amount of bit manipulations needed, it is not feasible to perform entropy coding in the processor. Therefore an early decision was made that entropy coding had to be hardware accelerated and that it should not be a part of this thesis. In this thesis no eact hardware costs for performance improvement will be calculated but instead a reasoning of feasibility will be done. The time constraint of this master thesis is twenty weeks which restricts the etent of the work. Because of the time constraint some parts of a complete encoder have had to be left out. 1.4 Way of Work One of the most time consuming tasks is motion estimation which together with discrete cosine transform and quantization became the primary target for evaluation. First a working implementation was produced. An iterative development was then used to refine the implementations and reach better performance. The partial implementations of the H.264 standard were written for the epuma system simulator. The simulator was also used for all performance measurements of the implementations using frames from several different commonly used test video sequences. Once the performance measurement results were acquired they were analyzed and the conclusions were made. The way of work is elaborated in section 5.2 and section 5.3. 1.5 Outline This thesis is aimed at an audience with an education in electrical engineering, computer engineering or similar. Epertise in video coding or the H.264 standard is not necessary as the main principles of these topics will be covered. The outline of this thesis is ordered as naturally as possible where this introduction chapter is followed by theoretical chapters containing the topics needed

1.5 Outline 3 to understand the rest of the thesis. The first of these is chapter 2 which covers the basics of video coding followed by chapter 3 which offers an introduction to the H.264 video coding standard. The last theoretical chapter is chapter 4 which covers the hardware architecture and toolchain of the epuma processor. The theory is followed by chapter 5 where a more detailed task specification, method and procedure of the thesis is presented with help from the knowledge obtained from the theoretical chapters. After that chapter 6 describes the function and development of the implementations produced. Chapter 7 then presents the results obtained and gives an analysis of them. Chapter 8 contains a discussion about the results as well as ideas thought of while working on this thesis. The final chapter is chapter 9 which contains the conclusions and the future work that could be done in the area.

Chapter 2 Overview of Video Coding This chapter gives an introduction to video coding, color spaces, predictive coding, transform coding and entropy coding. The knowledge is necessary to be able to understand the rest of the thesis. 2.1 Introduction to Video Coding A video consists of several images, called frames, showed in a sequence. The amount of space on disk required to store a sequence of raw data is huge and therefore video coding is needed. The purpose of video coding is to minimize the data to store on disk or the data to send over a network, without decreasing the image quality too much. There are a lot of techniques and algorithms on the market to do this such as MPEG-2, MPEG-4 and H.264/AVC. [10] Video Data Predictive coding Transform coding & Quantization Entropy coding Encoded Video Decoded Video Predictive decoding Inverse transform & Rescaling Entropy decoding Encoded Video Figure 2.1: Overview of the data flow in a basic encoder and a decoder All of these algorithms are constructed out of a similar template. First some technique is used to reduce the amount of data to be transformed. The video is then transformed with for eample a Discrete Cosine Transform (DCT). After this a quantization is performed to shrink the data further. The data is then pushed 5

6 Overview of Video Coding through an entropy coder such as Huffman or a more advanced algorithm such as Contet-based Adaptive Binary Arithmetic Coding (CABAC) or Contet-Based Arithmetic Variable Length Coding (CAVLC) which all compress the data based on patterns in the bit-stream. [10] The data flow of a basic encoder and a basic decoder is illustrated in figure 2.1. As mentioned a video sequence consists of many frames. In video coding these frames can be divided into something called slices. A slice can be a part of a frame or contain the complete frame. This slice division is advantageous because it gives ability to know e.g. that data in a slice does not depend on data outside the slice. The frames are also divided into something called macroblocks. A macroblock is a block consisting of 16 16 piels. This partitioning of the data makes computations easier to organize and structure. [10] 2.2 Color Spaces To understand video coding some knowledge about different color spaces is needed. One of the color spaces out there is RGB, which name comes from its components red, green and blue. With these three colors and different intensities of them it is possible to visualize all colors in the spectra. Another commonly used color space is Y C b C r, also called YUV. In this color space Y represents the luminance (luma) component, which corresponds to the brightness of a specific piel. The other two components, namely C b and C r, are chrominance (chroma) components which carry the color information. [10] The conversion from the RGB color space to the YUV color space is shown in equation (2.1). Y = k r R + k g G + k b B C b = B Y (2.1) C r = R Y C g = G Y As seen in equation (2.1) there also eists a third chrominance component for green, namely C g, which thanks to equation (2.2) can be calculated as shown in equation (2.3). This means that C g can be calculated by the decoder and does not have to be transmitted which is advantageous in the sense of data compression. [10] k b + k r + k g = 1 (2.2) C g = Y C b C r (2.3) The human eye is more sensitive to luminance than to chrominance and because of that a smaller set of bits can be used to represent the chrominance and a larger for representation of luminance. With this feature of the YUV color space the total amount of bits needed to encode a piel can be reduced. A common way to do this is by applying the 4:2:0 sampling format.

2.3 Predictive Coding 7 Y sample C r sample C b sample Figure 2.2: YUV 4:2:0 sampling format The 4:2:0 sampling format can be described as a 12 bits per piel format where there are 2 samples of chrominance for every 4 samples of luminance as shown in figure 2.2. If each sample is stored using 8 bits this will add up to 6 8 = 48 bits for 4 YUV 4:2:0 piels with an average of 48/4 = 12 bits per piel. [10] 2.3 Predictive Coding There are two kinds of predictive coding, intra coding and inter coding. By studying a picture it is easy to see that some parts in the picture are very similar, this is called spatial correlation. The predictive coding that uses these spatial correlations within a frame to form a prediction of other parts of the frame is called intra coding. By studying a sequence of pictures or a video sequence it can be seen that there is usually not much difference between the frames, this is called temporal correlation. By eploiting this temporal correlation a difference, also called a residue, can be calculated which is comprised of smaller values and therefore can be described with a smaller number of bits. This will result in better data compression. The predictive coding that uses temporal correlations between different frames is called inter coding. [10] 2.4 Transform Coding and Quantization The purpose of transform coding is to convert the image data or motion compensated data into another representation of data. This can be done with a number of different algorithms where the block based Discrete Cosine Transform (DCT) is one of the most common in video coding. The DCT algorithm converts the data to be described into sums of cosine functions oscillating at different frequencies. [10]

8 Overview of Video Coding There are some different transforms that could be used in video coding but the common property of them all is that they are reversible, meaning the transform can be reversed without loss of data. This is an important property because otherwise drift between the encoder and decoder can occur and special algorithms would have to be applied to correct these errors. As mentioned before block based transform coding is the most common. When using block based transform coding the picture is divided into smaller block such as 8 8 or 4 4 piels. Each block is then transformed with the chosen transform. The transformed data is then quantisized to remove high frequency data. This procedure can be done because the human eye is insensitive to higher frequencies and therefore these can be removed without any noticeable loss of quality. The quantizer re-maps the input data with one range of values to the output data which has a smaller range of possible values. This means the output can be coded with fewer bits than the original data and in this way data compression is achieved. [10] 2.5 Entropy Coding Entropy coding is a lossless data compression technique. The different entropy coding algorithms encode symbols that occur often with a few number of bits and symbols that occur less often with more bits. The bits are all put in a bitstream that could be written to disk or sent over a network. In video coding these symbols can be quantisized transform coefficients, motion vectors, headers or other information that should be sent to be able to decode the video stream. As mentioned earlier a few of the usual entropy coding algorithms are Huffman, CABAC and CAVLC. [10] 2.6 Quality Measurements There eists several ways to measure the quality of images and compare uncompressed images with reconstructed ones to evaluate video coding algorithms. 2.6.1 Subjective Quality Subjective quality is the quality that someone watching an image or a video sequence eperiences. Subjective quality can be measured by having evaluators rate each part of a series of images or video sequences with different properties. This can be a time consuming and unpractical way of measurement in most circumstances. [10] 2.6.2 Objective Quality To enable more automatic measurements of quality some algorithms are commonly used. One of these is Peak Signal to Noise Ratio (PSNR) which can be used to measure the quality of a reconstructed image by comparing it to an uncompressed

2.6 Quality Measurements 9 one. PSNR gives a logarithmic scale where a higher value is better. The Mean Square Error (MSE) is used in the calculation of PSNR and is calculated as MSE = 1 m n m i=1 j=1 n (C(i, j) R(i, j)) 2 (2.4) where n is the image height, m is the image width and C and R are the current and reference images being compared. With the MSE value the PSNR can be calculated as ( 2 bits ) 1 P SNR = 10 log 10 (2.5) MSE where 2 bits 1 is the largest representable value of a piel with the specified number of bits. [10]

Chapter 3 Overview of H.264 This chapter presents an overview of the H.264 video compression standard. Some sections are more detailed than others because of relevance for the master thesis. The topics covered include the different frame and slice types, intra and inter prediction, transform coding, quantization, deblocking filter and finally entropy coding. 3.1 Introduction to H.264 H.264[12], also known as Advanced Video Coding (AVC) and MPEG-4 Part 10, is a standard for video compression. The standard has been developed by Video Coding Eperts Group (VCEG) of International Telecommunications Union (ITU) and Moving Picture Eperts Group (MPEG) which is a working group of the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC). The main objective when H.264 was developed was to maimize the efficiency of the video compression but also to provide a standard with high transmission efficiency which supports reliable and robust transmission of data over different channels and networks. [10] H.264 is divided into a number of different profiles. These profiles include different parts of the video coding features from the H.264 standard. Some of the most common ones are the Etended, Baseline, Constrained Baseline and Main profiles. The Baseline profile supports inter and intra coding and entropy coding with CAVLC. The Main profile supports interlaced video, inter coding using B- slices and entropy coding using CABAC. The Etended profile does not support interlaced video nor CABAC but supports switching slices and has improved error resilience. [10] In figure 3.1 a detailed view of the data flow in an H.264 encoder can be seen. This figure illustrates the important prediction coding and how it is connected to the other parts of the encoder. The in-loop deblocking filter can also be seen in this illustration. [10] 11

12 Overview of H.264 F n (current frame) F n-1 (reference frame) ME (motion estimation) MC (motion compensation) + - DCT (discrete cosine transform) Q (quantization) Choose Intra Prediction Intra prediction R (rescaling) Reorder F n (reconstructed frame) Deblocking Filter + + IDCT (inverse discrete cosine transform) Entropy encode NAL Figure 3.1: Overview of the data flow in an H.264 encoder 3.2 Coded Slices A frame can be divided into smaller parts called slices. These slices can then be coded in different modes. The different coding modes in H.264 is presented below [14]. 3.2.1 I Slice In the I slice all macroblocks are intra coded. The encoder uses the spatial correlations within a single slice to code that slice. The I slice allocates most space of all the different types of slices after it has been encoded. [10] 3.2.2 P Slice P slices can contain both I coded macroblocks and P coded macroblocks. P coded macroblocks are predicted from a list of reference macroblocks. [10] 3.2.3 B Slice B slices or bidirectional slices can contain both B coded macroblocks and I coded macroblocks. B coded macroblocks can be predicted from two different lists of reference macroblocks both before and after the current frame in time. [10] 3.2.4 SP Slice A Switching P (SP) slice is coded in a way that supports easy switching between similar precoded video streams without suffering a high penalty for sending a new I slice. [10]

3.3 Intra Prediction 13 3.2.5 SI Slice A Switching I (SI) slice is an intra coded slice and supports easy switching between two different streams that does not correlate. [10] 3.3 Intra Prediction In intra coding the encoder only uses data from the current frame. Intra prediction is the net step in this direction to try to minimize the coded frame size. With intra prediction the encoder tries to utilize the spatial correlation within the frame.[10] 0 (Vertical) 1 (Horizontal) 2 (DC) M A B C D E F G H M A B C D E F G H M A B C D E F G H I I I Mean J J J K K K (A.. D, L L L I.. L) 3 (Diagonal down-left) 4 (Diagonal down-right) 5 (Vertical-right) M A B C D E F G H M A B C D E F G H M A B C D E F G H I J K L I J K L I J K L 6 (Horizontal-down) M A B C D E F G H I J K L 7 (Vertical-left) M A B C D E F G H I J K L Figure 3.2: 44 luma prediction modes 8 (Horizontal-up) M A B C D E F G H I J K L 0 (Vertical) H 1 (Horizontal) H 2 (DC) H 3 (Plane) H V. V. V Mean (V, H) V Figure 3.3: 1616 luma prediction modes H.264 supports 9 different intra prediction modes for 44 sample luma blocks, four different modes for 1616 sample luma blocks and four modes for 88 chroma components. The 9 44 prediction modes are illustrated in figure 3.2 and the 4 1616 luma prediction modes are illustrated in figure 3.3. The piels are interpolated or etrapolated from the piels nearby i.e the piels with letters. Usually

14 Overview of H.264 the encoder selects the prediction mode that minimizes the difference between the predicted block and the block to be encoded. I_PCM is another prediction mode which makes it possible to transmit samples of an image without prediction or transformation. [10, 14] 3.4 Inter Prediction Inter prediction creates a prediction model from one or more previously encoded video frames or slices using block-based motion compensation. The motion vector precision can be up to a quarter piel resolution. The task is to find a vector that points to a block of piels that have the smallest difference between the reference block and the block in the frame that is being encoded. [10] 168 1616 168 88 84 84 816 816 88 88 48 48 44 44 44 44 88 88 Figure 3.4: Different ways to split a macroblock in inter prediction. H.264 supports a range of block sizes from 1616 to 44 piels. This is illustrated in figure 3.4. Using big blocks will save data because you will not need as many motion vectors, but the distortion can be very high when there are a lot of small things moving around in the video sequence. Using smaller blocks will in many cases lower the distortion but will instead increase the amount of bits needed to store the increased number of motion vectors. By letting the encoder find the best solution for this a good data compression of the video sequence can be achieved. The blocks are split when a threshold value is reached. [10] SAD = m MSE = 1 m n i=1 j=1 n C(i, j) R(i, j) (3.1) m i=1 j=1 n (C(i, j) R(i, j)) 2 (3.2)

3.4 Inter Prediction 15 MAE = 1 m n m i=1 j=1 n C(i, j) R(i, j) (3.3) The macroblock cost is commonly calculated in one of a few different ways, Sum of Absolute Difference (SAD) is the most common as it offers the lowest computation compleity. The definition of SAD can be found in equation (3.1). Two other common ways to calculate the cost are Mean Square Error (MSE) and Mean Absolute Error (MSE) presented in equation (3.2) and equation (3.3) respectively. In equation (3.1), equation (3.2) and equation (3.3) n is the image width and m is the image height. [10] A 1 B C 2 D E F G a b c H I J d e f g 3 4 h i j k m 5 6 n p q r K L M s N P Q R 7 S T 8 U Figure 3.5: Subsamples interpolated from neighboring piels More accurate motion estimation in form of sub piel motion vectors is available in H.264. Up to a quarter piel resolution is supported for the luma component and one eighth sample resolution for the chroma components. This motion estimation is possible to do by interpolating neighboring piels and then compare with the current frame in the encoder. The interpolation is performed by a 6 tap Finite Impulse Response (FIR) filter with weights (1/32, 5/32, 20/32, 20/32, 5/32, 1/32). [10] In figure 3.5 the half piel sample b can be located. To generate this sample equation (3.4) can be used. Sample m can be calculated in a similar way shown in equation (3.5). [10] b = round((e 5F + 20G + 20H 5I + J)/32) (3.4) m = round((b 5D + 20H + 20N 5S + U)/32) (3.5)

16 Overview of H.264 After generating all half piel samples from real samples there are some half piel samples that have not been generated. These samples have to be generated from already generated samples. The sample j in figure 3.5 is an eample of that. To generate j the same FIR filter is used but with samples 1, 2, b, s, 7 and 8. j could also be generated with samples 3, 4, h, m, 5 and 6. Note that unrounded versions of the samples should be used when calculating j. When all half piel samples are generated it is time to generate the quarter piel samples. This is done by linear interpolation. Sample a in figure 3.5 is calculated as in equation (3.6) and sample d is calculated as in equation (3.7). To generate the last samples two diagonal half piel samples are used, see equation (3.8). [10] a = round((g + b)/2) (3.6) d = round((g + h)/2) (3.7) e = round((h + b)/2) (3.8) To enhance the video compression even more H.264 has support for predicting macroblocks from more than one frame. This can be applied to both B and P coded slices. With the possibility to predict macroblocks from different frames a much better video compression can be achieved. The downside with multiframe prediction is an increase cost of memory size, memory bandwidth and computational compleity. [10] Previous Frames Current Frame Following Frames Figure 3.6: Multiple frame prediction To find the best motion vector the encoder uses a search algorithm such as Full Search (FS), Diamond Search or Heagon Search. With Full Search a complete search of the whole search area is performed. This algorithm provides the best compression efficiency but is also the most time consuming algorithm. Diamond search is a less time consuming search algorithm where the search pattern is formed as a diamond. Its performance in terms of compression, is good in comparison with FS. Heagon search is an even more refined search pattern where the search points are formed as a heagon, figure 3.7a. By decreasing the number of search points the effort to calculate the motion vector will be minimized and the result will be almost as good as with Diamond Search [16]. Motion estimation is the part in H-264 encoding that consume the most computational power when encoding and is predicted to consume about 60% to 80% of the total encoding time[15].

3.4 Inter Prediction 17 3.4.1 Heagon search Heagon search uses a 7 point search pattern which can be seen i figure 3.7a. Each cross in the grid represents a search point in the search area where the grid resolution is one piel. From this search point a Sum of Absolute Difference, equation (3.1), is calculated. [16] (a) (b) Figure 3.7: Large(a) and small(b) search pattern in the heagon search algorithm. The search steps in the heagon search are the following. 1. Calculate the SAD of the si closest search points and the current search point. 2. Put the search point with the smallest SAD as new current search point. If the middle point has the smallest SAD jump to step 5. 3. Calculate the SAD of the 3 new search points that have not yet been calculated as illustrated in figure 3.8. 4. Jump to step 2 5. Calculate the SAD of the 4 new search points forming a diamond around the middle point. This is illustrated in figure 3.7b. 6. Choose the search point that resulted in the smallest SAD and form a motion vector to this search point. When the smallest SAD is found the motion compensated residue can be calculated. This residue is then sent to the transformation part of the encoder for further processing. In the decoder the motion vectors are used to restore the image correctly from the residue that was sent from the encoder. [16]

18 Overview of H.264 3 3 4 5 2 2 5 3 5 4 5 1 1 2 4 1 1 1 1 1 Figure 3.8: Movement of the heagon pattern in a search area and the change to the smaller search pattern. 3.5 Transform Coding and Quantization The main transform used in H.264 is discrete cosine transform. 3.5.1 Discrete Cosine Transform The Discrete Cosine Transform (DCT) is a widely used transform in image and video compression algorithms. In H.264 the DCT decorrelates the residual data before quantization takes place. The DCT is a block based algorithm which means it transforms one block at the time. In prior standards to H.264 the blocks were 88 piels large but that is now changed to 44 samples to reduce the blocking effects, which reduces the visual quality in the video. The DCT used in H.264 is a modified two-dimensional (2D) DCT transform. The transform matri for the modified 2D DCT can be found in equation (3.9). [10] 1 1 1 1 C f = 2 1 1 2 1 1 1 1 (3.9) 1 2 2 1 The 2D DCT transform in H.264 is given by equation (3.10)

3.5 Transform Coding and Quantization 19 Y = C f XCf T E f = 1 1 1 1 1 2 1 1 = 2 1 1 2 1 1 1 1 X 1 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 where a = 1 2 a 2 ab 2 a 2 ab 2 ab a 2 ab b 2 ab b 2 ab 4 2 4 a 2 ab b 2 ab b 2 4 2 4 (3.10) (3.11) b = 2 5 (3.12) and X is the 44 block of piels to calculate the DCT of. To simplify computation somewhat the post-scaling ( E f ) can be absorbed into the quantization process. [10] This will be described in more detail in section 3.5.3 which covers the quantization. The modified 2D DCT is an approimation to the standard DCT. It does not give the same result but the compression is almost identical. The advantages with this approimation is that the core equation C f XCf T can be done in 16 bit arithmetics with only shifts, additions and subtractions [6]. To do a two-dimensional DCT two one-dimensional DCTs can be performed after each other, the first one on rows and the second one on columns or vice versa. The function of the one-dimensional DCT can be seen in figure 3.9. [6] 0 + + X 0 1 + - + X 2 2 3 - - + + -2 2 + + X 1 X 3 Figure 3.9: DCT functional schematic The operations performed while calculating the DCT as shown in figure 3.9 can be written as equation (3.13). X 0 = ( 0 + 3 ) + ( 1 + 2 ) X 2 = ( 0 + 3 ) ( 1 + 2 ) X 1 = ( 1 2 ) + 2( 0 3 ) (3.13) X 3 = ( 1 2 ) 2( 0 3 )

20 Overview of H.264 3.5.2 Inverse Discrete Cosine Transform The transform that reverses DCT is called Inverse Discrete Cosine Transform (IDCT). With the design of the DCT in H.264 it is possible to ensure zero mismatch between different decoders. This is because the DCT and IDCT(3.14) can be calculated in integer arithmetics. In the standard DCT some mismatch can occur caused by different representation and precision of fractional numbers in encoder and decoder. [10] The 2D IDCT transform in H.264 is given by X r = Ci T (Y E i )C i = 1 1 1 1 2 ( a 2 ab a 2 ab ) 1 1 1 1 1 = 1 2 1 1 1 1 2 1 1 Y ab b 2 ab b 2 1 1 a 2 ab a 2 2 1 2 1 ab 1 1 1 1 (3.14) 1 1 1 1 ab b 2 ab b 2 1 2 2 1 1 1 2 where X r is the reconstructed original block and Y is the previously transformed block. As with the DCT the pre-scaling ( E i ) can be absorbed into the rescaling process. [10] This will be described in more detail in section 3.5.4 which covers the rescaling. X 0 + + 0 X 2 - + + 1 X 1 1/2 - + - + 2 X 3 1/2 + - + 3 Figure 3.10: IDCT functional schematic The function of the IDCT can be seen in figure 3.10. To do a two-dimensional IDCT two one-dimensional IDCTs are performed after each other, the first one on rows and the second one on columns or vice versa. [6] The operations performed while calculating the IDCT can be written as equation (3.15). 0 = (X 0 + X 2 ) + (X 1 + 1 2 X 3) 1 = (X 0 X 2 ) + ( 1 2 X 1 X 3 ) 2 = (X 0 X 2 ) ( 1 2 X 1 X 3 ) (3.15) 3 = (X 0 + X 2 ) (X 1 + 1 2 X 3)

3.5 Transform Coding and Quantization 21 3.5.3 Quantization Information is often concentrated to the lower frequency area, therefore quantization can be used to further compress the data after applying the DCT. H.264 uses a parameter in the quantization called Quantization Parameter (QP). The QP describes how much quantization that should be applied i.e. how much data that should be truncated. A total of 52 values ranging from 0 to 51 are supported by the H.264 standard. Using a high QP will decrease the coded data in size but it will also decrease visual quality of the coded video. With QP = 0 the quantization will be zero and all data is kept. [10] From QP the quantizer step size (Q step ) can be derived. The first values of Q step is presented in table 3.1. Note that Q step doubles in value for every increase of 6 in QP. The large number of step sizes provides the ability to accurately control the trade off between bitrate and quality in the encoder. [10] QP 0 1 2 3 4 5 6 7 8... Q step 0.625 0.6875 0.8175 0.875 1 1.125 1.25 1.375 1.625... Table 3.1: Q step for a few different values of QP The basic formula for quantization can be written as ( ) Z ij = round Y ij Q step (3.16) where Y ij is a coefficient of the previously transformed block to be quantized and Z ij is a coefficient of the quantized block. The rounding operation does not have to be to the nearest integer, it could be biased towards smaller integers which could give perceptually higher quality. This is true for all rounding operations in the quantization. [10] As mentioned in section 3.5.1 the quantization can absorb the post-scaling ( E f ) from the DCT. The unscaled output from the DCT can then be written as W = C f XCf T (as compared to the scaled output which is Y = C f XCf T E f ). [10] This gives ( ) Z ij = round W ij P F ij Q step (3.17) where W ij is a coefficient of the unscaled transformed block, Z ij is a coefficient of the quantized block and P F ij is either a 2, ab 2 or b2 4 for each (i,j) according to P F = a 2 ab 2 a 2 ab 2 ab a 2 ab b 2 ab b 2 ab 4 2 4 a 2 ab b 2 ab b 2 4 2 4 (3.18) where a and b are the same as in equation (3.10) in section 3.5.1. [10]

22 Overview of H.264 PF and Q step can then be reformulated using a multiplication factor (MF) and a division. MF is in fact a 4 4 matri of multiplication factors according to A C A C MF = C B C B A C A C (3.19) C B C B where the values of A, B and C depends on QP according to QP A B C 0 13107 5243 8066 1 11916 4660 7490 2 10082 4194 6554 3 9362 3647 5825 4 8192 3355 5243 5 7282 2893 4559 Table 3.2: Multiplication factor MF The scaling factors in MF are repeated for every increase of 6 in QP. The reformulation of PF and Q step then becomes P F = MF Q step 2 qbits (3.20) where qbits is calculated as ( ) QP qbits = 15 + floor 6 This gives a new quantization formula according to ( ) Z ij = round W ij MF ij 2 qbits (3.21) (3.22) which is the final form. [10] 3.5.4 Rescaling The rescaling also uses Q step which depends on the Quantization Parameter (QP) and is the same as for quantization (see table 3.1). The basic formula for rescaling can be written as Y ij = Z ij Q step (3.23) where Z ij is a coefficient of the previously quantized block and Y ij is a coefficient of the rescaled block. The rounding operation, as in the quantizer, does not have to be to the nearest integer, it could be biased towards smaller integers which

3.6 Deblocking filter 23 could give perceptually higher quality. This is true for all rounding operations in the rescaling. [10] As the quantization formula was reformulated the rescaling formula can also absorb the pre-scaling ( E i ) and be reformulated to match the quantization formula. The new formula for rescaling where the pre-scaling factor is included can be written as W ij = Z ij Q step P F ij 64 (3.24) where P F ij is the same as in (3.18), Z ij is a coefficient of the previously quantized block, W ij is a coefficient of the rescaled block and the constant scaling factor of 64 is included to avoid rounding errors while calculating the Inverse DCT. [10] Much like MF for the quantization the rescaling also uses a 4 4 matri of scaling factors called V, which also incorporates the constant scaling factor of 64 introduced in (3.24). V can be written as A C A C V = C B C B A C A C (3.25) C B C B where the values of A, B and C depends on QP according to QP A B C 0 10 16 13 1 11 18 14 2 13 20 16 3 14 23 18 4 16 25 20 5 18 29 23 Table 3.3: Scaling factor V The scaling factors in V are like MF repeated for every increase of 6 in QP. With V the rescaling formula can be written as which is the final form. [10] W ij = Z ij V ij 2 floor(qp/6) (3.26) 3.6 Deblocking filter When using block coding algorithms such as DCT, blocking artifacts can occur. This is unwanted because it lowers the visual quality and prediction performance. The solution to this is to add a filter than removes these artifacts. The filter is placed after the IDCT in the encoding loop which can be seen i figure 3.1. The filter is used on both luma and chroma samples of the video sequence. [10]

24 Overview of H.264 E F G 3 H 4 A B C D 1 2 (a) (b) Figure 3.11: Filtering order of a 1616 piel macroblock with start in A and end in H for luminance(a) and start in 1 and end in 4 for chrominance(b) The deblocking filter in H.264 has 5 levels of filtering, 0 to 4, where 4 is the option with the strongest filtering. The filter is actually two different filters where the first filter is applied on level 1 to 3 and the second on level 4. Level 0 means that no filter should be applied. The filter level parameter is called boundary strength (bs). The parameter depends on the current quantization parameter, macroblock type and the gradient of the image samples across the boundary. There is one bs for every boundary between two 44 piel block. The deblocking filter is applied to one macroblock at a time in a raster scan order throughout the frame. [5] p3 p2 p1 p3 p2 p1 p0 q0 q1 q2 q3 p0 q0 q1 q2 q3 Figure 3.12: Piels in blocks adjacent to vertical and horizontal boundaries When applying the deblocking filter on a macroblock it is done in a special order which is illustrated in figure 3.11. The filter is applied on vertical and horizontal edges as shown in figure 3.12. Where p 0, p 1, p 2, p 3, q 0, q 1, q 2, q 3 are piels from two neighboring blocks, p and q. The filtering of these piels only takes place if equation (3.27), (3.28) and (3.29) are fulfilled.

3.7 Entropy coding 25 p 0 q 0 < α(inde A ) (3.27) p 1 p 0 < β(inde B ) (3.28) q 1 q 0 < β(inde B ) (3.29) inde A = Min(Ma(0, QP + Offset A ), 51) (3.30) inde B = Min(Ma(0, QP + Offset B ), 51) (3.31) The values of α and β are approimately defined to equation (3.32) and equation (3.33). α() = 0.8(2 6 1) (3.32) β() = 0.5 7 (3.33) Note that in equation (3.30) and (3.31) it can be seen that the filtering is dependent on the Quantization Parameter. The different filters applied are 3-,4- and 5-tap FIR filters which are further described in. [5] 3.7 Entropy coding The H.264 standard supports two different entropy coding algorithms, Contetbased Adaptive Variable Length Coding (CAVLC) and Contet-based Adaptive Binary Arithmetic Coding (CABAC). CABAC is the most efficient of these two standards but it requires higher computational compleity. Bitrate savings of CABAC can be between 9% and 14% compared to CAVLC[7]. CAVLC is supported in all H.264 profiles but CABAC is only supported in the profiles above etended. [10]

Chapter 4 Overview of the epuma Architecture This chapter covers an introduction to the epuma processor architecture. The memory hierarchy, master core, Sleipnir core, the direct memory access controller and the simulator will be covered. 4.1 Introduction to epuma Embedded Parallel Digital Signal Processing Processor Architecture with Unique Memory Access (epuma) is a multi-core DSP processor architecture with 1 master core and 8 calculation cores. The master core handles the Direct Memory Access (DMA) communications. The slave core, which is also called Sleipnir, is a 15-stage pipelined calculation core. 4.2 epuma Memory Hierarchy The epuma memory hierarchy consists of three levels where the first level is the off-chip main memory, the second level is the local storage of the master and slaves and the third and final level is the registers of the master and slave cores. In figure 4.1 an illustration of how each core is connected to the on-chip interconnection is depicted. The on-chip interconnection is in turn connected to the off-chip main memory. The main memory is addressed with both a high word of 16 bits and a low word of another 16 bits which means that a 32-bit addressing is used where each address corresponds to a word of data. 27

28 Overview of the epuma Architecture Off chip main memory Level 1 On chip interconnection Master LS Sleipnir 0 LS Sleipnir 7 LS PM DM 0 DM 1 PM CM LVM 1 LVM 2 LVM 3... PM CM LVM 1 LVM 2 LVM 3 Level 2 Registers Registers Registers Level 3 Master Core Sleipnir Core Sleipnir Core Figure 4.1: epuma memory hierarchy The on-chip network is depicted in figure 4.2 where N0 to N7 are interconnection nodes. As can be seen from the figure the nodes are connected both to the master and the respective Sleipnir core but also to other nodes. This gives the ability of transferring data between Sleipnir cores and even pipeline the cores. With this setup data can be transferred in any way and combination that does not overlap. Sleipnir 0 Sleipnir 1 Sleipnir 2 N0 N1 N2 Sleipnir 3 N3 Master DMA Main Memory N4 Sleipnir 4 N5 N6 N7 Sleipnir 5 Sleipnir 6 Sleipnir 7 Figure 4.2: epuma star network interconnection

4.3 Master Core 29 4.3 Master Core The master core is for the moment based on a processor called Senior. This processor has been around on the Division of Computer Engineering for some years now and is used in some courses for educational purpose. The Senior processor is a DSP processor which means it got a Multiply and ACcumulate (MAC) unit and other DSP related capabilities. To accomplish a possibility to serve as a master core memory ports for DMA controller and interrupt coming from the DMA and Sleipnir cores have been added. 4.3.1 Master Memory Architecture The master core has 2 RAMs and 2 ROMs which are called Data Memory 0 (DM 0) and Data Memory 1 (DM 1). These memories are the local storage for the master core. The ROMs start at address 8000 on respective memory. This gives 7F F F = 32767 words in each RAM to work with. For calculation the master core has 32 16-bit registers that could be used as buffers. There are also a number of special registers such as 4 address registers, registers for hardware looping and registers for support of cyclic addressing in address register 0 and 1. Address register 0 and 1 also supports different step sizes. 4.3.2 Master Instruction Set Programming guide and instruction set for Senior can be found in [9] and [8] even though they might not be totally accurate because of the modifications for the epuma project. The masters instructions set is in large the same as the Senior instruction set. It is a standard DSP instruction set with support for a convolution instruction which multiplies and accumulates the results. To speed up looping a hardware loop function called repeat is included. All jumps, calls and returns can use 0 to 3 delay slots. The number of delay slots specifies how many instructions after the flow control instruction that will be eecuted. If not all delay slots are used for useful instructions, nop instructions will be inserted in the pipeline. 4.3.3 Datapath The datapath of the master consists of a 5-stage pipeline which can be seen in figure 4.3. There is only one eception to this, the convolution instruction (conv) uses a 7-stage pipeline but a figure of this is omitted for lack of relevance. The datapath is advanced enough for scalar calculations, larger computational loads should be delegated to the Sleipnir cores. In table 4.1, originally found in [9], a description of the pipeline stages is presented.

30 Overview of the epuma Architecture Net PC P1 PM P2 Decoder RF... P3 OP. SEL AGU P4 ALU flags ALU * DM 0 DM 1 Cond. Check P5 ACR, MAC flags + Figure 4.3: Senior datapath for short instructions Pipe RISC-E1/E2 RISC Memory load/store P1 IF: Instr. Fetch IF: Instr. Fetch P2 ID: Instr. Decode ID: Instr. Decode P3 OF: Operand Fetch OF+AG: Compute addr P4 EX1: Eecution(set flags MEM: Read/Write P5 EX2: Only for MAC, RWB WB: Write back (if load) Table 4.1: Pipeline specification 4.4 Sleipnir Core Sleipnir is the name of the calculation core. In the epuma processor there are 8 of them. The Sleipnir is a Single Instruction Multiple Data (SIMD) architecture which in this case means it can perform vector calculations. Each full vector consists of 128 bits and is divided into 8 words of 16 bits which can run through the pipeline in parallel. The datapath of the Sleipnir core has 15 pipeline stages. The pipeline length of an instruction is variable depending on the choice of operands.

4.4 Sleipnir Core 31 4.4.1 Sleipnir Memory Architecture The Sleipnir core has 3 memories where 2 of them are connected to the core and the third memory is connected to the DMA bus. The memories are called Local Vector Memories (LVMs). By being able to swap which memories that are connected to the processor and which memory that is connected to the DMA better utilization can be reached and a lot of the transfer cycle cost can be hidden. Constant Memory Each Sleipnir is also provided with a Constant Memory (CM) for use of constants during runtime. This memory can be used for different tasks such as scalar constants or permutation vectors. All constants that will be used during runtime can be stored in the CM. The memory can contain up to 256 vectors. Local Vector Memory The Local Vector Memories (LVM) are the local memories of the Sleipnir core. As described above each core has access to 2 LVMs at runtime. These memories are 4096 vectors large, where each vector is 128 bits wide. The memories have one address for each word of 16 bits. The memories consist of 8 memory banks, one for each word in a vector. The constant memory can be used to address the LVMs according to the values stored in the constant memory. The constant memory addressing of the LVMs can be used to generate a permutation of data which can be used for e.g. transposing a matri. Vector Registers File There are 8 Vector Registers (VR) in the Vector Register File (VRF), VR0 to VR7, for use in computations during runtime. Each word can be obtained separately, it is also possible to obtain a double word and half vector both high and low in each of the 8 vector registers. The different access types are listed in table 4.2, originally found in [4]. Synta Size Description vrx.y 16-bit Word vrx.yd 32-bit Double word vrx{h,l} 64-bit Half vector vrx 128-bit Vector Table 4.2: Register file access types Special Registers There are 4 address register ar0-ar3 which can be used to address memory in the LVMs. There are also 4 configuration registers for these 4 address registers. The subset of these registers are values for top, bottom and step size which can

32 Overview of the epuma Architecture be used when addressing memories in all kinds of loops. The different increment operations are listed in table 4.3, originally found in [4]. arx+=c Fied increment; C = 1,2,4 or 8 arx-=c Fied decrement; C = 1,2,4 or 8 arx+=s Increment from stepx register arx+=c% Fied increment with cyclic addressing arx-=c% Fied decrement with cyclic addressing arx+=% Increment from stepx with cyclic addressing Table 4.3: Address register increment operations The addressing of the two LVMs can be done with one of the four address registers, immediate addresses, vector registers or in combination with the constant memory, to form advanced addressing schemes as shown in table 4.4, originally found in [4]. Mode# Inde Offset Pattern Synta eample 0 arx 0 0,1,2,3,4,5,6,7 [ar0] 1 arx 0 cm[carx] [ar0 + cm[car0]] 2 arx 0 cm[imm8] [ar0 + cm[10]] 3 arx 0 cm[carx + imm8] [ar0 + cm[car0 + 10]] 4 0 vrx.y 0,1,2,3,4,5,6,7 [vr0.0] 5 0 vrx.y cm[carx] [vr0.0 + cm[car0]] 6 0 vrx.y cm[imm8] [vr0.0 + cm[10]] 7 0 vrx.y cm[carx + imm8] [vr0.0 + cm[car0 + 10]] 8 0 0 vrx [vr0] 9 0 0 cm[carx] [cm[car0]] 10 0 0 cm[imm8] [cm[10]] 11 0 0 cm[carx + imm8] [cm[car0 + 10]] 12 arx 0 vrx [ar0 + vr0] 13 arx vrx.y 0,1,2,3,4,5,6,7 [ar0 + vr0.0] 14 arx imm16 0,1,2,3,4,5,6,7 [ar0 + 1024] 15 0 imm16 0,1,2,3,4,5,6,7 [1024] Table 4.4: Addressing modes eamples Program memory The program memory (PM) can contain up to 512 instructions. It can be loaded from the main memory by issuing a DMA transaction. The program that is loaded into the Sleipnir PM is called a block. A kernel is a combination of master code and blocks. A block can utilize several Sleipnir cores with internal data transfers. Blocks can however not communicate with cores outside the block and can not be data dependant on any other block running at the same time.

4.4 Sleipnir Core 33 If for some reason the Sleipnir block code is larger than 512 lines of instructions it can be divided into two programs and the memory can be transferred between two Sleipnir cores. For this to work code is needed in the master to keep track of the cores and move data to the net core for further processing. When developing a new block or kernel it can sometimes be good to have a little etra memory. Therefore it is possible to increase the size of the PM in the simulator. 4.4.2 Datapath The datapath of the Sleipnir slave core is a 8-way 16-bit datapath. The datapath is divided into 15 pipeline stages and is depicted in figure 4.4. A more detailed version of the datapath can be found in [2]. A 1 A 2 Instr. Fetch Instr. Decode E 1 B 1 CM Addressing E 2 B 2 LVM Scalar Addressing CM CM y E 3 B 3 LVM Vector Addressing E 4 B 4 LVM LVM y VRF SPRF C 1 D 1 D 2 D 3 D 4 Operand Selection Operand Formatting Multiplier ALU 1 ALU 2 Figure 4.4: Sleipnir datapath pipeline schematic The datapath includes 16 1616-bit multipliers and two Arithmetic Logic Units (ALU) connected in series. Simpler instructions can bypass the first ALU and by that become a shorter instruction which saves some eecution time. These bypasses can be seen in stage D1 to D4 in figure 4.4. Some instructions use a very short datapath such as the jump instruction which is eecuted in stage A2. This makes the use of precalculated branch decisions unnecessary. Stage E1 to E4 can be described as the write back stage and therefore it follows after stage D4. Stage D3 and D4 are very similar but provides the core with the possibility of performing summation of a complete vector and similar tasks.

34 Overview of the epuma Architecture 4.4.3 Sleipnir Instruction Set The instruction set used is application specific. The instruction set includes no move or load instructions for data. These functions are all included in one instruction which is called copy. Operands and instructions can be combined in different ways with variable pipeline length as a result. The pipeline length depends on e.g. where the input operands are fetched from, where the result will be stored and if the instruction uses or bypasses the first ALU and multipliers. Instruction names are built upon what data they affect and how. For eample the instruction vcopy m0[0].vw m1[0].vw copies a vector from memory 1 address 0 to memory 0 address 0. If the instruction scopy would be used instead it would only copy a scalar word. Another eample is the add instruction. If vaddw m0[0].vw m1[0].vw vr0 is used two vectors will be loaded from both m1 and vr0. The.vw after the memory address denotes that the vectors will be added word wise, that means they will be considered as eight words. This means that the processor can carry out 8 additions per clock cycle. [4] 4.4.4 Comple Instructions To reach better performance results the datapath has to be utilized as much as possible, especially in the inner loops of the critical path. To be able to reach this better performance, new specialized instructions that perform several smaller tasks could be implemented. The result of this is that by pipelining several of these new comple instructions more work can be done in less time and the program will reach an increased throughput. Things that have been considered when deciding upon accelerating certain parts of code are listed below. Motivation Why should the acceleration be done Description What is going to be accelerated Etra hardware needed What etra hardware is needed for acceleration of the specific task Profiling and usage Is the task used a lot and therefore worth accelerating Etra hardware cost What is the cost of the etra hardware Cycle gain How many cycles can be saved Efficiency How efficient is the new solution in terms of cost per gain in performance 4.5 DMA Controller The Direct Memory Access(DMA) controller is used to load and store data to and from an off-chip memory. The DMA can transfer a 128-bit vector to one of the

4.6 Simulator 35 Sleipnirs every cycle. It can also broadcast data to one or more Sleipnirs. If a block is to be loaded in to two or more cores it can be broadcasted and the process will not lose cycles by loading each block separately to each core. This will save time both because of the time it takes to copy the data but also because it takes some cycles to configure and start the DMA transaction. Sleipnir Local Store DMA PM NoC Switch 2 CM Switch 1 LVM 1 LVM 2 Sleipnir Core LVM 3 Figure 4.5: Sleipnir Local Store switch As mentioned before there are 3 memories belonging to each core. There are 6 different setups for how the memories are connected to the DMA and the core. The switch which control this is illustrated in figure 4.5, originally described in [2]. There is also a switch for selecting if LVM, PM or CM should be connected to the DMA. This switch is used and changed accordingly when programming the Sleipnir cores PM, CM or LVM. To initiate a DMA transaction the DMA unit needs to be configured. This configuration includes start addresses in both memories, number of vectors to be transferred, how to access data, step size in memory, switch configuration and broadcast configuration. The DMA has support for 2D accesses in main memory which can be helpful when advanced access patterns are used. When the configuration of the DMA unit is done the task can be started. 4.6 Simulator The epuma architecture has a full system simulator available. The simulator is bit and pipeline true. Simulations can be done on either one standalone Sleipnir core or on the full system simulator where master core and all 8 Sleipnir cores are included.

36 Overview of the epuma Architecture The simulator can be invoked by a python script. The simulator has a number of functions that could be used to access LVM contents, address registers, vector registers, program counter and the instruction that is being eecuted [3]. This can be used both for debugging and profiling. Interrupts from DMA and the Sleipnir cores can also be caught by the same python script. Opportunities like pre-processing of the input and post-processing of the results directly in the python script are also available. The simulator has support for different modes of simulation, either simulation until an event occurs or simulation of one cycle at a time. When simulating until an event happens, these events need to be enabled in the simulator. Events that are possible to enable are starting and stopping of a specific Sleipnir core, memory out of range and data hazards such as read before write. Simulating one cycle at a time offers more opportunities to evaluate each step in an eecution. To get the simulator to carry out a full system simulation it needs input such as master code, Sleipnir code and the data that is going to be used during runtime. These are possible to add before the simulation begins. Allocating memory for results in main memory is also possible.

Chapter 5 Elaboration of Objectives This chapter gives a more detailed task specification by using knowledge acquired from the previous theoretical chapters. It also includes the method and the taken procedure. 5.1 Task Specification The main task at hand is to evaluate a new processor architecture and how capable it is in aspect of H.264 video encoding using the available system simulator. The evaluation will be done by developing selected parts of an H.264 encoder. Most weight should be put into evaluating the more computational intensive parts which are likely to make out the bottleneck of the encoding. To implement an encoder, or parts of it, that uses the H.264 standard a thorough understanding of both the H.264 standard s core parts and the epuma processor architecture, tool chain and instruction set is needed. This information and understanding has to be acquired first. The video focused upon will be 1080p full high-definition (HD) video at a rate of 30 frames per second (FPS) using the 4:2:0 sampling format encoding which was described in section 2.2. The video frames that calculations will be performed on are presumed to be stored with 8 bits per piel in the main memory. Once performance results have been acquired, possible areas of improvement can be eposed. Different ways of improvement can then be compared, both in terms of performance improvement and the estimated etra hardware needed, to give a measurement of efficiency. The results will also be compared to the results from the H.264 encoder for the STI Cell architecture which are presented in [15]. Other parts that will be evaluated include the Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), Quantization and Rescaling. 37

38 Elaboration of Objectives 5.1.1 Questions at Issue The following questions were derived from the purpose in section 1.2 and the task specification. Is it possible to perform real-time full HD video encoding at 30 FPS using the H.264 standard in the epuma processor? Would it be possible to modify the processor architecture to reach better performance and if so, would it be worth the cost of the potentially added hardware? What are the cycle costs compared to the STI Cell H.264 encoder? 5.2 Method The main method used to conduct the work has been to use the epuma system simulator. The simulator was invoked from a script written in the Python programming language. This gave enough fleibility to enable measurement of all the results as well as ability to make testing automatic within that same script. If the sole purpose of this thesis would have been to give performance measurements the method might have had other candidates such as hand calculations of cycle costs. As the purpose for this thesis also includes functional implementations using the simulator is the choice that offers the best validity if used correctly. If the implementation of the simulator is correct according to the proposed architecture, it will give measurements of high reliability. This is based on the fact that the simulator is pipeline true, cycle- and bit-correct. 5.3 Procedure The taken procedure while working on this thesis started with a study of video coding, the H.264 standard and the epuma processor architecture. Once the required information had been acquired a functional stand-alone Sleipnir block for motion estimation was developed. When the block was found to be correct and working, code for the master was developed. Then the master could start to run the motion estimation using the Sleipnir block. From this point the motion estimation kernel was developed with various stepwise improvements. The master code was also developed to be able to run the different versions of the kernel using a variable number of slave cores. Then the construction of the other Sleipnir blocks such as DCT, IDCT, Quantization and Rescaling began. Once all blocks were implemented performance measurements could be acquired and the results could be analyzed to give conclusions and answer the questions at issue.

Chapter 6 Implementation This chapter covers how the implementation of different kernels and blocks were done, how they evolved, the different decisions that were made and why. 6.1 Motion Estimation Motion estimation was found to be the prime target for performance evaluation as it, in nearly all cases, takes up the majority of the encoding cycle time. All implementations of motion estimation is done for a 65 macroblocks high and 118 macroblocks wide frame. The reason for this is because it simplifies the implementation as the corners and sides of a frame make out special cases. The number of macroblocks left out by doing this simplification is 430 compared to the total number of 120 67.5 = 8100 macroblocks for a full HD frame. This corresponds to 5.31% and still leaves 7670 macroblocks to perform calculations on. The search area was chosen as ( 15, 15) ( 15, 15) according to what was used in [15] to yield as comparable results as possible. Another simplification of the motion estimation is that it is only performed on entire macroblocks, no further division into e.g. 16 8, 8 8 or 4 4 piels is performed. The reason for this is that it might not be feasible to perform these calculations on a low-power architecture such as epuma without increasing the clock frequency. Doing so would be counter productive in a low-power point of view and, even if it would be doable, might not be applicable to hand held devices running on batteries. 6.1.1 Motion Estimation Reference In order to evaluate the results produced by the motion estimation kernels a reference motion estimation program was written in the Python scripting language. By comparing the resulting motion vectors and costs in an automated fashion the functionality of the kernels could be verified with little effort. 39

40 Implementation 6.1.2 Comple Instructions The function of the innermost loop of motion estimation can be described as follows: 1. 16 16 = 256 subtractions of 8-bit unsigned numbers. 2. Calculate the absolute value of each subtraction result. 3. Sum all absolute values together to one final sum. This gives a total of 256 subtraction operations (SUB), 256 absolute value (ABS) operations and 255 addition operations (ADD) which is equal to 767 operations. This theoretically corresponds to 32 vector word SUB instructions, 32 vector word ABS instructions and 37 vector word SUM instructions in the Sleipnir core. Theoretically those instructions would need a total of 32 + 32 + 37 = 101 cycles in the Sleipnir core. By eamining the Sleipnir datapath, as can be seen in figure 4.4, it can be found that several of the necessary operations could be done in series in a pipelined fashion. By eploiting this a new comple instruction could be constructed as mentioned in section 4.4.4. In addition, by having the operand selection and operand formatting parts of the pipeline able to fetch 8-bit unsigned numbers from the operands and feed them to the datapath as 16-bit unsigned numbers, a further reduction of cycle time could be achieved. By utilizing the datapath to this etent two scalar words will be the partially summed up result from the comple instruction. This means another 9 vector word SUM instructions will still be needed which gives a total theoretical computation time of 32 + 9 = 41 cycles for calculating one macroblock sum of absolute difference (MB SAD). By studying the heagon search algorithm (section 3.4.1) it can be seen that the algorithm will need to calculate the sum of absolute difference between two macroblocks (MB SAD) a number of times equal to 7 + 3 n + 4, where n is the number of steps taken. It is also known that there will be 8100 macroblocks to perform a heagon search upon in each frame. A summary of the considerations taken, as mentioned in section 4.4.4, is listed below. Motivation Innermost loop of motion estimation. Description Perform absolute difference and partial sum. Etra hardware needed None, or possibly operand 8-bit selection. Profiling and usage Used (7 + 3 n + 4) 8100 times per frame. Etra hardware cost None, or affordable. Cycle gains Theoretically 101 41 = 60 for each MB SAD. Efficiency, gain per cost Very high.

6.1 Motion Estimation 41 Having analyzed the innermost loop of the motion estimation leads to the conclusion that using comple instructions could give a real boost to performance. The new instructions specific for motion estimation were named HVBSUMABSDWA and HVBSUMABSDNA which can be read out as Half Vector Bytewise SUM of AB- Solute Differences Word Aligned and Half Vector Bytewise SUM of ABSolute Differences Not word Aligned respectively. The proposed hardware setup of the datapath of these two instructions are depicted in Appendi A.1 and A.2. As can be seen from the figures the instructions do not use the full width of the operands, because the data is stored bytewise in the memory. The datapath will still be fully utilized as the 8-bit input piel values are promoted to 16-bits values in the 8 computation lanes of the Sleipnir datapath. In addition to the motion estimation instructions, two more instructions can follow without any essential additional cost. These instructions were named HVBSUBWA and HVBSUBNA which can be read out as Half Vector Bytewise SUBtraction Word Aligned and Half Vector Bytewise SUBtraction Not word Aligned respectively. These instructions are used in motion compensation as the subtraction results have to be kept intact to produce the residue frame. The implementation of these instructions can be seen in Appendi A.3 and A.4. 6.1.3 Sleipnir Blocks The heagon search Sleipnir block was the first part to be implemented, at first only focusing on performing calculations on one macroblock at a time. The input needed to perform calculations is one macroblock from the new frame and a larger chunk of data from the previous frame, the reference, that makes out the search area. All motion estimation blocks are divided into smaller functions and the program flowchart is depicted in figure 6.1. When eecution starts the program will calculate the Sum of Absolute Difference (SAD) for the first 7 search points MID, LEFT, RIGHT, UP LEFT, UP RIGHT, DOWN LEFT and DOWN RIGHT as shown in figure 6.1. Once the first 7 SAD costs have been calculated the program reaches the main loop and the MIN function determines which cost is lowest and moves on to one of the 7 corresponding MIN functions. The 6 MIN functions (MIN LEFT, MIN RIGHT, MIN UP LEFT,...) updates the motion vectors and data addresses and then moves on to the corresponding 3 new search points. Once the SAD costs of these 3 new search points have been calculated the MIN function is again used to find the new minimum cost and the loop continues. The MIN MID state is reached if the middle point was found to be the search point with the lowest SAD cost. When this happens Phase 2 (P2) of the algorithm starts which means that the searchpattern is changed to the small heagon (figure 3.7). Once the final 4 search points have been calculated the smallest cost amongst them is found by the P2 MIN function and the final motion vector is calculated. For the Sleipnir blocks that does not use the Motion Compensation (MC) function the DONE/RESTART state is reached and if MC is used it will be calculated before the block finishes.

42 Implementation The blocks calculating on one macroblock reach the DONE stage and finalize their eecution. The RESTART function is naturally only used by the blocks calculating on more than one macroblock per eecution. If the DONE/RESTART function is reached the program starts over from START/RESTART if there are more macroblocks left to compute, otherwise they reach DONE and finalize their eecution. The P2 in some function names indicate that they are used in phase two of the search algorithm when the small heagon pattern, discussed in section 3.4.1, is used. START/ RESTART MID LEFT RIGHT UP LEFT UP RIGHT MIN DOWN RIGHT DOWN LEFT MIN DOWN RIGHT MIN DOWN LEFT MIN UP RIGHT MIN UP LEFT MIN RIGHT MIN LEFT MIN MID DOWN RIGHT LEFT RIGHT UP LEFT RIGHT LEFT P2 UP RIGHT DOWN LEFT UP RIGHT LEFT DOWN RIGHT DOWN LEFT P2 DOWN DOWN LEFT DOWN RIGHT UP LEFT UP RIGHT UP RIGHT UP LEFT P2 LEFT P2 RIGHT DONE/ RESTART MC P2_MIN Figure 6.1: Motion estimation program flowchart The MC part in the final stage of figure 6.1 is as mentioned only included in the final block but all other stages are common to all blocks. The computations performed by the different functions in figure 6.1 are depicted in figure 6.2. The Finite State Machine (FSM) included in figure 6.2 is the state machine presented in figure 6.1. Here SAD calculating functions are e.g. LEFT, RIGHT, UP LEFT and Min functions are e.g. MIN, P2 MIN, MIN LEFT, MIN UP RIGHT.

6.1 Motion Estimation 43 1. Check Odd/Even Flag 2. Calculate Data Adress MC Calculating Functions SAD Calculating Functions 1. Check if out of bounds 2. Check Odd/Even Flag 3. Calculate Data Adress ODD MC EVEN MC FSM Min Functions ODD EVEN Store Done or Net MB START/RESTART 1. Init loop counter, motion vector and adress registers 2. Copy data MB to other LVM 1. Find Min Value 2. Update Odd/Even Flag 3. Update Base Adress Sum and store Out of bounds Figure 6.2: Motion estimation computational flowchart When the block starts it first sets the loop counter to zero, sets the motion vector to (15,15) (to start in the middle) and initializes the addresses used to access the reference macroblocks stored in the Local Vector Memory (LVM). Then the current data block is copied from the input LVM to the other LVM to be able to perform calculations easier by accessing one memory for the reference and the other for the data. By comparing the motion vector that corresponds to the current search points position and the minimum and maimum values allowed, 0 and 30 respectively, search points that are out of bounds can be detected. If the search point is found to be out of bounds the calculation of that macroblocks SAD will not take place and the program will continue to the net search point. In figure 6.2 ODD and EVEN are the computational functions which use the new instructions HVBSUMABSDWA and HVBSUMABSDNA to calculate the Sum of Absolute Difference (SAD) of the macroblocks. After that the results are summed up to a single integer value and stored in memory on one of the 11 (7 + 4) addresses dedicated for the search points in the large and small heagon search pattern. The MIN and P2 MIN functions can then find the smallest value of the costs previously stored in memory on a part of the specific addresses mentioned above. MIN for instance eamines the first 7 costs from the large heagon pattern and P2 MIN the final 4 search points plus the middle point again. Once the minimum value is found it will be known if the ODD/EVEN flag will have to be updated or not. If the search point moves an even number of piels the flag will be unchanged, if it moves an odd number of piels the flag will be inverted to indicate the change. The MC calculation is very similar to the SAD calculation, the difference is that it can not be out of bounds and that the result is one complete macroblock of the residue which consists of 32 vectors of 16-bit integers. Once the MC calculation is finished the eecution will either finish or restart calculations on the net macroblock.

44 Implementation Simple Flow Control The simple program flow controller was implemented with a series of conditional jump instructions and a status flag stored in memory to move between functions in a correct order as depicted in figure 6.3. START MID JUMP FORWARD > 7 JUMP FORWARD > 4 LEFT == 1 RIGHT == 2 UP LEFT == 3 UP RIGHT == 4 DOWN LEFT == 5 DOWN RIGHT == 6 MIN == 7 JUMP FORWARD > 11 MIN MID == 8 MIN LEFT == 9 MIN RIGHT == 10 MIN UP LEFT == 11 JUMP FORWARD > 14 MIN UP RIGHT == 12 MIN DOWN LEFT == 13 MIN DOWN RIGHT == 14 P2 UP == 15 P2 DOWN == 16 P2 LEFT == 17 P2 RIGHT == 18 P2 MIN == 19 DONE Figure 6.3: Heagon search program flow controller The status flag is updated by each function to enable eecution of the net function in order. To enable the ability to only recalculate the three necessary positions other flags are set, by the corresponding MIN functions, for each position if it should be recalculated in the net pass or not. The block was verified to work as intended by comparing the produced results with the results produced by the Python motion estimation reference program described in section 6.1.1. Advanced Flow Control When the block using the simple program flow control was functioning it became more clear that implementing functionality for call and return in the slaves could yield an increase in performance. The implementation of a relatively simple hardware stack with only 4 levels as shown in figure 6.4 was the result. In figure 6.4 the original programflow control consists of the blocks inside the dashed bo. The additional hardware added by call and return is the Call / Return Controller block and the 4 address registers it uses. The added hardware cost of these parts is very reasonable, the increase in program flow controlability and the performance gain that follows makes this a well worth addition.

6.1 Motion Estimation 45 PC PM Return Addr 1 Return Addr 2 Return Addr 3 Return Addr 4 Call / Return Controller PC FSM Pipeline Reg Instr. Decoder Figure 6.4: Proposed implementation of call and return hardware Once functionality for call and return instructions were added to the simulator a new heagon search block that utilizes these new features was written. With the call and return functionality the somewhat primitive program flow controller could be replaced with function calls. The program flowchart in figure 6.1 is still valid for blocks using the advanced program flow control. This is because the functionality of the blocks has not changed. Multiple Macroblocks Once the call and return version of the block was completed and confirmed as working, further development was based on making each Sleipnir core perform calculations on several macroblocks. The number of macroblocks to calculate the motion vectors for during one Sleipnir block eecution were chosen to 5 and 13 because they both divide 65 evenly. Two block versions, one for 5 and one for 13 macroblocks, were implemented and tested. One of the most substantial benefits of doing multiple macroblock calculations at a time is the opportunity to eploit data reuse. 1 2 3 4 5 1 2 Figure 6.5: Reference macroblock overlap For each etra macroblock beyond the first an amount of data transfer equal to 6 macroblocks (6 16 = 96 vectors) can be saved. The reason for this is that a 3 Height search area avoids a vertical overlap, the shaded areas, as depicted in figure 6.5.

46 Implementation The Sleipnir block calculating 13 motion vectors during each eecution needs a data input equal to the 13 data macroblocks but also the search area for them which makes out an 3 15 area of macroblocks. There is still a considerable horizontal overlap in this setup but the advantage over calculating one macroblock per eecution by transferring each data macroblock and its 3 3 macroblocks of search area is considerable. Reference Frame in Main Memory Reference Columns Sleipnir 0 Sleipnir 1 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 1 2 3 4 5 1 2 1 2 3 119 120 121 122 123 119 120 119 120 121 119 120 121 122 123 Data Overlap Figure 6.6: Reference macroblock partitioning for 13 data macroblocks In figure 6.6 the data partitioning of a frame in the main memory where only the upper right corner of the full frame is shown. The numbered areas illustrate the overlay of the data macroblocks being calculated by each Sleipnir block eecution. The data macroblocks are taken from the current frame not the reference frame shown in figure 6.6. As the frame contains 118 columns to be calculated the net row of 13 macroblocks starts as number 119.

6.1 Motion Estimation 47 Motion Compensation Once a motion estimation block has found the best match only a little etra time would be needed to calculate the motion compensated residue of that macroblock. By taking advantage of this and adding this functionality to the motion estimation block motion compensation can be achieved for a very low overhead cost as all information needed is already present. To perform motion compensation the block, once it has found its best match, has to perform a subtraction between two macroblocks. This adds up to 256 subtraction operations or 32 vector subtraction instructions in the Sleipnir. The result will be stored as 32 vectors of 8 16-bit integers and copied back to the main memory along with the motion vectors. As mentioned in section 6.1.3 the motion compensation block uses the HVBSUBWA and HVBSUBNA instructions to speed up the calculation of the residue macroblocks. 6.1.4 Master Code The implementation of the master code was started once the Sleipnir code was found to be working. The masters tasks include keeping track of how many more macroblocks to do calculations on, set up all DMA data transfers to and from the Sleipnir cores and divide the workload of the motion estimation between them. In figure 6.7 the program flow of the master is shown. In the prolog the stack pointer, DMA- and slave-interrupts, number of macroblocks to compute, address registers and configurations of data storage in the main memory are set up. In the prolog the master also loads the program and constants into the Sleipnir cores memories. Prolog Configure DMA Start DMA Start Sleipnir YES Epilog NO More MBs to Compute? Copy Results Find Net Available Sleipnir Figure 6.7: Master program flowchart In the Configure DMA stage the coming DMA transfers for data and reference are configured. The addresses for where the result should be written is saved to DM0 at the label called Results which can be found in figure 6.8a. In the Start DMA stage the DMA transfers are started and the addresses for the location of the net data block is calculated. There are two DMA transfers that are completed and therefore a wait for the first to finish is performed. During this wait the calculation of the net address is hidden. These addresses are saved to DM0 in the RAM block DMA data and DMA ref for easier configuration of DMA when reaching the Configure DMA step net time.

48 Implementation In the Start Sleipnir stage the switches of the memory are set in the correct place before starting the Sleipnir core. After the Sleipnir core has been started it is time to find a new available Sleipnir core to fill with data and start. The Find Net Available Sleipnir stage iterates over the Sleipnirs until it finds one that is free. The iteration in the Find Net Available Sleipnir stage gives the ability to rather quickly find any Sleipnir that has finished eecution. Sleipnir cores are chosen in an first free first served fashion so that Sleipnir 0 will have the highest priority and Sleipnir 7 will have the lowest priority. The program uses status flags for each core to know what it is currently doing. These flags are used when finding a new free core. If the Sleipnir block is running and then finishes it will send an interrupt to the master which will change the status flag for the core in an interrupt routine. Net time the master is looking for a free Sleipnir it will find that the Sleipnir has finished, due to the flag value, and needs to have its results copied back. When a Sleipnir core has finished eecution the results from that core needs to be copied back to the main memory. This is done in Copy Results stage. The information where the results should be copied to are fetched from DM0 and written to the Copy Back (CB) allocated memory DMA CB seen in figure 6.8a. The DMA unit is then configured with the information that can be found in DMA CB. When the DMA transfer is finished it is time to load the Sleipnir with new data to perform calculations on. This will of course happen in Configure DMA and Start DMA. Using this type of setup of the program enables out of order eecution of the Sleipnir cores which is suitable for a search algorithm such as motion estimation that has a highly variable eecution time. If there are more macroblocks available for calculation the loop will continue to the configure DMA stage again. If all macroblocks are finished the Epilog is activated. In the Epilog the master waits for all Sleipnirs to finish their eecutions. The last results are then copied to main memory and the kernel is finalized. DM 0 RAM 0 ROM 0 DMA Data DMA PM DMA CB DMA CM DMA Ref Results (a) Main Memory Data (current frame) Refence Frame Results PM CM (b) Figure 6.8: Memory allocation of data memory in the master(a) and main memory allocation(b)

6.2 Discrete Cosine Transform and Quantization 49 The memory allocation for the masters data memory can be found Figure 6.8a. It is a simple setup where three blocks of DMA configuration setting are stored in the top of RAM 0. During runtime the master points the DMA firmware to the different memory blocks where it reads the DMA settings. The last block in RAM 0 is used to store result pointers for main memory. These result pointers are needed because the likelihood of the blocks finishing in an out of order fashion. The allocation overview of the main memory can be seen in figure 6.8b. Data and Ref are data for the frames to encode. The results block is memory allocated for the residues and for motion vectors. The last two blocks of data are memory allocated for the Sleipnir block and its constants. Main Memory Sleipnir ME Eecution Result in Main Memory Task 0 Task 1 Task 2 Task 3 Task 4 Sleipnir 0 Sleipnir 1 Result 0 Result 2 Result 3 Sleipnir 2 Task 118 Task 119 Task 120 Task 121 Task 122 Sleipnir 3 Sleipnir 4 Time Current Time Figure 6.9: Sleipnir core motion estimation task partitioning and synchronization In figure 6.9 the motion estimation task partitioning and synchronization between Sleipnir cores is shown. Task 0, Task 1 and so on contains the copying of both the data macroblocks and the reference macroblocks used for the search area and performing motion estimation on them. The number of macroblocks used as data macroblocks in the tasks can be 1, 5 or 13 and the number of reference macroblocks can be 9, 21 or 45 respectively. Shown in the figure is also eamples of the Sleipnir cores eecution times and the resulting out of order completion of the tasks. 6.2 Discrete Cosine Transform and Quantization The output from the motion compensation block will be a motion compensated residue which will be the input to the net part of the encoder, the Discrete Cosine Transform (DCT) and Quantization block.

50 Implementation 6.2.1 Forward DCT and Quantization The DCT and the quantization were combined into one Sleipnir block to be able to save cycles by performing the quantization directly after the transform. The Quantization Parameter (QP) was chosen as a fied value of 10, this value is easily changed if another fied value would be desired. Adding support for a variable value of QP would cost both additional instructions and additional constants in the constants memory. To get as low eecution times as possible while still following the H.264 standard a variable QP was left out. The order of computations in the DCT and quantization block is as follows: 1. Process the blocks through the first DCT stage. 2. Transpose the blocks. 3. Process the blocks through the second DCT stage. 4. Transpose the blocks again. 5. Multiply by MF, scale by qbits and round to get the result. The calculation of a 4 4 block based two-dimensional DCT as discussed in section 3.5.1 can be described as follows: 1. Calculate X 0.. X 3 according to figure 3.9 for each row of the block. 2. Transpose the resulting 4 4 block to be able to calculate the DCT of the columns. 3. Calculate X 0.. X 3 according to figure 3.9 for each column of the block. 4. Transpose the resulting 4 4 block again to get the final result. As the block is transposed two times the resulting block will not be transposed compared to the input block. The input data is presumed to be stored as 16-bit integers as this is the native Sleipnir datapath width. The input data itself consists of a number of 4 4 blocks of the residue piel values. To utilize the full datapath width of the Sleipnir two 4 4 blocks can be calculated simultaneously. The flow of the two-dimensional DCT is depicted in figure 6.10. First the input data consisting of two 4 4 piel blocks is read in and transformed through the first DCT stage. The result will be two one-dimensionally DCT-transformed 4 4 piel blocks. The following transpose of the blocks will be performed as shown in figure 6.11. After that the transposed blocks will be processed by the second DCT stage and finally the blocks are yet again transposed to complete the two-dimensional DCT transform.

6.2 Discrete Cosine Transform and Quantization 51 1 2 3 4 17 18 19 20 + + 1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 + - + 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 Input data of two 44 blocks of 16-bit integers - - + -2 2 + First DCT + + 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 First blockwise transpose 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31 4 8 12 16 20 24 28 32 Second blockwise transpose + + - + + + + -2 2 + + - - Second DCT 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31 4 8 12 16 20 24 28 32 Input to the second DCT 1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 Final twodimensional DCT output Figure 6.10: DCT flowchart Two 44 Blocks to be transposed 1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 Constant Memory Permutation Adressing 0 9 18 27 4 13 22 31 1 10 19 28 5 14 23 32 2 11 20 29 6 15 24 33 3 12 21 30 7 16 25 34 Transposed Output 1 5 9 13 17 21 25 29 2 6 10 14 18 22 26 30 3 7 11 15 19 23 27 31 4 8 12 16 20 24 28 32 0 8 16 24 32 1 2 3 4 17 18 19 20 5 6 7 8 21 22 23 24 9 10 11 12 25 26 27 28 13 14 15 16 29 30 31 32 Memory Mapping Figure 6.11: Memory transpose schematic

52 Implementation The blocks to be transposed will be stored in memory according to the Memory Mapping part of figure 6.11 where the data is displaced one address higher for each new vector stored. The displacement is necessary as only one value can be read out from each memory bank. In the Local Vector Memories (LVMs) there are 8 memory banks, one for each column of the memory. This setup enables the addressing of the memory according to prestored addresses in the Constant Memory (CM). As the arrows display in the figure the first address vector in CM is 0, 9, 18, 27, 4, 13, 22 and 31. This vector will fetch the values of piels 1, 5, 9, 13, 17, 21, 25 and 29 which then can be stored in e.g. a vector register. By using the memory transpose as shown in figure 6.11 the transpose can be performed in only 4 vector copy instructions. An ecerpt from the first transpose of the Sleipnir block code is vcopy vcopy vcopy vcopy vr0 m1[ a r 1 + cm [ACCESS_PATTERN_0_4 ] ]. vw vr3 m1[ a r 1 + cm [ACCESS_PATTERN_3_7 ] ]. vw vr1 m1[ a r 1 + cm [ACCESS_PATTERN_1_5 ] ]. vw vr2 m1[ a r 1 + cm [ACCESS_PATTERN_2_6 ] ]. vw where ar1 is an address register pointing to the location of data stored in memory m1 (Memory Mapping), vr0 to vr3 are vector registers and the access patterns are ACCESS_PATTERN_0_4: 0 9 18 27 4 13 22 31 ACCESS_PATTERN_1_5: 1 10 19 28 5 14 23 32 ACCESS_PATTERN_2_6: 2 11 20 29 6 15 24 33 ACCESS_PATTERN_3_7: 3 12 21 30 7 16 25 34 as also shown in figure 6.11. The particular order of the vector registers in the ecerpt comes from the minimization of data dependency in the following stage of the Sleipnir block code. Calculating the transpose of the two 4 4 blocks in only 4 instructions contributes to a fast DCT. Once the DCT is completed the final stage of the block performs the quantization. The final quantization formula from section 3.5.3 has the benefit of being easy to implement in integer arithmetic as the division can be replaced by a shift operation and MF only consists of integer numbers. The division by 2 qbits can be rewritten as an arithmetic right shift by qbits. Utilizing this the implemented quantization epression can be written as Z ij = round(w ij MF ij >> qbits) (6.1) where >> is the right shift operation. [6] The quantization was implemented by multiplying the 4 4 blocks by the Multiplication Factor (MF) described in equation (3.19) and table 3.2 in section 3.5.3. This results in 4 vector to vector multiplications between the blocks and the MF used for the current value of QP. The shift by qbits and rounding was implemented using scaling and rounding of the multiplication result which is a built in function of the multiplication instruction. An ecerpt from the quantization part of the Sleipnir block code is vvmul<rnd, s c a l e = 16, ss> m0[ ar2 +=8].vw vr0 cm [MF_QP_10_1]

6.2 Discrete Cosine Transform and Quantization 53 vvmul<rnd, s c a l e = 16, ss> m0[ ar2 +=8].vw vr1 cm [MF_QP_10_2] vvmul<rnd, s c a l e = 16, ss> m0[ ar2 +=8].vw vr2 cm [MF_QP_10_1] vvmul<rnd, s c a l e = 16, ss> m0[ ar2 +=8].vw vr3 cm [MF_QP_10_2] where the values to be quantized are stored in the vector registers vr0 to vr3, ar2 is the address register pointing to the location in memory m0 where data will be stored and the Multiplication Factors (MF) for QP equal to 10 are MF_QP_10_1: 8192 5243 8192 5243 8192 5243 8192 5243 MF_QP_10_2: 5243 3355 5243 3355 5243 3355 5243 3355 which are derived from table 3.2 and 3.19 in section 3.5.3. The ability to quantisize two 4 4 blocks in 4 instructions gives a quick quantization. 6.2.2 Rescaling and Inverse DCT The rescaling and the Inverse DCT (IDCT) were also combined into one Sleipnir block to be able to save cycles by performing the IDCT directly after the rescaling. As with the DCT and quantization block only a fied value of rescaling is supported to speed up eecution while still following the H.264 standard. The order of computations in the IDCT and rescaling block is as follows: 1. Perform rescaling by multiplication of the blocks and V. 2. Run the blocks trough the first IDCT stage. 3. Transpose the blocks. 4. Run the blocks trough the second IDCT stage. 5. Divide by 64 and round to get the result. The calculation of the 4 4 block based two-dimensional IDCT can be described as follows: 1. Calculate 0.. 3 according to figure 3.10 for each row of the block. 2. Transpose the resulting 4 4 block to be able to calculate the IDCT of the columns. 3. Calculate 0.. 3 according to figure 3.10 for each column of the block. 4. Transpose the resulting 4 4 block again to get the final result. As the block is transposed two times the resulting block will not be transposed compared to the input block. To utilize the full datapath width of the Sleipnirs two 4 4 blocks can be calculated on simultaneously. The first stage of the block performs the rescaling by multiplying the 4 4 blocks by the rescaling factors (V) which was described in equation (3.25) and table 3.3 in section 3.5.4. The final rescaling formula discussed in section 3.5.4 was W ij = Z ij V ij 2 floor(qp/6) (6.2)

54 Implementation which like the final quantization formula has the benefit of being easy to implement in integer arithmetic. [6] The factor 2 floor(qp/6 causes the output to increase by a factor of two for every increment of 6 in QP. The factor 2 floor(qp/6) can be incorporated into V, reducing calculations with at least the calculation of f loor(qp/6) and one multiplication at the cost of having more constants in memory. As the constant memory is only read into the Sleipnir core once for each change of block, this was found to be beneficial, especially if a Sleipnir core will be dedicated to running the IDCT and Rescaling block. By incorporating the multiplication of 2 floor(qp/6) into V (6.2) can be rewritten as W ij = Z ij V ij (6.3) where V ij is V ij with a built in scaling of 2 for every increase of 6 in QP. Note that the result from the following Inverse DCT has to be rescaled once more to remove the constant scaling factor of 64 introduced in (3.24) which was also incorporated in V. This is the formula used in the implementation of the rescaling part of the block. An ecerpt from the rescaling part of the Sleipnir block code is vvmul<rnd, s c a l e = 0, ss> vr1 m0[ ar0 + 8 ]. vw cm [V_QP_10_2] vvmul<rnd, s c a l e = 0, ss> vr3 m0[ ar0 + 2 4 ]. vw cm [V_QP_10_2] vvmul<rnd, s c a l e = 0, ss> vr0 m0[ ar0 ]. vw cm [V_QP_10_1] vvmul<rnd, s c a l e = 0, ss> vr2 m0[ ar0 + 1 6 ]. vw cm [V_QP_10_1] where ar0 is the address register pointing to the location of the blocks in memory m0, the vector registers vr0 to vr3 will store the rescaled result and the rescaling factors (V) for QP equal to 10 are V_QP_10_1: 32 40 32 40 32 40 32 40 V_QP_10_2: 40 50 40 50 40 50 40 50 which are derived from table 3.3 and 3.25 in section 3.5.4. The ability to rescale the two 4 4 blocks in 4 instructions gives a quick rescaling. The IDCT is implemented much like the DCT described in section 6.2.1. Compared to the DCT the function of the transform stages is changed to those performing IDCT as described in section 3.5.2 but for eample the transpose functionality is still the same. In addition the order of the different stages are changed to the reversed order of the DCT. The IDCT is followed by a arithmetic shift right by 6 bits which can be written as X = round(x r >> 6) (6.4) where X r is the output from the IDCT, >> is the right shift operation and X is the final output. This final shift gives a division by 64 and removes the constant scaling factor of 64 which was introduced from V ij. The final stage is done by a vector to vector multiplication using the built in functionality for scaling and rounding of the multiplication instruction. An ecerpt from the final scaling part of the Sleipnir block code is

6.2 Discrete Cosine Transform and Quantization 55 vvmul<rnd, s c a l e = 6, ss> m1[ ar1 + 9 ]. vw vr0 cm [ONES] vvmul<rnd, s c a l e = 6, ss> m1[ ar1 + 1 8 ]. vw vr2 cm [ONES] vvmul<rnd, s c a l e = 6, ss> m1[ ar1 ]. vw vr4 cm [ONES] vvmul<rnd, s c a l e = 6, ss> m1[ ar1 + 2 7 ]. vw vr5 cm [ONES] where ar1 is the address register pointing to the location in memory m1 where the results should be stored, the vector registers vr0, 2, 4 and 5 contains the X r values and ONES is a constant memory vector consisting of only 8 ones according to ONES: 1 1 1 1 1 1 1 1 as only functionality of the multipliers scaling and rounding is needed.

Chapter 7 Results and Analysis In this chapter the performance results from the implementations of the kernels and blocks are presented. The results presented are for the motion estimation, motion compensation and transform and quantization. 7.1 Motion Estimation In this section the results from different simulations of motion estimation is presented. The result depends on different properties of the kernel code. Each subsection has a separate description of how the simulation was performed. A total of 5 kernels and 4 video sequences have been tested on 1, 2, 4 and 8 Sleipnir cores. The results are based on calculations of 7670 macroblocks which means that the edges of the frame have intentionally been left out. The edges consists of a number of 430 macroblocks. This simplification was done because these macroblocks would add special cases which could have been solved with for eample a message bo to each Sleipnir. Message boes were not available in the revision of the simulator that was used. All test sequences were downloaded from [1]. The simulations are all eecuted with revision 9888 of the epuma simulator with a patch on event.hpp from revision 9958. The patch corrects event IDs for DMA and Sleipnir cores. In table 7.1 short names of the kernels under test are presented with a short description. These names will be used throughout this section. In table 7.2 the columns of the result tables are described. Datasent = Searches ((MBs reference + MBs data ) vectors_per_mb) (7.1) The amount of data sent to the blocks is in all cases calculated according to equation (7.1). Searches is the total number of searches performed and MBs reference is the number of macroblocks sent to the Sleipnir blocks as reference to be used as the search area. MBs data is the data macroblock(s) and vectors_per_mb is 16 when using a representation of 8-bits per piel and 32 when using 16-bits per 57

58 Results and Analysis piel. In equation (7.1) the data transfer cost of the DMA for programming the Sleipnirs PM and CM is not included. Kernel Name Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5 Description Calculates the motion vector for one macroblock each eecution. Program flow control is implemented with jump. Calculate the motion vector for one macroblock each eecution. Program flow control has support for call and return. Calculates motion vectors for 5 macroblocks each eecution. Program flow control has support for call and return. Calculates motion vectors for 13 macroblocks each eecution. Program flow control has support for call and return. Calculates motion vectors and motion compensated residue blocks for 13 macroblocks. Program flow control has support for call and return. Table 7.1: Short names for kernels that have been tested Column name Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Description Sleipnir core Number of times the Sleipnir core have been started with the specific block Total number of cycles that the Sleipnir has eecuted during simulation Total number of cycles that the Sleipnir has been idling during simulation Number of cycles the Sleipnir have been idling not including idle before first start and after last start Sleipnir utilization in % based on total number of cycles eecuted in the block and total simulated cycles Table 7.2: Description of table columns 7.1.1 Kernel 1 Results presented in this section are simulations of the Sleipnir block performing the motion vector calculation of one macroblock, this block is called block 1. The Sum of Absolute Difference (SAD) calculations are implemented using the comple instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Program flow control is implemented using the jump instruction. Only simulations that required the most computational power are presented.

7.1 Motion Estimation 59 Result Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 7 670 30 853 670 5 574 019 5 572 399 84.7 Avg. util. 84.7 Master 36 427 689 Table 7.3: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 1 Sleipnir core Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 1 220 4 806 328 1 363 006 1 358 873 77.9 Sleipnir 1 1 189 4 693 008 1 476 326 1 318 404 76.1 Sleipnir 2 1 141 4 546 033 1 623 301 1 257 604 73.7 Sleipnir 3 1 086 4 330 691 1 838 643 1 197 861 70.2 Sleipnir 4 993 4 072 588 2 096 746 1 104 705 66.0 Sleipnir 5 885 3 591 793 2 577 541 984 926 58.2 Sleipnir 6 698 2 902 438 3 266 896 786 886 47.0 Sleipnir 7 458 1 910 784 4 258 550 509 376 31.0 Avg. util. 62.5 Master 6 169 334 Table 7.4: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 1 using 8 Sleipnir cores Block 1 PM CM LVM 0 LVM 1 Cost 613 instructions 65 vectors 26 vectors 180 vectors Table 7.5: Block 1 costs The best runtime for one block eecution was 1 584 and the worst was 10 386 cycles in both simulations. The amount of data that was sent to the blocks was 7 670 ((9 + 1) 16) = 1 227 200 vectors (18.73 MByte) calculated according to

60 Results and Analysis equation (7.1). 7670 vectors (0.12 MByte) was copied back to main memory from the blocks. Before any calculations can begin a prolog is eecuted to copy vectors to a second memory and to set up address registers. This prolog is 31 cycles in block 1. After the search has finished a epilog is eecuted which takes 8 cycles. Analysis Kernel 1 was the first working kernel and a proof of concept. The DMA configurations performed by the master is written in such a way that Sleipnir 0 has the highest priority and Sleipnir 1 has second priority an so on. This is the reason why Sleipnir 7 has a lower utilization compared to for eample Sleipnir0. With this kernel it was found that a lot of cycles in the Sleipnir block were spent on state handling and etra overhead caused by the jump instruction that can only jump to immediate addresses. 7.1.2 Kernel 2 Results presented in this section are simulations of kernel 2. This kernel uses an improved version of block 1,called block 2. Block 2 calculates the motion vector of one macroblock each eecution. The Sum of Absolute Difference (SAD) calculations are implemented using the comple instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. In block 2 hardware support for call and return has been added to the simulator and this is utilized for program flow control. Result Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 7 670 16 933 783 5 572 953 5 571 478 75.2 Avg. util. 75.2 Master 22 506 736 Table 7.6: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 1 Sleipnir core The results from simulation with riverbed video sequence are presented in table 7.6 and 7.7. The best runtime for one block 2 eecution was 986 and the worst was 5 348 cycles in both simulations. The amount of data that was sent to the blocks was 7670 ((9+1) 16) = 1 227 200 vectors (18.73 MByte) and 7 670 vectors (0.12 MByte) was copied back as the result from the blocks. Before any calculations can begin a prolog is eecuted to copy vectors to a second memory and to set up address registers. This prolog is the same as in block 1 and therefore takes 31

7.1 Motion Estimation 61 cycles. After the search has finished a epilog is eecuted which also, as block 1, takes 8 cycles. Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 1 755 3 848 831 2 003 542 2 000 872 65.8 Sleipnir 1 1 678 3 652 381 2 199 992 1 909 706 62.4 Sleipnir 2 1 562 3 393 931 2 458 442 1 770 069 58.0 Sleipnir 3 1 358 3 004 121 2 848 252 1 542 502 51.3 Sleipnir 4 910 2 069 282 3 783 091 1 035 091 35.4 Sleipnir 5 368 867 747 4 984 626 417 652 14.8 Sleipnir 6 38 95 305 5 757 068 41 958 1.6 Sleipnir 7 1 2 178 5 850 195 1 181 0.0 Avg. util. 36.2 Master 5 852 373 Table 7.7: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 2 using 8 Sleipnir cores Block 2 PM CM LVM 0 LVM 1 Cost 442 instructions 65 vectors 26 vectors 180 vectors Table 7.8: Block 2 costs Analysis In block 2 an improvement of 37.8% in best eecution time can be seen compared to block 1. There is also an improvement of 48.5% in the worst eecution time compared to block 1. This improvement is significant and should lower the total eecution time of one frame but as can be seen the total eecution time is only improved by 5.1%. The eplenation is that the average utilization of the Sleipnir cores have decreased from 84.7% to 75.2% in the simulation of 1 Sleipnir and 62.5% to 36.2% in the simulation with 8 Sleipnirs. In table 7.7 it can be seen that the utilization of Sleipnir 7 is 0.0%. This indicates that the blocks are eecuting to few cycles in the Sleipnirs or that the master is too slow and does not feed the Sleipnirs with enough data. Targeting the master code does not offer too many opportunities of optimization and the compleity of the code has not yet reached the compleity of a complete encoder. It was therefore concluded that searching of more macroblocks in the block should be investigated. Table 7.8 shows the

62 Results and Analysis memory cost of block 2. 7.1.3 Kernel 3 Results presented in this section are simulations of kernel 3. This kernel uses a Sleipnir block called block 3. Block 3 is a further development of block 2 where a wrapper that handles looping has been added. Block 3 calculates the motion vectors of 5 macroblocks during each eecution. As in kernel 2 the Sum of Absolute Difference (SAD) calculations are implemented using the comple instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Result Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 198 2 226 674 273 956 264 792 89.0 Sleipnir 1 195 2 235 280 265 350 264 586 89.4 Sleipnir 2 193 2 218 086 282 544 256 840 88.7 Sleipnir 3 194 2 187 261 313 369 255 665 87.5 Sleipnir 4 191 2 152 389 348 241 253 583 86.1 Sleipnir 5 190 2 175 113 325 517 252 341 87.0 Sleipnir 6 187 2 135 931 364 699 250 903 85.4 Sleipnir 7 186 2 139 942 360 688 247 395 85.6 Avg. util. 87.3 Master 2 500 630 Table 7.9: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 3 using 8 Sleipnir cores Block 3 PM CM LVM 0 LVM 1 Cost 478 instructions 67 vectors 26 vectors 444 vectors Table 7.10: Kernel 3 costs The results from simulation with riverbed video sequence are presented in table 7.9. The best runtime for one Sleipnir block eecution was 5 866 and the worst was 19 416 cycles. The amount of data that was sent to the blocks were (7 670/5) ((3 7 + 5) 16) = 638 144 vectors (9.74 MByte) and 7 670 vectors (0.12 MByte) were

7.1 Motion Estimation 63 copied back as the result from the blocks. The prolog in block 3 is slightly larger than in block 2, it is now 46 cycles. The epilog has also increased and now takes 83 cycles. Between the calculations on each macroblock there is an intermission that takes 43 cycles to finish. This intermission changes offsets for memory reads and copies a new macroblock to the second memory. Analysis Kernel 3 resulted in a 57.3% improvement of total simulation time on eecution on 8 Sleipnirs compared to kernel 2. The utilization has increased to over 85% in Sleipnir 7 which is more acceptable. The wrapper introduced in block 3 only required 36 instructions etra compared to block 2. The increase in LVM memory needed is for storage of the 16 etra macroblocks, 4 more motion vectors and etra overhead from e.g. the added loop counter. 7.1.4 Kernel 4 Results presented in this section are simulations of kernel 4. This kernel uses a Sleipnir block called block 4. Block 4 is the net step of improvement of the Sleipnir blocks and it calculates 13 motion vectors during each eecution. As in block 2 and 3 the Sum of Absolute Difference (SAD) calculations are implemented using the comple instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Result Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 149 4 370 096 262 074 253 825 94.3 Sleipnir 1 146 4 359 195 272 975 250 107 94.1 Sleipnir 2 148 4 377 214 254 956 249 644 94.5 Sleipnir 3 147 4 345 295 286 875 250 289 93.8 Avg. util. 94.2 Master 4 632 170 Table 7.11: Motion estimation results from simulation with Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 4 Sleipnir cores The results from simulation with riverbed video sequence are presented in table 7.12 and 7.11. The best runtime for one Sleipnir block eecution was 18 057 and the worst was 42 896 cycles in both simulations. The amount of data that was sent to the blocks was (7 670/13) ((15 3 + 13) 16) = 547 520 vectors (8.35 MByte) and 7 670 vectors (0.12 MByte) was copied back as the result from the

64 Results and Analysis blocks. Block 4 has the same prolog, intermission and epilog cycle cost as block 3 i.e 46, 43 and 83 cycles. Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 75 2 214 930 146 170 130 223 93.8 Sleipnir 1 74 2 200 402 160 698 129 794 93.2 Sleipnir 2 74 2 194 769 166 331 129 572 93.0 Sleipnir 3 75 2 191 467 169 633 130 113 92.8 Sleipnir 4 75 2 191 754 169 346 132 144 92.8 Sleipnir 5 73 2 174 953 186 147 127 130 92.1 Sleipnir 6 72 2 127 822 233 278 125 152 90.1 Sleipnir 7 72 2 155 699 205 401 127 116 91.3 Avg. util. 92.4 Master 2 361 100 Table 7.12: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 4 using 8 Sleipnir cores Block 4 PM CM LVM 0 LVM 1 Cost 478 instructions 67 vectors 26 vectors 964 vectors Table 7.13: Kernel 4 costs Analysis Kernel 4 pushes the utilization up to over 90% in every Sleipnir. The total simulation time has decreased from 2.50 Mega cycles (Mc) to 2.36 Mc which results in an improvement of 5.6% percent. Kernel 4 only copies 85.8% of the data compared to kernel 3. This decrease in memory data transfers will help later when the whole encoder is implemented. The cost of local memory used in the Sleipnir block has increased by 520 vectors compared to block 3.

7.1 Motion Estimation 65 7.1.5 Kernel 5 Results presented in this section are simulations of kernel 5. This kernel uses a Sleipnir block called block 5. Block 5 uses the same motion estimation code as block 4 where as before the Sum of Absolute Difference (SAD) calculations are implemented using the comple instructions HVBSUMABSDWA and HVBSUMABSDNA as discussed in section 6.1.2. Added to block 5 is code for calculating the motion compensated residue macroblock which is done using the HVBSUBWA and HVBSUBNA instructions as discussed in section 6.1.2. The benefit of doing this in the same Sleipnir block is that all etra overhead for moving data to another kernel is avoided. In this part simulation results from 4 different video sequences are presented to highlight that there is a difference in total simulation time depending on the data that is fed to the Sleipnir blocks. Result Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 74 1 783 865 198 721 175 131 90.0 Sleipnir 1 75 1 806 053 176 533 178 070 91.1 Sleipnir 2 74 1 799 925 182 661 176 137 90.8 Sleipnir 3 73 1 766 940 215 646 173 459 89.1 Sleipnir 4 74 1 787 586 195 000 178 102 90.2 Sleipnir 5 74 1 798 046 184 540 178 102 90.7 Sleipnir 6 73 1 776 852 205 734 176 123 89.6 Sleipnir 7 73 1 765 463 217 123 173 901 89.0 Avg. util. 90.1 Master 1 982 586 Table 7.14: Motion estimation results from simulation on Sunflower frame 10 and Sunflower frame 11 with kernel 5 using 8 Sleipnir cores The results from simulation with sunflower video sequence are presented in table 7.14. The best runtime for one Sleipnir block eecution was 18 457 and the worst runtime was 28 039 cycles. In table 7.15 simulation on blue sky video sequence is done which resulted in a best runtime for one Sleipnir block eecution of 18 079 cycles and a worst runtime of 41 415 cycles.

66 Results and Analysis Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 74 1 945 498 195 944 170 596 90.8 Sleipnir 1 75 1 954 782 186 660 167 994 91.3 Sleipnir 2 74 1 929 502 211 940 170 267 90.1 Sleipnir 3 73 1 919 413 222 029 165 579 89.6 Sleipnir 4 73 1 926 889 214 553 164 997 90.0 Sleipnir 5 75 1 926 105 215 337 171 946 89.9 Sleipnir 6 74 1 926 972 214 470 172 619 90.0 Sleipnir 7 72 1 895 777 245 665 168 488 88.5 Avg. util. 90.0 Master 2 141 442 Table 7.15: Motion estimation results from simulation on Blue sky frame 10 and Blue sky frame 11 with kernel 5 using 8 Sleipnir cores The third simulation was done with pedestrian area clip and the results can be found in table 7.16. The best runtime for one Sleipnir block eecution was 15 378 cycles and the worst was 47 611 cycles. Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 79 2 089 641 193 948 184 881 91.5 Sleipnir 1 75 2 071 619 211 970 171 855 90.7 Sleipnir 2 73 2 071 989 211 600 168 548 90.7 Sleipnir 3 75 2 049 072 234 517 177 366 89.7 Sleipnir 4 71 2 080 353 203 236 164 579 91.1 Sleipnir 5 73 2 058 428 225 161 171 021 90.1 Sleipnir 6 73 2 043 657 239 932 171 602 89.5 Sleipnir 7 71 2 031 873 251 716 164 102 89.0 Avg. util. 90.3 Master 2 283 589 Table 7.16: Motion estimation results from simulation on Pedestrian area frame 10 and Pedestrian area frame 11 with kernel 5 using 8 Sleipnir cores

7.1 Motion Estimation 67 Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 147 4 659 746 315 540 311 485 93.7 Sleipnir 1 147 4 628 572 346 714 310 410 93.0 Sleipnir 2 148 4 616 250 359 036 313 279 92.8 Sleipnir 3 148 4 629 588 345 698 315 330 93.1 Avg. util. 93.1 Master 4 975 286 Table 7.17: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 using 4 Sleipnir cores Core Number of starts Total cycles Idling cycles Runtime idle Utilization in percent Sleipnir 0 75 2 333 516 207 336 188 358 91.8 Sleipnir 1 74 2 349 169 191 683 187 756 92.5 Sleipnir 2 73 2 331 312 209 540 185 101 91.8 Sleipnir 3 75 2 323 308 217 544 189 070 91.4 Sleipnir 4 74 2 331 564 209 288 188 828 91.8 Sleipnir 5 75 2 305 196 235 656 186 202 90.7 Sleipnir 6 72 2 299 853 240 999 184 476 90.5 Sleipnir 7 72 2 260 234 280 618 180 283 89.0 Avg. util. 91.2 Master 2 540 852 Table 7.18: Motion estimation results from simulation on Riverbed frame 10 and Riverbed frame 11 with kernel 5 on 8 Sleipnir cores The last simulation was done on the riverbed video sequence and the results can be found in table 7.18. The best runtime for one Sleipnir block eecution was 19 884 cycles and the worst was 44 739 cycles in the simulations presented in table 7.18 and table 7.17. The amount of data that was sent to the Sleipnir blocks was (7 670/13) ((15 3 + 13) 16) = 547 520 vectors (8.35 MByte) and 7 670 33 = 253 110 vectors (1.99 MByte) was copied from the blocks. The prolog cost for block 5 is 46 cycles and intermission cycle cost is 43 which is the same as in block 4. The epilog in block 5 is 185 and comes from the time it takes to save the motion compensated residue to local vector memory.

68 Results and Analysis Block 5 PM CM LVM 0 LVM 1 copy LVM VR copy CM VR Cost 574 instructions 64 vectors 26 vectors 1411 vectors 34 instructions 23 instructions Table 7.19: Kernel 5 costs Analysis The difference from block 4 can easily be seen in table 7.19 where the memory cost has increased a lot due to the etra vectors needed for storing the motion compensated residues. This also causes a need for etra data to be copied back to main memory which increases the runtime idle. The differences can be seen if comparing table 7.12 and table 7.18. The cycle cost from kernel 5 is not based on complete full HD frames which was mentioned in the beginning of the chapter. Equation (7.2) is a calculated approimation of this increased cost when the input data is a complete full HD frame. Number of MB = 8 100 Number of MB in kernel 5 = 7 670 P inc = 8 100 7 670 = 1.06 T otal cycle cost = 2 540 852 P inc = 2 693 304 (7.2) The number of copy instructions from one of the LVMs and from the CM to the Vector Register (VR) are listed in table 7.19. These copies do not add any compuational functionality and are therefore not desirable. For block 5 these numbers are rather low which indicates that not too much unnecessary copying is done. Some of these copies are used to speed up the block, for eample by pre-loading a value to the vector register instead of reading it from the CM the eecution of the instruction using it will finish faster.

7.1 Motion Estimation 69 7.1.6 Master Code The master code is used when testing the motion estimation blocks also known as block 1, block 2, block 3, block 4 and block 5. Program Memory Costs The master codes used in kernel 1, 2, 3, 4 and 5 are slightly different but have the same size. The simulations done with 1, 2, 4 and 8 Sleipnirs differs some in code size caused by removal of code only used when keeping track of more Sleipnirs. The DMA firmware that was used is not included in the statistics for the master but instead it got a row of its own because it is included in all the kernels. Description Code size RAM ROM Master with 1 Sleipnirs 326 58 16 Master with 2 Sleipnirs 363 60 16 Master with 4 Sleipnirs 437 64 16 Master with 8 Sleipnirs 585 72 16 DMA Firmware 272 0 0 Table 7.20: Master code cost In table 7.20 the column Code size is measured in number of instructions and column RAM and ROM is measured in words. Table 7.20 shows that the DMA Firmware does not use any memory. That is not really the case but instead the memory cost has been included in the master code costs so that it is not counted twice. Worth to mention is that the DMA firmware was not written by the authors and therefore got its own row for the cost. The ROM contains information for the DMA pointing to the addresses where the program memory and constant memory for the Sleipnirs are stored. The instruction costs of the master have not been a target for optimization and therefore it can provide room for improvement. The reason for not optimizing the cost nor code is that it will not be used in eactly this way when used in a complete encoder. In the main memory data for 2 complete full HD frames with 4:2:0 sampling and the size of the Sleipnirs program memory and constant memory is allocated. The master also allocates memory for the resulting motion vectors and motion compensated residue blocks. Prolog and Epilog Before any calculations can be done the environment has to be configured in the processor. Table 7.21 introduces this cost. The cycle count of the long prolog includes configuration of stack pointer, interrupt handling, set up of registers, programming the Sleipnir cores PM and CM and data copying to LVM. This can also be described as the cycle count until the first Sleipnir core is started. The short prolog is the same as the long prolog ecept that the data copying to LVM has been ecluded.

70 Results and Analysis Task Cycles Prolog short 929 Prolog long 38 262 Epilog 277 Table 7.21: Prolog and epilog cycle costs When all calculations have been performed an epilog is initiated. This epilog is for the finalization of the kernel. In this case it empties the last Sleipnir core of calculation results. The cycle count of waiting for the last Sleipnir to finish has not been included because it is dependent on which data the calculations are performed upon. Video Sequence Epilog cycles Blue Sky 25 955 Sunflower 23 169 Pedestrian Area 22 350 Riverbed 31 969 Table 7.22: Simulated epilog cycle cost including waiting for last Sleipnir to finish The epilog cycle costs including waiting for the last kernel to finish has been measured and the results are presented in table 7.22. Results from the 4 different video sequences can of course be worse than in table 7.22 but a better understanding of the cycle cost can be achieved. All results are from simulations with 8 Sleipnir cores. As can be seen in table 7.22 the riverbed simulation had the longest epilog eecution time. This can vary and is not necessarily related to the overall computational load of the frame. The last parts to be motion estimated will be the down-right corner of the frame. One of these columns of 13 macroblocks will likely be the last column a Sleipnir has to process in the end. This will be the Sleipnir that the master has to wait for. If there are a lot of motion in the down-right corner of the frame the calculation of the motion vectors will need more cycles to finish and the epilog will therefore cost more cycles. DMA To initiate a DMA transfer the DMA module needs to be configured and the transfer needs to be started. The DMA firmware provides subroutines to do this. Table 7.23 provides DMA costs from kernel 5. In table 7.23 transfer cost for search data is 760 cycles. The observant reader notice that kernel 5 only should need 720 cycles to transfer 720 vectors. The measurement is done in such a way that all etra penalties are included which means that cycle cost for interrupt and return are included. The transfer time is therefore longer than epected.

7.1 Motion Estimation 71 Task Cycles Loading Sleipnir PM Configure 41 Start 39 Transfer block 5 666 Loading Sleipnir CM Configure 41 Start 44 Transfer block 5 106 DMA Firmware Configure search data 75 Configure results 59 Start search data 62 Start results 62 Transfer search data, block 5 760 Transfer MB search for, block 5 250 Transfer results, block 5 55 Table 7.23: DMA cycle costs The cost for copying the Sleipnir PM and CM are also presented in table 7.23. These costs are included in the prolog of the program. The total cost of the prolog can be found in table 7.21. Considering this when implementing a complete encoder, decisions can be taken whether to distribute the different blocks between different cores or if the cores are going to be loaded with a new block between tasks. Table 7.23 also shows three different data transfers. At least two is needed, one for filling a LVM with data and one for emptying the LVM. Reason for using three different transfers was that an easier memory allocation scheme in main memory could be used. To gain a better utilization of the Sleipnir cores the master needs to start them and have them running as much as possible. One way to increase utilization is to try doing as much as possible during DMA transactions. The master that was used for simulating kernel 1 to 5 did not offer that much opportunity to hide cycles during DMA transfer. During transaction of search data 98 cycles could be eecuted and during transfer of the macroblocks to search for 22 cycles could be eecuted. When the results is copied back to main memory 0 cycles could be saved. 7.1.7 Summary The results from kernel 2 can be compared with the cycle cost of the H.264 encoder for the STI Cell processor which can be found in [15] and [11]. There the cycle cost of performing a heagon search on a macroblock of 16 16 piels in a (- 15,15)(-15,15) search area is listed. This corresponds to the same functionality

72 Results and Analysis as was implemented in Sleipnir blocks 1 through 4. The listed cycle costs for the best and worst case searches are 1 451 and 3 609 cycles respectively. These results can be compared to the best and worst runtime for kernel 2 for running motion estimation on a macroblock of 16 16 piels in a (-15,15)(-15,15) search area for the riverbed video sequence. Kernel 2 is used as it is possible to measure each search separately and the functionality is still the same. The results for the best and worst runtimes were 986 and 5 348 cycles. This shows that the best case runtime is substantially shorter for the epuma implementation. However the worst case is on the other hand substantially better for the STI Cell implementation. Block 2 still offers room for improvement. The low runtime of the best case shows that the search and overhead needed could be optimized further for long searches to reach better worst case performance. Scalability Tabell1 40000000 35000000 30000000 25000000 20000000 15000000 10000000 Kernel 1 Kernel 2 Kernel 3 Kernel 4 Kernel 5 5000000 0 0 1 2 3 4 5 6 7 8 9 Figure 7.1: Cycle scaling from 1 to 8 Sleipnir cores for simulation of riverbed Figure 7.1 shows a graph of the scaling from 1 core to 8 cores for the 5 different kernels. It can be seen that the scaling is almost linear on simulations with kernel 4 and 5. This shows that the master can fully utilize the etra cores to speed up calculations. The simulation results that the graph is created from can be found in appendi B. In the same table simulation results from pedestrian area, sunflower and blue sky can be found. The reason for better scaling in kernel 4 and 5 is due to the fact that more calculations and therefore cycles are performed during each eecution of a Sleipnir core. By doing more cycles in the kernel the master has more time to provide the other 7 cores with data between two eecutions of a Sleipnir core. The processor is utilized the best when all Sleipnir cores are running simultaneously. The easiest way to increase speed further would be either to try optimizing kernel 5 even more or try writing code for a master that utilizes the

7.1 Motion Estimation 73 third local vector memory that is connected to the DMA. Utilizing the LVM will make it possible to hide more DMA cycles and by that increase utilization of the Sleipnir cores which will result in a faster total eecution time. Energy Reduction Results Figure 7.2: Frame 10 from Pedestrian Area video sequence Figure 7.3: Difference between frame 10 and frame 11 in Pedestrian Area video sequence To see the real difference between ordinary residue calculation and motion compensated residue calculation both will be presented as well as the differences between them. Figure 7.2 is frame 10 from pedestrian area video sequence and shows the

74 Results and Analysis back of a person in the center of the image and a lot of moving people on the street in the background. Figure 7.3 presents the residue between frame 10 and frame 11. The white areas in the picture indicates big differences between the two frames. Darker areas in the figure indicates better matching between the two frames. 60 50 40 30 20 10 0 20 40 60 80 100 Figure 7.4: Motion vector field calculated by kernel 5 on frame 10 and 11 of the Pedestrian Area video sequence Figure 7.5: Difference between frame 10 and frame 11 in Pedestrian Area video sequence using motion compensation