Institutionen för systemteknik

Transkript

1 Institutionen för systemteknik Department of Electrical Engineering Examensarbete Evaluation of computer vision algorithms optimized for embedded GPU:s Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet av Mattias Nilsson LiTH-ISY-EX--14/4816--SE Linköping 2014 Department of Electrical Engineering Linköpings universitet SE Linköping, Sweden Linköpings tekniska högskola Linköpings universitet Linköping

2

3 Evaluation of computer vision algorithms optimized for embedded GPU:s Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet av Mattias Nilsson LiTH-ISY-EX--14/4816--SE Handledare: Examinator: Erik Ringaby isy, Linköpings universitet Johan Pettersson SICK IVP Klas Nordberg isy, Linköpings universitet Linköping, 20 maj 2014

4

5 Avdelning, Institution Division, Department Computer Vision Laboratory Department of Electrical Engineering SE Linköping Datum Date Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LiTH-ISY-EX--14/4816--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version Titel Title Utvärdering av bildbehandlingsalgoritmer optimerade för inbyggda GPU:er. Evaluation of computer vision algorithms optimized for embedded GPU:s Författare Author Mattias Nilsson Sammanfattning Abstract The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the embedded GPU:s on the market today. The results indicates that embedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware. Nyckelord Keywords Embdedded GPU, Computer Vision, CUDA

6

7 Abstract The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers such as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. For embedded systems, GPU:s are not as popular yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been too hard to get, too energy consuming and not suitable for some algorithms. At SICK IVP, advanced computer vision algorithms run on FPGA:s. This master thesis optimizes two such algorithms for embedded GPU:s and evaluates the result. It also evaluates the status of the embedded GPU:s on the market today. The results indicates that embedded GPU:s perform well enough to run the evaluatedd algorithms as fast as needed. The implementations are also easy to understand compared to implementations for FPGA:s which are competing hardware. iii

8

9 Acknowledgments This project could not have been executed without the help from Johan Pettersson and Johan Hedborg. Thank you very much! I would also like to thank Erik Ringaby and Klas Nordberg from CVL for their help. Linköping, June 2014 Mattias Nilsson v

10

11 Contents Notation xi 1 Introduction Background Purpose and goal Delimitations Hardware Sequential Algorithms Rectification of images Pattern recognition Normalized cross correlation Scaling and rotation Complexity Sequential implementation Pyramid image representation Non maxima suppression Parallel programming in theory and practise GPU-programming Memory latency Implementation Parallel programming metrics Parallel time Parallel speed-up Parallel Efficiency Parallel Cost Parallel work Memory transfer vs. Kernel execution Performance compared to bandwidth Related Work Method 21 vii

12 viii Contents 4.1 Initial phase Parallelization Theoretical evaluation Implementation Evaluation Alternative methods Theoretical method One algorithm Conclusions Rectification of images Generating test data Theoretical parallelization Theoretical evaluation Implementation Initial implementation General problems Texture memory Constant memory Results Memory transfer Kernel execution Memory access performance Discussion Performance Memory transfer Complexity of the software Compatibility and Scalability Conclusions Pattern Recognition Sequential Implementation Generating test data Assuring the correctness of results Theoretical parallelization Pyramid image representation Parallelizing using reduction Theoretical evaluation Searching intuitive in full scale Trace searching intuitive Search using reduction Memory transfer vs. kernel execution PMPS Implementation Implementation of reduction in general Reduction for pattern recognition

13 Contents ix Implementation of non maxima suppression Results Kernel performance Performance of algorithm PMPS Memory access performance and bandwidth Discussion Intuitive implementation Reduction implementation CPU Conclusions Conclusions Overall Conclusions Recommendation about hardware Future Architecture Implementation Evaluation of method Work in a broader context Bibliography 57

14

15 Notation GPU-architecture Notation SM Warp Computecapability Meaning Streaming multiprocessor, a main processor in charge of a number of cores. Smallest amount of cores doing the same operations, often 32. A number describing which generation of Nvidia GPUarchitecture the GPU is built according to. Higher compute capability supports more features. Kepler The Nvidia GPU-architecture used in the Master thesis project, compute capabilities of 3.0 or 3.1. Fermi The Nvidia GPU-architecture with computecapability CUDA Notation Kernel Thread Block Grid Meaning A CUDA-function written for a GPU. Each kernel runs a number of parallel threads. A block consists of a number of threads indexed in up to 3 dimensions. A grid consists of a number of blocks indexed in up to 3 dimensions. xi

16

17 1 Introduction 1.1 Background The interest of using GPU:s as general processing units for heavy computations (GPGPU) has increased in the last couple of years. Manufacturers as Nvidia and AMD make GPU:s powerful enough to outrun CPU:s in one order of magnitude, for suitable algorithms. Embedded GPU:s are small GPU:s built into SoC:s (System on Chips). SoC:s are integrated circuits where several processor and function blocks are built into one chip. SoC:s are used in embedded systems such as mobile phones. The interest of using embedded GPU:s as general processing units have not been nearly as high as for regular GPU:s yet. The embedded GPU:s available on the market have often not been able to justify hardware changes from the current systems (CPU:s and FPGA:s) to systems using embedded GPU:s. They have been hard to get since the models available on the market have been few. Their energy consumption have been to high and thet have not been suitable for some algorithms. However, the performance of embedded GPU:s improve all the time and it is very likely that their performance will be sufficient in the foreseeable future. At SICK IVP advanced computer vision algorithms are accelerated on FPGA:s. Accelerating the algorithms on embedded GPU:s instead might be preferred for several reasons. Apart from possibly being faster, GPU:s are also in general easier to program than FPGA:s. This is because the programming model of a GPU is much more similar to a CPU than the programming model of an FPGA is. 1

18 2 1 Introduction 1.2 Purpose and goal The goal of the master thesis is to analyse how some of the computer vision algorithms that SICK IVP today run on FPGA:s instead would suit running on GPU:s. Critical factors in the analysis are theoretical parallelization, memory access pattern, memory choice and how good the performance is in practise. Another goal is to determine whether the embedded GPU:s available today are good enough to be considered in computer vision products. The results from the algorithms relate to this question in several important ways by answering the following questions. How well is the algorithm parallelized? What is the performance of the implemented algorithms compared to what was theoretically expected? How device specific are the implementations, i.e. how portable are they? Is the performance sufficient? Is the code hard to understand, compared to a CPU implementation and compared to an FPGA implementation? When the algorithms were implemented and evaluated, so that the previous questions could be answered, a recommendation about hardware was done for SICK IVP based on the answers. 1.3 Delimitations To define the project and to scale it down to a reasonable size some delimitations were made. The delimitations regard implementation, hardware, number of algorithms and how the result of the project should be interpreted. To get a perfect idea of the performance of computer vision algorithms on embedded GPU:s a large number of algorithms could be analysed and implemented. In this project only two algorithms were analysed. In this project only Nvidia GPU:s were used so that the CUDA programming language could be used. CUDA is a modern GPGPU programming language that is easy to set up and use compared to other GPGPU programming languages. For more information about hardware choice see section, 1.4. In GPU-programming a concept called multiple streams exist. Multiple streams are explained in section and are of interest for the two different algorithms implemented. However multiple streams are only discussed theoretically and not implemented. The recommendation about embedded GPU:s in products, mentioned in section 1.2, is only based on the questions of the same section. Other factors that could be interesting for a hardware choice are not considered.

19 1.4 Hardware Hardware Nvidia Tegra is Nvidia:s product series of SoC:s. They are embedded devices with both CPU:s and GPU:s on the same chip. Three different hardware set-ups were used during the project. Most of the development were performed on a desktop computer featuring a GTX 680 GPU. At the start of the master thesis project there were no devices or test boards on the market that ran embedded GPU:s with unified shader architecture. In a unified shader architecture all streaming multiprocessors (SM:s)can be used for GPGPU operations but in a non-unified shader architecture some SM:s are reserved for specific graphic operations. Devices without unified shader architecture can therefore not be utilized to their full capacity by GPGPU operations. To simulate an embedded device with unified shader architecture, tests were run on a test board featuring an Nvidia Tegra 3 with a separate Geforce GT 640 GPU. Nvidia calls this combination Kayla [Nvidia, 2013]. The separate GPU is there to simulate future devices with unified shader architecture. NVIDIA Tegra K1 has a GPU based on Kepler architecture which includes unified shaders. There are some differences between the Kayla platform and the K1 though. A big performance difference between the Kayla platform and the Tegra K1 is that the the Tegra only have one SM, the separate GPU of Kayla has two SM:s. Another difference is that the K1 has a GPU and a CPU with a shared memory pool. This kind of memory drastically reduces the transfer time between the CPU and GPU. A third important difference is that the memory bandwidth is higher on the GPU of Kayla, making memory accesses faster. At the end of the project all tests were run on a test board called Jetson TK1 featuring a K1 SoC. All GPU:s used in the project are based on Kepler architecture, which is the architecture of a specific generation of Nvidia GPU:s. Some important specifications of the GTX 680, the GT 640 and the Tegra K1 are listed in table 1.1. Since accessing the global memory is a typical bottleneck of a GPU-kernel the memory bandwidth is very important. Core speed is important to be able to make computations as fast as possible. For all GPU:s built on Kepler architecture an SM contains 192 cores. Therefore the number of SM:s determines the total number of cores. Feature GTX 680 GT 640 Tegra K1 Memory Bandwidth 192 GB/s 29 GB/s 17 GB/s Core speed 1053MHz 1033 MHz 950 MHz Number of SMs Table 1.1: Specification for the devices targeted in the master thesis.

20

21 2 Sequential Algorithms Many computer vision algorithms are suited for running on GPU:s and the algorithms chosen for this master thesis project were: Rectification of images Pattern recognition using normalized cross correlation The purpose of the first algorithm is to extract a geometrical plane from an image with respect to the distortion of the camera lens. It was chosen for the project since it is of low complexity. The second algorithm tries to find an object in an image using the intensity of the pixels. It was chosen for the project since it is a common computer vision algorithm of high complexity and because it is, in contrast to the first algorithm, not intuitively well suited for a GPU. 2.1 Rectification of images The rectification algorithm applies to when a camera is stationed to capture a plain surface. Even though an image where the camera is placed over the surface pointing straight down on it is wanted, see figure 2.1, it is often not desirable to install the camera that way e.g. because the camera may cast a shadow on the surface. The camera is often placed around 45 degrees compared to the surface, see figure 2.1. The purpose of the algorithm is to extract parts of the image using a given homography and lens distortion parameters. Given the homography between the image plane and the surface it is possible to transform the image to be placed in the image plane. Let X be a coordinate in the original image, H the homography between the surface and the image plane and Y the coordinate in the transformed image. X and Y are written in homogeneous form. 5

22 6 2 Sequential Algorithms Figure 2.1: Camera placed orthogonal to the surface to the left and approximately with approximately 45 degrees angle to the right. Y HX (2.1) This transformation is not sufficient since the camera is using a lens that has lens distortion. The most significant lens distortions is the radial distortion [Janez Pers, 2002]. Radial distortion can be corrected according to equation 2.2. x corrected = x(1 + k 1 r 2 + k 2 r 4 + k 3 r 6...) y corrected = y(1 + k 1 r 2 + k 2 r 4 + k 3 r 6...) (2.2) where k n is the nth radial distortion parameter, r is the radial distance from the optical centre of the picture, x and y are the original coordinates and x correct and y correct are the corrected coordinates. If the result is not good enough it is possible to also add tangential distortion to the model. For most applications it is sufficient to correct for radial distortion. The work flow of the algorithm is to go through all pixels in the output image and make the transformation and lens correction backwards to find the correct place in the input image. Interpolation between the neighbouring pixels in the input image is then performed to get subpixel accuracy. 2.2 Pattern recognition Pattern recognition, also known as template matching [Lewis, 1995], is an algorithm that aims to find occurrences of a known pattern in an image using the pixel values. It searches by moving the pattern through the search image and calculating a match-value pixel by pixel, see figure 2.2 The match-value can be calculated in different ways. In this project normalized cross correlation (NCC) is used.

23 2.2 Pattern recognition 7 Figure 2.2: Performing NCC on each pixel of the image Normalized cross correlation Let P be a template image with width w and height h. Let S x be the overlapping part of search image S when placing P around a certain pixel a. S and P are the mean-values of the search and template image around a. The normalized cross-correlation, C, for that image is defined in equation 2.3, [Lewis, 1995]. w,h i=1,j=1 C a = 1 (S a(i, j) S a )(P (i, j) P ) w,h i=1,j=1 (S a(i, j) S a ) 2 (2.3) w,h i=1,j=1 (P (i, j) P )2 Since NCC takes the mean-value and standard-deviation of the image in account it is not as sensitive to illumination differences as it would be if regular correlation was used. When NCC is calculated for all coordinates in the image, the result is a new image consisting of NCC-values Scaling and rotation The algorithm can be constructed to be rotation and scale invariant by using transformations. In the project a rotation invariant version was implemented. In the rotation invariant version a number of angles are chosen. For each angle that is chosen a separate image of NCC-values is calculated to find occurrences of the pattern rotated by that angle. When the NCC is calculated for a specific pixel at a specific angle, every coordinate in the pattern is rotated using a rotation matrix based on that angle, see equation 2.4. Each coordinate is then added to the coordinate of the pixel for which the NCC is calculated for. The result of the addition is the coordinate in the search image that should be compared with the original coordinate of the pattern. The rotation often results in floating point

24 8 2 Sequential Algorithms indexes. An interpolation between the closest pixels in the search image around the indexes is therefore required. Bilinear interpolation is often chosen for this type of interpolation. To make the image rotate around its optical center instead of its upper left corner the coordinates of the pattern are in range [ w/2, w/2] and [ h/2, h/2]. The rotation invariant version searches for matches by rotating the pattern in a chosen number of angles. It then calculates one image of NCC-values per angle. The index in the pattern is rotated by the transformation in equation 2.4, where x Sa, y Sa are coordinates in the local search image and x p, y p are coordinates in the pattern. ( xsa y Sa ) ( cos(θ) sin(θ) = sin(θ) cos(θ) ) ( xp y P ) (2.4) Complexity The rotation variant and scale variant algorithm is of the complexity O(w P h P w S h S ) [James Maclean, 2008] where w p and h p is width and height of the pattern and w P and h P is the width and height of the search image. Making it scale invariant and rotation invariant increases the complexity to O(rsw P h P w S h S ) where r is the number of rotations and s is the number of scales. These calculations assume that the mean and standard-deviation of the images is already known. Calculating the mean has the complexity O(w P h P ) and can be disregarded. However calculating the mean of all local search images is of complexity O(w P h P w S h S ) and can not be disregarded Sequential implementation When running the algorithm on a CPU or GPU it would be better to be able to calculate the sums in equation 2.3 and the mean of S a in the same loop instead of in two separate loops. The mathematical complexity does not differ but the overhead of running a loop on a computer makes it faster to perform several operations in the same loop than performing one operation in several loops. By rewriting the sum n i=1 (a i a) 2, where a i are pixel values and a is the mean of picture a, it is possible to separate the pixel values and the mean. n (a i a) 2 = i=1 n (a 2 i 2a ia + a 2 ) = i=1 n a 2 i + na2 2a i=1 n a i = i=1 n n a 2 i + a 2 2 i=1 i=1 n a 2 i + na2 2na 2 = i=1 n a i a = i=1 n a 2 i na2 In the same way it is possible to rewrite the sum n i=1 (a i a)(b i b), where b is another picture of the same size, to separate the pixel values and the mean. i=1

25 2.2 Pattern recognition 9 n (a i b i ) i=1 n (a i a)(b i b) = i=1 n ab i i=1 n a i b + i=1 n (a i b i ab i a i b + ab) = i=1 n ab i = i=1 n (a i b i ) nab nab + nab = i=1 n (a i b i ) nab i=1 By using the rewritten sums, the mean of the overlapping image S a can be calculated simultaneously as the other sums, thereby reducing the number of loops to 1 instead of 2. The average of the pattern and the square sum of the pattern is calculated offline since they will be the same for every NCC-pixel. Three sums are calculated online for each NCC-pixel.The sums that are calculated are: S i - to be able to calculate S S 2 i - for the denominator in NCC S i P i - To be able to calculate (S i S)(P i P ) When the sums are calculated they are converted to the sums in equation 2.3. The NCC-value is calculated and the value of the pixel in the result image is set. These operations are repeated for all pixels so that the result image is filled with NCC-values Pyramid image representation In this project full scale pattern recognition is defined as calculating the NCC for all pixels in the search image. Because of the high complexity of the full scale pattern recognition algorithm an image pyramid representation is often used to reduce the complexity, see [James Maclean, 2008]. The original images are downsampled to create an image pyramid of a desired number of levels. The full scale pattern recognition is only performed on the coarsest level. When matches are found on the coarsest level the matching coordinates are changed to match an image with a finer scale. In the larger image the NCC is only calculated for the resulting pixel from the previous image and its neighbouring pixels. If any of the neighbouring pixels have a better correlation the index will be changed to the better match. The search in the finer images is in this report called trace search. The total number of operations performed when using an image pyramid is significantly lower then when performing a full scale pattern search on the original image. The number of operations O is illustrated in equation 2.5. O = rsw P h P w S h S 16 k + k i=1 mw p h p 16 k i (2.5)

26 10 2 Sequential Algorithms In equation 2.5 k is the number of times the pattern and search image are downsampled and m is the number of matches that are saved from the original search image. The coefficient in the denominator is 16 because when the width and height of the search image and pattern is down scaled by 2, the total scale factor will be 2 4 = 16. The single term in equation 2.5 is the number of operations in the full scale search on the coarsest image. The sum is the total number of operations for scaling up coordinates to fit finer images and calculating the matches of the neighbourhoods. To use initial matches to go from coarser images to finer images is hereby called trace search Non maxima suppression Non maxima suppression is used to suppress all image-values where a neighbouring value is higher than the current value. Applying non maxima suppression to the NCC-images will make regions of high correlation values result in only one high value. Since very few pixels are examined at finer levels it is very important that only one pixel per correct match is saved. Non maximum suppression, see algorithm 1 is performed on all NCC-images on the coarsest level to get unique results to use in the trace search. A rough description of the complete algorithm can be seen in algorithm 2. Data: image for all pixels (x,y) in image do if max(neighbouring pixels) > pixel then pixel = 0; end end Algorithm 1: Non maximum suppression of image. If any of the neighbouring pixels have a higher value than the current one its value is set to zero. When using a large number of different angles in the search there is a risk that several angles of the same pixel will produce high NCC-values. A better non maxima suppression would suppress not only in the x and y-dimension, but also for the closest different angles. This kind of suppression was not implemented in the project, mainly because in practise the angles searched for are often so few that only the best angle will produce a NCC-value high enough to be considered a match. A larger area to examine when suppressing could also be used, but suppressing only according to the neighbouring pixels was found sufficient for

27 2.2 Pattern recognition 11 the project. Data: patternpyramid, searchimagepyramid, minsimilarity Result: bestmatches for all angles do image = performncc(coarsestimages, angle); image = nonmaximumsupress(image); bestmatches = findbestmatches(image, bestmatches); end for all larger images do upscaleindexes(bestmatches); tracesearch(currentimagesize, bestmatches); removebadmatches(bestmatches, similarity); end Algorithm 2: Pseudo code for finding a pattern in an image. The images in the image pyramids differ by a factor 2 of size in width and height. W.James MacLean and John K. Tsotsos proposed a similar algorithm.[james Maclean, 2008].

28

29 3 Parallel programming in theory and practise 3.1 GPU-programming When programming a GPU there are a number of features that need to be considered. GPU:s use SIMD architecture (Single instruction multiple data). SIMD means that there are several cores running, often many and they all perform the same operations. The only difference between them is that they take different data as input. The bottleneck of the performance when programming GPU:s are often the bandwidth of different memories, see section Applications written in the CUDA programming language manages the SIMDarchitecture in an efficient way. A grid of 1, 2 or 3 dimensions is used to index the running threads of a function. The grid is divided into blocks that are calculated spread on several streaming multiprocessors (SM:s). An important note is that in the CUDA programming language are functions called kernels and they shall not to be mistaken for cores or processors Memory latency A typical problem when programming a GPU is that the transfer between different memories is a bottleneck. There are different kind of memory transfers and they are often slow if they are not chosen and implemented with care. The most important memories in GPU:s are: Global memory Constant memory Texture memory 13

30 14 3 Parallel programming in theory and practise Shared memory When computations are running on a GPU they first need to fetch the input data from the memory of the CPU. This is the slowest type of memory transfer in the work flow of a GPU and the speed for the transfer is often one magnitude slower than accessing regular GPU-memory called global memory. When the computations are done the output data is transferred back to the CPU. That transfer is as slow as the first transfer. This problem is hard to avoid when the time of the kernel is short compared to the amount of data that needs to be transferred to the GPU. Multiple streams, [Jason Sanders, 2010b], can sometimes reduce the problem. A stream is the flow of transferring data to the GPU, compute a kernel and transfer the resulting data back to the CPU. The purpose of multiple streams is that as soon as data for a kernel is transferred into the GPU, the transfer of data for the next kernel is started so that when the first kernel is finished, the data for the second kernel has already been transferred. The second kernel can then start its computations at once, see figure 3.1. The benefits of using multiple streams are greatest when the runtime of the kernel is equally long as the transfer time. If the kernel is shorter than the transfer time the transfer time can not be hidden and if the kernel time is much longer than the kernel time the gained performance will be negligible. Another thing that often increase the transfer speed is to use pinned memory instead of pagable memory. Pinned memory is in difference of pagable memory locked to a certain adress on the CPU. Figure 3.1: Advantage of using multiple streams For Tegra K1, memory latency caused by transferring data from the CPU-memory to the GPU-memory is not as important as for desktop GPU:s. This is because the GPU and CPU share a unified memory pool that stores data that is accessed by both the GPU and the CPU, i.e. no transfer between them is needed [Harris, 2014b]. The shared memory pool supports currently only regular memory, other types of memory such as texture and constant memory, described in section and 3.1.1, are not supported.

31 3.1 GPU-programming 15 Coalescing memory accesses In regular memory the data is stored in horizontal lines. Since all threads that run on an SM are performing the same tasks, they will access the global memory at approximately the same time. If neighbouring threads access neighbouring data in the memory several threads can get their desired data on the same read from the global memory. In this way, the number of accesses to the global memory can be reduced dramatically, see figure 3.2. Figure 3.2: Perfect coalescing in the upper image and a bad memory access pattern below. Shared vs. global memory In addition to the global regular memory each SM have a local memory called shared memory which is shared for all threads in a block. A typical GPU data transfer bottleneck is accessing global memory of the GPU from a thread. The shared memory is smaller than the regular memory and accessing it is faster. If a kernel uses many accesses to the global memory, the values can be stored in the shared memory to reduce accesses to the global memory [Jason Sanders, 2010c]. If a value is read from memory several times it is always beneficial to read it once from the global memory and the rest of the times from the shared memory. When the value has been read from the global memory it should be stored in the shared memory so it can be reached from there for the future readings. On newer GPU:s, based on Fermi or Kepler architecture, it is not as crucial to use shared memory as for earlier architectures. This is since Fermi introduced built

32 16 3 Parallel programming in theory and practise in caches for each SM. The cache is using the shared memory to store the values. Consideration of shared memory is still important for maximum performance [Ragnemalm, 2013]. Constant memory If a value is read from many different threads in a kernel it is preferred to store it in the constant memory for increasing the performance. The constant memory has a fast cache accessible from the whole GPU [Jason Sanders, 2010a]. Global memory is only cached on every SM or block since it uses the shared memory. Values that are read from all blocks will therefore require less memory bandwidth when placed in the constant memory. It is constant because it can only be read from the GPU, it is set during a memory transfer from the CPU. Texture memory When the access pattern is not horizontal and is hard to predict, it is often good to use the texture memory. Texture memory stores data in a different way than the other types of memories used in CUDA. It stores data in squared areas instead of horizontal rows as the global memory. It also has a cache that fetches a number of these areas and not lines. Since the cache has 2-dimensional data stored, memory accesses is fast not only for horizontally proximate values but also for vertically proximate values. This is called 2D-locality. There is also built in interpolation so that accesses with 2D floating point index only require one memory access [Wilt, 2012]. When normally stored memory is used all 4 neighbouring values to the index are needed to fetch from the memory to make the interpolation. The difference of using texture memory compared to regular memory is that the interpolation is performed before the transfer in the texture memory and after the transfer in the regular memory. Figure 3.3: Upper image shows regular memory storing order and lower image shows texture memory storing order Implementation This section covers information on what to consider besides memories when writing software for GPU:s. Block size Choosing a correct block size increase the performance of CUDA-kernels. There are several factors to consider when choosing block size. The first thing to con-

33 3.1 GPU-programming 17 sider is how many SM:s the GPU running the kernel has. The workload should be divided in at least as many blocks as there are SM:s so that all SM:s will be busy. Another important thing when choosing block size is that the block size is a multiple of the warp size. The warp size is the smallest amount of threads performing the same operation. The GPU is always running a multiple of warps doing the same thing [Jason Sanders, 2010a]. If a block does have fewer threads it will be rounded up to the next multiple of the block size and those resources will be wasted. So if the warp size is 32 and 33 three threads are chosen for a kernel 64 threads will be used and 31 of them will idle. The wasted resources can be calculated according to: r wasted = w b%w w b/ w (3.1) where r wasted are the wasted resources [0, 1], w is the warp size, b is the block size and % calculates the remainder of a division. Note that no resources are wasted if the block size is a multiple of the warp size. The equation is invalid if the block size is a multiple of the warp size. Partition work between CPU and GPU As mentioned in section 1.1, GPU:s outrun CPU:s by one order of a magnitude for many algorithms. There are also many algorithms where parts or the whole algorithm run faster on a CPU than on a GPU, especially parts that are not parallelizable at all. Therefore it is important to evaluate which part of an algorithm that might run faster on a CPU. Since the Tegra K1 has a shared memory pool between the CPU and the GPU, the overhead of switching from CPU to GPU is reduced, resulting in more situations where it is favourable to switch between CPU and GPU. Shuffling A new feature of the Kepler architecture is that it is possible to share data between different threads in a warp without using shared memory. When a variable is read using shuffle all threads will read the value of the variable in the neighbouring thread, instead of the local thread, one or several steps away. Shuffle one step will read the variable of thread 0 in thread 1 etc. This way of reading data is even faster than using shared memory since only one read operation is required. Shared memory needs write, synchronize and read. Another benefit of using shuffling compared to using shared memory is that the size of the shared memory is small. The joint size of all the registers from the threads is bigger than the shared memory [Goss, 2013]. Grid stride loops In GPU-computing the number of threads is often adapted to the number of elements in the array processed. This is not convenient for all algorithms e.g. if all elements in an array of 33 elements are multiplied by 2 the number of threads

34 18 3 Parallel programming in theory and practise should intuitively be 33. But since the warp size of the GPU is often 32, 33 threads will make 31 cores of the GPU idle, see section It is common that a specific number of threads result in a simpler implementation and a higher performance. Grid stride loops can then be used [Harris, 2013] to avoid adaption of the number of threads to the array size. The purpose of a grid stride loop is to be able to read a larger number of elements into a fixed lower number of threads in a coalesced way. In each thread the reading of values is performed in a loop. The first thread is assigned to read the first element in the memory and the next thread is assigned to read the second element in the memory etc. When there are no threads left there will still be elements left to read from the memory. The first thread is then assigned to the first non-assigned element and the second to the second assigned element etc. This assignment lasts until all elements are assigned and read. This is done technically according to algorithm 3. Data: Array, N sum =0; for i =threadid; i <N; i+=threadwidth do sum+=array[i]; end Algorithm 3: Grid stride loop performed in a thread threadw idth number of threads and N number of elements in the array. 3.2 Parallel programming metrics When comparing the performance of sequential algorithms, time complexity is often used. For parallel algorithms there are other metrics that also show how well parallelized the algorithm is. In this section the metrics that are used for analysing the algorithms are presented. Note that the unit of both time and operations are clock cycles, making some calculations a bit confusing Parallel time Parallel time T p is the time it takes for a parallel implementation to run on p processors. T p is measured in clock cycles Parallel speed-up The parallel speed-up S p measures how much faster the parallel implementation is compared to the sequential implementation. S p = T T p (3.2) T is the time of running a sequential implementation. The parallel speed-up has no unit but is in the range of [1, p].

35 3.2 Parallel programming metrics Parallel Efficiency Parallel efficiency E p measures how well an implementation scales independent of the value of p. E p = S p p (3.3) where the optimal scaling of an algorithm is 1. E p is S p normalized over the number of processors Parallel Cost Parallel cost measures if resources are wasted when running a parallel algorithm. C p = pt p (3.4) Consider the total number of clock cycles passed on a system using p processors, during a parallel time T p, pt p. If the passed clock cycles are more than the total number of clock cycles passed when running the algorithm sequentially on one processor, the parallel implementation is wasting resources. An algorithm is therefore Cost optimal if C p = T Parallel work The work W is the total number of operations that are performed on all processors. If more operations are performed than operations performed on one processor in the sequential algorithm, the parallel algorithm is doing more work than the sequential algorithm. An algorithm is Work optimal if the number of operations performed for the parallel algorithm is equal to the operations performed by the sequential algorithm, W = T Memory transfer vs. Kernel execution A crucial part of running computations on a GPU is transferring data between the CPU and the GPU. Sending the input data to the GPU before the computations and the result back to the CPU after the computations is time consuming. An analysis of an algorithm must consider the transfer time of data. Multiplying the size of the data with the transfer speed on the device results in the transfer time. An interesting metric is the kernel execution time compared to the memory transfer time Performance compared to bandwidth The performance of a kernel can be evaluated by comparing its average memory bandwidth to the memory bandwidth of the GPU. The memory bandwidth can be estimated by running a kernel that only copies the value from one array to another. The quotient between the average memory access speed and the memory bandwidth is in this report called memory access performance. By dividing the size of the copied array by the running time of the kernel the average memory access

36 20 3 Parallel programming in theory and practise speed for the kernel can be calculated. v m = whsn t k (3.5) v m is the memory access speed, s is the size of one pixel in the image, n is the number of times the value is read or written to the memory and t k is the measured time of the kernel. A kernel with an optimal access pattern has a speed very close to the copy kernel. 3.3 Related Work Egil Fykse wrote a thesis [Fykse, 2013] comparing the performance of computer vision algorithms running on GPU:s in embedded systems and on FPGA:s. His benchmark algorithms were similar to the used algorithms in this thesis, although his focus laid on implementing FPGA versions of the algorithms. The hardware used for his GPU-implementations is not an embedded GPU, but the predecessor to the Kayla platform, CARMA. Egils conclusion is that his results are slightly faster for the GPU but that the FPGA is more energy efficient. The Tegra K1 should be much more power efficient than CARMA since CARMA features a desktop GPU, but this project does not examine the power usage of any devices.

37 4 Method The project was performed according to a method that analysed the algorithms in steps which are described in this chapter. The parallelization, the theoretical evaluation and the implementation were executed iteratively to be able to test new ideas. The steps were: Initial phase Parallelization Theoretical evaluation Implementation Evaluation 4.1 Initial phase In the initial phase the sequential version of the algorithm was analysed theoretically, by calculating its complexity, and implemented. Artificial test data was generated using Matlab. The test data was in general as simple as possible. The purpose of the project was not to test the accuracy of the algorithms but to optimize the already known algorithms. Imprecise test data could result in problems where it would be hard to know whether undesired results was because of the accuracy of the algorithm or because of bugs in the implementation. 21

38 22 4 Method 4.2 Parallelization The parallelization was about making parallel versions of the algorithm and determining which of the versions that should be implemented and further analysed. The list below describe on what premises the implementations were optimized. Different algorithm variants Memory choice Memory access pattern Partitioning between CPU and GPU Shuffling Grid stride loops For a description of what the items in the list mean see section Theoretical evaluation A theoretical evaluation is a good way of determining how parallelizable an algorithm is. The parallel performance metrics presented in section 3.2 are used for the theoretical evaluation. Since not all algorithms perform well on GPU:s the result of the theoretical evaluation may differ from the result of later steps of the method. 4.4 Implementation The implementation was about implementing the different versions of the algorithm proposed in the parallelization phase as efficient as possible. Measuring of performance was an important part of the implementation phase. Profiling tools make it possible to show the performance of the different parts of the running kernels. For this project Nvidia Visual profiler was used. 4.5 Evaluation In the evaluation the results from the theoretical evaluation and implementation was compared to make a conclusion supported in several ways. There were important questions that needed to be answered to be able to make a conclusion about embedded GPU:s after the 2 algorithms were analysed. Was the performance as expected? Is the performance sufficient? How advanced is the code compare to code for a CPU or FPGA?

39 4.6 Alternative methods 23 How portable is the code? Is further optimization possible? What are the bottlenecks? When all the algorithms were analysed, the possible conclusions about performance of embedded GPU:s in general and about the algorithms were made. 4.6 Alternative methods The above described method is a combination of practical and theoretical work. Algorithms are both analysed theoretically, implemented and evaluated using the results. Other approaches could be either more theoretical or laying more focus on one single algorithm Theoretical method A theoretical method would analyse algorithms only theoretically. By using this method the analysis of an algorithm would take less time so that the project would cover more algorithms. The benefit of more algorithms is that it would give a larger picture of how well computer vision algorithms are suited for embedded GPU:s. However implementations often reveal problems that are easily missed when doing a theoretical evaluation. An implementation is a certification that something actually works and an evaluation of how well it works One algorithm Another type of method could spend more time implementing and optimizing one single algorithm. Even better results could be achieved for the chosen algorithm when spending more time on it. However a lot of the optimizations regarding one algorithm is specific for that algorithm and does not say much about the performance of embedded GPU:s in general. This method would not give a good picture of computer vision on embedded GPU:s in general Conclusions Given the projected outcomes of the alternative methods described above the originally proposed method was used.

40

41 5 Rectification of images 5.1 Generating test data Synthetic test data was generated in Matlab. The first step was to create two images where one had a rectangle located in the image plane, see figure 5.1. The other picture was a geometrical object simulating a rectangle from another view, see figure 5.2. By using the edges from the rectangles and equation 2.1 a homography between the rectangles could be calculated. The homography parameters were stored to use as input to the program running the algorithm. Figure 5.1: Rectangle in image plane. 25

42 26 5 Rectification of images Figure 5.2: Rectangle from another view. Lens distortion was also simulated. An image distorted by specific lens distortion parameters can not be calculated analytically since equation 2.2 has no closed form solution for obtaining x from x c. An iterative numerical solution according to Newtons method, equation 5.1, was implemented in Matlab to simulate image distortion. The pseudo code for generation of lens distortion is displayed in algorithm 4. Only radial distortion was simulated. x n+1 = x n f (x n ) f (x n ) (5.1) Data: image, maxdiff, param Result: Distorted image for all pixels (x,y) in image do Convert x and y to normalized coordinates; x i, y i = x, y; while correct(x i, param) x + correct(x i, param) y > maxdif f do x i = x i + correct(x i,param) x i (correct(x i,param) x i ) ; y i = y i + correct(y i,param) y i (correct(y i,param) y i ) ; end Convert x i and y i to pixel range.; outimage(x, y) = interpolation(image(x i, y i )); end Algorithm 4: Pseudo code for distorting an image. correct is the lens correction formula. The interpolation is bilinear, maxdif f is the maximum tolerated error and param are the distortion parameters. The test input image was achieved by changing the perspective of the first picture

43 5.1 Generating test data 27 with a rectangle in the image plane and the applying lens distortion to that image, see figure 5.3. To get a reference result for the GPU rectifications, the sequential rectification algorithm was applied to the test input image, see figure 5.4. Note that some parts of the original image is missing. This is not an error but due to the fact that some parts of the original image does not fit into the input image in figure 5.3. The parameters of the homography and the lens distortion parameters are affecting the performance of the algorithm. If the homography makes the algorithm fetch values from a smaller rectangle there will be fewer cache misses resulting in a higher performance. The reason that there will be fewer cache misses is that a smaller rectangle has fewer pixels and a bigger percentage of the pixels can be stored in the cache. However when the algorithm is used in reality the rectangle will always be as large as possible and still fitting the sensor. Therefore the input data is also constructed this way. If the lens distortion parameters are smaller the access pattern will be more linear which also will result in fewer cache misses. Reasonable sizes are therefore chosen for the lens distortion parameters. The lens distortion assumes an image with normalized coordinates between [ 1, 1]. It is therefore important to transform the pixel values into normalized coordinates to get a correct result in terms of lens correction. The lens distortion parameters, described in equation 2.2, used in this project are illustrated in equation 5.2, note that k 3 was not used. k 1 = 0.04, k 2 = (5.2) Figure 5.3: Input image for tests.

44 28 5 Rectification of images Figure 5.4: Reference result for tests. 5.2 Theoretical parallelization The rectification algorithm is very suitable for parallelization since lens correction and homography transformation can be performed independently for each pixel. The pseudo code for the parallelized algorithm for n pixels on n processors is illustrated in algorithm 5. Data: image, H Result: rectified image for all pixels (x,y) in parallel do r = x 2 + y 2 ; x c = x(1 + k 1 r 2 + k 2 r 4 ); y c = y(1 + k 1 r 2 + k 2 r 4 ); (x h, y h, 1) T H (x c, y c, 1) T ; outimage(x, y) = interpolation(image(x h, y h )); end Algorithm 5: Pseudo code for parallel rectification. Assumes normalized coordinates. The interpolation used in the master thesis project is bilinear and it is done since (x h, y h ) is typically not integers.

45 5.3 Theoretical evaluation Theoretical evaluation In this section the algorithm is theoretically evaluated according to section 3.2. As section 5.2 states, the algorithm is very suitable for parallelization. The parallel time T p for p processors and n pixels is of order n/p. The parallel speed-up S p increases proportionally when p is increasing. This gives a parallel efficiency E p np of 1. The parallel cost C p is of the order p = n. Since the sequential time is of order n, the algorithm is cost optimal. The parallel work for p processors is also of order n. The algorithm is also work optimal, see section The slow part of a kernel is the global memory accesses. In this kernel, there will be maximum 5 global memory accesses per thread. 4 accesses for fetching neighbouring pixel values to interpolate between and one access to write the result to the global memory. But since the GPU uses the shared memory as cache, see section 3.1.1, there will most likely be less global memory accesses depending on the access pattern. The memory bandwidth between the CPU and GPU can be measured but it is in general at least 10 times slower than the global memory bandwidth, the factor is 24 for the GTX680 GPU. Even if all global memory accesses will be cache misses, the kernel will still be a lot faster than the memory transfer. Equation 5.3 aims to illustrate that the kernel will be faster definit M DtH as memory transfer latency device to host and M HtD as vice versa. M DtH + M HtD >> 5 GlobMemAccess P ixels (5.3) For the GTX 680 GPU the memory transfer should be around 48 5 times slower than the kernel. Since the kernel is so much faster than the memory transfer multiple streams would not increase the performance in any substantial way. Multiple streams are described in section Implementation The implementation was done in steps to be able to determine how much each step affected the performance Initial implementation The first implementation of the rectification was simple and intuitive. Global memory was used for all memory accesses. As mentioned in section 5.2 the rectification algorithm is easy to parallelize. For an Nvidia GPU from the generation of Fermi or newer, the naive implementation is quite good since the shared memory is used as a cache. But for an older GPU without use of cache the implementation would be slow.

46 30 5 Rectification of images Figure 5.5: Access pattern on input image in rectification General problems There are two main problems regarding kernel speed, when implementing a GPUkernel for the rectification algorithm. The first problem is that the homography part of the algorithm may make the access pattern in the image non-horizontal, see figure 5.5. The reason that the access pattern can get non-horizontal is that it is hard to install a real camera perfectly straight compared to the observed plane. If the camera is leaning slightly to the right or left, the access pattern will be horizontal. Since the image is a stored as one array putting each row after the previous row, a non-horizontal access pattern will make the memory accesses for two neighbouring threads on different rows in the memory i.e. the access pattern will not be coalesced. The second problem is that the lens correction makes the access pattern nonlinear. Instead of being aligned the access pattern will be concave. The larger the distortion parameters are the more concave will the access pattern be, see equation 2.2. According to equation 2.2 the access pattern will be very dense in the middle of the image and more sparse further out from the middle. In the sparse areas the memory accesses will be far away from each other. It is not intuitive how to use the shared memory in an efficient way for that access pattern. In this project the problem was solved by disregarding the manual shared memory and instead use it as cache.

47 5.5 Results Texture memory When the access pattern is irregular the performance is it often increased by using texture memory instead of global memory. Interpolations are also performed very fast using the texture memory, see section The access pattern of the rectification algorithm fits well into that description and the performance were clearly increased by loading the input image to the texture memory instead of the global memory Constant memory The input parameters are the same for every pixel in the image and they are read once for every thread. The performance of the implementation is drastically increased when reading them from constant memory compared to reading them from global memory, see section Results The resulting images from running the rectification on a GPU was very similar to running it in Matlab. A slight difference occurred near all edges on the chess board since Matlab used 64 bits precision of their floating values while 32 bits precision were used in CUDA resulting in a slightly worse interpolation, see the difference of the resulting images from Matlab and Cuda in figure 5.6. The white slightly bent line in the image occurs because of the index differences in Matlab compared to most programming languages, the indexing starts at 1 and not 0. Figure 5.6: Absolute difference between Matlab and GPU results, values in range [0, 1].

48 32 5 Rectification of images The results of the different steps of the optimization are all presented below to be able to evaluate them. All results are averages over 5 runs Memory transfer As explained in section transferring data from the CPU to GPU is often a bottleneck when running smaller kernels. The differences of using pagable or pinned memory, see section 3.1.1, are displayed in table 5.1. The time of the memory transfer does not affect the kernel time. Table 5.1: Transferring image of 1024x1024 pixels 32 bit floating points between CPU and GPU using pagable vs pinned memory (µs). Task GTX 680 GT 640 Pagable CPU -> GPU Pagable GPU -> CPU Pinned CPU -> GPU Pinned GPU -> CPU On Tegra K1 no regular memory needs to be transferred between the CPU and GPU because of the unified memory pool. Data lying in the texture memory needs to be transferred though. The transfer of a 1024x1024 image of 32 bit floating points to the texture memory takes about 1.1 ms Kernel execution The results of running an rectification implementation on a 1024x1024 pixels image using a naive approach (only global memory), a constant memory approach and a texture memory and constant memory approach on the GTX 680 and GT 640 is displayed in table 5.2. Table 5.2: Performance of different optimization steps (µs). The implementation using texture memory also uses constant memory. Task GTX 680 GT 640 Naive Constant memory Texture memory On Tegra K1 it is not as obvious what is the best way of optimizing the kernel. The texture memory can not be used in the unified memory pool. This means that if the texture memory is used, more transfer between GPU and CPU is needed. If the increased performance in the kernel is smaller than the lost time of memory transfer, it is not beneficial to use the texture memory. The results of running

49 5.5 Results 33 the algorithm on K1 using texture memory and unified memory are illustrated in table 5.3 Table 5.3: Performance of using textured memory and global unified memory on K1 (ms). Task Tegra K1 Texture memory 1.8 Global unified 4.1 Since the memory transfer time of the texture memory is 1.1 ms and the reduced time in kernel by using the texture memory is 2.3 ms it is preferred to use the texture memory. Since the memory transfer time is shorter than the kernel time the latency can theoretically also be hidden by using multiple streams. The resulting images would then be received with a constant delay of 1.1 ms but new results would be received every 1.8 ms. Table 5.2 shows that choosing texture memory instead of global memory is preferred for a desktop GPU running the rectification algorithm. The data set used for that test is an optimal data set for the global memory. In table 5.4 the plane that is supposed to be extracted from the input image is rotated 90 degrees to the input image making the access pattern in the input image very bad for the global memory, as discussed in section The tests are run with 1024x1024 image size. The results show that for this kind of data set the texture memory is even more superior than for the previous data. Table 5.4: Varying results for hard data set using textured and global memory (µs). Task GTX 680 GT 640 Kernel using texture memory Kernel using global memory Memory access performance The performance of a kernel can be evaluated by comparing its average memory access speed to the memory access speed of a copy kernel, see section The memory access performance were quite good for the rectification algorithm, but it differed between the GTX 680 and the other two GPUs. The size of the input data was 1024x1024 and the size of each element were 4 Bytes (32-bit floating points). In the kernel code there is one read and one write making n = 2. The memory access speed on the Kayla platform is then: 1024x1024x4x GB/s. (5.4)

50 34 5 Rectification of images The memory access speed of a copy kernel on Kayla was 27GB/s making the memory access performance 0.4. The memory access speed of the Tegra K1 was: 1024x1024x4x GB/s. (5.5) The memory bandwidth of the desktop was 11,7 GB/s making the memory access performance 0.4. The memory access speed of the GTX 680 was: 1024x1024x4x GB/s. (5.6) The memory bandwidth of the desktop was 147 GB/s making the memory access performance Discussion As section 5.3 states, the algorithm is very suitable for parallelization. The practical results also states that its performance running on an actual GPU is good. As mentioned in section the bottleneck of running algorithms on GPU:s is often the number of accesses to the global memory. When using texture memory the rectification algorithm only performs 2 accesses per thread, except the input parameters which are all read once for every thread. As the results states the most important differences in performance for this algorithm depend on how the different memories are used. Using the constant memory for the parameters of lens distortion and for the homography is crucial to get a good result. It is also important to use texture memory, depending on the camera installation, see section and table 5.4. The texture memory will make the installation of the camera much easier though, since a non-horizontal access pattern will not decrease the performance Performance The memory access performance of the algorithm is not very close to optimal. The lens correction part of the algorithm makes it almost impossible to avoid cache misses. It is possible that usage of the shared memory in a manual way could make the memory access performance even higher but manual shared memory has not been used in this project. Something that is interesting for the project is to use the performance to calculate how much of the GPU computing time is used by the rectification. The performance goal of the rectification is that other algorithms should be able to run simultaneously on the GPU, keeping their performance. Given that 25 frames per second (fps) is needed for the other algorithm and the kernel of rectification is 1.8 ms on K1, the time used by rectification on

51 5.7 Conclusions 35 the GPU is approximately 1, 8ms 25 = 0.05 or 5%. The performance goal is therefore considered fulfilled Memory transfer For a classic computer architecture with separate memories for the GPU and the CPU, the kernel is very fast. Although the transfer time of data from the CPU to the GPU is slow in comparison. This memory latency can not be hidden by using multiple streams i.e. no matter how fast the kernel is the number of kernels that can be run per second is restricted by the memory transfer time. The conclusion from this is that for a classic computer architecture, it is better to include the rectification part in another algorithm than to use it separately since the memory transfer from CPU to GPU can be avoided and the memory latency will be possible to hide by using multiple streams. The unified memory on Tegra K1, see section results in several benefits. The obvious reason is that slow CPU-GPU memory transfers are removed and there will be less memory latency for the kernels. Another benefit is that the host code will be easier to read and understand because less code will be about memory transfers Complexity of the software The software written for the rectification algorithm is short and easy to read. It is harder to read than software doing the same thing on a CPU though. The main difference in readability is that management of threads using the combination of grids, blocks and threads is more complex than management of threads in CPUcode where one-dimensional indexes are used Compatibility and Scalability One important task of this project was to determine the portability of the software written for a specific GPU. For the rectification algorithm there is no obvious way to change the code depending on which GPU is used. Both the desktop GPU (Geforce GTX 680) and the Kayla platform (Geforce GT 640) run the same software. They also perform best for approximately the same block size. The code must be changed a bit to use unified memory on K1 though. But the code that use unified memory is shorter and easier to understand because of the absence of memory transfers. 5.7 Conclusions The rectification algorithm works well on a GPU, especially for an embedded GPU. The reason why it works better for an embedded GPU is because of the avoidance of memory transfer latency when using the shared memory pool. The performance is very high, giving the GPU a chance to perform other tasks along with the rectification.

52 36 5 Rectification of images The algorithm is completely parallelizable which makes it computationally light for a GPU. The fact that the memory access pattern is not coalesced slows down the result though. The software is easy to understand and is compatible for different GPU:s featuring the Kepler architecture. However, to get maximum performance, a correct image size should be selected, the number of pixels should be a multiple of the warp size, see section An obvious benefit of using embedded GPU:s compared to other hardware is that it makes the installation easier for customers since perfect alignment of the camera is not necessary to keep the performance up, see section and figure 5.4. When the direction of the camera lens gets more horizontal, the result image from the rectification gets blurrier though. Similar solutions as the texture memory could be implemented on other hardware but with a high developer effort.

53 6 Pattern Recognition 6.1 Sequential Implementation Before any parallel pattern recognition algorithm was implemented a sequential implementation was made. The purpose of the sequential implementation was to get a deeper understanding of the algorithm. Since it is very hard to debug parallel code and especially GPU-code, it is very convenient to rely on a verified CPU implementation when making a GPU-implementation. The performance of the sequential CPU implementation should not be compared to the GPU-implementations. Such a comparison would not be fair since no greater effort has been made to optimize the performance of the sequential algorithm. The sequential implementation was done according to algorithm Generating test data The test data for the pattern recognition algorithm was mainly synthetic. To assure that the images were not too noisy to get a good result, synthetic search images were made by pasting rotated pattern images, see image 6.1, into a larger image, see image 6.2. However the fact that the synthetic data was noise free was not exploited to make a faster implementation that would fail for a noisy data set. 37

54 38 6 Pattern Recognition Figure 6.1: Example of pattern image. Figure 6.2: Example of search image. 6.3 Assuring the correctness of results The correctness of an implementation was assured by checking the result for a small data set where the correct result was trivial. The search image of the trivial test data was:

Visa mer