Institutionen för systemteknik

Institutionen för systemteknik Department of Electrical Engineering Examensarbete Video coding using compressed transportation plans Examensarbete utfört i bildkodning vid Tekniska högskolan i Linköping av Johan Lissing LITH-ISY-EX--07/3918--SE Linköping 2007 Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Linköpings tekniska högskola Linköpings universitet 581 83 Linköping

Video coding using compressed transportation plans Examensarbete utfört i bildkodning vid Tekniska högskolan i Linköping av Johan Lissing LITH-ISY-EX--07/3918--SE Handledare: Examinator: Niclas Wadströmer isy, Linköpings universitet Niclas Wadströmer isy, Linköpings universitet Linköping, 19 January, 2007

Avdelning, Institution Division, Department Image Coding Group Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping, Sweden Datum Date 2007-01-19 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport ISBN ISRN LITH-ISY-EX--07/3918--SE Serietitel och serienummer Title of series, numbering ISSN URL för elektronisk version http://www.ep.liu.se/ Titel Title Videokodning med komprimerade transportplaner Video coding using compressed transportation plans Författare Author Johan Lissing Sammanfattning Abstract A transportation plan is a byproduct from the calculation of the Kantorovich distance between two images. It describes a transformation from one of the images to the other. This master thesis shows how transportation plans can be used for video coding and how to process the transportation plans to achieve a good bitrate/quality ratio. Various parameters are evaluated using an implemented transportation plan video coder. The introduction of transform coding with DCT proves to be very useful, as it reduces the size of the resulting transportation plans. DCT coding roughly gives a 10-fold decrease in bitrate with maintained quality compared to the nontransformed transportation plan coding. With the best settings for transportation plan coding, I was able to code a test sequence at about 5 times the bitrate for MPEG coding of the same sequence with similar quality. As video coding using transportation plans is a very new concept, the thesis is ended with conclusions on the test results and suggestions for future research in this area. Nyckelord Keywords video coding, kantorovich distance, transportation plan

Abstract A transportation plan is a byproduct from the calculation of the Kantorovich distance between two images. It describes a transformation from one of the images to the other. This master thesis shows how transportation plans can be used for video coding and how to process the transportation plans to achieve a good bitrate/quality ratio. Various parameters are evaluated using an implemented transportation plan video coder. The introduction of transform coding with DCT proves to be very useful, as it reduces the size of the resulting transportation plans. DCT coding roughly gives a 10-fold decrease in bitrate with maintained quality compared to the nontransformed transportation plan coding. With the best settings for transportation plan coding, I was able to code a test sequence at about 5 times the bitrate for MPEG coding of the same sequence with similar quality. As video coding using transportation plans is a very new concept, the thesis is ended with conclusions on the test results and suggestions for future research in this area. v

Contents 1 Introduction 1 1.1 Purpose................................. 1 1.2 Background............................... 1 1.3 Methodology.............................. 1 2 Background on video coding 3 2.1 Source coding and data compression................. 3 2.1.1 Lossless compression...................... 3 2.1.2 Lossy compression....................... 5 2.2 Image and video coding........................ 6 2.2.1 A simple video coder...................... 6 2.2.2 A lossy video coder....................... 6 2.3 Performance measures......................... 8 2.3.1 Compression.......................... 8 2.3.2 Quality............................. 8 3 The Kantorovich distance and transportation plans 11 3.1 Definitions................................ 11 3.2 Examples................................ 12 3.3 Application to video coding...................... 14 4 Compressing the transportation plans 15 4.1 Frame selection............................. 15 4.1.1 Interpolation methods..................... 16 4.2 Frame transformation......................... 17 4.3 Transportation plan calculation.................... 17 4.3.1 Mass equalization....................... 17 4.3.2 Distance function........................ 18 4.4 Arc selection.............................. 18 4.4.1 Arcs with length 0....................... 19 4.4.2 Arcs with low mass....................... 20 4.4.3 Isolated arcs.......................... 21 4.5 Arc transformation........................... 21 4.5.1 The difference plan....................... 21 4.5.2 Scan-line traversal....................... 22 vii

viii Contents 4.5.3 Hilbert curve traversal..................... 22 4.5.4 Vector coding of relative receiving pixel coordinates.... 24 4.6 Mass compression............................ 25 4.7 Lossless compression.......................... 25 4.8 Filtering................................. 27 5 Transform coding 29 5.1 Frame selection............................. 29 5.2 Frame transformation......................... 29 5.3 Transportation plan calculation.................... 29 5.3.1 Mass equalization....................... 30 5.3.2 Mass elevation......................... 30 5.4 Arc selection.............................. 31 5.5 Arc transformation........................... 31 5.5.1 Zigzag scan........................... 31 5.5.2 Vector coding of relative receiving pixel coordinates.... 32 5.6 Mass compression............................ 32 5.7 Lossless compression.......................... 33 5.8 Filtering................................. 33 6 Implementation 35 6.1 Tools................................... 35 6.2 Details.................................. 35 6.3 Limitations............................... 35 7 Results 37 7.1 Transportation plan coding...................... 37 7.2 MPEG coding.............................. 39 8 Conclusions 41 8.1 Transform benefits........................... 41 8.2 Quality................................. 42 8.2.1 Quality control......................... 42 8.2.2 Artifacts............................. 42 8.3 Interpolation.............................. 42 8.4 Lossless coding............................. 43 8.5 Complexity............................... 43 9 Future work 45 9.1 Modifications.............................. 45 9.1.1 Interpolation.......................... 45 9.1.2 Transforms........................... 45 9.1.3 Distance functions....................... 45 9.1.4 Lossless coding......................... 46 9.2 Expansions............................... 46 9.2.1 Color video........................... 46

Contents ix Bibliography 47 A The Miss America sequence 49 B Compression algorithms 50 B.1 Unary coding.............................. 50 B.2 Move-to-front (MTF) coding..................... 50 B.3 Run-length coding........................... 52 B.4 Huffman................................. 52 B.5 LZ78................................... 54 B.6 LZW................................... 55 C The discrete cosine transform 57 C.1 Definitions and example........................ 57 C.2 Quality factor.............................. 61

Chapter 1 Introduction 1.1 Purpose The purpose of this master thesis project is to investigate if effective video coding, based on compressed transportation plans, can be achieved. The effectiveness, in this case, mainly pertains to the ratio between the video bitrate and the distortion, compared to that of a commercial video coder. 1.2 Background Transportation plans have so far been sparsely used in video coding and are not incorporated in any commercial codecs. A previous master thesis [7] has investigated the possibilities for using transportation plans to code video, but there are many options and parameters left to evaluate. That makes video coding using transportation plans an interesting topic to explore. 1.3 Methodology I intend to use a program written by my supervisor, Niclas Wadströmer, for calculating the Kantorovich distance between two images, and also the transportation plan from the first to the second image. This code will be fitted into a larger program that reads entire video sequences, calculates transportation plans between the frames, compresses these plans and finally writes them to an encoded video file. The program will also be able to play back the encoded video files and measure their quality. To be able to achieve high compression for the transportation plans, I will look at the steps involved in a video coder and see what options there are for each step. I will evaluate a few combinations of these option to see which ones are useful and which are not. Most of the ideas presented in the previous master thesis on the topic [7] will be implemented to start with and I will continue from there. 1

Chapter 2 Background on video coding 2.1 Source coding and data compression Coding is a general term that is synonymous to representing. To code a data sequence is to represent it in a different way. The new representation may be more suited to a certain purpose. Coding does not necessarily mean compression, although this is often implied. When an encoded sequence needs less storage space than the original sequence, compression is achieved. Source coding is to code the output signal of a source (e.g. an image) for a channel with limited capacity (e.g. a hard drive). This often requires data compression. It is assumed that the channel itself is noise-free and does not introduce any distortion to the signal. Data compression can be divided into two categories: lossless and lossy compression. As their names imply, lossless compression allows the data to be perfectly reconstructed, while lossy compression introduces errors distortion. Stand-alone lossless compression is mainly used for compressing text documents, where any distortion is intolerable. With images and video, a lossy compression can exploit defects in the human vision and let the distortion go unnoticed. Lossy compression is usually followed by lossless compression to further reduce the file size. The aim for lossless compression is to minimize the number of bits needed to represent the data. With lossy compression, the aim is to achieve a good compromise between bitrate and distortion. Both compression techniques make use of statistical redundancies in the data. The following subsections will describe lossless and lossy compression in general. Specific compression algorithms and techniques used in the implementation are described in appendix B. 2.1.1 Lossless compression A simple lossless compression technique is demonstrated in example 2.1. 3

4 Background on video coding Example 2.1: Lossless compression Let s say we want to compress the output of a data source that produces integers. A sample elementary event is the following vector: x = (17 18 19 18 20 21 23 24 26) The vector x contains 9 numbers. If we code this vector using normal binary numbers, we would need log 2 (26) + 1 = 5 bits per symbol, knowing that 26 is the highest number in x. The whole vector of 9 numbers then requires 9 5 = 45 bits. If we know the fact that there are only 8 distinct numbers in x, we would only need log 2 (8) = 3 bits per symbol, giving 27 bits for the whole vector. If we construct a new vector x using { x x(0) n = 0 (n) = x(n) x(n 1) 0 < n < 9 we get x = (17 1 1 1 2 1 2 1 2) Apart from x (0), which can be treated separately, the number of distinct values in x is 3, requiring only log 2 (3) = 2 bits per symbol. Using 5 bits for the first number, x requires 5 + 8 2 = 21 bits. That is less than half the number of bits required to store the original vector x. The statistical redundancy in this case is the sample-to-sample correlation in x, where each number only differs a small amount from the previous number. Of course, when decompressing x, one must know how it was constructed in order to reconstruct x. If the vector x is representative for the source that produced it, this method should be usable to compress all outputs from the source. A lower bound on how many bits are required to code a data sequence is given by its entropy. The entropy measurement is based on the distribution of the elements in the data. Definition 2.1 (Entropy) The (first-order) entropy of a data sequence x is [5]: H(x) = i x p(i) log 2 p(i) where p is the probability function. The resulting value has the unit bits/symbol. Example 2.2: Entropy If we assume that the vectors in example 2.1 are representative for the source, we can estimate their probability distributions. In x, we have p(18) = 2/9, while

2.1 Source coding and data compression 5 the rest of the numbers each have the probability 1/9. The entropy can then be calculated as ( H(x) = 7 1 9 log 1 2 9 + 2 ) 9 log 2 2 = 2.95 9 The difference vector x, however, has a skewer probability distribution: p( 1) = 1 9 p(1) = 4 9 p(2) = 3 9 p(17) = 1 9 The entropy of x is H(x ) = ( 1 9 log 1 2 9 + 4 9 log 4 2 9 + 3 9 log 3 2 9 + 1 ) 9 log 1 2 = 1.75 9 Example 2.2 shows that the difference vector requires, at least theoretically, fewer bits per symbol than the original vector. 2.1.2 Lossy compression When using lossy compression, the reconstructed data is allowed to differ from the original data. The difference is often small enough to go unnoticed, or at least to not be annoying. How much distortion is considered acceptable depends on the application. Lossy compression is often used to compress audio and video because it can exploit defects in human hearing and vision. The lossy part of lossy compression is often quantization, which reduces the size of the data alphabet and the number of bits needed to code each symbol. Example 2.3: Lossy compression We want to compress the output of a data source that produces floating-point numbers. A sample elementary event is the following vector: y = (45.8 44.1 43.7 44.2 47.9 46.4 49.5 50.3 49.8) which has 9 distinct numbers, requiring log 2 (9) = 4 bits per symbol. If we round each number to the nearest integer, we get y = y + 0.5 = (46 44 44 44 48 46 50 50 50) which has only 4 distinct numbers and therefore requires only log 2 (4) = 2 bits per symbol.

6 Background on video coding 2.2 Image and video coding Image coding deals with coding still images, such as photos or artwork. One of the most widespread coding standards for photographic images is JPEG. It uses both lossless and lossy coding. The trade-off between distortion and compression can be adjusted with a quality parameter. An important observation is that in natural images, pixels tend to have similar intensity and color as their neighbors. This is called spatial redundancy. 2.2.1 A simple video coder A video is essentially a sequence of images, and can thus be coded by applying image coding to each frame. However, a video also has temporal redundancy, meaning that the intensity and color of a pixel is similar to that of the previous and next frames in time. The traditional video coder exploits this by coding the difference frame D(t) = F (t) F (t 1) instead of the regular frame F (t). The difference frames should then contain mostly pixels with value 0. A block diagram for a difference coder is shown in figure 2.1. In this thesis, I intend to replace the difference frame D(t) = F (t) F (t 1) with the transportation plan from F (t 1) to F (t). 2.2.2 A lossy video coder When introducing lossy compression to a coder based on sample differences, the errors must not accumulate. Example 2.4 shows how such a situation can occur. Example 2.4: Quantization coder with error accumulation Using the same data as in example 2.3, we have the data vector x = (45.8 44.1 43.7 44.2 47.9 46.4 49.5 50.3 49.8) and the difference vector x = (45.8 1.7 0.4 0.5 3.7 1.5 3.1 0.8 0.5) If we quantize the difference vector x by rounding each floating-point number to the nearest integer, we get x = (46 2 0 1 4 1 3 1 0) When the data is reconstructed using { x x (n) = (0) n = 0 x (n) + x (n 1) 0 < n < 9 this results in x = (46 44 44 45 49 48 51 52 52)

2.2 Image and video coding 7 Figure 2.1. A difference coder. T is a delay element. x(n) + - x (n) x*(n-1) Q x (n) + + C T x*(n) Figure 2.2. A coder using quantization of sample differences. Q is a quantizer, T is a delay element and C is a lossless coder. with the absolute errors x x = (0.2 0.1 0.3 0.8 1.1 1.6 1.5 1.7 2.2) The problem with example 2.4 is that the calculation of the difference vector x is based on the exact values of x, which are not available to the decoder. This corresponds to connecting a quantizer to the output of the coder shown in figure 2.1. The solution is to make each difference calculation consider the quantization of the previous value, i.e. x (n) = x(n) x (n 1). A decoder is built into the coder, which now looks like the block diagram in figure 2.2. Example 2.5 demonstrates how it works. Example 2.5: Quantization coder with no error accumulation We again code the vector x = (45.8 44.1 43.7 44.2 47.9 46.4 49.5 50.3 49.8) x(0) gets quantized to x (0) = x (0) = 46. The difference of the second number and the reconstruction of the first is x (1) = x(1) x (0) = 44.1 46 = 1.9. This gets quantized to x (1) = 2 and the reconstruction is x (1) = x (0) + x (1) = 46 + ( 2) = 44.

8 Background on video coding The difference of the third number and the reconstruction of the second is x (2) = x(2) x (1) = 43.7 44 = 0.3. This gets quantized to x (2) = 0 and the reconstruction is x (2) = x (1) + x (2) = 44 + 0 = 44. Continuing in the same fashion gives the quantized difference vector which gets reconstructed to with the absolute errors x = (46 2 0 0 4 2 4 0 0) x = (46 44 44 44 48 46 50 50 50) x x = (0.2 0.1 0.3 0.2 0.1 0.4 0.5 0.3 0.2) 2.3 Performance measures The performance of a video coder using transportation plans must somehow be measured in order to compare it with other coders. Compression and quality were selected as the two most important performance factors in this thesis. 2.3.1 Compression The compression can be measured as the ratio between the size of the original data and the size of the compressed data. A suitable measure for image and video coding is the number of bits per pixel (bpp) in the compressed data. An 8-bit gray-scale image has 8 bpp. If it can be compressed to 0.2 bpp, the compression ratio is 8/0.2 = 40. 2.3.2 Quality There is no universal measure for image quality. It is rather a matter of definition for the current application. For example, an image of a map is readable even if the colors are wrong, but becomes useless if the image is blurred so that thin lines (borders, roads) fade. On a photograph of a natural scene, some blurring may go unnoticed, while heavy discoloring could make the photo look unreal. Ideally, the quality should be measured as it is perceived by a human. However, in order to get an objective measure, most measures are formulated mathematically. Two common measures for distortion are mean squared error (MSE) and peak signal-to-noise ratio (PSNR).

2.3 Performance measures 9 (a) Original image (b) PSNR: 0 db (c) PSNR: 6 db Figure 2.3. PSNR comparison. Figure 2.3(c) has higher PSNR even though figure 2.3(b) seems better to the human eye. Definition 2.2 (Mean squared error) The MSE is defined as MSE = 1 n n (x(i) x (i)) 2 i=1 where x is the original signal and x is the reconstruction of x after lossy coding. If x = x, i.e. the coding is lossless, then MSE = 0. Definition 2.3 (Peak signal-to-noise ratio) The PSNR is defined as ( ) m 2 P SNR = 10 log x 10 MSE and measured in decibels (db). m x is the maximum value of the signal, which is 255 in the case of 8-bit gray-scale images. The MSE and PSNR measures are easy to use, but do not always agree with the image quality perceived by a human, as illustrated by example 2.6. Example 2.6: PSNR exploit The original image is shown in figure 2.3(a). It is 16 16 pixels and consists of white (value 255) and black (value 0) lines. Figure 2.3(b) is a shifted version of the original image, while figure 2.3(c) is a gray image (value 127). The MSE for figure 2.3(b) is 65025 and the PSNR is 0 db. For figure 2.3(c), the MSE is 16256 and the PSNR is 6 db. Even though figure 2.3(b) seems more similar to the original image, it has higher MSE and lower PSNR than 2.3(c). PSNR is used as a quality measure in chapter 7, despite its poor performance in example 2.6. The example is an extreme case and with natural images, PSNR mostly works fine. An alternative to PSNR is to let a group of people evaluate image quality, but this method was considered too cumbersome.

Chapter 3 The Kantorovich distance and transportation plans The Kantorovich distance and transportation plans will be explained in this chapter. The Kantorovich distance is a distance measure between two gray-scale images. Generally, the smaller the Kantorovich distance is, the more similar the two images are. 3.1 Definitions The gray-values in a gray-scale image can be viewed as mass. A pixel having the gray-value 17 corresponds to the mass 17 in that position. Assuming that two images have the same total mass, the mass elements in one of them can be moved around, transforming the first image into the other image. A transportation vector describes how a mass element should be moved from the transmitting pixel in the first image to the receiving pixel in the second image. Definition 3.1 (Transportation vector and its components) Using the notation from [4], a transportation vector (i n, j n, x n, y n, m n ) defines each move of mass where (i n, j n ) is the transmitting pixel, (x n, y n ) is the receiving pixel and m n is the mass. The pair ((i n, j n ), (x n, y n )) is called an arc. For simplicity, the term transportation vector is substituted for arc throughout this thesis. This can safely be done since each transmitting-receiving pixel pair only appears once in a transportation plan. Each transportation vector, or arc, is associated with a cost, indicating how expensive it is to move its mass. Definition 3.2 (Cost) The cost of a transportation vector (i n, j n, x n, y n, m n ) is defined as d((i n, j n ), (x n, y n )) m n, where d((i n, j n ), (x n, y n )) is the distance between the transmitting pixel and the receiving pixel using some distance function d [4]. 11

12 The Kantorovich distance and transportation plans (a) Transmitting image (b) Receiving image (c) Arcs Figure 3.1. Illustration of example 3.1. Transmitting and receiving image and the arcs between them. The distance function d can for example be the squared Euclidean distance d((i n, j n ), (x n, y n )) = (x n i n ) 2 + (y n j n ) 2 Definition 3.3 (Kantorovich distance) The Kantorovich distance between the images A and B, with equal total gray-value, is the smallest total cost required to transform A into B using transportation vectors. The Kantorovich distance between A and B is denoted d K (A, B) [4]. Calculating the Kantorovich distance between two images yields a transportation plan as a byproduct. The transportation plan can be used to transform the first image to the second. Definition 3.4 (Transportation plan) The (optimal) transportation plan from image A to image B, which have equal total gray-value, is the set of transportation vectors that yields the smallest total cost when transforming A into B. The total cost of the transportation plan from A to B is equal to d K (A, B) [4]. 3.2 Examples Example 3.1: A simple transportation plan Figure 3.1 shows two small 3 3 frames. The arcs of the transportation plan from figure 3.1(a) to figure 3.1(b) are shown in figure 3.1(c). The pixel values range from 0 (black) to 255 (white). With the origin in the top left corner, the transportation is (0 0 0 1 255) distance: 1 cost: 255 (1 0 1 0 128) distance: 0 cost: 0 (1 0 2 1 127) distance: 2 cost: 254 The distances use the squared Euclidean distance function. The Kantorovich distance between the two images is 255 + 254 = 509.

3.2 Examples 13 (a) Using the Euclidean metric, 392 arcs (b) Using the squared Euclidean distance, 1009 arcs Figure 3.2. Transportation plans between the first two frames of Miss America using different distance functions. Each arc is represented by a line between the transmitting and the receiving pixel.

14 The Kantorovich distance and transportation plans Example 3.2: Real transportation plans Example 3.1 does not reveal the complexity of the transportation plan between two real images. This can instead be seen in figure 3.2. The two transportation plans are calculated between the first two frames of the Miss America sequence, one using the Euclidean metric and the other the squared Euclidean distance. Appendix A shows the first frames from this sequence. In this example, the frames were subsampled to 32 32 to make the figures readable. Figure 3.2(a) shows the result of using the Euclidean metric. The arcs are few, but they have different lengths and point in all directions. When using the squared Euclidean distance, as shown in figure 3.2(b), there are more arcs, but their lengths and directions are more uniform. 3.3 Application to video coding Using the information in a transmitting image and a transportation plan, the receiving image can be reconstructed. A video sequence can thus be coded with one transportation plan for each frame, using the previous frame as the transmitting image.

Chapter 4 Compressing the transportation plans A video coder using transportation plans can be built using the blocks in figure 4.1. These blocks will be described, in that order, in the following sections. Note that this coder assumes that sequential frames are similar, i.e. from the same scene in a video sequence. When a scene change occurs, it is probably best to code the first frame of the new scene as a stand-alone image. A transportation plan from a previous frame or a completely blank frame could also be used, but these would most likely contain too many arcs to allow for efficient coding. Figure 4.1. Block diagram for a transportation plan video coder. A video sequence is encoded by passing it through all steps of the coder. 4.1 Frame selection This step decides which video frames to encode. Possible choices here include: Code all frames Code frames 0, n, 2n, 3n,... and let the decoder interpolate the omitted frames Code frames 0, n, 2n, 3n,... and use interpolations of the omitted frames as predictors for themselves 15

16 Compressing the transportation plans (a) Interpolation of frames for direct use. (b) Interpolation of frames for use as predictors. Figure 4.2. Interpolation of frames. F = frame sent as pure image, T = frame produced from transportation plan, I = interpolated frame. Bold arrows represent transportation plans, i.e. sent data. Thin arrows show which frames are used to interpolate intermediate frames. The last options resemble the use of GOPs (group of pictures) in MPEG video coding. A GOP consists of I, P and B frames. I frames are stand-alone images, while P frames are predicted from previous frames and B frames are bidirectionally predicted from both previous and future frames. An illustration of the last options with n = 3 is shown in figure 4.2. Interpolated frames need not be sent, but can be calculated from frames already available to the receiver, which reduces the bitrate drastically. When using the interpolated frames as predictors, the transportation plans from them have to be sent, but at the same time the quality can be increased. 4.1.1 Interpolation methods Interpolation can be done in many ways with varying complexity and quality. A simple interpolation between the frames F (t) and F (t+2) is the linear interpolation F (t + 1) = (F (t) + F (t + 2))/2. Several frames can be interpolated between F (t) and F (t + n) using F (t + k) = (n k)f (t) + kf (t + n) n where 1 k n 1 A transportation plan can also be used to perform interpolation, either by halving the arc masses or the arc lengths. However, testing showed that interpolation through halving arc lengths gave bad results. Halving arc masses gave a somewhat better quality, but linear interpolation was the superior method.

4.2 Frame transformation 17 4.2 Frame transformation The input video frames can be transformed before the transportation plan is calculated. As will be made apparent in section 4.4, a transportation plan between natural frames have many arcs. The number can be reduced drastically if the frames are transformed prior to calculating the transportation plan. The two tested choices are: No transformation Discrete cosine transform The discrete cosine transform turned out to give far superior compression. Coding with DCT is further described in chapter 5. The rest of this chapter deals with non-transformed frames and their transportation plans. 4.3 Transportation plan calculation The calculation of transportation plans was done by an external program, written by my supervisor. 4.3.1 Mass equalization As mentioned in chapter 3, the masses of the two input images must be equal, to comply with the definition of the Kantorovich distance. This can be achieved in a number of ways: Cross-multiplication of the image masses Cross-multiplication with the greatest common divisor Sequentially modifying individual pixel values in one of the images until they have the same mass The first and third option were implemented, but showed no significant differences regarding the number of arcs or their distribution across the images. A drawback with cross-multiplying is that the alphabet for the mass component becomes huge. This requires immediate quantization and also prolongs execution time. These methods are also dependent on the frame size. The third option means changing the average intensity for the whole image. To equalize the masses exactly, some single pixel changes may be necessary. These can be done in a scan-line fashion. This method always yields pixel values in the range [0, 255] and arc masses in the range (0, 255]. For this reason, this approach was used in all tests. All options require that some side information is passed to the receiver. When using cross-multiplication, the image mass of the receiving frame can be used for division back to actual pixel values. The sequential modification scheme can use the difference of the two image sums to restore the receiving frame at the decoder end. In all cases, the size of this side information is negligible. With a resolution of 128 128, it requires less than 2 bytes per frame.

18 Compressing the transportation plans 4.3.2 Distance function The definition of the Kantorovich distance does not specify which distance function to use for determining arc costs, so this can be chosen. A few common distance functions are: Euclidean metric Squared Euclidean distance Manhattan metric (also known as city block metric) The squared Euclidean distance distance was already implemented in the program I used for transportation plan calculation. With this distance function, the cost rapidly increases as an arc gets longer, which gives arcs of similar length and direction that should be easy to code. Calculating the transportation plan between the first two frames of the 128 128 Miss America sequence (see appendix A) using the squared Euclidean distance function gives 23677 arcs. With Manhattan metric, the figure is 9280 arcs. However, in the first case all arcs are of length 1 or 2, while in the latter case the arc length distribution is exponentially decreasing from length 1 to 23. Arcs of zero length (see section 4.4.1) are not included in the figures. The squared Euclidean distance was selected as the distance function to use in subsequent tests. 4.4 Arc selection A typical transportation plan between two 128 128 video frames contains tens of thousands of arcs. For instance, considering the transportation plans between the frames of the Miss America sequence, the plan with fewest arcs contains 34331 arcs and the one with most arcs contains 42095 arcs. There are on average 2.3 arcs per pixel. However, there are several types of arcs that can be removed from a transportation plan to improve the bitrate while not affecting the quality too much: Arcs with length 0 Arcs with low mass Isolated arcs The first two categories are mentioned in [7]. The removal of arcs from the transportation plan between the first two frames in the Miss America sequence is illustrated in figure 4.3, where every arc s transmitting pixel has been plotted. If several arcs transmit from the same pixel, that pixel has been made brighter. Figure 4.3(a) shows the transmitting pixels of the unmodified transportation plan. From the images, one can make out the woman s eyes and lips, which seem to give rise to most arcs. Removal of low mass and isolated arcs does not always result in such a drastical drop of the number of arcs as in the case illustrated by figure 4.3. Subsequent

4.4 Arc selection 19 (a) All 39921 arcs (b) Arcs with length 0 removed (23677 arcs left) (c) Arcs with mass 7 removed (6110 arcs left) (d) Isolated arcs removed (5021 arcs left) Figure 4.3. Transmitting pixels. The brighter a pixel is, the more arcs start in that pixel. The number of transmitting pixels ranges from 0 (black) to 6 (white). transportation plans may need more arcs to compensate for accumulated errors. When coding the entire Miss America sequence, the number of arcs varies between about 4000 and 14000, using a mass limit of 7 and removal of isolated arcs. 4.4.1 Arcs with length 0 Arcs with zero length can be discarded without any quality loss at all, since the decoder has access to the previous frame. Instead of transporting mass from the previous frame to a blank one, the mass can be moved within the frame. There is generally no reason to keep zero-length arcs, except in a few cases. For example, these arcs are needed if the decoder has to discard the previous frame due to limited memory capacity. Figure 4.3(b) shows the transmitting pixels when zero-length arcs have been removed from the example transportation plan.

20 Compressing the transportation plans Number of arcs 2.5 x 104 2 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mass limit (a) The number of arcs as a function of the arc mass limit. 55 50 PSNR 45 40 35 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mass limit (b) The PSNR as a function of the arc mass limit. When the mass limit is 0, no arcs are removed and the PSNR gets infinitely large. Figure 4.4. These diagrams show the result of successively removing arcs with higher mass from the transportation plan between the first two frames of Miss America. Isolated arcs were not removed. 4.4.2 Arcs with low mass Assuming that all arcs cost rougly the same to code, low mass arcs can be removed with the motivation that their absence only makes small contributions to the MSE. The mass limit for whether an arc will be included or not can be used as a bitrate/quality parameter. Figure 4.3(c) shows the transmitting pixels when arcs with mass 7 have been removed from the example transportation plan. The impact of removing low mass arcs was tested using the transportation plan between the first two frames of Miss America. After successively removing arcs with higher mass, the number of arcs were counted and the quality of the reconstructed second frame was measured. The results, shown in figure 4.4, give a hint on what quality to expect when choosing a mass limit. When encoding a whole video sequence, the PSNR will be lower and the number of arcs will be higher than in this test, since the starting image here has perfect quality. Isolated arcs were not removed in this test, unless they were under the mass limit.

4.5 Arc transformation 21 Component Entropy Uncoded bits Transmitting x-coordinates 6.333 7 Transmitting y-coordinates 6.687 7 Receiving x-coordinates 6.368 7 Receiving x-coordinates 6.719 7 Masses 4.722 8 Sum (bits per arc) 30.829 36 Table 4.1. Entropies for the components of the original transportation plan between the first two frames of Miss America, using mass limit 7 and removal of isolated arcs. 4.4.3 Isolated arcs A number of arcs typically appear in uninteresting regions, such as the background behind the woman in the Miss America sequence. The transmitting pixel of an isolated arc has few neighboring transmitting pixels of other arcs and the same holds for its receiving pixel. Isolated arcs are expensive to code since they do not belong to a region of arcs. The size of the area in which to search for neighbors and the number of required neighbors can be used as bitrate/quality parameters. In all tests, I used a 3 3 region, centered on the current pixel, with a requirement of at least 4 neighbors to count as non-isolated. Figure 4.3(d) shows the transmitting pixels when isolated arcs have been removed from the example transportation plan. 4.5 Arc transformation The 5 values that constitute an arc of a transportation plan typically vary greatly, as the coordinates may be spread out over the entire image and the mass may vary from 1 to 255. This can be illustrated by the transportation plan (with mass limit 7 and removal of isolated arcs) between the first two frames of Miss America. When looking at the separate components, the entropies in table 4.1 are achieved. The coordinate component entropies are only slightly lower than 7, which is the number of bits required to code the coordinates (ranging from 0 to 127) without any compression at all! The high entropy is a result of the coordinates almost uniform distribution. By transforming the transportation plan into a difference plan [7], it can be coded with fewer bits. 4.5.1 The difference plan A transportation plan can be transformed into a difference plan by replacing the receiving pixel with a vector from the transmitting pixel to the receiving pixel for each arc. I.e. the arc (i n, j n, x n, y n, m n ) gets replaced by (i n, j n, x n i n, y n j n, m n )

22 Compressing the transportation plans Component Entropy Uncoded bits Distances to last transmitting pixel 1.712 14 Receiving x-differences 1.093 7 Receiving y-differences 1.430 7 Masses 4.722 8 Sum (bits per arc) 8.957 36 Table 4.2. Entropies for the components of the modified difference plan between the first two frames of Miss America, using mass limit 7 and removal of isolated arcs. The difference plan can be further transformed into what Östman [7] calls a modified difference plan. The idea is to represent a transmitting pixel with its distance to the transmitting pixel of the previous arc. This distance is measured along some traversal of the coordinate system, corresponding to an ordering of the arcs. In [7], the coordinate system is traversed in a scan-line fashion. 4.5.2 Scan-line traversal In a scan-line traversal, the coordinate system is traversed row by row. This corresponds to ordering the arcs first by the second (row) coordinate of the transmitting pixel, then by the first (column) coordinate. Starting at (0, 0), the coordinate system is traversed until a transmitting pixel of an arc is encountered. The position of that tranmitting pixel is stored as the number of traversal steps since the last transmitting pixel. When the traversal reaches the end of the first line (coordinate (127, 0) in a 128 128 frame) it jumps to the beginning of the next line, (1, 0), and continues from there. The modified plan now consists of four components. Using the same setup as with the original transportation plan, the component entropies are shown in table 4.2. These components have significantly lower entropies than those of the original transportation plan, shown in table 4.1, so coding them should therefore require much fewer bits. 4.5.3 Hilbert curve traversal Instead of using scan-line traversal when transforming the transmitting pixels, I also tried to traverse them along a Hilbert curve. A Hilbert curve is a space-filling curve which tends to traverse a region in an image before moving on. A traversal along a Hilbert curve does not make as drastic jumps as a scan-line traversal, since a Hilbert curve has no row endings. Hilbert scan is illustrated in figure 4.5.

4.5 Arc transformation 23 Figure 4.5. Hilbert curve. Definition 4.1 (Hilbert curve) A Hilbert curve can be defined as the following rewrite system [6]: Alphabet: L, R Constants: F, +, Axiom: L Production rules: L +RF LF L F R+ and R LF +RF R+F L A string of alphabet symbols and constants is recursively built using the production rules. The initial string is the axiom L. After the first iteration, the string is +RF LF L F R+. After n string iterations, the Hilbert curve covers 2 n 2D points. The constants are interpreted as follows: + = turn right, = turn left and F = move forward. By applying this scheme to an iterated string, a Hilbert curve is constructed. Using a Hilbert traversal affects the entropy for the component with distances to the last transmitting pixel. With the same parameters as in the previous entropy tests, the entropy for this component has now dropped slightly to 1.661, compared to 1.712 for the scan-line traversal. The entropies for the other components remain unchanged, since they are only reordered. From a few test runs, it is apparent that transmitting pixels are usually grouped in regions and that arcs generally have the same direction within a region. By using a Hilbert traversal, there should be a better chance for repetitive patterns in the transmitting and receiving pixels, respectively. This would make them more suitable for e.g. LZW compression (see section 4.7). The idea is confirmed by a test that showed that the conditional entropy for the component with receiving pixels is slightly lower than its normal entropy. The conditional entropy incorporates previous knowledge, in this case the previous receiving pixel. The locality of arc direction is also demonstrated in figure 4.6, using the tranportation plan between the first two frames of Miss America with a mass limit of 7 and isolated arcs removed.

24 Compressing the transportation plans (a) Color key. (b) Arc directions. Figure 4.6. Figure 4.6(b) shows arc directions using the key in 4.6(a). The direction color code is plotted on the transmitting pixel. If several arcs share the same pixel, the color represents the arc with highest mass. 2 16 15 14 13 12 1 17 4 3 2 11 y 0 : 5 0 1 10-1 6 7 8 9-2 -3-2 -1 0 1 2 x Figure 4.7. Spiral code. 4.5.4 Vector coding of relative receiving pixel coordinates Since the (modified) x and y coordinates for the receiving pixels are correlated, I chose to vector code them using a spiral code. The coding principle is demonstrated in figure 4.7. The receiving pixels are now represented by one component, the spiral code. In the example in table 4.2, the sum of the entropies for the receiving pixel component is 1.093 + 1.430 = 2.523. The entropy of the spiral coded component using the same parameters is 2.248, which is slightly lower. Since arcs usually have the same length and direction within a region, I applied move-to-front (MTF) coding to the spiral code. The entropy now dropped to 1.213. MTF coding is further described in appendix B.2.

4.6 Mass compression 25 Figure 4.8. Arc masses plotted on the transmitting pixel. If several arcs share the same transmitting pixel, the average mass is plotted. The mass ranges from 0 (black) to 144 (white). 4.6 Mass compression The arc masses have so far been only been ordered, due to the choice of traversal. Using the knowledge of the mass of the previous arc along the Hilbert curve did not seem to help when coding them. The distribution of masses is demonstrated in figure 4.8. As can be seen in the figure, arcs with similar mass are not grouped within regions, but rather along lines and edges. Thus, the mass component does not benefit from the Hilbert traversal as much as the other components do. However, the mass component can be quantized, to reduce the alphabet size. As concluded in [7], the arc mass is the only part of the modified difference plan that can be lossy coded without a severe distortion penalty. I used linear quantization with varying quantization steps. Figure 4.9 shows the result of quantizing the masses using 16 steps and no removal of arcs. The PSNR of the reconstructed frame is 32.643 and the quantization errors are visible as noise in the image. 4.7 Lossless compression The remaining parts of the modified difference plan are three components: Transmitting pixels, represented by the distance to the transmitting pixel of the previous arc Receiving pixels as MTF-coded spiral codes Quantized arc masses

26 Compressing the transportation plans (a) Original second frame (b) Frame reconstructed from quantized transportation plan (c) Squared difference image Figure 4.9. Figure 4.9(a) shows the original second frame of the Miss America sequence. Figure 4.9(b) was reconstructed from a transportation plan using 16 mass quantization steps with a resulting PSNR of 32.643. The squared difference is shown in figure 4.9(c).

4.8 Filtering 27 These components were lastly encoded with combinations of lossless compression algorithms. These lossless coders were tested: Unary coding Run-length coding Huffman LZ78 LZW Each algorithm is described in appendix B. Huffman and LZW coding gave the best compression and the differences between them were small. It appeared that the best combination was to use Huffman coding for the transmitting and receiving pixels and to encode the masses with LZW. This setup was used in all tests. Instead of coding the three components separately, one could form new tuples by combining them. One approach could be to pair transmitting and receiving pixels and code the transportation plan as a these tuples plus the masses. However, testing showed that the tuples generally had higher entropy than the two separate components. 4.8 Filtering After reconstructing an encoded frame, it can be filtered to increase the PSNR to some extent. The tried filters are: Blur Median However, the improvement in PSNR does not directly translate to a perception of better image quality. Overuse rather degrades the perceived visual quality as the image gets too blurry, even though the PSNR increases. The reason for the higher PSNR value is that the errors caused by the encoding are distributed to surrounding pixels, which decreases the MSE. This is an exploit similar to example 2.6.

Chapter 5 Transform coding This chapter describes a coding process, similar to that of chapter 4, but using frames transformed with the discrete cosine transform (DCT). An introduction to this transform is given in appendix C. The main benefit of the DCT is that the transformed images have their energy concentrated to a few pixels, which results in very few arcs in a transportation plan between two such images. The chapter is divided into the same sections as chapter 4. Most methods described in that chapter are applicable here too. 5.1 Frame selection Frames can be selected as in section 4.1. The question here is whether any interpolation should be done using reconstructed frames or DCT transformed frames. From test runs, it is apparent that interpolation in the signal domain gives better quality. 5.2 Frame transformation For this coder, the frames are transformed with DCT on n n pixel blocks. The DCT image can at this stage be quantized as in JPEG coding, using a quantization matrix coupled with a quality factor. Quantization is demonstrated in example C.1 and a suitable quality factor is defined in appendix C.2. 5.3 Transportation plan calculation The use of transformed images as input to the transportation plan calculation function requires some preprocessing, as the following subsections will show. 29

30 Transform coding Figure 5.1. Transmitting pixels using DCT with 16 16 block size and quality 50. Zero-length arcs omitted. The brighter a pixel is, the more arcs start in that pixel. The numer of transmitting pixels ranges from 0 (black) to 3 (white). 5.3.1 Mass equalization Equalizing transform image masses will not work well using the sequential modification method described in 4.3.1. The reason is that the mass in a blocktransformed image is distributed very unevenly. There are islands of high positive or negative values around the DC pixel of each block, while the pixels on the block perimeters are usually zero. Sequentially modifying mass on such an image will nullify its property of energy compaction. The cross-multiplication methods, also described in 4.3.1 should work better. However, they would suffer even more from the problem with large sized alphabets, since DCT components may take on a much wider range of values than normal image pixels. A new option for equalizing the masses of two DCT frames is as follows: Increase/decrease only pixels representing the DC component for each block, so that the sum of a block in the first frame becomes equal to the sum of the corresponding block in the second frame. This makes it probable that arcs will be short and move their mass within a block, most likely to or from the DC pixel. The difference for the DC pixels is sent as side information. If the transformation block size is large, there will only be a few DC pixels to send side information for. On the other hand, the locality of the transformation decreases with large block sizes. 5.3.2 Mass elevation As apparent from example C.1, the DCT components may be negative as well as positive. The definition of the Kantorovich distance in [4] does not state if image pixels are allowed to have negative values, but the transportation plan calculation function only allows positive pixel values. A workaround for this is to temporarily elevate both images to be strictly positive before using the function. The calculated transportation plan can then be used on the non-elevated transmitting image.