Camera-Based Friction Estimation with Deep Convolutional Neural Networks

UPTEC F 18046 Examensarbete 30 hp Juli 2018 Camera-Based Friction Estimation with Deep Convolutional Neural Networks Arvi Jonnarth

Abstract Camera-Based Friction Estimation with Deep Convolutional Neural Networks Arvi Jonnarth Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 471 30 03 Telefax: 018 471 30 00 Hemsida: http://www.teknat.uu.se/student During recent years, great progress has been made within the field of deep learning, and more specifically, within neural networks. Deep convolutional neural networks (CNN) have been especially successful within image processing in tasks such as image classification and object detection. Car manufacturers, amongst other actors, are starting to realize the potential of deep learning and have begun applying it to autonomous driving. This is not a simple task, and many challenges still lie ahead. A subproblem, that needs to be solved, is a way of automatically determining the road conditions, including the friction. Since many modern cars are equipped with cameras these days, it is only natural to approach this problem with CNNs. This is what has been done in this thesis. First, a data set is gathered which consists of 37,000 labeled road images that are taken through the front window of a car. Second, CNNs are trained on this data set to classify the friction of a given road. Gathering road images and labeling them with the correct friction is a time consuming and difficult process, and requires human supervision. For this reason, experiments are made on a second data set, which consist of 54,000 simulated images. These images are captured from the racing game World Rally Championship 7 and are used in addition to the real images, to investigate what can be gained from this. Experiments conducted during this thesis show that CNNs are a good approach for the problem of estimating the road friction. The limiting factor, however, is the data set. Not only does the data set need to be much bigger, but it also has to include a much wider variety of driving conditions. Friction is a complex property and depends on many variables, and CNNs are only effective on the type of data that they have been trained on. For these reasons, new data has to be gather by actively seeking different driving conditions in order for this approach to be deployable in practice. Handledare: Thomas Svantesson, Daniel Murdin Ämnesgranskare: Thomas Schön Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F 18046

Populärvetenskaplig Sammanfattning Under de senaste åren har det gjorts stora framsteg inom maskininlärning, särskilt gällande neurala nätverk. Djupa neurala närverk med faltningslager, eller faltningsnätverk (eng. convolutional neural network) har framför allt varit framgångsrika inom bildbehandling i problem så som bildklassificering och objektdetektering. Biltillverkare, bland andra aktörer, har nu börjat att inse potentialen av maskininlärning och påbörjat dess tillämpning inom autonom körning. Detta är ingen enkel uppgift och många utmaningar finns fortfarande framöver. Ett delproblem som måste lösas är ett sätt att automatiskt avgöra väglaget, där friktionen ingår. Eftersom många nya bilar är utrustade med kameror är det naturligt att försöka tackla detta problem med faltningsnätverk, vilket är varför detta har gjorts under detta examensarbete. Först samlar vi in en datamängd beståendes av 37 000 bilder tagna på vägar genom framrutan av en bil. Dessa bilder kategoriseras efter friktionen på vägen. Sedan tränar vi faltningsnätverk på denna datamängd för att klassificera friktionen. Att samla in vägbilder och att kategorisera dessa är en tidskrävande och svår process och kräver mänsklig övervakning. Av denna anledning utförs experiment på en andra datamängd beståendes av 54 000 simulerade bilder. Dessa har blivit insamlade genom spelet World Rally Championship 7 där syftet är att undersöka om prestandan på nätverken kan ökas genom simulerat data och därmed minska kravet på storleken av den riktiga datamängden. De experiment som har utförts under examensarbetet visar på att faltningsnätverk är ett bra tillvägagångssätt för att skatta vägfriktionen. Den begränsande faktorn i det här fallet är datamängden. Datamängden behöver inte bara vara större, men den måste framför allt täcka in ett bredare urval av väglag och väderförhållanden. Friktion är en komplex egenskap och beror på många variabler, och faltningsnätverk är endast effektiva på den typen av data som de har tränats på. Av dessa anledningar behöver ny data samlas in genom att aktivt söka efter nya körförhållanden om detta tillvägagångssätt ska vara tillämpbart i praktiken. ii

Acknowledgements First of all, I would like to thank my supervisors Daniel Murdin and Dr. Thomas Svantesson at NIRA Dynamics for their guidance and support. I would like to thank them for the possibility of doing my master s thesis at such an interesting company like NIRA. My time here has been great. I would also like to thank my subject reader Professor Thomas Schön at Uppsala University. Without all of your feedback this thesis would not have become what it is today. Furthermore, I would like to thank my fellow thesis workers Daniel, Olle and William. I have enjoyed our discussions throughout the thesis as well as our much needed foosball sessions. You have contributed to an enjoyable workplace. iii

Contents Abstract Populärvetenskaplig Sammanfattning Acknowledgements Notation List of Figures List of Tables i ii iii vi viii ix 1 Introduction 1 1.1 Background........................................ 1 1.2 Project Goal....................................... 1 2 Theory 3 2.1 Artificial Neural Networks................................ 3 2.2 Fully Connected Networks................................ 3 2.2.1 Activation Functions............................... 5 2.3 Training.......................................... 6 2.3.1 Backpropagation................................. 7 2.3.2 Batch Training.................................. 8 2.3.3 Gradient-Based Optimization.......................... 8 2.3.4 The Vanishing/Exploding Gradient Problem................. 10 2.3.5 Weight Initialization............................... 10 2.3.6 Dropout...................................... 11 2.4 Convolutional Neural Networks............................. 12 2.4.1 The Convolution Layer............................. 12 2.4.2 Pooling...................................... 14 2.4.3 Batch Normalization............................... 15 2.4.4 The Convolution Cell.............................. 16 2.4.5 Feature Map Dropout.............................. 16 3 Methods 18 3.1 Network Architectures.................................. 18 3.1.1 VGG........................................ 18 3.1.2 ResNet...................................... 19 3.1.3 DenseNet..................................... 20 3.1.4 Implementation Details............................. 20 3.2 Data Set.......................................... 24 3.2.1 Real Data..................................... 24 3.2.2 Simulated Data.................................. 25 3.2.3 Data Mixture................................... 26 3.3 Data Augmentation................................... 26 3.3.1 Left-to-Right Flip................................ 27 3.3.2 Rotation...................................... 27 iv

3.3.3 Translation.................................... 28 3.3.4 Scaling...................................... 28 3.3.5 Noise....................................... 28 3.3.6 Intensity Modification.............................. 29 3.3.7 Cutout...................................... 29 3.3.8 Implementation Details............................. 30 3.4 Multiple-Crop Evaluation................................ 30 4 Results 32 4.1 Hyperparameter Tuning................................. 32 4.2 Training with Simulated Data.............................. 34 4.3 Network Architecture Evaluation............................ 35 4.4 Data Set Evaluation................................... 39 5 Discussion 41 5.1 Network Architecture Analysis............................. 41 5.2 Data Set Analysis.................................... 41 5.3 Effects of Simulated Data................................ 42 5.4 Model Generalization................................... 43 5.5 Future Work....................................... 44 6 Conclusions 45 v

Notation Symbol a b c C D 1, D 2 H k K K 1, K 2 K m ˆm n in n out N N L p keep P Q r 1, r 2 s 1, s 2 v ˆv w x ˆx X y ŷ z Z α β β 1, β 2 γ δ ɛ Description A vector representing a fully connected layer, post-activation Bias term (both in fully connected and convolutional networks) The convolution operation in a neural network Loss (as a function of model output and desired output) Spatial dimensions of an image; height and width respectively The output residual of a residual block Kernel in a convolution Batch size Spatial dimensions of a kernel; height and width respectively A 4-D tensor describing a kernel, used in the convolution operation c within a neural network 1 st moment estimate of the loss gradient Bias-corrected 1 st moment estimate The number of neurons in the previous layer for a connection The number of neurons in the next layer for a connection Number of neurons in the output layer Number of neurons in the last hidden layer The fraction of neurons to be kept during dropout Number of input channels of an image or feature map to a convolution Number of output channels from a convolution Split parameters describing when to go from simulated data to real data during training The spatial strides within a convolution or pooling layer, along the height and width respectively 2 nd moment estimate of the loss gradient Bias-corrected 2 nd moment estimate Weight matrix in a fully connected layer Model input Normalized input to a batch normalization layer A 3-D tensor describing an image or feature map Correct or desired output Model output (or prediction) A vector representing a fully connected layer, pre-activation A 3-D tensor describing the output feature map of the convolution operation Learning rate Applied shift in a batch normalization layer Exponential decay rates for the 1 st and 2 nd moment estimates of the loss gradient Applied scale in a batch normalization layer The Dirac delta function A small number for the purpose of avoiding division by zero Continued on next page vi

Symbol θ κ λ Description Continued from previous page The compression factor in a DenseNet network architecture The growth rate in a DenseNet network architecture The intermediate partial derivative C z µ Friction coefficient µ hi Friction coefficient, indicating high friction (µ 0.6) in the backpropagation algorithm µ me Friction coefficient, indicating medium friction (0.2 µ < 0.6) µ B The batch mean of an input or feature in a batch normalization layer σ σ 2 B ω Ω Activation function The batch variance of an input or feature map in a batch normalization layer Any parameter (weight or bias) within a neural network A vector containing all the parameters in a neural network ( ) The nabla operator x 1,..., x n Hadamard product or element-wise multiplication of two vectors or matrices of equal size vii

List of Figures 1 An illustration of a neural network with fully connected layers............ 4 2 Four common activation functions (blue solid line) together with their derivatives (red dashed line)...................................... 5 3 A neural network (a) without dropout and (b) with dropout............. 12 4 An illustration of a convolutional neural network.................... 12 5 The convolution operation between the input X and the kernel K producing the feature map Z....................................... 14 6 Max pooling of a 4 4 image with a kernel size of 2 2 and a stride of 2...... 15 7 The VGG neural network architecture. (Image from 26.).............. 18 8 The residual block with (a) same and (b) different input and output sizes...... 19 9 A dense block in the DenseNet architecture where every layer is connected together. (Image from 24.).................................... 20 10 Example images of roads with (a) high friction and (b) medium friction....... 24 11 Examples of simulated images of roads with (a) high friction and (b) medium friction. 25 12 A plot illustrating the definition of the split parameters r 1 and r 2.......... 26 13 An example where left-to-right flip has been applied to an image........... 27 14 An example where rotation has been applied to an image............... 27 15 An example where translation has been applied to an image............. 28 16 An example where scaling has been applied to an image................ 28 17 An example where noise has been applied to an image................. 29 18 An example where intensity modification has been applied to an image....... 29 19 An example where cutout has been applied to an image................ 30 20 An illustration of multiple-crop evaluation....................... 31 21 Trained kernels in the first convolution layer (a) without feature map dropout and (b) with feature map dropout. Left: Local color normalization. Right: Global color normalization........................................ 33 22 Some selected kernels from Figure 21(b) when training with feature map dropout, highlighting the similarity between kernels....................... 33 23 Training, testing and simulation plots when training a DenseNet-A network on both simulated and real data. The data mixture scheme described in Section 3.2.3 has been used where the vertical lines correspond to the split parameters r 1 = 50 and r 2 = 100 epochs...................................... 34 24 Trained kernels in the first layer after (a) pre-training on simulated data and then (b) training on real data. Left: Local color normalization. Right: Global color normalization........................................ 35 25 Training results for VGG-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 36 26 Training results for VGG-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 36 27 Training results for ResNet-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 36 28 Training results for ResNet-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 37 29 Training results for DenseNet-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 37 viii

30 Training results for DenseNet-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets... 37 31 Example images and their corresponding label and prediction. High or Medium indicates the true label and the percentages indicate the output confidence for the true label. The border indicates whether the prediction was correct, where green corresponds to a correct prediction and red to a false prediction. (The prediction is correct if the output confidence of the true label exceeds 50 %.)........... 38 32 Training and validation plots for DenseNet-A, highlighting the close similarity between the two data sets.................................. 39 33 The prediction accuracy as a function of the fraction of training data points used in the training data set. The prediction accuracy has been computed after the same number of weight updates (2000) for each run. The vertical lines indicate one standard deviation..................................... 40 34 An example of an objective function with a narrow global minimum which is not robust against perturbations. In their presence, the wider local minimum would be preferred.......................................... 43 List of Tables 1 The specific network compositions of the VGG-type architectures.......... 21 2 The specific network compositions of the ResNet-type architectures......... 22 3 The specific network compositions of the DenseNet-type architectures........ 23 4 Summary of the six different networks that were implemented............ 23 5 Class distributions for the real data set......................... 25 6 Results for the six different networks. The values are averaged over the five subsets and over the last 20 epochs of training. The test metrics in bold indicate the best performance........................................ 35 7 Results for the five different subsets where the indicated subset was used as the test set and the other four assembled the training set. The values are averaged over every implemented network and over the last 20 epochs of training.......... 39 ix

1 Introduction Research within the field of deep learning is currently progressing at a rapid pace. New methods are published every year and the extent at which deep learning is applied continually expands. So far, it has been applied in fields such as computer vision, audio recognition, bioinformatics and natural-language processing to name a few. This technology is now reaching the private sector where car manufacturers, amongst other actors, are beginning to realize the potential of deep learning and have begun applying it to active safety within autonomous driving. This thesis aims to explore how deep learning can be applied to a specific task related to autonomous driving, namely the task of estimating the road friction. 1.1 Background This thesis has been conducted at a company called NIRA Dynamics, which develops software solutions for vehicle safety. One of their products is the Tire Grip Indicator (TGI) which estimates the road friction using wheel signals and other automotive grade sensors. With this approach, the accuracy of the friction estimation is better at higher accelerations. One aspect of this thesis is to see if a deep learning approach using cameras could be implemented to complement TGI at constant speeds. Monitoring the road friction has many applications and benefits. Apart from active safety within autonomous driving, on the scale of a singular vehicle, an on-board road friction monitoring system could be used to alert the driver of a slippery road. Usually, drivers behave differently depending on their road awareness. For example, on dry asphalt, drivers tend to drive more dynamically as they know that they have good grip, but on gravel or snow-covered roads, they tend to drive more cautiously since they know that the tires can begin to slip more easily. The danger comes when the road conditions unforeseeably change and the grip decreases without warning. In these situations, an automated system for friction estimation is a great aid. From a larger perspective, road friction measurements from several cars could be combined in a large-scale cloud-based system to map the friction of entire cities or other large road networks. In this way, even cars without road friction monitoring systems could be alerted of slippery roads, for example through a mobile application based on GPS. Since modern cars are equipped with cameras these days, they could be used in addition to other sensory data to improve software solutions for vehicle safety. This is in good alignment with the great success of convolutional neural networks (CNN), within the field of image processing. CNNs have shown to be extremely powerful in tasks such as image classification, object detection and image segmentation, surpassing human-level performance in some cases. This motivates the approach of applying CNNs to estimate the road friction based on image data. The invention of neural networks was inspired by the neuron activity in the brain. A neural network is, in a sense, a much simplified artificial representations of a human brain where information is propagated through a series of layers of neurons. Neural networks can be applied in many different fields, it all depends on how they are trained. These networks are trained by exposing them to data, and in a sense, showing them how to perform a specific task. An error term is mathematically formulated which depends on the output of the network, and is used to iteratively adjust the connections between the neurons in order to increase the performance. This process, however, requires large amounts of data. In a blog post from NVIDIA 1, the problem of data collection was addressed. An estimate was made, based on conservative assumptions, on the amount of raw data that was needed in order to develop systems for autonomous vehicles. The required size of the raw data was estimated to be more than 200 PB. 1.2 Project Goal The massive success of neural networks within image processing in combination with the fact that many modern cars are equipped with cameras, has sparked interest in applying deep learning to autonomous driving. To solve the task of creating reliable systems for autonomous driving, many sub-problems need to be overcome. One of these sub-problems includes the task of automatically 1

determining the road conditions, including the friction. The goal of this thesis is to investigate if it is possible to apply deep learning to the task of estimating the road friction, and if not, what challenges need to be overcome. Additionally, the process of collecting and labeling data according to friction is time consuming and difficult. For this reason, a further goal is to investigate if simulated data can be of any help in this task. 2

2 Theory The fundamental goal of statistical modelling is to formulate a model based on incomplete knowledge about an underlying process. What is available is data sampled from a distribution generated by the unknown process where statistical assumptions are made regarding the sample distribution and the data generation. The final model should then output, or predict, the outcome of the process outside the scope of the sample data. Model validation is then performed on a separate set of sampled data drawn from the same distribution but not used in the model formulation. Machine learning, which is a branch of computer science, incorporates these notions to solve problems such as classification and regression, by letting an algorithm learn from data instead of explicitly defining a model in terms of computer code. Classification problems are a set of problems where the goal is to categorize data points into a set of pre-defined classes based on some attributes of the data. Concretely, these might involve image classification where an algorithm is to describe what appears in an image, or spam detection where an e-mail is to be classified as spam or not depending on its contents. Regression problems, on the other hand, aim to estimate some continuous hidden variable based on other observable variables. Within image processing, a common regression problem is object detection which includes finding and estimating the position of objects in an image. Other applications include estimating the momentum and trajectory of subatomic particles given some measurements from a particle detector, or predicting the price of a house based on its size and location. In recent years, neural networks have become a popular approach to many machine learning problems, especially within the field of computer vision. 2.1 Artificial Neural Networks Artificial neural networks (ANN), or simply neural networks (NN), have been around since 1943 when they were first introduced by McCulloch & Pitts 2 within the field of neurophysiology. They were inspired by nervous activity in the brain and set out to model so called nerve nets. With the fact that neurons have an all-or-none character they used propositional logic to create the very first neural networks which would later be called McCulloch-Pitts (MP) nerve nets. Rosenblatt 3 was inspired by this idea and built on the notion by introducing the concept of association cells. These association cells assembled an intermediate layer in the MP nets whose task was to extract features, much like the hidden layers in today s networks. Using this approach, Rosenblatt set out to model a perceptron, or a pattern recognition device. Even though Rosneblatts methodology closely resembles the current structure of neural networks, they would not become fully applicable and competetive with other, much simpler methods, until much later. One key factor at play was the fact that neural networks required vast amounts of data and computing power which was not available at the time. Another key component that was missing was an efficient way of training the networks which was first introduced in 1974 by Werbos 4 where errors were propagated backwards through a multi-component system by differentiation using the chain rule. The more general technique is called automatic differentiation, and Werbos finally applied it to neural networks in 1982 5 in the form that is used today. The backpropagation algorithm is described in further detail in Section 2.3.1. 2.2 Fully Connected Networks A fully connected neural network is the most simple variation of neural networks and is illustrated in Figure 1. This type of network consists of an input and an output layer which are connected through a series of one or more hidden layers. If the number of hidden layers is more than one, then the network is categorized as a deep network. Each layer consists of neurons, where each neuron in one layer is connected to every neuron in the next layer, hence the name fully connected network. The connections are represented by real numbers and determine how much and what information is passed on to the next layer. In modern terminology these are referred to as weights and are subject to optimization during training. Additionally, a bias term is added to each neuron. In 3

Figure 1: An illustration of a neural network with fully connected layers. Figure 1 the bias term is represented by an additional neuron with a constant value of one, whose weights are also optimized over during training. This type of network falls within the family of feed-forward networks. These are networks where information is only propagated forward through the network, without any feedback loops. So-called recurrent neural networks, for example, are not feed-forward networks since their output is fed back into the network. Recurrent networks are usually applied to time series. Since every hidden layer in a fully connected neural network is an array of neurons, they can be represented by a vector a l where l is the layer number. a 0 corresponds to the input layer and is also referred to as x. Subsequently, the weights connecting layer l 1 to layer l can be represented by a matrix w l where element wij l is the connection from neuron j in layer l 1 to neuron i in layer l. Similarly, the bias terms are represented by a vector b l. The notation may vary in different literature but this is the notation used by Michael Nielsen in his book Neural Networks and Deep Learning 6 and has been adapted in this thesis. Using this notation, layer l is computed as a linear combination of the neurons in layer l 1 a l = w l a l 1 + b l. (1) However, when only computing linear combinations in each layer, the resulting model would simply be linear, regardless of the number of layers in the network. In order to introduce some complexity, a non-linear activation function σ is applied to the linear combination before being passed on through the network. Layer l is thus computed by a l = σ l (w l a l 1 + b l ), (2) where σ l is the activation function applied in layer l. The activation function is a design choice and can technically be any scalar or vector-valued function but is in practice scalar and monotonically increasing. Note that for the case when σ is a scalar function it is applied element-wise for vector inputs. Currently, the most common activation function is the rectified linear unit (ReLU) which is defined as σ(x) = max(0, x). Other activation functions are also used with different properties and are described further in Section 2.2.1. Finally, the output layer ŷ is computed in the same way as the hidden layers, i.e. as ŷ = σ y (w y a L + b y ), (3) where L is the total number of hidden layers in the network, σ y is the activation function applied 4

in the output layer and w y and b y are the weights and biases between the last hidden layer L and the output layer. 2.2.1 Activation Functions The choice of activation function has proven to be an important factor when training deep neural networks. Figure 2 shows some of the common activation functions and they are further described below. Sigmoid The sigmoid function is a smooth, infinitely differentiable function and is usually the first activation function that is mentioned when introducing neural networks. It is defined as σ(x) = 1 1 + e x, dσ dx = e x (1 + e x ) 2, (4) and assumes values on the interval (0, 1). In a sense, it is practical since it normalizes the values preventing a single neuron to dominate within a specific layer. One problem, however, is that if large (absolute) values are fed into the function they get pushed to the so called saturated regimes resulting in gradients that are close to zero. This is problematic during backpropagation since small gradients mean that the error is propagated very slowly and thus resulting in slow convergence 7. Additionally, it has a mean value of 0.5 (assuming an input distribution with mean zero) which means that the magnitude of the output will depend on the number of neurons since the output will be a sum over positive numbers only. In this case the network will rely heavily on the bias terms to assume negative values in order for the output not to fall far into the saturated regimes. It does have some uses, however, even in deep neural networks. For example, it can be used as the output activation function σ y if each neuron in the output layer is to represent a probability. Tanh Tanh is basically just a shifted version of the sigmoid function and assumes values on the interval ( 1, 1). It is defined as 2 sigmoid 2 tanh 1 1 0 0 1 1 2 2 0 2 2 2 0 2 2 ReLU 2 ELU 1 1 0 0 1 1 2 2 0 2 2 2 0 2 Figure 2: Four common activation functions (blue solid line) together with their derivatives (red dashed line). 5

σ(x) = ex e x e x + e x, dσ dx = 4 (e x + e x ) 2, (5) and overcomes the problem of a non-zero mean but also has saturated regimes resulting in gradients close to zero. ReLU A simple activation function that overcomes the problems of sigmoid and tanh is the rectified linear unit (ReLU) defined as { dσ σ(x) = max(0, x), dx = 0, for x < 0 1, for x 0. (6) It is a common activation function for ANNs and has become the go-to function for deep CNNs. ReLU does not have saturated regimes, allowing the gradient to flow more easily through the network. This results in a faster convergence than for the sigmoid-like functions which has also been empirically tested 8. Additionally, ReLU is slightly faster to compute and results in a sparse output since many elements will be zero. ELU A concern with the ReLU activation function has been that if a neuron gets pushed far into the zero regime for negative intputs the gradient will be zero and the neuron might potentially die 9. This is called the dying ReLU problem and might happen either as a result of initialization of the weights or during training if the learning rate is too high. For this reason several variants of ReLU have been formulated, among which one is the exponential linear unit (ELU) 10 and is defined as σ(x) = { e x 1, for x < 0 x, for x 0, dσ dx = { e x, for x < 0 1, for x 0. (7) The idea is that if a neuron falls far into the zero regime it might still have a chance to recover since the gradient of ELU is non-zero for all inputs. Softmax In classification problems it is common to use an activation function on the output layer which returns values between zero and one. In these problems the output neurons can be interpreted as class probabilities. In classification problems where each data point corresponds to exactly one class the softmax activation function is the most common and is defined as σ i ( x) = e xi N n=1 exn, dσ i dx j = e xi N n=1 exn (δ ij ) e xj N, (8) n=1 exn where N is the number of classes and δ is the Dirac delta function. Softmax simply transforms a vector x to a probability distribution where the sum of all elements in the output vector is one. The elements of the vector x are sometimes referred to as logits which are the negative log-odds for each class. Identity In regression problems where the output spans an arbitrary real interval there is no need to use an activation function in the output layer since the output from a neural network layer naturally spans the real axis. Technically, this is the absence of an activation function but it has still been given a name, the identity activation function where σ(x) = x. 2.3 Training In machine learning, there are different types of learning methods, broadly grouped into supervised learning, unsupervised learning and reinforcement learning. In supervised learning, data is shown to a model together with the correct label of the data, meaning that it is known beforehand what the correct output should be, given some input. Typical problems in supervised learning 6

are classification and regression. In unsupervised learning, however, no label or categorization is available and a model has to be created only given some input data. Clustering is a prime example of unsupervised learning. In reinforcement learning, an agent is put into an environment where, given the state, it performs some action. It is not known if a certain action is correct or not. Instead the agent is rewarded based on some reward system where specific actions are not directly connected to specific rewards. The agent has to learn which actions yield high rewards by itself. In this thesis supervised learning was used since the labels (friction) of the images were known. Supervised learning is applied to feed-forward networks by first feeding the input data x through the network yielding some predicted output ŷ. The prediction is then compared to the ground truth output y and an error (or loss) term is formulated. The error is then propagated backwards through the network and the weights are adjusted slightly along the way to reduce the error. This process is called backpropagation and is described in detail in Section 2.3.1. In the backpropagation algorithm the gradient of the loss with respect to the model parameters is computed which is in turn used by a gradient-based optimization algorithm to update the parameters. Different optimization methods are described in Section 2.3.3. 2.3.1 Backpropagation As mentioned previously, a network is trained by computing an error term and backpropagating it through the network. The error is mathematically formulated in terms of a loss function C(ŷ, y), where ŷ is the predicted output of the network and y is the desired output. Just like the activation functions, the loss function is a design choice and is supposed to reflect how far away the prediction is from the correct answer. For a classification problem a good metric that describes a model is the prediction accuracy. This, however, is discontinuous in its gradients with respect to the weights and is difficult to implement in a sound way. Instead, a different metric is formulated which is a continuous function of the predicted output. A common loss function is the quadratic loss function and is defined as C(ŷ, y) = 1 2 (ŷ y)t (ŷ y), C ŷ = ŷ y, (9) which computes the squared error of each output neuron independently. Note that C is a scalar and ŷ as well as C ŷ are column vectors where the latter contains the partial derivatives. The quadratic loss is a good general loss function which works for both classification as well as regression problems. A loss function that is more specifically designed for classification problems with exactly one correct class, is the cross-entropy loss function and is defined as C(ŷ, y) = y T ln(ŷ), ( C ŷ = y1,..., y ) T N, (10) ŷ 1 ŷ N where N is the number of output neurons. Since the output vector in a classification problem describes the class probabilities, the elements in y are either 0 or 1 with only one element being 1. This is called one-hot labeling. This means that only the correct class is being considered when computing the cross-entropy loss. Additionally, if the prediction of the correct class tends towards 0 the loss tends towards infinity, meaning that false certainty is heavily penalized. Now, the goal of backpropagation is to compute the gradient of the loss function with respect to each parameter. Let Ω be a vector containing all the weights and biases of the network. Since ŷ = ŷ(ω) is an analytical function, the gradient of the loss can be computed with respect to each weight. This is done using the chain rule. First, by differentiating C with respect to ŷ we get N C ω = n=1 C ŷ n ŷ n ω = ( ) T C ŷ ŷ ω, (11) where C ŷ = ( C ŷ 1,..., C ŷ N ) T, ŷ ω = ( ŷ1 ω,..., ŷ N ω )T, ω Ω is any weight or bias within the network and N is the number of output neurons. The last term can be computed by differentiating (3), yielding 7

where dσy dz y ŷ ω = dσy al wy dzy ω, (12) = ( dσy dσ,..., y ) T a, L dz y 1 dz y ω = ( al 1 ω,..., al N L ω )T, σ y = σ y (z y ) is a scalar function of N L z y = w y a L + b y, N L is the number of neurons in the last hidden layer and denotes the Hadamard product or element-wise multiplication. Similarly, by differentiating (2) the error can be backpropagated all the way to the input a l ω = dσl dz l wl al 1 ω, (13) where dσl z l ω are defined similarly as in (12). Furthermore, these gradients can be compressed into the following four equations which describe the full backpropagation algorithm for a fully connected neural network (adapted from 11) dz l, a l and al 1 λ y = σ y (z y ) C(ŷ), λ l = σ l (z l ) ((w l+1 ) T λ l+1 ), (14a) (14b) C w l ij = λ l a l 1 j, (14c) C b l i = λ l, (14d) where the element λ l C i corresponds to and λ y is the same as λ L+1. Note that these equations zi l describe the more general case where the activation functions are not assumed to be scalar functions but they can also be vector-valued. This means that σ y and σ l are matrices. For scalar functions they will only have non-zero elements on the diagonal. 2.3.2 Batch Training The optimization problem at hand is to minimize the total loss on every data point in the whole data set. However, most data sets contain too many data points for it to be feasible to include every data point when computing the gradients in every iteration. On the contrary, using only one data point to compute the weight gradients would result in very noisy gradients which would lead to slow convergence. Therefore the gradients are computed and averaged over random data points in a batch of data. This requires the loss function to be written as C = 1 C x, (15) K where the total loss C is computed by averaging the individual losses C x for each data point x within a batch of size K. The batch size is chosen such that the gradient computation is accurate enough and that each iteration completes within a feasible time. 2.3.3 Gradient-Based Optimization In essence, optimization is the task of finding an optimum of a given objective function with respect to some variables. This is basically what is done when training a neural network where the objective function is the loss function and the variables are the weights and biases. Optimization is generally applied to non-linear, high-dimensional functions where it is impractical or simply impossible to find the optima analytically. Instead, iterative methods are used where we take advantage of the gradient, hence gradient-based optimization. There exists methods which not only use the first order gradient but also take advantage of higher order gradients, such as the Hessian. When it comes to neural networks, however, it would become too computationally expensive to compute the Hessian in each iteration. Therefore only the first order gradient is used when training neural networks, which provides a sufficient approximation of the local behaviour of the loss surface. The first order gradient of the loss function is computed analytically during backpropagation which is very convenient. Since the goal is to minimize the loss, the most simple approach is to change x 8

the weights in the opposite direction of the gradient by some factor α. Following the notation in Section 2.3.1 where ω represents any parameter in the network, the weight update rule can be written as ω k+1 = ω k α C ω (ω k), (16) where k is the iteration variable and ω k is the weight at iteration k. This method is called gradient descent (GD) and α is generally called the step size but when training networks it is usually referred to as the learning rate. A subtle detail in (16) is that it uses the full analytical gradient which, as mentioned, is not feasible to compute for large data sets. This is why the gradient is computed on a smaller batch of data instead. When this approximation of the gradient is used, the optimization method is called stochastic gradient descent (SGD) since the gradient is, in some sense, stochastically computed. To get a smoother approximation of the gradient, it is common to apply momentum in the form of an exponentially moving average. In this case, the weight update rule is written as m k+1 = β 1 m k + (1 β 1 ) C ω (ω k), (17a) ω k+1 = ω k αm k+1, (17b) where β 1 0, 1) is the momentum parameter and controls how much the previous gradient computations should be weighted compared to the current gradient. Note that β 1 = 0 corresponds to SGD. Typical values of β 1 are 0.9 or 0.99. A value closer to 1 is recommended for smaller batch sizes or if the data is inherently noisy. However, excessively high values will lead to oscillations around the optimum. Nesterov 12 further enhanced the moment estimation of the gradient by looking ahead in the direction of the momentum. This is called Nesterov accelerated momentum and has been shown to yield faster convergence than normal momentum. It is computed by evaluating the gradient at the point located one step in the direction of where the momentum is pointing to, namely by m k+1 = β 1 m k + (1 β 1 ) C ω (ω k αβ 1 m k ), (18a) ω k+1 = ω k αm k+1. (18b) In addition to using momentum in the optimization algorithm, adaptive methods automatically tune the learning rate for each parameter individually during training. This can be done in different ways. Kingma & Ba 13 accomplished this by using adaptive moment estimation in their method, called Adam. Instead of only using the 1 st moment estimate m as in (17) and (18) they also used the 2 nd moment estimate v. The 2 nd moment is the gradient squared and is also implemented as an exponentially moving average controlled by the parameter β 2. Additionally, they used the bias-corrected moment estimates ˆm and ˆv to account for the bias in the gradient towards zero as an effect of the exponentially moving average which is evident in the early stages of training. The weight update scheme for Adam is described in Algorithm 1 (taken from 13). Adam has become the most popular adaptive optimization algorithm for neural networks and labeled as the best choice by some practitioners. Recently, however, scepticism has been directed towards adaptive methods as they tend to converge towards drastically different optima than non-adaptive methods. Wilson et al. 14 demonstrated this by formulating a binary classification problem where SGD achieved zero test error but the adaptive methods (including Adam) attained test errors of a half, which would be the error for a random classifier. This is connected to the topic of the generalization of neural networks and is discussed further in Section 5.4. 9

Algorithm 1 The Adam optimizer. 1: Require: α: Step size 2: Require: β 1, β 2 0, 1): Exponential decay rates for the moment estimates 3: Require: f(ω): Stochastic objective function with parameters Ω 4: Require: Ω 0 : Initial parameter vector 5: m 0 0 (Initialize 1 st moment vector) 6: v 0 0 (Initialize 2 nd moment vector) 7: t 0 (Initialize time step) 8: ɛ 10 8 (Stability parameter) 9: while Ω t not converged do 10: t t + 1 11: g t Ω f t (Ω t 1 ) (Get gradients with respect to stochastic objective at time step t) 12: m t β 1 m t 1 + (1 β 1 ) g t (Update biased 1 st moment estimate) 13: v t β 2 v t 1 + (1 β 2 ) g t g t (Update biased 2 nd raw moment estimate) 14: ˆm t m t /(1 β t 1) (Compute bias-corrected 1 st moment estimate) 15: ˆv t v t /(1 β t 2) (Compute bias-corrected 2 nd raw moment estimate) 16: Ω t Ω t 1 α ˆm t /( ˆv t + ɛ) (Update parameters) 17: end while 18: return Ω t (Resulting parameters) 2.3.4 The Vanishing/Exploding Gradient Problem An unfortunate problem of the backpropagation algorithm is that gradients tend to decrease as the error is propagated deeper through the network. Since the partial derivatives are multiplied layer by layer, if the distribution of gradients are below 1, then less and less of the error gets propagated. Consequently, different layers learn at different speeds making it difficult to train the earlier layers. This is called the vanishing gradient problem. Similarly, if the gradient distributions are centered above 1 the opposite can happen, i.e. the weights diverge. This is called the exploding gradient problem. These problems have a larger impact in deep feed-forward and recurrent neural networks which involve many gradient multiplications. Furthermore, the sigmoid and tanh activation functions have gradients in the interval (0, 1 2 ) and (0, 1) respectively. If these are used in the hidden layers, the gradients decrease exponentially for each layer causing the problem to emerge. 2.3.5 Weight Initialization For a long time, after the second era of neural network research in the 1990s, people almost gave up hope on them since they seemed very difficult to train, which was in part due to the vanishing/exploding gradient problem 15. It was not until 2010 when Glorot & Bengio 16 realized that the combination of initialization scheme and activation function played an important role, which sparked new interest in the field. At the time, the common approach for training NNs was to use sigmoid-like activation functions and initialize the weights randomly with a Gaussian distribution with mean 0 and variance 1. Glorot & Bengio found that this combination resulted in an increase of variance in the layer inputs with each subsequent layer. Example 2.1 highlights the problem with an initial variance of 1. Example 2.1 (Gaussian initialization) Imagine a fully connected network with M neurons in the input layer x as well as in the hidden layers and all weights are initialized with a normal distribution N (0, 1) with mean 0 and variance 1. For simplicity, omit the bias and assume that the inputs are binary variables where half are -1 and half are 1. Now, each neuron zi 1 (pre-activation) is computed by z 1 i = M wijx 1 j, i = 1,..., M. (19) j=1 Since this is simply a sum over M normally distributed variables, the variance of each hidden 10

neuron is V arz 1 i = M. (20) As can be seen in the example, the variance increases with the number of neurons, pushing the values far into the saturated regimes of sigmoid and tanh where the gradient is close to zero. A potential solution could be to normalize the input to the input layer in order to account for the hidden layer widths, but in deep networks the effect would occur in every hidden layer and this solution would simply shift the problem a few layers. Instead, Glorot & Bengio proposed a weight initialization method which is dependent on the number of neurons n in in the previous layer and n out in the following layer. They formulate this method by constraining the output variance to be the same as the input variance. With this approach they arrive at an initial normal distribution for sigmoid-like activation functions with variance V arω = 1 n in. (21) This keeps the variance constant during forward propagation but not during backpropagation, unless n in = n out. In order for the variance to be unchanged between layers during backpropagation, the variance of the initial normal distribution should be V arω = 1 n out. (22) Since (21) and (22) cannot be satisfied simultaneously for the general case, a compromise between the two was formulated. To account for both forward- and backpropagation, a variance that works well is V arω = 2 n in + n out. (23) Additionally, with this method, biases are initialized to zero. This method is called Xavier initialization after the author. He et al. 17 extended this method to ReLU-like activation functions. In this case the initial weight variance is 2.3.6 Dropout V arω = 4 n in + n out. (24) Since neural networks often contain many more parameters than there are data points in the data set, they are prone to overfitting. To counteract this, several regularization methods have been developed. A simple yet somewhat counterintuitive method is dropout 18. It works by randomly dropping some fraction of the neurons by removing the neurons themselves as well as their connections, as illustrated in Figure 3. This forces the network to consider several features when making a prediction, making it less dependent on individual neurons and less prone to overfitting. Dropout is controlled by the hyperparameter p keep which is the probability that a given neuron is kept. This process is only applied when training the network. During testing, the full network is used and each activation is scaled down by a factor p keep to account for the increased number of neurons. 11

(a) (b) Figure 3: A neural network (a) without dropout and (b) with dropout. 2.4 Convolutional Neural Networks As opposed to fully connected networks, convolutional neural networks (CNN) do not connect every neuron in each layer to every neuron in the next layer but instead take advantage of weight sharing. This is accomplished by applying convolutions on the hidden layers where the filters, or kernels, are learnable parameters. The convolution operates by letting the kernel slide over the input while computing the output on a patch of the input at a time, taking into account the spatial relation of the elements in the input. Note that since we are dealing with images, 2D-convolutions are used which have 2D weights, and consider the spatial relation in both x- and y-direction. Parameter sharing is not the only advantage, but convolutional networks also offer sparse interactions as well as equvariant representations 19. This results in many fewer parameters overall and is very useful in applications where adjacent variables in the data are highly correlated such as in image processing as well as in text and speech recognition. CNNs are usually constructed by repeating convolution and pooling layers and finally attaching one or more fully connected layers before the classification layer as shown in Figure 4. This section covers the basic building blocks of convolutional neural networks. Figure 4: An illustration of a convolutional neural network. 2.4.1 The Convolution Layer Originally, the word convolution stems from the mathematical convolution operation between two functions, where the 1-D continuous version is defined as (k x)(t) = k(τ) x(t τ) dτ. (25) The result is another function which has intuitively been calculated by flipping one of the input functions and sliding it over the other. In some sense the convolution is a similarity measure since it outputs a high value if the functions are similar. When talking about convolutional neural networks the function k(t) is called the kernel and the function x(t) is called the input. In signal 12

processing as well as in image analysis the kernel is also referred to as a filter which gives it a more intuitive description. In (25), the convolution is defined between two continuous functions. In real-world applications, however, data is usually sampled into discrete data points which cannot be processed by these types of continuous operators. Instead one can use the discrete convolution (k x)n = m= km xn m, (26) which is very similar to the continuous case except that it is calculated as a sum over discrete data points. This can be applied on mono audio for example, where the waveform is sampled at a specific frequency. In images, however, the data is distributed on two dimensions and a 2D-version of the convolution is required. The discrete 2D-convolution, denoted by, is defined by (k x)n 1, n 2 = m 1= m 2= km 1, m 2 xn 1 m 1, n 2 m 2. (27) This operation takes into consideration the spatial correlation in both the x- and y-direction of the image. As previously, the image xn 1, n 2 is flipped before multiplied by the kernel kn 1, n 2. In image processing, the flipping step is usually skipped since it is more intuitive to think of just placing the kernel on top of the image. Technically it is no longer called convolution but rather cross-correlation which is denoted by and defined as (k x)n 1, n 2 = m 1= m 2= km 1, m 2 xn 1 + m 1, n 2 + m 2. (28) The only difference between convolution and cross-correlation is a minus sign which causes crosscorrelation to lose the commutative property. This property is more relevant in mathematics but not as much in applied fields such as image processing and within convolutional neural networks. Here, the more intuitive cross-correlation operation is used and often referred to as convolution. For this reason cross-correlation will be referred to as convolution throughout this report as well. One additional aspect of color images is that they are not represented by a single 2-D grid but rather by three, one for each color channel. The most common color encoding is RGB where each of the colors red, green and blue have their own channel. Therefore, an image X is a 3-D tensor of size D 1, D 2, P where D 1 is the height, D 2 is the width and P is the number of channels, which is three for an RGB image. In CNNs, convolution layers are usually stacked in series to produce hidden layers called feature maps. These feature maps can be thought of as color images with an arbitrary number of channels and can also be represented by a tensor X. The kernel in a convolution layer is then represented by a 4-D tensor K of size K 1, K 2, P, Q where K 1 is the height, K 2 is the width and Q is the number of output channels of the convolution. Additionally, in order to decrease the spatial size of the output, a stride can be applied to jump several pixels at a time. The convolution operation between an image or feature map X and a kernel K within a CNN is illustrated in Figure 5 and mathematically described as the operation c Z i,j,q = c(k, X, s 1, s 2 ) i,j,q = K 1 K 2 n=1 m=1 p=1 P K n,m,p,q X s1(i 1)+n,s 2(j 1)+m,p, (29) i = 1,..., D 1 K 1+1 s 1, j = 1,..., D2 K 2+1 s 2 and q = 1,..., Q, where i and j indexes the height and width of the output, q indexes the output channel and where s 1 and s 2 are the strides along the height and width respectively. Z is the output feature map produced by the convolution. Note that the output dimensions of Z are smaller than for the original input X since the kernel is not applied outside or on the edge of the input. Usually the input is padded on the edges before applying the convolution in order to preserve the original shape. 13

Figure 5: The convolution operation between the input X and the kernel K producing the feature map Z. The necessary gradients for backpropagation are (adapted from 19) D1 K 1 +1 D2 K 2 +1 s C 1 s 2 C = K n,m,p,q Z i,j,q i=1 j=1 X s1(i 1)+n,s 2(j 1)+m,p (30a) C X i,j,p = v,n s.t. s 1(v 1)+n=i Q w,m s.t. q=1 s 2(w 1)+m=j C Z v,w,q K n,m,p,q, (30b) n = 1,..., K 1, m = 1,..., K 2, p = 1,..., P and q = 1,..., Q. (30a) is used to update the kernel weights and (30b) is used to further backpropagate the error through the network. The term C Z i,j,q is known from the previous step in the backpropagation algorithm. Note that convolution is done in only two dimensions even if X is three dimensional. Since the last dimension corresponds to the channels, convolution is not performed in this dimension. Alternatively, one could think of it as a three dimensional convolution but the stride for the channel dimension is equal to the number of channels P resulting in an output of length one across this dimension. A technique that is sometimes used in CNNs is to apply 1 1 convolutions 20. This might seem pointless at first since this operation does not consider adjacent pixels. The reason for this is simply to change the number of channels in cases where the exact number of channels is important either for compatibility or efficiency reasons. 2.4.2 Pooling Pooling layers can be applied in order to reduce the spatial dimensions of the feature maps within a network. This is done in a similar patch-wise fashion as in convolutions. Instead of performing element-wise multiplication with a kernel, a single value from each patch is passed to the resulting feature map. In a max pooling layer the maximum value of the patch is computed. Similarly, in an average pooling layer the average value of the patch is returned. As in convolution layers, a stride is applied which ultimately reduces the spatial size. Figure 6 illustrates the max pooling operation. The obvious difference between max and average pooling, is that average pooling computes a smoothed value. Since max pooling computes the max value, it will have an output distribution with a higher mean since the sign is considered while average pooling will have a mean closer to zero. Additionally, the error will only be backpropagated through one pixel in each patch in a max pooling layer since a small change in the pixels with lower values of the patch will not affect the maximum value and therefore the gradient through these pixels will be zero. 14

Figure 6: Max pooling of a 4 4 image with a kernel size of 2 2 and a stride of 2. Pooling is also used in a separate context in the form of global pooling. The common approach when going from the last convolution layer to a fully connected layer is to simply flatten the last feature map such that each pixel in every channel is assigned its own neuron. However, if the feature map is spatially large and contains many channels, it results in many neurons, and therefore many parameters. An alternative to this approach is to apply global average pooling to the feature map, which is simply a regular pooling layer but the stride extends the whole spatial size of the feature map. Each channel is averaged and the resulting layer will contain as many neuron as there are channels. Since fully connected layers naturally contain many parameters they are sometimes completely excluded in CNNs. In this case the global pooling layer is connected directly to the classification layer and the network is then called a fully convolutional neural network. 2.4.3 Batch Normalization Inspired by the relatively early work of LeCun et al. in 1998 21 where input normalization was recommended, Ioffe & Szegedy 22 extended the idea to perform normalization inside the network as well. In their paper they introduce a new method called batch normalization which has become standard in today s networks. The authors motivate this approach by formulating the Internal Covariate Shift problem. This is a problem when training neural networks that arises as a consequence of weight updates. When the weights get updated in each training iteration so does the distribution of each layers inputs. This leads to slow convergence and is addressed by normalizing the layer inputs. A further motivation for batch normalization is that it counteracts the vanishing/exploding gradient problem. As the name suggests, normalization is performed on a per batch basis where batch statistics are computed in the form of mean and variance for each feature and layer. The features are then normalized such that the output mean is 0 and the variance is 1. Additionally, what makes the method so powerful, is that the network can learn the optimal scale and shift. Ioffe & Szegedy accomplished this by introducing two new parameters γ and β which are multiplied and added to the normalized features respectively. These are learnable parameters that allow the network to optimally normalize each feature independently. This makes it easy to weight the features according to importance. In a fully connected layer, each neuron is normalized and assigned its own normalization parameters. In a convolution layer, however, the normalization is performed across the spatial dimensions of the feature maps as well, yielding a larger effective batch size. This is important since it preserves the spatial correlation of the pixel values. If normalization would have been done per neuron or per pixel in each feature map, not only would the number of parameters increase drastically but it would take much longer for the network to learn the optimal scale and shift if it converges at all. Mathematically, a feature x is normalized to y on a scale and shift given by the parameters γ and 15

β by the equations µ B = 1 K K x i, i=1 (31a) σb 2 = 1 K K (x i µ B ) 2, i=1 (31b) ˆx i = x i µ B σ 2 B + ɛ, (31c) y i = γˆx i + β, (31d) where µ B and σb 2 are the batch mean and variance respectively, ˆx is the zero-centered and normalized input, K is the batch size, i indexes the data points within the batch and ɛ is a small constant number to avoid division by zero. Since the mean value is subtracted from the input, batch normalization eliminates the need for a bias term since it is in some sense included in the β parameter. The necessary gradients to perform backpropagation on a batch normalization layer are calculated using the chain rule and given by the equations (derived from 22) C K γ = C ˆx i, (32a) y i i=1 C K β = C, y i C x i = i=1 γ K σ 2 B + ɛ K C y i K j=1 C x i µ B y i σb 2 + ɛ K j=1 C, y j (32b) (32c) C and β C are used to update the two parameters and x i where C γ C further back through the network. in the previous step. y i is used to propagate the loss is known as it has been computed through backpropagation Ideally, the features would be normalized based on the mean and variance of the whole training data set. However, since the weights of the network change in each iteration so does the mean and variance. Therefore it is not feasible to compute these statistics on the whole data set but instead only within each batch. To better approximate the batch statistics it is common to apply an exponentially moving average on µ B and σb 2 during training time. In order to make the method independent of test data, the final averaged values are used during test time. Using this approach also avoids the need to evaluate test data in a batch and allows for single data point evaluation. 2.4.4 The Convolution Cell Since the introduction of batch normalization, it has become common to combine it together with ReLU and a convolution layer to form the basic building block of CNNs. The order is not completely standardized as both BN-ReLU-Conv 23, 24 and Conv-BN-ReLU 25 have been used. The advantage of BN-ReLU-Conv is that it can be used as the first cell directly on non-normalized data since batch normalization takes care of this. 2.4.5 Feature Map Dropout When introduced, dropout was designed for fully connected layers where single neurons were dropped. If this is directly applied to convolution layers, it would mean that individual pixels in feature maps would be dropped. However, due to the spatial correlation between pixels in feature maps, this would not achieve the desired effect. The necessary information could still be extracted through adjacent pixels and it would only slow down training. Instead, whole feature maps can be dropped. This is called feature map dropout and is controlled by the parameter p keep similar to 16

regular dropout. In a fully connected layer, each neuron represents a feature but in a convolution layer a feature is represented by a kernel instead. Since every kernel produces its own feature map it is sensible to drop whole feature maps, making the network lose access to the feature itself. This would force the network to use information from more features to base its predictions on, reducing the risk of overfitting. 17

3 Methods This section describes the specific implementations in more detail along with network architectures in Section 3.1, the data set with both real and simulated images in Section 3.2 as well data augmentation techniques to enhance training in Section 3.3. 3.1 Network Architectures The basic components of neural networks have been outlined in Section 2. With these ingredients the next step is to figure out how to connect them together and build up the network architecture. It is not trivial to know what the optimal structure is. Good architectures are usually iterated forth by trial and error. During recent years there has been a focus to search for better and better architectures, and several different architectures have been found. A catalyst in this process has been the annual ImageNet challenge where the task is to classify and detect objects in images from the ImageNet database. In 2012, deep learning took over the scene and it was the first year when a CNN won the competition. The network was called AlexNet and is considered a break-through for CNNs as it crushed its opponents and raised interest across the field. In the following years new architectures were invented and CNNs continued to dominate, decreasing the error rate each year. During this thesis several different network architectures have been tested where some were contestants from the ImageNet challenge. Three network structures were chosen for evaluation where different variations were tested. The sizes were varied, both lengthwise and breadthwise. The three architectures VGG, ResNet and DenseNet are described in Sections 3.1.1, 3.1.2 and 3.1.3 respectively. Finally, the implementation details of the specific networks that were implemented are described in Section 3.1.4. 3.1.1 VGG A simple yet effective network structure, called VGG, was suggested by Simonyan & Zisserman 27. They limited the convolutions to only include 3 3 kernels as well as pooling layers with 2 2 strides. Figure 7 illustrates the network architecture of the VGG network. (Note that the network was designed for the ImageNet challenge which had 1000 classes which explains the output layer size.) The network is implemented with blocks of several convolutions and a following max pooling Figure 7: The VGG neural network architecture. (Image from 26.) 18

layer. Previously, a larger filter size was common but they argued that the benefits of larger kernels could still be achieved with a smaller size. The receptive field is smaller for smaller kernels but by combining several convolution layers, any given receptive field can be attained with enough layers. For example, a 7 7 convolution has the same receptive field as three 3 3 convolutions since the receptive field increases by two pixels for each 3 3 convolution. The advantage of using many smaller filters to reach the same receptive field is that it is more parameter efficient and allows for more complex features to be detected since more non-linearities are involved. Additionally, it is easy to implement due to its simple structure which prioritizes depth over width. This approach has been shown to be successful. 3.1.2 ResNet Another influential network structure is the residual network, or ResNet for short, which was introduced by He et al. 28 in 2015. The key component is the residual block which is shown in Figure 8. Instead of only propagating the output feature map Z of each convolution, they also propagate the input feature map X. The input and output feature maps are then added together element-wise to produce the final output H = Z + X. Here, Z is called the residual as it can be seen as some perturbation that is applied to the input. The authors hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. This became evident when they trained a 56-layer deep CNN without residual connections which performed worse than its 20-layer counterpart. But when they used residual connections in both of the networks, the deeper network outperformed the shallower one 28. Since the input feature map is added together element-wise with the residual, it is required that they have the same dimensions. However, this is not always the case as the feature maps are compressed and stretched within the network such that the spatial dimensions decrease while the number of channels increase. This case is considered in Figure 8(b) where a 1 1 convolution is used to increase the number of output channels followed by a max pooling layer to decrease the spatial dimensions. Here, the first convolution layer (top-right) is implemented with a stride of 2 and produces the same number of channels as the 1 1 convolution. Note that this is not the TOGETHER FOR SMARTER SAFETY TOGETHER FOR SMARTER SAFETY (a) (b) Figure 8: The residual block with (a) same and (b) different input and output sizes. 19

exact implementation as in 28 as they only use one 1 1 convolution with a stride of 2 without the pooling layer. In this thesis, however, the stride is omitted in the 1 1 convolution with the motivation that it would leave out the majority of the pixels. Instead, the stride is included in the max pooling layer with the same kernel size, in order to consider every pixel. 3.1.3 DenseNet Inspired by ResNet, Huang et al. 24 extended the idea of skip connections to not only propagate the input feature map, but also every previous layer in the network. This network architecture is called DenseNet. A slight modification to the residual connections in ResNet, however, is that when feature maps are combined, they are concatenated instead of added element-wise. The concatenation is done along the channel dimension, producing more and more features with each convolution layer. The result of this is that the later layers contain a mix of both high and low level features, encouraging the network to use a wider range of information. An additional benefit of this approach is that it allows for feature reuse and thus reducing the amount of parameters needed. As for the residual connections in ResNet, the concatenation only works on feature maps with the same spatial size. For this reason, DenseNet is divided into several dense blocks where the spatial dimensions are preserved. A dense block is shown in Figure 9 and comprises of several convolution layers where each convolution produces κ output channels, where κ is called the growth rate. Each subsequent convolution will therefore have more and more input channels. To prevent extreme computational costs for very deep DenseNet-variants, a 1 1 convolution layer is placed in front of every regular convolution which has more than 4κ input channels. The 1 1 convolutions produce 4κ output channels which means that every 3 3 convolution layer has less than or equal to 4κ input channels and as such, improving the computational efficiency. Multiple dense blocks are then connected through transition blocks which reduce the spatial size with pooling layers. Additionally, to keep the number of feature channels from exploding, each transition layer has a 1 1 convolution layer to reduce the number of channels by some factor θ (0, 1. The hyperparameter θ is called the compression factor and the number of output channels of a transition layer is θ #input channels. Figure 9: A dense block in the DenseNet architecture where every layer is connected together. (Image from 24.) 3.1.4 Implementation Details CNNs take as input an image of a fixed size. Therefore, several different image resolutions were tested during the thesis to see what level of pixel detail was necessary. Additionally, larger images introduce larger computational costs and require either more pooling layers to reduce the spatial size or pooling layers with a more aggressive stride. Through testing and consideration, a final image resolution of 360 640 was decided. However, the networks were not trained on the whole 20

images, but instead on random crops of size 224 224. There are three reasons why this was done. First, this is a way of augmenting the data by only showing the network a small part of each image, effectively increasing the size of the data set. Second, 224 224 is a common input size for CNNs and many architectures have been tuned to this resolution, including VGG, ResNet and DenseNet. Third, the number 224 has the prime factors 2 2 2 2 2 7. This makes it convenient to use pooling layers with strides of 2 to reduce the spatial size down to 7 7. Subsequently, the final 7 7 feature map could either be flattened pixel-wise into a fully connected layer or a global pooling layer could be applied. As described, the three network architectures that have been evaluated in this thesis are VGG, ResNet and DenseNet. Note, however, that they have not all been implemented exactly as described in their respective papers. ResNet and DenseNet were implemented with an initial 7 7 convolution layer followed by a 3 3 max pooling layer, both with strides of 2, reducing the 224 224 input image to a feature map of size 56 56. Apart from the initial 7 7 convolution, the rest of the convolutions were implemented either with a kernel size of 1 1 or 3 3. Since every convolution in all of the networks were implemented as a convolution cell, BN-ReLU-Conv, as described in Section 2.4.4, the whole cell is referred to as a convolution for simplicity. This is also the case for the initial 7 7 convolution. The transition from the final 7 7 feature map into a fully connected layer was performed using a global average pooling layer. The resulting fully connected layer was then connected directly to the output layer. This was the case for all of the networks. The inputs to every convolution operation in all networks have been padded with the edge pixel values to preserve spatial dimensions. Each of the three network architectures has been implemented in two different versions, A and B. The variant denoted by the letter B is larger and contains more layers and more channels in each feature map than the variant denoted by A. Tables 1, 2 and 3 shows the specific network compositions off the VGG, ResNet and DenseNet architectures respectively. The VGG architectures were implemented by connecting every layer in series where each convolution layer had a stride of 1 and each max pooling layer had a filter size and stride of 2. In the ResNet architectures, however, the layers were connected through residual connections which is not portrayed in Table 2. Here, each residual block of two 3 3 convolutions within a Conv X Layer name Table 1: The specific network compositions of the VGG-type architectures. Output size Output channels Conv 1 224 224 16 VGG-A VGG-B Output Layer composition Layer composition channels 3 3 Conv 2 32 3 3 Conv 4 Pool 1 112 112 16 Max pooling 32 Max pooling Conv 2 112 112 32 3 3 Conv 2 64 3 3 Conv 4 Pool 2 56 56 32 Max pooling 64 Max pooling Conv 3 56 56 64 3 3 Conv 2 128 3 3 Conv 6 Pool 3 28 28 64 Max pooling 128 Max pooling Conv 4 28 28 128 3 3 Conv 2 128 3 3 Conv 6 Pool 4 14 14 128 Max pooling 128 Max pooling Conv 5 14 14 128 3 3 Conv 2 256 3 3 Conv 4 Pool 5 7 7 128 Max pooling 256 Max pooling Conv 6 7 7 128 3 3 Conv 2 256 3 3 Conv 4 Pool 6 1 1 128 Global pooling 256 Global pooling Output 1 1 2 Softmax 2 Softmax 21

Layer name Table 2: The specific network compositions of the ResNet-type architectures. Output size Output channels Conv 0 112 112 16 ResNet-A Layer composition 7 7 Conv Output channels 64 ResNet-B Layer composition 7 7 Conv Pool 0 56 56 16 Max pooling 64 Max pooling 3 3 Conv 3 3 Conv Conv 1 56 56 16 2 64 3 3 Conv 3 3 Conv Pool 1 28 28 32 Conv 2 28 28 32 Pool 2 14 14 64 Conv 3 14 14 64 Pool 3 7 7 128 Conv 4 7 7 128 1 1 Conv Max pooling 3 3 Conv 3 3 Conv 1 1 Conv Max pooling 3 3 Conv 3 3 Conv 1 1 Conv Max pooling 3 3 Conv 3 3 Conv 128 2 128 128 2 128 256 2 256 1 1 Conv Max pooling 3 3 Conv 3 3 Conv 1 1 Conv Max pooling 3 3 Conv 3 3 Conv 1 1 Conv Max pooling 3 3 Conv 3 3 Conv Pool 4 1 1 128 Global pooling 256 Global pooling Output 1 1 2 Softmax 2 Softmax 3 4 6 3 group had the same dimensions and were residually connected as illustrated in Figure 8(a). Since both the spatial dimensions and number of channels change between each Conv X group, a pooling block containing a 1 1 convolution and a 2 2 max pooling layer was used. The pooling block was connected in parallel with the first residual block, as illustrated in Figure 8(b). To maintain the correct spatial dimensions, the first 3 3 convolution within each Conv X group was implemented with a stride of 2. For the DenseNet architectures in Table 3, every group indicated by square brackets within a dense block was densely connected as illustrated in Figure 9. DenseNet-A was implemented with a growth rate of κ = 8 and DenseNet-B with κ = 12, and a compression factor of θ = 0.5 was used in both of the networks. Additionally, the initial 7 7 convolutions produced 2κ output channels in both of the DenseNet variants. The 1 1 and 3 3 convolutions had strides of 1, and the pooling layers within the transition blocks had strides of 2. Table 4 summarizes the six different networks showing the total number of layers and number of parameters. The table also contains what batch size and learning rate was used for the specific networks. Additionally, all networks were trained using feature map dropout with p keep = 0.85. The optimizer that was used was SGD with Nesterov momentum with a momentum parameter of β 1 = 0.97. The reason why the Adam optimizer was not used is because preliminary tests showed some odd behaviour where random spikes could appear in the training loss. This might have to do with the criticism towards adaptive methods by Wilson et al. 14 and is discussed in Section 5.4. 22

Layer name Table 3: The specific network compositions of the DenseNet-type architectures. Output size Output channels Conv 0 112 112 16 DenseNet-A Layer composition 7 7 Conv Output channels 24 DenseNet-B Layer composition 7 7 Conv Pool 0 56 56 16 Max pooling 24 Max pooling Dense 1 1 Conv 1 1 Conv 56 56 64 6 96 block 1 3 3 Conv 3 3 Conv Transition block 1 Dense block 2 Transition block 2 Dense block 3 Transition block 3 Dense block 4 28 28 32 28 28 96 14 14 48 14 14 112 7 7 56 7 7 120 1 1 Conv Max pooling 1 1 Conv 3 3 Conv 1 1 Conv Max pooling 1 1 Conv 3 3 Conv 1 1 Conv Max pooling 1 1 Conv 3 3 Conv 48 8 192 96 8 360 180 8 372 1 1 Conv Max pooling 1 1 Conv 3 3 Conv 1 1 Conv Max pooling 1 1 Conv 3 3 Conv 1 1 Conv Max pooling 1 1 Conv 3 3 Conv Pool 1 1 120 Global pooling 372 Global pooling Output 1 1 2 Softmax 2 Softmax 6 12 22 16 Table 4: Summary of the six different networks that were implemented. # Convolution layers # Parameters Batch size Learning rate VGG-A 12 886K 200 0.02 VGG-B 28 6.23M 50 0.005 ResNet-A 20 703K 200 0.02 ResNet-B 36 6.42M 150 0.015 DenseNet-A 64 156K 200 0.02 DenseNet-B 116 938K 150 0.015 When training neural networks it is important to have balanced labeled classes so that the networks are exposed to equally many images of each class, in order to prevent them from becoming biased towards the class with more samples. Unfortunately, this is not the case neither for the real nor the simulated data set. To account for this, images were sampled independently from each class when assembling each training batch so that they contain equally many images of each class. Furthermore, each batch was sampled independently, meaning that some images could appear more often than others during training. For this reason, the definition of the term epoch differs slightly from the conventional definition. Usually, during training, an epoch has elapsed when the network has been trained on each of the images in the data set once. In this thesis, however, an epoch is referred to when a network has been trained on as many images as the size of the training data set, but some images might have appeared several times and some might not have appeared at all 23

during one epoch. Everything was implemented in python, where the network architectures, training and evaluation was done using TensorFlow. 3.2 Data Set The neural networks were trained on a data set containing images of roads which were divided into classes of high and medium friction. A data set of real images was gathered during a three month period prior to the thesis project and was used for evaluating the different networks. This data set is described in further detail in Section 3.2.1. Additionally, a data set of simulated images was created from the racing game World Rally Championship 7. This data set was mainly used to pre-train the networks and is described further in Section 3.2.2. 3.2.1 Real Data The data set used in this project consisted of 37,000 images of roads taken from inside a car and was divided into two classes, high friction (µ hi ) and medium friction (µ me ). The data was gathered and labeled by Thomas Svantesson during November 2017 to January 2018, mostly from Linköping, Sweden, but also from elsewhere in Sweden as well as from Germany. An image was labeled as high friction if µ 0.6 and medium friction if 0.2 µ < 0.6. (Originally, the idea was to also include low friction images where µ < 0.2 but these were extremely few compared to the other classes and were excluded from the project.) Figure 10 shows an example of a µ hi and a µ me image. The images were captured in irregular intervals varying from a couple of images per second to minutes between shots. The data set also contained some image sequences where many images were taken in rapid succession. This might be problematic since the images within a sequence are highly correlated. If this is not accounted for it might cause the networks to be biased towards these data points. For this reason the networks have been trained and evaluated on both the full data set and a smaller subset where all images but one have been removed from each image sequence. In order to validate a model it is of high importance to separate between training and validation data. Neural networks usually contain many more parameters than there are data points which means that the risk of overfitting becomes prominent. A common technique is to randomly sample a fraction of the data set (for example 70 %) which is used as a training set and is fed through to the network during training. The remaining data is called a validation set and is used to evaluate how well a model generalizes to previously unseen data. The motivation for this approach is to get a more robust measure of performance since it indicates how the model behaves on new data. When it comes to sequentially sampled data, however, this approach might yield misleading results. The reason for this is that consecutive data points are highly correlated. If two subsequent images are separated into the training and validation data set, a model might accurately classify the validation data point even though it has heavily overfitted to the training data. In this case (a) (b) Figure 10: Example images of roads with (a) high friction and (b) medium friction. 24

the validation accuracy no longer gives a robust measure of how well the model generalizes to new data and overfitting might go unnoticed. To overcome this problem, the whole data set was divided into five subsets which did not contain overlapping sequences. Additionally, the subsets were constructed such that the fraction of µhi and µme data was preserved from the original data set. This enables cross-validation to be applied where four out of the five subsets where chosen as the training set and the last subset is referred to as the test set. A network is then trained five times and the test subset is varied throughout the runs. The final prediction accuracy is computed as the average of the five runs. Additionally, a validation data set was randomly sampled as 10 % from the training data set with the aim of highlighting the difference between the two types of validation methods. Table 5 shows the class distributions of each individual subset as well as for the whole data set, with and without image sequences. To clarify, the training set contains the data that is shown to the model during training and is usually the largest. The validation data set is randomly sampled from the same distribution as the training data (but not shown during training) and provides an easy way to see how the model generalizes. The test is more carefully crafted and not necessarily sampled from the same distribution as the training set. The purpose of the test set is to be used as a more realistic way to validate a model where the idea is to take into consideration that the training set does not cover the whole span of data. Note, however, that in some literature where a test set is not present, the validation set might be referred to as the test set, but throughout this report the terminology described in this section has been used. Table 5: Class distributions for the real data set. Subset 1 2 3 4 5 Full 3.2.2 With sequences µhi µme 4677 4877 4521 4572 4815 23,462 2559 2651 2764 2569 2975 13,518 Without sequences µhi µme 1863 2096 2086 1778 2608 10,431 464 509 551 599 662 2785 Simulated Data In order to artificially increase the size of the data set of real images, new data was simulated by sampling screenshots from the racing game World Rally Championship 7 (WRC 7). In total, 54,029 images were captured with a frequency of 2 Hz. WRC 7 was specifically chosen since the in-game graphics seemed to closely mimic real world images. An additional advantage of this game was that it was possible to adjust the weather conditions which allowed for good control over the data distribution. Figure 11 shows two example images from the simulated data set. (a) (b) Figure 11: Examples of simulated images of roads with (a) high friction and (b) medium friction. 25

Data was collected for 12 different tracks which had three different road surfaces, namely asphalt, gravel and snow. Asphalt was labeled as high friction while gravel and snow was labeled as medium friction. In total, the simulated data set contained 16,675 high friction images and 37,354 medium friction images. It is worth to note that there might exist cases where this labeling is not completely accurate. However, since the purpose of the simulated data was to pre-train the networks it is a sufficient criteria for labeling. The networks can still learn basic and essential features from this data and be fine-tuned when trained on the real data. 3.2.3 Data Mixture Several different data mixing schemes were tested. First, the most simple one was to mix a constant fraction of real and simulated data into each batch. Second, the networks were pre-trained on only simulated data and then fine-tuned on real data. A variation of this method was to let the fraction of real data smoothly increase from 0 to 1 after pre-training on simulated data. This was implemented by introducing the split parameters r 1 and r 2. r 1 denotes at what epoch number the fraction of real data should begin to increase. After this point, the fraction is increased linearly until it reaches 1 at r 2 number of epochs. Figure 12 illustrates the definitions of the split parameters. 1 Fraction of real data 0 0 r 1 r 2 Epochs Figure 12: A plot illustrating the definition of the split parameters r 1 and r 2. When using two different data sets when training the networks, the meaning of the term epoch becomes ambiguous. To clarify the definitions, an epoch is still defined as in Section 3.1.4, meaning that an epoch has elapsed when the network has been shown as many images as there are images in the training data set of real images. This means that when pre-training a network on only simulated data, the first epochs could consist of purely simulated images. Now, for practical reasons, we introduce the term effective epoch and define it as in Definition 3.1. This will be useful when analyzing the effect of simulated data. Definition 3.1 (Effective epoch) When training a neural network, an effective epoch has elapsed when the network has been shown as many real training images as there are images in the training data set. If the network is simultaneously shown both real and simulated images, only the real images count towards the effective epoch number. 3.3 Data Augmentation The general task in machine learning is to fit a model to data. It is therefore not only important that the model is designed for the specific problem but it is of at least equal importance that the data set in question is of high quality and of sufficient size. In most deep learning applications, however, the number of parameters in a neural network exceeds the number of data points in the data set, sometimes with a large margin. This becomes problematic if the network is not trained properly since it is easy for the model to overfit to the training data. To combat this problem 26

several regularization methods have been developed. One of the easiest methods to understand and implement is data augmentation. The idea is to modify the training data in some way so that a larger variety of data is shown to the network. Applying data augmentation effectively increases the size of the data set, reducing the risk of overfitting. In this section several data augmentation methods are described. 3.3.1 Left-to-Right Flip The most simple and efficient data augmentation technique is to flip the image left to right as illustrated in Figure 13. The intention of this method is that the model becomes invariant to mirrored objects. However, depending on the application this might not be desirable. For example if the task is to develop a text recognition algorithm, the model might confuse an E with a 3. The flip is applied to each image with a 50 % probability so that the network sees as many original and flipped images. Similarly, one can flip the images upside down with the intention of making the model invariant to this transform. But when it comes to road images this augmentation is unnecessary and possibly even disadvantageous since these images are not and should not be horizontally symmetrical. Figure 13: An example where left-to-right flip has been applied to an image. 3.3.2 Rotation Another common data augmentation technique is to rotate the images by some angle. Usually, the angle is uniformly sampled from an interval θ, θ where θ is the maximum rotation angle and controls the effect of the augmentation. As seen in Figure 14 the rotated image extends beyond the frame of the image and the extra pixels on the border have been colored gray in this example. The intention of the border color is that it should result in as small of an activation as possible and affect the prediction as little as possible. Depending on the first layer the color might be chosen differently. If, for example, the first layer is a convolution layer a black border color would give the smallest activation since black corresponds to 0, 0, 0 in a red-green-blue (RGB) encoding. Since the pixel values are multiplied by the kernel components the resulting activation would be zero for the black regions of an image. On the other hand, if the first layer is a batch normalization layer then a black color would shift the mean value towards zero and as such affect the activation. Preferably, the border color would be the mean value of all the pixels of all the images in the training set. But this might be expensive to compute and a good enough estimation is an RGB Figure 14: An example where rotation has been applied to an image. 27

color of 0.5, 0.5, 0.5 which corresponds to gray. Alternatively, the border pixels could be randomly sampled as noise. 3.3.3 Translation Translation can also be applied in the data augmentation pipeline so that the images are moved off-center as shown in Figure 15. This is useful in problems where spatial information is of importance, such as object detection and image segmentation, and helps prevent the network from only looking for specific objects in specific places but rather using the whole image. When using multiple crop evaluation, however, this method is unnecessary since translating the image will not yield any difference. As for rotation, the resulting image contains border pixels which were not included in the original image and have to be colored. Figure 15: An example where translation has been applied to an image. 3.3.4 Scaling A forth geometric data augmentation technique is to scale the images by some factor where both up-scaling and down-scaling can be performed. The purpose of this method is to train the model to be invariant to the size of objects. The scaling can also be performed axis-wise such that the images get stretched in some direction. This would simulate objects being rotated away from the direction of the camera making the network more robust against suboptimal angles. Figure 16: An example where scaling has been applied to an image. 3.3.5 Noise A common, simple and yet effective regularization method is to add noise to the input images. This helps prevent the networks from being sensitive to small changes in the input which has proven to be a problem in deep neural networks. An additional advantage is that it becomes harder for the networks to learn small pixel-scale features which are only present in a few images in the training set, and in this way prevents overfitting. It has been shown that neural networks can be trained on completely random data making them memorize the exact data points and achieving close to 0 % error rate on training data while not being able to generalize at all 29. Adding noise to the inputs helps reduce this problem. 28

Figure 17: An example where noise has been applied to an image. Apart from the previously described methods, adding noise as a means of augmenting the data extends further than just to images. For most machine learning problems with continuous parameters this is a good initial approach. However, depending on the application, the noise might be sampled from different distributions. In many cases noise follows a Gaussian distribution, but not always. For example, noise in images which results from the quantized nature of photons called shot noise is Poisson distributed. The difference is apparent in low light conditions where Gaussian noise with a constant standard deviation overestimates the noise, when in fact the noise is monotonically increasing with the photon intensity 30. Other than Gaussian and Poisson distributed noise, salt & pepper noise is commonly mentioned within image analysis. This is applied by randomly coloring pixels either black or white with some probability. 3.3.6 Intensity Modification To simulate different lightning conditions one can artificially adjust the intensity of the images. This can be done in different ways. For example, this can be done by adding a constant positive RGB value a, a, a to each pixel in the image to make it brighter as seen in Figure 18. Alternatively, a negative value could be added to make it darker. A different value can be added to the different color channels in order to adjust the hue as well. The intensity can also be modified by applying gamma correction. When using batch normalization, however, the effect of intensity modification is nullified since the input image will be normalized and centered around the mean. This is why the inventors of batch normalization 22 discourage the use of this type of data augmentation in combination with batch normalization. Figure 18: An example where intensity modification has been applied to an image. 3.3.7 Cutout DeVries & Taylor 31 introduced a different type of regularization technique in 2017 called cutout. The idea was simply to remove part of the image as shown in Figure 19 to encourage the network to use more of the image to base its prediction on. Cutout was inspired by dropout and was initially developed as a targeted approach where parts of the images were dropped which had the highest activation. Tests showed, however, that simply dropping a random region of a fixed size performed just as well and was preferred because of its simplicity. The authors noticed that it was essential to let the cutout region appear partly outside the image and not only be fully contained within the image frame. Alternatively, cutout could be applied randomly to 50 % of the images to also 29

let the network see full images. Figure 19: An example where cutout has been applied to an image. 3.3.8 Implementation Details The data augmentation techniques that were implemented in the final training of the networks were left-to-right flip, rotation, scaling, noise and cutout. Left-to-right flip and cutout were both implemented with a 50 % chance for each image, where cutout was implemented as a 60 60 gray pixel square. Each image was rotated by a random angle uniformly sampled from 10, 10. In the same way, each image was scaled by a random factor uniformly sampled from 0.9, 1.1. Noise was applied to each training image in the form of Gaussian noise. The noise was applied pixelwise with a mean of 0. To make the networks robust against different noise scales, the standard deviation was randomly sampled uniformly from the interval 0, 10, which was done image-wise. Note that the noise was added directly to the pixel values which span the interval 0, 255. Translation and intensity modification were used during preliminary tests but were omitted during the final implementation. Translation was not used since the networks were trained on smaller crops of the images. If translation would have been used, it would only shift the position of each crop slightly but would not affect the training since the crops were randomly selected. Intensity modification was not used since the first layer of every network was a batch normalization layer. Since intensity modification is only applied by adding some value channel-wise it would be nullified by batch normalization. 3.4 Multiple-Crop Evaluation As mentioned in Section 3.1.4, the networks were trained on random 224 224 crops of the images. During testing time, however, one would like to consider the whole 360 640 image when making the final friction estimate. For this reason, eight fixed crops, which can be seen in Figure 20, where fed into the network during evaluation. Additionally, the mirror images of each crop were also fed into the network for a total of 16 different sub-images. When making the final friction estimate, the output layers from each separate crop were averaged to arrive at the final output vector. This process is called multiple-crop evaluation. It is not uncommon that neural networks are sensitive to the input, and several cases have been found where small perturbations in the input image can change the final prediction drastically 32, 33. Multiple-crop evaluation is a countermeasure for this problem where the intention is that it produces a more robust prediction. With multiple-crop evaluation, a perturbation in the input image would have to sway the prediction for at least around half of the 16 crops for the final prediction to be affected. This is far less likely to occur with 16 crops than for single image evaluation. 30

Figure 20: An illustration of multiple-crop evaluation. 31

4 Results This section presents the results that were achieved during the thesis. These include comparisons between the network architectures, the effect of simulated data and also an evaluation of the data set. First, preliminary tests are described in Section 4.1 which explains how the hyperparameters were chosen. It also shows what filters were learned by the networks in the initial layer and highlights the effect of feature map dropout. Section 4.2 investigates if and how simulated data can be used to improve predictive performance of the networks. Section 4.3 evaluates the six different networks that were implemented with plots of the loss and prediction accuracy during training. A robust measure of predictive performance using cross-validation is also presented. Finally, in Section 4.4 experiments are conducted which aim to give a better understanding of the data set and how it can be improved. 4.1 Hyperparameter Tuning The process of training neural networks involve many hyperparameters that have to be chosen. As opposed to the weights and biases, hyperparameters are parameters that are not learned during training, but are chosen explicitly. These include, for example, the learning rate, batch size, dropout, network length, network width and data augmentation parameters. Tuning these hyperparameters is, in a sense, an optimization problem in itself. This section explains the process of finding good values for these parameters. Early during the thesis, the networks were built with more feature channels within the hidden convolution layers than suggested by the network architectures in Section 3.1.4. To better understand what the networks were looking for and to see how they utilized the feature channels, the initial convolution layers were visualized for an early version of a DenseNet-type network. This was done in conjunction with the evaluation of feature map dropout. Figure 21 shows the 64 learned filters in the initial convolution layer and how they are affected when using feature map dropout. The filters to the left have been locally normalized where the contrast has been maximized for every filter individually. To the right, the filters have been globally normalized. This is done by computing the minimum and maximum value across every filter and normalizing each filter with respect to these so that the minimum value corresponds to a pixel value of 0 and the maximum value corresponds to 255. Since the filters can have negative values, a pixel value of 127 was chosen to represent a filter value of 0. This means that a grey filter does not propagate any information. From Figure 21 we see that it is only a small fraction of the filters that resemble an intuitive or meaningful feature and most filters are monotonic or noisy. It does not seem like the networks utilize the full potential of the layers, but only use a limited number of features. This is why the number of channels were reduced. The most noticeable difference when using feature map dropout is that the number of monotonic filters has been reduced while the filters resembling edges between bright and dark colors have increased. Additionally, many filters seem to appear in duplicates. This can clearly be seen in Figure 22 where some selected kernels have been placed next to each other to highlight the duplicated filters. It was observed that using feature map dropout, generalization was improved and overfitting was reduced. The downside, however, was that training was slowed down. A parameter value of p keep = 0.85 seemed to result in a good balance between the pros and cons. 32

(a) (b) Figure 21: Trained kernels in the first convolution layer (a) without feature map dropout and (b) with feature map dropout. Left: Local color normalization. Right: Global color normalization. Figure 22: Some selected kernels from Figure 21(b) when training with feature map dropout, highlighting the similarity between kernels. 33

The tuning of the batch size K, learning rate α and momentum β 1 was done in reference to the work of Smith & Le 29. They formulate a noise scale where they find that the training behaviour of a network remains unchanged as long as the term αn K(1 β 1) remains constant, where N is the size of the training set. The limiting factor when choosing the batch size was memory consumption. The batch size was set to as large as possible such that the network, image batch and intermediate feature maps fit into the GPU memory. When the batch size was preliminarily decided, the momentum was fine tuned to β 1 = 0.97. It was then found through experimentation that a batch size of 100 in combination with a learning rate of 0.01 was a good combination. The largest possible batch size was then chosen for each network architecture individually while keeping the fraction α B = 0.01 100 constant. It was noted, however, that when exceeding a batch size of around 300-400 with this approach, the training begun to suffer and the predictive performance of the networks decreased. This was probably a result of the learning rate becoming too large, making gradient descent unstable. For this reason the batch size was capped at 200. Data augmentation was found to be the most effective method for preventing overfitting, especially by applying noise. Without data augmentation, the networks adjusted very fast to the training data, reaching 99.9 % prediction accuracy. But this came at the cost of a reduced prediction accuracy on the test set. Initially, the test metrics would follow the training metrics, but would then quickly begin to suffer. By applying data augmentation, the test metrics followed a more monotonic behaviour, reaching better predictive performance. 4.2 Training with Simulated Data A DenseNet-A network was trained on a mix of both simulated and real data. Figure 23 shows the training results where the network was trained for 200 epochs. Here, subset 1 was used as the test set and the network was trained using the data mixture scheme described in Section 3.2.3 where the split parameters were set to r 1 = 50 and r 2 = 100 epochs. Since the network was not exposed to real images during the entire training period, the total number of effective epochs is less than 200. With these split parameters the network was trained for 125 effective epochs. The final prediction accuracy on the test set was 84.57 %. When training on only real data the prediction accuracy was 84.18 % on subset 1, meaning that simulated data only yields a marginal improvement. Considering the amount of noise in the graphs in Figure 23, the effect of simulated data is negligible. Other data mixture schemes were tested but showed the same negligible increase in predictive performance. 10 0 Training loss Validation loss Simulation loss 1.0 0.9 Cross-entropy loss Prediction accuracy 0.8 0.7 0.6 0.5 10 1 0 50 100 150 200 Epochs 0.4 0.3 Training accuracy Validation accuracy Simulation accuracy 0 50 100 150 200 Epochs Figure 23: Training, testing and simulation plots when training a DenseNet-A network on both simulated and real data. The data mixture scheme described in Section 3.2.3 has been used where the vertical lines correspond to the split parameters r 1 = 50 and r 2 = 100 epochs. To get an understanding of how simulated data affects the learned features, the filters in the first convolution layer were visualized for a network trained on simulated data as well. Figure 24 shows what the filters look like for a DenseNet-type network after first pre-training on simulated data and then after fine-tuning on real data. An important observation from this is that the basic structure 34

of the filters do not change. The filters change mostly in the shade of the colors, and they are brighter after pre-training than after fine-tuning. Additionally, the filters that have been learned here seem to have slightly different characteristics compared to the filters in Figure 21. When trained on simulated data, the network seems to have learned more tilted and skewed edges and not as many vertical or horizontal ones. (a) (b) Figure 24: Trained kernels in the first layer after (a) pre-training on simulated data and then (b) training on real data. Left: Local color normalization. Right: Global color normalization. 4.3 Network Architecture Evaluation Figures 25, 26, 27, 28, 29 and 30 show the cross-entropy loss and prediction accuracy during training for VGG-A, VGG-B, ResNet-A, ResNet-B, DenseNet-A and DenseNet-B respectively. In Table 6 the final loss and accuracy of the networks have been summarized using cross-validation over the five subsets. This was done by training the networks from scratch five times and varying which subset was used as the test set. Since the the graphs of the loss and accuracy are quite noisy, the final results in the Table 6 were averaged over the final 20 epochs of training in order to get more accurate values. As can bee seen from the table, DenseNet-B achieved the highest prediction accuracy at just above 90 %. Note that all of the networks, for which results are presented, have been trained on the data set excluding image sequences. (The sequences are described in Section 3.2.1.) This was done since the sequences contained many images which were highly correlated with one another and did not contribute, other than introducing a bias towards these images. Also note that the training parameters have been chosen in order to decrease test loss, increase test accuracy and reduce overfitting as much as possible. This comes at the price of reduced training and validation accuracies. In the beginning of the thesis, initial training parameters resulted in training and validation accuracies as high as 99.9 %. But this was changed once the more meaningful test set metrics were used. Some examples of images are shown in Figure 31 together with their correct label and predicted output to see how the networks perform in concrete cases. Table 6: Results for the six different networks. The values are averaged over the five subsets and over the last 20 epochs of training. The test metrics in bold indicate the best performance. Network Cross-entropy loss Prediction accuracy Training Testing Training Testing VGG-A 0.0747 0.3759 96.98 % 89.97 % VGG-B 0.1334 0.3949 95.02 % 88.26 % ResNet-A 0.0939 0.4164 96.56 % 89.09 % ResNet-B 0.0677 0.4074 97.47 % 89.50 % DenseNet-A 0.1093 0.3814 95.69 % 88.96 % DenseNet-B 0.0738 0.4321 97.16 % 90.02 % 35

10 0 1.00 0.95 Cross-entropy loss 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 Training loss Validation loss 0 50 100 150 200 Epochs 0.70 Training accuracy Validation accuracy 0 50 100 150 200 Epochs Figure 25: Training results for VGG-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. 1.00 0.95 Cross-entropy loss 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 Training loss Validation loss 0 50 100 150 200 Epochs 0.70 Training accuracy Validation accuracy 0 50 100 150 200 Epochs Figure 26: Training results for VGG-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. Training loss 1.00 Validation loss 0.95 Cross-entropy loss 10 0 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 0 50 100 150 200 Epochs 0.70 Training accuracy Validation accuracy 0 50 100 150 200 Epochs Figure 27: Training results for ResNet-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. 36

10 1 Training loss Validation loss 1.00 0.95 Cross-entropy loss 10 0 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 0 50 100 150 200 Epochs 0.70 Training accuracy Validation accuracy 0 50 100 150 200 Epochs Figure 28: Training results for ResNet-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. 10 0 1.00 0.95 Cross-entropy loss 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 Training loss Test loss 0 50 100 150 200 Epochs 0.70 Training accuracy Test accuracy 0 50 100 150 200 Epochs Figure 29: Training results for DenseNet-A. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. 10 0 1.00 0.95 Cross-entropy loss 10 1 Prediction accuracy 0.90 0.85 0.80 0.75 Training loss Validation loss 0 50 100 150 200 Epochs 0.70 Training accuracy Validation accuracy 0 50 100 150 200 Epochs Figure 30: Training results for DenseNet-B. The thin lines correspond to each of the five subsets described in Section 3.2.1 and the thick lines are the averages over the subsets. 37

High µ: 99.87 % High µ: 36.49 % Medium µ: 78.12 % High µ: 93.73 % High µ: 98.89 % Medium µ: 33.11 % High µ: 47.10 % High µ: 99.99 % Figure 31: Example images and their corresponding label indicates the true label and the percentages indicate the The border indicates whether the prediction was correct, prediction and red to a false prediction. (The prediction is true label exceeds 50 %.) 38 and prediction. High or Medium output confidence for the true label. where green corresponds to a correct correct if the output confidence of the