Classifying Material Defects with Convolutional Neural Networks and Image Processing

UPTEC F 19026 Examensarbete 30 hp Juni 2019 Classifying Material Defects with Convolutional Neural Networks and Image Processing Jawid Heidari

Abstract Classifying Material Defects with Convolutional Neural Networks and Image Processing Jawid Heidari Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 471 30 03 Telefax: 018 471 30 00 Hemsida: http://www.teknat.uu.se/student Fantastic progress has been made within the field of machine learning and deep neural networks in the last decade. Deep convolutional neural networks (CNN) have been hugely successful in image classification and object detection. These networks can automate many processes in the industries and increase efficiency. One of these processes is image classification implementing various CNN-models. This thesis addressed two different approaches for solving the same problem. The first approach implemented two CNN-models to classify images. The large pre-trained VGG-model was retrained using so-called transfer learning and trained only the top layers of the network. The other model was a smaller one with customized layers. The trained models are an end-to-end solution. The input is an image, and the output is a class score. The second strategy implemented several classical image processing algorithms to detect the individual defects existed in the pictures. This method worked as a ruled based object detection algorithm. Canny edge detection algorithm, combined with two mathematical morphology concepts, made the backbone of this strategy. Sandvik Coromant operators gathered approximately 1000 microscopical images used in this thesis. Sandvik Coromant is a leading producer of high-quality metal cutting tools. During the manufacturing process occurs some unwanted defects in the products. These defects are analyzed by taking images with a conventional microscopic of 100 and 1000 zooming capability. The three essential defects investigated in this thesis defined as Por, Macro, and Slits. Experiments conducted during this thesis show that CNN-models is a potential approach to classify impurities and defects in the metal industry, the potential is high. The validation accuracy reached circa 90 percent, and the final evaluation accuracy was around 95 percent, which is an acceptable result. The pretrained VGG-model achieved much higher accuracy than the customized model. The Canny edge detection algorithm combined dilation and erosion and contour detection produced a good result. It detected the majority of the defects existed in the images. Handledare: Mikael Björn Ämnesgranskare: Niklas Wahlström Examinator: Tomas Nyberg ISSN: 1401-5757, UPTEC F 19026

Acknowledgements First of all, I would like to thank my supervisors Mikael Björn at Sandvik Coromant for guidance and support. Besides, I would express my gratitude and thanks to the quality lab personal that helped me with data collection and data categorization. It has been a great time to work at a large international company. Thank you for this fantastic opportunity! I would also like to thank my subject reader Doctor Niklas Wahlström at Uppsala University. Without all of your feedback and guidance, this thesis would not have become what it is today. Furthermore, I would like to thank my fellow thesis workers Emil Fröjd and Alving Ljung. I have enjoyed our discussions throughout the thesis and thank you for your ideas and help.

Populärvetenskaplig sammanfattning Under de senaste åren har det gjorts stora framsteg och utveckling inom området maskininlärning. Djupa faltningsnätverk som delvis efterliknar hjärnans neuroner i sin funktionalitet och inlärning har varit särskilt framgångsrika inom bildbehandling, bildklassificering och objektdetektion. Dessa nätverk är mycket komplexa i sin struktur och består av många olika lagrar/komponenter. Den stora kostnadsreducerande potentialen lockar allt fler forskare, industriföretag och it-företag som Tesla, Google och Facebook som även de själva har drivit fram utvecklingen inom området. Till exempel använder Tesla allt mer maskininlärning för deras självkörande bilar. Detta är ingen enkel uppgift och många utmaningar återstår fortfarande. Ett annat problem som många stora industriföretag, så som ABB och Sandvik, möter är att automatisera olika processer och spara pengar samt reducera repetitiva arbetsuppgifter med hjälp av maskininlärning. I det här examensarbetet kommer djupa faltningsnätverk och bildbehandling tillämpas för att klassificera och detektera tre förekommande materialdefekter i samband med tillverkning av skärverktyg hos Sandvik Coromant. Dessa defekter förekommer naturligt under tillverkningsprocessen och bidrar till en försämrad produktkvalitet. För närvarande detekteras olika typer av defekter manuellt av företagets operatörer. Sandvik Coromant har som mål att automatisera denna process i hop om att minimera tiden från start till slutprodukt med hjälp av maskininlärning. Testavdelningen har samlat ca 1000 bilder på tre olika typer av defekter så som Por, Makro och Slits defekter. Bilderna är tagna med konventionella mikroskop och är kapabla att ta bilder upp till 1000 gångers förstoring. Det här examensarbetet består av flera olika delar. Första delen är att samla nya bilder och förberedda dessa. Det andra delen är att välja och träna lämpliga djupa nätverk. För det här projektet har två olika nätverk tränats. Det första är ett stort och generellt förtränat nätverk som heter VGG19 och den andra nätverket är ett mindre som är skapat enbart för det här projektet. Syftet är jämföra resultatet mellan dessa två nätverk och välja det bästa. De experiment som har utförts under examensarbetet visar att djupa faltningsnätverk är ett bra tillvägagångssätt för att klassificera materialdefekter. Det stora förtränade nätverk visade sig presterade bäst och påvisade en noggrannhet på ca 95 procent. Den begränsande faktorn var kvantitativt samt kvalitativt där datamängden och datakvaliteten inte var tillräckligt. För framtida projekt behövs större och bättre datamängder. Det här examensarbetet är det första i sitt slag och kan därför betraktas som ett förundersökningsarbete för framtida projekt. De här resultaten kan användas som en fingervisning inför framtida arbeten. Mycket mer arbete återstår innan detektion av defekter kan automatiseras hos Sandvik Coromant. 1

List of Abbreviations ACC Final Accuracy ADAM Adaptive Moment Estimation AI Artificial Intelligence ANN Artificial Neural Network CNN Convolutional Neural Network DNN Deep Neural Network GD GPU ML PPV Gradient Descent Graphical Processing Unit Machine Learning Positive Predicted Values R-CNN Region Proposal CNN RMSprop Root Mean Square Propagation RNN Recurrent Neural Network SGD SSD TPR Stochastic Gradient Descent Single Shot MultiBox Detector True Positive Rate YOLO You Only Look Once List of Symbols β 1, β 2 η γ ˆm, ˆv ŷ The exponential decay rates for ADAM optimizer The first coefficient of momentum Learning rate for SGD The first and second corrected moments of the gradient loss for ADAM The predicted values from a neural network The nabla operator 2

ψ σ a b C D 1, D 2 The gradient direction for Canny filter The activation function A vector representation of a full connected neural network layer before activation function The bias term in neural network Number of channels in an image Image height and width D k Kernel size (K 1 xk 2 ) D in Image input size (D 1 xd 2 D out E G g g 1, g 2 G x G y I K K k K 1, K 2 l Feature map size (output size after convolution of X K The cost/error function for the backpropagation Gaussian kernel for edge detection The gradient on current mini-batch in ADAM The Gaussian kernel size for Canny filter Sobel filter for horizontal derivative in Canny algorithm Sobel filter for vertical derivative in Canny algorithm Gradient intensity matrix for Canny algorithm Number of classes in the softmax function The general kernel used in CNN 1D kernel Kernel height and width A layer in a network m, v The first and second moments of the gradient loss for ADAM n l 1, n l, n l+1 Represents the neurons in three successive layers, used in backpropagation algorithm P p S W w X y Z z Number of channels in a kernel Zero padding size Stride size The weight matrix in a fully connected layer A specific weight in the network A general input in a machine learning model The true labels/values The feature map after convolution operation A vector representation of a full connected neural network layer after activation function 3

Contents 1 Introduction 9 1.1 Background........................................ 9 1.2 Short Introduction about Data Set........................... 10 1.3 Project Goal and Limitations.............................. 11 2 Theory of Machine Learning and Neural Networks 12 2.1 General Introduction to Machine Learning....................... 12 2.1.1 Supervised Learning............................... 12 2.1.2 Unsupervised Learning.............................. 12 2.1.3 Reinforcement Learning............................. 12 2.2 Training, Validation and Test Sets........................... 13 2.3 Fully Connected Networks................................ 13 2.4 Activation Functions................................... 15 2.4.1 Sigmoid Function................................. 15 2.4.2 Tanh Function.................................. 15 2.4.3 ReLU Function.................................. 16 2.4.4 Softmax Function................................ 16 2.5 Forward Propagation and Loss Functions....................... 17 2.5.1 Square Loss or L2-Norm............................. 17 2.5.2 Cross Entropy Loss................................ 17 2.6 Backpropagation and Partial Derivatives........................ 17 2.7 Optimization Methods in Neural Networks....................... 18 2.8 Convolutional Neural Networks............................. 20 2.8.1 Convolutional Layers............................... 20 2.8.2 Pooling Layers.................................. 21 2.9 Regularization...................................... 22 2.9.1 Dropout...................................... 22 2.9.2 Early Stopping.................................. 22 2.10 Artificial Neural Networks................................ 23 3 Classifying Material Defects using Neural Networks 24 3.1 Dataset.......................................... 24 3.2 Data Collection and Data Cleaning........................... 25 3.3 Data Augmentation................................... 25 3.4 Neural Network Architectures.............................. 28 3.4.1 VGG Architecture................................ 29 3.4.2 Customized Architecture............................ 29 3.5 Network Training and Software............................. 30 3.5.1 Transfer Learning................................ 31 3.5.2 Model Performance................................ 32 4 Classifying Material Defects using Image Processing 34 4.1 Image Processing and Object Detection........................ 34 4.2 Canny Edge Detection.................................. 35 4.3 Morphological Operations................................ 36 4.4 Contour Detection.................................... 36 4.5 Implementation and Software.............................. 36 4

5 Result 38 5.1 Defect Classification using CNN Models........................ 38 5.2 Evaluation of CNN Models............................... 40 5.3 Defect Detection using Image Processing........................ 42 5.4 Comparison of CNN models and Image Processing.................. 43 6 Discussion 46 6.1 Analysis of CNN Models................................. 46 6.2 Analysis of Image Processing.............................. 47 6.3 Comparison of CNN Models and Image Processing.................. 47 6.4 Limitations........................................ 48 7 Conclusion 49 8 Future Work 50 5

List of Figures 1.1 Images taken in 100 zooming and with white background. Left image contains only one Por and right image contains many Por...................... 10 1.2 Images taken in 1000 zooming and with dark background. Left image contains two Macro and right image contains one Slits....................... 10 2.1 The split of the total data into three subsets, training, validation and test data.. 13 2.2 A fully connected neural network............................ 14 2.3 A node of a fully connected neural network in mathematical terms......... 14 2.4 The three different functions mentioned above..................... 16 2.5 A convolutional neural network, with several blocks.................. 20 2.6 A convolution operation between the input X and the kernel K producing the feature map Z........................................... 21 2.7 A max pooling with kernel size 2x2 and stride 2.................... 22 2.8 The dropout effect in a network............................. 22 2.9 The process of early stop................................ 23 3.1 The existence of pors in the images. Left image is specifically of type A00B06 and right image is of type A02B04.............................. 25 3.2 The horizontal flip of an image, on the left original image and the right flipped image 26 3.3 The vertical flip of an image, on the left original image and the right vertical flipped image........................................... 26 3.4 The rotation of an image, on the left original image and the right rotated image. 27 3.5 The translation of an image, on the left original image and the right translated image 27 3.6 The cropping of an image, on the left original image and the right zoomed image. 28 3.7 The Gaussian added noise of an image, on the left original image and the right noised image....................................... 28 3.8 VGG-16 neural network architecture [1]........................ 29 3.9 Three different transfer learning strategies, the large rectangle is the main block of a model and the small rectangle is only the top layers, including the softmax classifier 31 5.1 Training accuracy and cross entropy loss of the pretrained VGG19 using transfer learning, training the top layers only. This figure shows the result without data augmentation........................................ 39 5.2 Training accuracy and cross entropy loss of the pretrained VGG19 using transfer learning, training the top layers only. This figure shows the result with data augmentation.......................................... 39 5.3 Training accuracy and cross entropy loss of the customized model. This figure shows the result without data augmentation.......................... 40 5.4 Training accuracy and cross entropy loss of the customized model. This figure shows the result with data augmentation............................ 40 5.5 The confusion matrix of the VGG19 model without data augmentation (right) and with data augmentation (left).............................. 41 5.6 The confusion matrix of the customized model without data augmentation (right) and with data augmentation (left)........................... 41 5.7 The four performance measures for Por (left) and Slits (right), the + signs represent the model with data augmentation. The blue chart is VGG19, and the orange one is VGG19+. The green one is the customized model (C), and the red is C+.... 42 6

5.8 The four performance measures for Macro (left) and the final accuracy of all models (right), the + signs represent the model with data augmentation. The blue chart is VGG19, and the orange one is VGG19+. The green one is the customized model (C), and the red is C+.................................. 42 5.9 The detected defects by image processing, both Macro and Por defects and some false detected defects................................... 43 5.10 The detected defects by image processing, Macro, Por and Slits defects and some false detected defects in the edges of the picture on the left............. 43 5.11 The confusion matrix of the VGG19 model with data augmentation (left) and image processing (right)..................................... 44 5.12 The F 1 score and final accuracy for the image processing approach and VGG19 model with data augmentation. The first three charts represent F 1 score for each defect, and the last chart is the final accuracy for all three combined........ 45 7

List of Tables 3.1 The three defect categories for this thesis....................... 24 3.2 Decomposition of VGG19................................ 29 3.3 Decomposition of customized network......................... 30 3.4 Confusion table/matrix for binary classification.................... 32 5.1 Training result for CNN-models [VGG19 and customized=c] with -and without data augmentation. In the table data augmentation is denoted by a + sign to reduce the space, so C+ means customized model with data augmentation. The best performance metric is indicated by bold text................... 39 8

Chapter 1 Introduction Artificial intelligence (AI) and machine learning (ML) has revolutionized today s society. Enormous progress has been made in this field the last two or three decades. AI and ML are used in many different areas today, from cancer tumor detection to voice recognition. It is a billion dollar industry. This tremendous growth is possible thanks to large computational power and a large amount of data available. The AI and ML algorithms enable computers to learn from data, and even improve themselves, without being explicitly programmed. These algorithms can perform future predictions. The idea of machine learning is not a new concept for us. This field started already around 1950, at the early age of computers. However, the breakthrough came in the 1990s with more probabilistic approaches and progresses in the field of computer science. Big companies like Google and Facebook are investing a large amount of capital in developing new algorithms and platforms. Recently, Google developed TensorFlow, which is an open source library used for deep learning and neural network. Recurrent neural network (RNN) plays an essential role in the field of natural language processing and text analyses, for example, in text classification and voice recognition. The great success and performance of the convolutional neural network (CNN) within the field of image classification and object detection is a promising strategy for the industrial companies to automate industrial processes. Another growing area for CNN-models is the automotive industry for autonomous driving cars. 1.1 Background This thesis project was conducted at Sandvik Coromant in Gimo Sweden, which is a world leading supplier of tools, tooling solutions, and know-how to the metalworking industry. Sandvik Coromant offers products for turning, milling, drilling, and tool holding. Also, Sandvik Coromant can provide extensive process and application knowledge within machining. Sandvik Coromant is part of Sandvik corporation, which is a global company with 43,000 employees and is active in 150 countries around the world. The manufacturing process of metal cutting tools is very complicated. A cutting tool or cutter is used to remove material from the workpiece through shear deformation. It is, for example, used to drill metals or rocks. The process starts by cemented carbide powder and ends with a cutting tool product. During this long process, the risk of contamination and impurities is very high. Those undesired substances can reduce the strength and other physical and chemical capacities of the products. The company has rules and regulations for specific products to fulfill. One of these rules is about the defect size. A defect is unwanted structural damage, contamination, and impurities during the manufacturing process. Therefore, Sandvik Coromant has a testing department that controls and investigates the products. The control is performed by taking and analyzing microscopic pictures of the products by a conventional microscope. Operators perform this process at the department. The testing team is investigating the occurrence of many different structural defect types. This detecting and classifying process is performed manually by operators today. The process is highly tedious and time-consuming. The company wants to improve its performance and accelerate this process by an automatic system. 9

1.2 Short Introduction about Data Set The testing and quality department at Sandvik Coromant in Gimo has gathered microscopical images for many different types of defects. In this thesis, however, three types of defects are analyzed, Por, Macro, and Slits. These defects are of the same kind, but different reasons cause them. Therefore, it is highly essential to classify and distinguish these from each other. Por usually is under 25 Micrometer and circles shaped, see figure 1.1. Macro is over 25 Micrometer and circle shaped, but also different shapes, see figure 1.2. Slits is also over 25 Micrometer and long rectangle shaped, see figure 1.2. The dataset consists of approximately 1000 images, with two different zoomings (100 and 1000) and two different backgrounds (white and dark). Figure 1.1: Images taken in 100 zooming and with white background. Left image contains only one Por and right image contains many Por Figure 1.2: Images taken in 1000 zooming and with dark background. Left image contains two Macro and right image contains one Slits 10

1.3 Project Goal and Limitations The main objective of this master thesis is to develop a machine learning algorithm and image processing to classify three different structure defects, describe below. Por, very small defects, under 25 micrometer. Macro, bigger defects from 25 micrometer. Slits, very small in width but long in height from 25 micrometer. Two different strategies will be investigated to solve this problem. 1. The first approach is to implement CNN-models to classify these defects. The input is an image, and the output is a classification score. 2. The second approach is to implement an object detection algorithm, where the idea is to detect and classify the individual defects in an image. This machine learning approach is the first project at Sandvik Coromant for this type of problem. The main question is this: is it possible to implement a machine learning model with sufficient performance to replace the manual process by operators? We will investigate this problem from both the academic and industrial perspective. The main potential limitations for this project are: Firstly, the contrast and the zooming difference of the image can cause a problem for CNN-models to capture. The optimal start would be if all the image were of the same quality, taken by the same microscope and with a similar background. Secondly, the testing department is using several lenses with different properties. It can introduce another limitation to the dataset. Thirdly, another limitation is to gather new images with the same quality. The Slits and Macro defects are not very common in the test samples, so it would be a time-consuming process to start from scratch and collect completely new images. The rest of this thesis is divided into eight chapters. Chapters 2 and 3 are related to the first goal of this project, the defect classification using CNN models. Chapter 4 is about the second goal of this project, the rule-based image processing algorithms to detect individual defects in the images. All the results and discussions for this thesis are presented in chapters 5 and 6. Chapters 7 and 8 are about the conclusion and future works. 11

Chapter 2 Theory of Machine Learning and Neural Networks 2.1 General Introduction to Machine Learning The fundamental principle of machine learning and statistical modeling is to formulate a mathematically formalized way to approximate reality and make future predictions from this approximation [2]. Machine learning is a sub-domain of computer science. The machine directly learns from the underlying structure of input data and get smarter, and they are not hardcoded based on some deterministic rules. Commonly the machine learning algorithms are categorized into three major categories: supervised, unsupervised, and reinforcement. 2.1.1 Supervised Learning The models are trained based on input data X and output data y, and the main idea is for the algorithm to learn the mapping function f from the input to the output. y = f(x) (2.1) In supervised learning, the learning is supervised or watched based on ground truth label or output data; a good analogy is to a teacher overseeing the learning process of a student. Based on the correct names, the algorithm iteratively makes predictions until it reaches an acceptable performance level. The two major domains in this area are: Classification is a set of problems where the goal is to categorize data points into a set of pre-defined classes based on some attributes of the data. Practically these might involve image classification where an algorithm is to describe whatever an image contains a cat or dog, or spam detection where an e-mail is to be classified as spam or not depending on its contents. Often this is discrete values. Support vector machine and random forest are two known algorithms. Regression is another major area; in this scenario, the aim is to estimate some continuous hidden variable based on other observable variables. For example stock price, inflation rate, or housing price. Linear regression is a famous algorithm in this field. The first goal of this project is a supervised strategy, where CNN models will be implemented to classify material defects. Many deep learning algorithms are supervised learning, especially for image classification and object detection. 2.1.2 Unsupervised Learning Unsupervised learning is where only input data (X) exists, and no corresponding output variables (y). The goal of unsupervised learning is to model the underlying structure or distribution in the data to learn more about the data. No ground true or correct label exist in this case. Two major sub-domain are clustering and association problems. K-means [3] is a known algorithm for this category. 2.1.3 Reinforcement Learning This category is an action based learning [4]. An agent that takes actions to maximize reward in a particular situation. The agent is supposed to find the best possible path to reach the reward. 12

During the training process, the program will favor actions that previously resulted in higher rewards given a similar state. Reinforcement learning requires no data to learn from, as opposed to supervised learning. The input needed in reinforcement learning is instead a function to calculate the reward. Some applications of reinforcement learning are: Path exploration, this is implemented in computer vision for robots to find a specific room or location. Another example is to navigating through a complicated maze to see the opening. A third example is in game, to find the optimal moves to win the game. 2.2 Training, Validation and Test Sets Useful data is the backbone of a machine learning model. Many people think that machine learning is almost some sort of magic, just put in some data into a model, and it will give a fantastic result. But the reality is not that simple. So good data will hopefully provide some useful result with certain tricks. The deep neural networks (DNN) needs a lot of data not only valuable data. Figure 2.1: The split of the total data into three subsets, training, validation and test data Usually, the total available data is split into three different subsets, training, validation and test set, see figure 2.1. The percentage division is dependent on many factors, for example, the number of hyperparameters, the quality of data and amount of data. Training data The training dataset is the most important one. It is the dataset to train the models. The model sees and learns from this data. Validation data Validation data is an important benchmark to monitor how good the training process is for the models. It is used to cross-validate the performance of the model given some hyperparameters. Overfitting is a common problem in optimization and machine learning problems. Validation data is an excellent tool to check this problem during the training of a model and monitor the model accuracy. Test data The test set is used to evaluate the final model after the training and validation steps. This data is independent of the training and validation set. The test dataset is used to obtain the definitive performance characteristics such as evaluation accuracy, sensitivity, specificity, F-measure, etc. 2.3 Fully Connected Networks A fully connected neural network consists of a series of fully connected layers. It is the most simple variation of neural networks illustrated in figure 2.2. The network consists of an input layer, some intermediate hidden layers, and an output layer. Each layer usually consists of a linear function followed by a non-linear transform. The input layer consists of the feature vector x. The network is categorized as a deep network if it consists of the number of hidden layers is more than one. As seen in figure 2.2 each layer consists of several neurons, where each neuron in one layer connected to every neuron in the next layer. The weights from a layer are represented by real numbers and determine the information passing to the next layer. These weights are trained and 13

optimized during training to increase network performance and decrease the error function. Furthermore, each layer consists of a bias term, 2.3. These terms make the network more robust and generalized. A fully connected network is simply a mapping function f mentioned in equation 2.1.1. Each hidden layer represents a matrix multiplication of the previous layer. Since every hidden layer in a fully connected neural network is an array of neurons, a vector representation is possible. In this case, network 2.2 consisted of a three-layers can be seen as ŷ = a 3 a 2 a 1 (x), where a 1 represents the first hidden layer and a 2 the second. The final layer, a 3 in this case, is usually called the output layer, which yields the prediction ŷ. These deep networks are highly non-linear because of multiple non-linearities in each hidden layer. The networks can learn very complex patterns in the given data. l-1 l l+1 Input 1 Input 2 Input 3 Output: ŷ Input 4 Input 5 Bias Bias Figure 2.2: A fully connected neural network x 1 w 1 Bias b Inputs Activate function x 2 w 2 Σ σ Output ŷ x 3 w 3 Weights Figure 2.3: A node of a fully connected neural network in mathematical terms The mathematical formulation for a three-layer fully connected network would look like this. First the input feature vector x is multiplied by a weight matrix W 1 with an added bias vector b 1. The result from this node/layer is then passed through a non-linear activation function σ 1, see 2.3, after this operation the first activation. This process repeated itself for other nodes to get the output result. a 1 = σ 1 ( W 1 x + b 1) a 2 = σ 2 ( W 2 a 1 + b 2) ŷ = σ 3 ( W 3 a 2 + b 3) (2.2) 14

The general mathematical formulation is a following. z l = W l a l 1 + b l a l = σ l ( z l) ŷ = a L = σ L ( z L) (2.3) Where l and a goes from 1 to L, but a 0 = x, the input layer. And z l is the weighted input to the neurons in layer l before activation function, a l is activation of neurons in layers after activation function, l it the layer number, σ is the activation function, which can be different for every layer, W is the weight matrix for a layer, b is the bias, and finally L is the total number of hidden layers in the network. All weights W and biases b are parameters that are optimizable during the training process, so that ŷ approaches the actual target y. Each entry in the activation vectors a represents a node in the network. The weights W then determine the strength of the links between those interconnected layers. The non-linear transformation σ is introduced to increase the network complexity and capture a more complex data structure. Otherwise, the network would be only a linear superposition of linear functions, regardless of the number of layers in the network. The activation function is hyperparameter as the number of hidden variables. This function can be a general function, both scalar and vector. But commonly it is a scalar and monotonically increasing function. Note that for the case when σ is a scalar function, it is applied element-wise for vector inputs. Some standard activation functions are described further later in this section. 2.4 Activation Functions The use of activation function has some biological connection. The activation function is usually an abstraction representing the rate of action potential firing in the cell. It determines if a neuron is firing or not. Therefore, the choice of activation function has proven to be an essential factor when training deep neural networks. There exist many different types of activation functions, both linear and nonlinear. These function maps the output from a node in a network, between 1 and -1, to check if a node/neuron should fire or not. The most used ones are: 2.4.1 Sigmoid Function The sigmoid function is a monotonic, smooth and differentiable function. Defined for all real values, and bounded in the interval (0, 1). The function is defined as follows 1 σ(x) = 1 + e x dσ(x) dx = e x (1 + e x ) 2 (2.4) The sigmoid function has several useful properties. First of all, it is highly nonlinear, which can introduce complexity to the network. Secondly, it is very smooth and easy to implement. However, this function has some significant drawbacks. The primary problem is the vanishing gradient for large x-values. According to figure 2.4 the gradients converges to toward zero for large absolute values.this property is resulting in slow error convergence during backpropagation, which not desirable. The second problem is the off-zero centered, which makes the gradient update inefficient. 2.4.2 Tanh Function The hyperbolic tan function (tanh) is closely related to the sigmoid function. It is defined between ( 1, 1) and centered around zero, see figure 2.4. Therefore, optimization is easier in this method; hence in practice, it is a good option over sigmoid function. The function is defined as follows. σ(x) = ex + e x e x + e x dσ(x) dx = 4 (e x + e x ) 2 (2.5) 15

2.4.3 ReLU Function The Rectified Linear units (ReLU) has become highly popular in the ANN field in the past years. It is a standard function for hidden layers in the CNN-networks. The function is defined as follows σ(x) = max(0, x) dσ(x) dx = { 0, x < 0 1, x 0 According to equation 2.6 the mathematical form of this function is straightforward and efficient. The big advantage is avoiding the vanishing gradient problematic in contrast to sigmoid and tanh functions. Therefore, the convergence rate is much faster [5], empirically proved in this paper. The other big advantages are the weights sparsity in the hidden layer, which inducing some regularization effect and reduces overfitting. [6]. However, ReLU has also some minor drawbacks. The first one is that it can only be used in the hidden layers and second it can result in dead neurons. (2.6) Figure 2.4: The three different functions mentioned above 2.4.4 Softmax Function For a classification problem, the desired output is a probability distribution between (0, 1). This output can be seen as class probability, to decide if a picture is a dog, cat or a humane with a certain probability. For this purpose, the softmax activation function implemented in the last ANN-layer, the output layer. It is defined as follows. σ i (x) = dσ i dx j = e xi C c=1 exc e xi C c=1 exc [δ ij ] (2.7) e xj C c=1 exc Where, C is the total number of classes, and δ is the Kronecker delta to simplify the equation. Softmax is a generalized version of sigmoid function 2.4, used for binary classification. Softmax 16

extends this idea into a multi-class classification. That is, Softmax assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. 2.5 Forward Propagation and Loss Functions During the training process of an ANN, the first step is the forward pass in the network, where the data is passed through the network, from the input layer to the final layer. The forward pass calculates the weighted sum, apply an activation function and predict the outcome, and calculate the error rate, i.e., the difference between the predicted value and the actual value. A neural network contains millions of weights; hence, weight initialization is very important to avoid layer activation outputs from exploding or vanishing during a forward propagation through the network. A proper weight initialization will improve the performance and the convergence speed of the network. Some common initializers are Zeros, Ones, Random Normal, and Random Uniform. Glorot Normal [7] is another popular method which has shown good results recently. It draws samples from a truncated normal distribution centered on 0 with adapted standard deviation after the input weights and output wights in a layer. 2.5.1 Square Loss or L2-Norm Instead, the prediction accuracy is used in the loss function, which is a more intuitive metric. A widespread loss function is quadratic loss function or square loss. It is defined as following in vector form E = 1 2 ŷ y 2 = 1 2 (ŷ y)t (ŷ y) (2.8) It is a common function for regression. Where y is the ground truth values and ŷ is the predicted values. 2.5.2 Cross Entropy Loss A more specific loss function for classification is the cross-entropy loss function. It measures the performance of a classification model based on class probability which is in range 0 and 1. It is defined as following C E = y c ln ŷ c = y T ln ŷ (2.9) c=1 In a binary classification problem, where C = 2, the cross-entropy loss can be simplified as. E = y ln ŷ (1 y) ln (1 ŷ) (2.10) The binary cross entropy loss 2.10 plus sigmoid function 2.4 is named cross binary in many existing machine learning frameworks and softmax function 2.7 plus 2.9 is called categorical cross entropy loss. 2.6 Backpropagation and Partial Derivatives Backpropagation is the backbone of ANN. It is a supervised learning technique for neural networks that calculates the gradient of descent for all the weights in the network. It s short for the backward propagation of errors since the error is computed at the final layer and distributed backward throughout the network s layers. The goal is to minimize the predicted error calculated by a forward pass. The error is mathematically formulated in terms of a loss function E(ŷ, y), where ŷ is the predicted label, and y is ground true. This loss function is minimized based on weights and biases in the network. The figure 2.2 is showing three consecutive layers of a network, denoted l 1, l, l + 1 to illustrates the derivatives to the weights and biases in a network. The neurons in those layers are represented by n l 1, n l and n l+1. The first step is to define a loss function of E. The derivatives to a single weight in layer l are calculated based on partial derivatives and chain rule, defined as follows. E wn l = E zn l l l n l 1 zn l l wn l = E a l n l zn l l l n l 1 a l n l zn l l wn l = E l n l 1 zn l+1 w l+1 n l n l+1 σ (zn l l )a l 1 n l 1 (2.11) nl+1 l+1 17

Where z, and a is defined in equation 2.3. The equation 2.11 represent the derivative with respect to only one specific weight. The summation is due to all contributions from the neurons in layer l + 1 have to be accounted for since their value is affecting the error function. In this case the n l 1 and n l is fixed constant, the derivative is taken only for one weight. The term E/ zn l l in 2.11 is normally called the error signal. δ l n l = E z l n l = nl+1 E zn l+1 w l+1 l+1 n l n l+1 σ (z l n l ) (2.12) It encapsulates the errors concerning each node or layer in the network. The derivatives for biases are defined as follows. E = E z l n l z l n l b l n l = E z l n l = δ l n l (2.13) b l n l These four equations incorporate the entire backpropagation algorithm for the general matrix form [8] δ L = a E σ (z L ) E b l n l E w l n l n l 1 δ l = ( (w l+1 ) T δ l+1 ) σ (z l ) = δ l = a l 1 n l δ l (2.14) This first term in equation 2.14 compute the error in the final layer, where z L is defined in equation 2.3. Where Hadamard product or element-wise multiplication. The second term represents the error in hidden layers or intermediate layers, in a recursive way. The third term is the gradients for the biases in the hidden layers. The last term calculates the propagating error gradients between two successive layers. 2.7 Optimization Methods in Neural Networks The fundamental part of a machine learning algorithm is to minimize some given loss function and reduce the error. It is also the case for a neural network algorithm. During the training of an ANN, the cost or loss function is optimized based on weights and biases. The main idea is to find the global minimum. The ideal scenario is to calculate the first and second derivative of a function compute the global minimum analytically, as in a calculus exam question. But, that is not the case for the majority real-world problem. As described in the earlier sections, a neural network is very complex and highly nonlinear. The solution to this problem is iterative methods, where we take advantage of the gradient. It exists many different types of gradient-based algorithms. Some common ones in this field are: Gradient descent(gd) is iteratively moving in the direction of steepest descent as defined by the negative of the gradient, to reach the minimum. In this case, it would update the weights to find a minimum. The algorithm is defined as follows. w t+1 = w t γ E(w t ) (2.15) Backpropagation and GD calculates the gradient and updates the weights. Where t is the iteration index, and γ is a hyperparameter, which is called the learning rate. The entire dataset sometimes is used for these updates. For a neural network, the dataset can be extensive, and it can be costly and inefficient to use the whole dataset. Mini-batch approach, combined with GD is another training procedure. It is called stochastic gradient descent (SGD), where the complete dataset is divided into a smaller portion. This approach is much faster and more efficient. However, SGD can induce some fluctuation for parameter updates because of this stochasticity. Another problem with SGD and GD is that it can get stuck in a local minimum, it can oscillate around a local minimum [9]. Momentum [10] is a more refined method that helps accelerate SGD in the appropriate direction and dampens oscillations. υ t = ηυ t 1 γ E(w t ) w t = ηw t + υ t (2.16) Where, η is called the first coefficient of momentum and υ is called retained gradient. Typically η is set around 0.9. 18

Nesterov accelerated gradient is more precise method that is based on momentum [11]. This method adds a new term to momentum algorithm that is looking forward before updating. Therefore, it is smarter and more accurate. υ t = ηυ t 1 γ E(w t ηυ t 1 ) w t = ηw t + υ t (2.17) Where w t ηυ t 1 is used to compute an approximately derivative for the next update. This anticipatory update prevents the updates a big jump, which make the algorithm more efficient. Many other, more efficient algorithms exist nowadays. These algorithms have an adaptive learning rate to the parameters. We will not mention all of those, and it is too many! However, the two most used ones are: RMSprop and Adam. Root Mean Square Propagation (RMSprop) is an unpublished optimization algorithm designed for neural networks. It is quite interesting because it is officially not published yet, but still it is used. Geoffrey Hinton proposed this in a lecture Neural Networks for Machine Learning [12]. We will not go in much detail for this algorithm. The main principle is that it is using an exponentially decaying learning rate using the root mean squared of the gradients and first momentum as NAG. It adapts the learning rate after the dataset. Another widely used algorithm is adaptive moment estimation (Adam). D. Kingma and J.Ba proposed it in 2015 [13]. Adam is adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance). To estimates the moments, Adam utilizes exponentially moving averages, computed on the gradient evaluated on a current mini-batch. m t = β 1 m t 1 + (1 β 1 )g t v t = β 2 v t 1 + (1 β 2 )g 2 t (2.18) Where m and v are moving averages, g is gradient on current mini-batch, β 1 and β 2 exponential decay rates for the moment estimates are hyperparameters. 0.9 and 0.999 are standard values, respectively. The vectors of moving averages are initialized with zeros at the first iteration. Additionally, a bias-corrected moment estimates ˆm and ˆv are introduced to account for the bias in the gradient towards zero as an effect of the exponential moving average which is evident in the early stages of training. ˆm t = m t 1 β1 t ˆv t = v (2.19) t 1 β2 t Currently, it is the most used optimizer in neural networks, because it has shown a good empirical result. The complete algorithm is outlined below [13]. Algorithm 1: Adaptive moment estimation (Adam) optimizer Require: α: Stepsize Require: β 1, β 2 [0, 1) : Exponential decay rates for the moment estimates Require: f(θ): Stochastic objective function with parameters θ Require: θ: Initial parameter vector m 0 0 (Initialize 1 st moment vector) v 0 0 (Initialize 2 end moment vector) t 0 (Initialize timestep) while t not converged do t t + 1 g t θ f(θ t 1 ) (Get gradients w.r.t. stochastic objective at timestep t) m t β 1 mn t 1 + (1 β 1 )g t (Update biased first moment estimate) v t β 2 v t 1 + (1 β 2 )g 2 t (Update biased second raw moment estimate) ˆm m t /(1 β t 1) (Compute bias-corrected second raw moment estimate) ˆv v t /(1 β t 2) (Compute bias-corrected second raw moment estimate) θ t θ t 1 α ˆm/( ˆv + ɛ) (Update parameters) end while return θ t (Resulting parameters) 19

Batch training is a standard approach for all these algorithms. This strategy is much more computationally efficient are more resistant to sample noises. The total data is randomly divided into smaller batches, and the final result is averaged over all these batches. 2.8 Convolutional Neural Networks Convolutional neural network (CNN) is a class of deep neural networks. It is a powerful technique for image visualization, image classification, and object detection. The human vision system inspired CNN, the primary visual cortex in the brain. When we see an object, the light receptors in the eyes send signals via the optic nerve to the primary visual cortex, the central processing for inputs. A newborn child does not know the difference between a car and a buss. A child learns to recognize objects from their direct environment and parents. They see these objects a million times during their childhood. The brain learns the specific features of these objects. All of this seems very simple and natural for us; however, this is a very complex process. A computer is not as flexible and complex as the brain. A computer algorithm needs millions of pictures before it learns all the features and can generalize the input and make predictions for images it has never seen before. CNN models have been proven to be very good this task. Y.LeCun introduced this concept in 1989 [14]. CNN architecture is somehow different from fully connected networks described in the previous section. Convolutional neural networks do not connect every neuron in each layer to every neuron in the next layer but instead take advantage of weights sharing. The weights are connected locally, meaning one node only connects to spatially adjacent nodes in the next layer; the network is considering the spatial structure of the data. This concept is especially useful for images because it is reasonable to assume that every pixel has some correlation to the neighbors. The concept of convolution in mathematics, especially in the field of signal processing and Fourier analysis is similar to CNN. The mathematical convolution in 1D (one dimension) discrete function is defined below. (k x)[n] = k[i]x[n i] (2.20) i= Where, k(t) is called the kernel, and x(t) is the input in the field of deep learning. The convolution process produces a third function which in somehow a similarity measure between two functions. Figure 2.5: A convolutional neural network, with several blocks Convolutional neural networks are constructed by concatenating several individual blocks that achieve different tasks. Every layer of a CNN transforms one volume of feature map to another through a nonlinear activation function. Figure 2.5 illustrates a network with a convolutional layer, pooling layer, and fully-connected layer. The input data is images, and output data is mainly class score for every image. 2.8.1 Convolutional Layers An image is represented as matrices or tensors in the computer, by their RBG values. Therefore, the convolutional filters or kernels, in this case, are matrices. The convolution operates by letting the kernel slide over the input while computing the output on a patch of the input at a time, taking into account the spatial relation of the elements in the input. The equation below represents a 20

general The convolutional operation for images. Z ij = (X K)(i, j) = K 1 K 2 l=1 m=1 n=1 C X(i + l, j + m, n)k(l, m, n) (2.21) The input image is a 3D tensor with size [D 1, D 2, C] where D 1 is the height, D 2 is width and C is the number of channels. The corresponding kernel is also a 3D tensor with size [K 1, K 2, P ], where K 1 is the height, K 2 is width and P are the output channels of convolution. According to the equation, 2.21 number of image channels are the same as the number of Kernel output, the number of channels is conserved. This convolution operation is only spatial coordinates dependent. The output from this convolution is a feature map called Z. 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 = 1 4 3 4 1 1 2 4 3 3 1 2 3 4 1 1 3 3 1 1 3 3 1 1 0 X K X K Figure 2.6: A convolution operation between the input X and the kernel K producing the feature map Z As seen in figure 2.6, the kernel filter slides over the entire input image and produce an output. The input image data is 7x7, and the output is 5x5, which is reduced by factor 2. The focus point of a kernel is always the center weight; in this case, the filter ignores or misses the image edges. The kernel would end up outside the image if it is placed directly in at the edges. To handle the edges during convolution zero padding must be introduced. Zero paddings add some extra zeros in the image to increase the size. However, those zeros do not add some additional features. In this case, the input size increases to 9x9, and output would be 7x7 as an original image, the desired effect. The following hyperparameters determine the size of the output image. D out = D in D k + 2p S + 1 (2.22) Where, D in is the input size, D k is the kernel size, p is the zero padding, and S is called stride. The stride parameter decides the filter overlap for each spatial coordinate. In this case, the stride is one. It is a filter shifting, how much the filter shifts per each iteration. If the stride is the same as the kernel size then only one input pixel is affected by this filter, then the spatial correlation is ignored. p = D k 1 (2.23) 2 After each convolutional layer, it is a convention to apply a nonlinear activation function. The purpose of this layer is to introduce nonlinearity to a system that has just been computing linear. For more details, read the previous sections. 2.8.2 Pooling Layers The pooling or downsampling layer implements for reducing the spacial size of the activation maps. In general, they are used after multiple stages of other layers (i.e., convolutional and non-linearity layers) to reduce the computational requirements progressively through the network as well as minimizing the likelihood of overfitting., see 2.5. A pooling layer has two hyperparameters, one is the filter size, and the other is stride number. There are several techniques for this, and one is called max pooling, which often used, see figure 2.7. Max pooling operates by finding the highest value within the filter size region and discarding the rest of the values. 21

2 2 9 2 9 6 4 3 5 0 9 3 7 5 2 2 max pooling 9 9 7 0 7 0 0 9 0 9 5 9 7 9 3 5 9 4 Figure 2.7: A max pooling with kernel size 2x2 and stride 2 Other options for pooling layers are average pooling and L2-norm pooling. Max pooling has demonstrated faster convergence and better performance in comparison to the average pooling and L2-norm [15]. 2.9 Regularization Overfitting is a significant problem in the field of data science and machine learning. An overfitted model is a statistical model that contains more parameters that can be justified by the data. An overfitted model perform exceptionally good on train data but is not able to produce well prediction in test data. The trained model captures the noise or irrelevant information in a dataset during the training step. This phenomenon is more probable for a complex model. The deep neural networks are very complex models, which is more prone to overfitting. Regularization is a technique which can reduce this effect. Some minor modifications introduce to the learning algorithm such that the trained model generalizes better. It, in turn, improves the model s performance for future predictions. There exist several of these regularization techniques two important ones described in the next sections. 2.9.1 Dropout Dropout is a popular regularization technique these days in the field of neural networks [15]. The key idea is to drop units from the neural network during training randomly, set those to zero. It prevents groups from co-adapting too much. The number of weights is decreasing and also the model complexity. dropout Figure 2.8: The dropout effect in a network The dropout technique has a hyperparameter between [0, 1], which is a percentage. Dropout can be implemented after each layers in the network if necessary. 2.9.2 Early Stopping Early stopping is another powerful and used regularization technique in deep learning [16]. Usually, the available data is divided into three different categories: training data, validation data, and test 22

data. The training data is used to train the model. The validation set is used to validate the performance of the model during each training step. Errors of both sets are monitored at each step. When the performance on the validation set is getting worse or has not improved after some subsequent epochs, the training is immediately stopped, see figure 2.9. It prevents the model overfitting and improves generalization. Figure 2.9: The process of early stop 2.10 Artificial Neural Networks Artificial neural networks (ANN) was first introduced around 1943 by W.McCulloch & W.Pitts (MP) [17]. Their works were inspired by the human brain and character of nervous activity. They created the first computational model for neural networks based on propositional logic or threshold logic, according to the notion of "all-or-none" character of the nervous activity in their article. D. O. Hebb worked on a theory called Hebbian theory where it discusses the mechanism of neural plasticity and interaction between neurons. Rochester, Holland, Habit, and Duda developed neural networks and performed simulations at IBM computer, [18, 19]. Finally, in 1958 F.Rosenblatt he introduced the perceptron model [20], it extended PM:S work and built on the notion by introducing the concept of association cells. A perceptron is simple a single layer neural network in today notation, and a multi-layer perceptron is the building block of today s deep neural networks. Even though Rosenblatt s methodology has closely linked to the current structure of deep neural networks, they would not become fully applicable and competitive with other, much simpler methods, until many decades later. In 1969 Minsky and Papert discovered to the limitation of computational machines that performed perceptron on that time. The first important factor was that it was incapable of processing the exclusive-or circuit. And the second major factor was that in fact that neural networks required vast amounts of data and computing power which was not available at the time [21]. Another critical component that was missing was an efficient way of training the networks which were first introduced in 1974 by Werbos [22], the idea of backpropagation. The errors were propagated backward through the networks system by differentiation applying the chain rule. The more general technique is called automatic differentiation, and Werbos finally applied it to neural networks in 1982 [23] in the form used today. In recent years, neural networks have become a popular approach to many machine learning problems, especially within the field of computer vision. These networks apply to both supervised and unsupervised learning. 23

Chapter 3 Classifying Material Defects using Neural Networks This chapter is about the more practical implementation part about two different CNN-models for defect classification related to the first goal in this project. In section 3.1 and 3.2 the dataset and data processing part is outlined. Section 3.3 discusses the different data augmentation techniques to increase the original dataset for CNN training. The neural network architectures and the training process are presented in section 3.4. 3.1 Dataset As described in section 1.2, the total gathered dataset is circa 1000 images in three categories/defects. The defect s geometrical properties define these defects, see table 3.1. Table 3.1: The three defect categories for this thesis Final categories Categories Size Por < 25µm [0.06, 0.6] volume % Macro > 25µm & Length/Width < 5:1 Slits > 25µm & Length/Width > 5:1 In the original dataset, the Por category is divided into type A and B. However, during this project no distinction is made between type A and B as long as the volume area of defects is more significant than 0.06 volume percent. In fact, in some images, both of those types co-exist, see figure 3.1. Sometimes it is difficult to make a clear distinction between those two types. The essential factor is the total volume percentage in this case. In figure 3.1, the volume percentage is 0.02 for type A and 0.04, and 0.06 for type B. The volume percent is an indicator of how many defects exists per image. For the categories Slits and Macro, the vital factor is, however, the size of the defect. If the defect size ( height or width) is more significant than 25 µm, then it is considered unacceptable. The risk for failure is much higher for the products in case of larger defects. Therefore, the requirement and standard are more strict. The distinction between Macro and slits is only the ratio of height and width. If the rate of height: width is equal or more than 5:1, the defects are classified as Slits otherwise as Macro, as shown in the table 3.1. The Slits are small and long. The Macros are thick and short, see figure 1.2. In total it exists 440 Macro, 360 Por, and 340 Slits. 24

Figure 3.1: The existence of pors in the images. Left image is specifically of type A00B06 and right image is of type A02B04 3.2 Data Collection and Data Cleaning The first phase of this project was to understand the problem, break it down into smaller pieces, and make a preliminary time plan. In this phase, the data collection and data cleaning were performed. This part was completed in collaboration with the colleagues in Sandvik Coromant s quality lab. The gathered data was stored in C-drives at the computer lab in different files. The data was not very structured and sorted into different categories. The three defects analyzed in this project are just a few of defects occurring in the manufactured products. In total, it existed approximately 30000 images. This step was quite tedious and time-consuming. After this process, the collected images were classified into three categories for this project with the help of an operator. We managed to find circa 1000 useful images. The majority of those images were of type macro and slits, taken in 1000 zooming. Also, during this process, some new images were gathered, mainly of type Por in 100 zoomings. The operators managed to find test samples that had a high probability of containing these three defects. The Macro and Slits were very rare to find. After a few weeks of work, approximately 1100 images were available. 3.3 Data Augmentation The deep neural networks are highly complex, with several millions of weights that must be optimized and fitted to the training data. It is therefore not only important a good model for the specific problem, but it is of at least equal importance that the data set in question is of high quality and sufficient size. In most deep learning applications, however, the number of parameters in a neural network exceeds the number of data points in the dataset, sometimes with a large margin. It becomes problematic if the network is not trained well since it is easy for the model to overfit to the training data. Several regularization methods have been developed in the past years to solve this problem. Data augmentation is one of the most implemented methods. The main idea is to increase the data set by modifying it in various ways and so that a more extensive variety of data is shown to the network. This approach effectively increases the size of the data set, reducing the risk of overfitting. In this section, several basic data augmentation methods are described. For this part Keras, Imgaug [24] and Augmentor [25] is used. Horizontal and Vertical Flip Both horizontal and vertical flipping is very efficient and common data augmentation methods, see figure 3.2 and 3.3. These methods are implemented so that the trained model becomes invariant to mirrored objects. This is extra useful for this project, because the probability is high that mirrored images occurs. Approximately 50 % of the images are effected by these methods, so the models are trained equally. 25

Figure 3.2: The horizontal flip of an image, on the left original image and the right flipped image Figure 3.3: The vertical flip of an image, on the left original image and the right vertical flipped image Rotation Rotation is another efficient and intuitive augmentation method. The rotation angle is uniformly distributed between [ θ max, θ max ]. For this case in the figure 3.4 the maximum rotation angle is set to 45 degrees. The main reason for this method is that the network has to recognize the object present in any orientation. This finer angles rotation can create problem for some applications. The rotated image can add extra background noise, see the extra four edges created in figure. If this background noise is very different compare to other part of the image i will certainly create problem, because the networks can learn false features. However, this extra edges can be very useful in this case, because they have the same intensity and color as the rest of the image. These created edges can be seen as real edges for the network. The probability is very high that the future images for prediction have edges. The rotation has been implemented for all the images with random rotation angles. 26

Figure 3.4: The rotation of an image, on the left original image and the right rotated image Translation Translation is another common augmentation method. This method is helps the network to recognize the object present in any part of the image. Also, the object can be present partially in the corner or edges of the image. For this reason, we shift the object to various parts of the image, see figure 3.5. This method can also create extra edges with noise as rotation and later create problems. But this feature can be good in this case. It creates extra edges that can be considered as real for the networks. This is good generalizing the networks for future prediction that contains real edges. Figure 3.5: The translation of an image, on the left original image and the right translated image Zooming Zooming is a good augmentation method to make the models invariant to the size of objects, see figure 3.6.Zoom in the range [0.8,1.0] which means zoom by a maximum 20. Having differently scaled object of interest in the images is a important aspect of image diversity. This method may be extra helpful for this specific data set due to the zooming problem. As mentioned earlier in this section the images are taken in two different zooming, 100 and 1000. Implementation of this method may cancel this effect. 27

Figure 3.6: The cropping of an image, on the left original image and the right zoomed image Gaussian Noise This augmentation method is a powerful technique to reduce overfitting in some extend and make the models more robust to injected noises, see figure 3.7. This method makes also difficult for the network to learn very small-pixel values which are relevant for the problem. Gaussian noise a common approach, but it exists many other noises, like Laplace, Poisson etc. Addition of salt and pepper noise, which presents itself as random black and white pixels spread through the image is another possibility. This is similar to the effect produced by adding Gaussian noise to an image, but may have a lower information distortion level. Figure 3.7: The Gaussian added noise of an image, on the left original image and the right noised image 3.4 Neural Network Architectures The theory of CNN was presented in section 2.10. As described, these networks contain many different parts. In this section, these ingredients are connected to build a complete architecture. To find a suitable architecture is not a trivial task; it exists many hyperparameters to cross-validate. New improved and complex architectures are presented every year. The ImageNet challenge [?] is an annual challenge used as the benchmark. The purpose of this challenge is to classify millions of images in many different categories. During the last few last years, several different successful architectures improved the accuracy further. The breakthrough for CNN models came in 2012, 28

where the network called AlexNet won the competition by a big marginal compared to other methods [5]. This significant progress motivated many researchers and big companies to invest time and money in this field. Subsequently, in the following years, new architectures were invented, and CNN continued to dominate, decreasing the error rate each year. Two different designs are investigated in this thesis. One of these architectures was the winner of the ImageNet challenge in 2014, VGG-network. The other design was a customized one, much smaller compare to VGG. 3.4.1 VGG Architecture VGG network is relatively compact and straightforward architecture using small 3 3 convolutional layers with stride one stacked on top of each other in increasing depth. The 2x2 max pooling with stride two is added to reduce the image volume and number of weights. A softmax classifier then follows two fully-connected layers, each with 4096 nodes. The final softmax output is 1000, which corresponds to 1000 classes of ImageNet competition. Simonyan and Zisserman [26] suggested this network., see figure 3.8. The CNN layers have a much smaller size, and receptive filed in VGG-net compare to previous nets, like AlexNet. The previous ones usually had 7x7 kernels and larger receptive field. However, several smaller kernels stacked together give the same effect. For example, 3x3 kernels have the receptive field as a 7x7 kernel. Smaller kernel enhances the network nonlinearity and complexity, which can capture more sophisticated features. It is a very efficient way to performance and has the same amount of weights. It exists two different VGG-nets, one 16 layers called VGG16 and one 19 layers called VGG19. However, both have the same structure. In table 3.2 a model summary is presented. The number of channels per layer is 64, 128, 256, and 512. The number of channels starts with three input channels and goes from 64 to 512 in increasing order. An additional dropout layer is added to the network to reduce the overfitting. And also the original 1000 classes softmax is replaced by 3 class for this thesis. Figure 3.8: VGG-16 neural network architecture [1] Table 3.2: Decomposition of VGG19 Layer name Number of layer Number of channels Convolution 12 [64, 128, 256, 512] Max pooling 4 [64, 128, 512] Fully connected 3 [4096, 512, 3] Dropout 2 [4096, 512] 3.4.2 Customized Architecture All the architectures developed for ImageNet competition are large networks, and they become even more extensive. VGG-net is one of the smallest ones. Two other vital networks are residual 29

network (ResNet) developed by He et al. [27] in 2016 and DenseNet developed by Huang et al. [28] in 2017. It requires a large amount of training data to train these networks with millions of weights. Those amount of data is out of scope for this thesis. For this thesis, a smaller and customized architecture has been implemented to compare the result with VGG19. The architecture is similar to VGG but with fewer layers and batch normalization for hidden layers. The concept of batch normalization was introduced in 2015 by S.Ioffe et al. [29]. It came after the VGG-nets, which explains why they do not have it. Batch normalization is a powerful technique for improving the speed, performance, and stability of artificial neural networks. In table 3.3, a model summary for this customized network is presented. The number of channels per layer is 16, 32, 64, and 128. The number of channels starts with three input channels and goes from 16 to 128 in increasing order. An additional two dropout and five batch normalization layer are added to the network to reduce the overfitting. Finally, a three-class softmax function is added to classify the defects. Algorithm 2: Batch Normalizing Transform, applied to activation x over a mini-batch Require: Values of x over a mini-batch: B = {x1...m}; Parameters to be learned: γ, β µ β 1 m m i=1 x i (mini-batch mean) σ 2 1 m m i=1 (x i µ β ) 2 (mini-batch variance) ˆx i xi µ β (normalize) σ2 +ɛ y i γ ˆx i + β BN γ,β (x i ) (scale and shift) return y = BN γ,β (x i ) (Result) This algorithm calculates the mean and variance for each mini-batch data first and then normalized and finally perform a shift and scaling. Notice that γ and β are learned during training, along with the original parameters of the network. In the algorithm, ɛ is a constant added to the mini-batch variance for numerical stability. BN γ, β is called Batch Normalizing Transform. Batch normalization has other great benefits: the network can use a higher learning rate without vanishing or exploding gradients. It reduces overfitting because it has a slight regularization effect, similar to dropout. But in this case, no information is lost in contrary to dropout. Table 3.3: Decomposition of customized network Layer name Number of layer Number of channels Convolution 5 [16, 32, 64, 128] Max pooling 3 [16, 32, 64] Fully connected 3 [1024, 512, 3] Dropout 2 [64, 512] Batch Normlize 5 [16, 32, 64, 128] 3.5 Network Training and Software The field of deep learning is trendy and state of the art research field. It is the dominating machine learning currently area. Therefore, a wide variety of deep learning software tools is publicly available. Most of them are open-source and distributed under licenses that allow commercial use. These frameworks are often driven by leading Universities and companies like Google, Facebook, Microsoft, Amazon, etc. Theano, which was developed by the University of Montreal. It is one the first and most used one libraries [30]. Caffe is another powerful framework developed by UC Berkeley [31]. Pytorch is another popular framework developed by Facebook [32]. However, Tensorflow is the most used framework for deep learning today. It was developed by Google [33], it running in several languages including Python. According to StackOverflow questions and Github user, TensorFlow has the largest user community. For this part of the project, Keras and TensorFlow are used. TensorFlow is based on flow graphs, and tensors operations for the dataflow and backpropagation derivatives, which make is it very computationally efficient. It provides a powerful framework for building machine learning models with a variety of abstraction levels. Implementing low-level APIs to construct a model by defining a series of mathematical operations, which is more time consuming and more in-depth in term of coding and learning rate. The other option is to use a high-level API such as Keras [34] to specify predefined architectures. This higher level abstraction is, however, easier to use but less flexible. Keras is an open-source deep learning library written in Python and is capable of using TensorFlow, Theano, and Microsoft Cognitive Toolkit (CNTK) as back-end. It is very user-friendly and allows fast experimentation with different models and hyperparameters. Extensibility and modularity are 30

two other great advantages of Keras. It is capable of running in both CPU and GPU as TensorFlow. During this project, several different image resolutions were tested to see what level of pixel detail was necessary. The dataset contained images with different resolution depending on the microscope, mix of 2448x2048 and 1592x1196. The default input size for VGG19 model is 224x224. After some iteration 224x224 input size was implemented for both networks. The first approach included randomly cropping an original image into several smaller images. However, this method had a major drawback. For this classification strategy, the entire image is of interest, not only a small part of it. Therefore, the images were instead down-sampled to keep the entire image. Additionally, the customized network was tested with grayscale images. However, the pretrained VGG19 is fine-tuned and adapted for color images, three channels. It does not accept grayscale images. Therefore the grayscale test could not be performed. It is possible to convert the VGG19 weights for single channel input, but this is not done in this project. It is worth mentioning that grayscale input reduces the number of trainable weights drastically, two channels are simply dropped. It can enhance computational efficiency. The complete data was divided into three sets: training data (80 %), validation data (15 %), and test data (5%), read section 2.2. The models were trained in mini-batches of the training data. Different batch sizes with factor 2 were tested for both training and validation. The most common one for the training were 32 and 16, 16, and 8 for the validation part. The images in the batches were shuffled and selected randomly by Keras. The dataset is biased; it exists more Macro images than Slits and Por. When training neural networks, it is important to have balanced classes so that the networks are exposed to equally many images of each class, to prevent them from becoming biased towards the class with more samples. Two different approaches were tested to solve this problem. The first method was to adjust the data after the class with minimum data, which was slits in this case around 300 images. This method is called undersampling, which means removing some instance of over-represented classes. And after that applied the data augmentation techniques equally for each class. The other method was to use all the available data and adjust the augmentation methods to produce the balanced final dataset. In this scenario, the slits class was augmented much more to compensate for the original data. This method is called oversampling, which means repeating the instance of underrepresented class/classes. These two approaches mentioned above is called data sampling. Additionally, it exist many other methods to solve imbalanced dataset [35]. 3.5.1 Transfer Learning For the VGG19 model, a method called transfer learning was used. It is a compelling approach to various machine learning problems. This concept was first introduced around 1990 by L.Pratt [36]. Figure 3.9: Three different transfer learning strategies, the large rectangle is the main block of a model and the small rectangle is only the top layers, including the softmax classifier 31

The main idea is to reuse a trained model to train new models for another similar problem. This technique is especially applicable for deep learning problems, where a large amount of data and computation capability is required. It is a good approach in this case because the original dataset (without data augmentation) is small. The gained knowledge is recycled. As mentioned in section 3.4.1, the VGG-nets are trained on ImageNet, millions of images in many different classes. In Keras, it is straightforward to implement these networks as a starting point for a new model. It takes only a few lines of code. In this case, VGG19, with its trained weights, was used as the base model to build a new model. The last fully connected and softmax layers were not included. Instead, a few customized layer were added to the base model. These layers included some regularization effect, dropout layers, see section 2.9, to avoid overfitting. The last softmax layer was added to produce class probability for the three different classes. This approach can be considered as strategy 3 in figure 3.9, where the base model is frozen, and the top layers are trained. A CNN-model contains many different layers. All these layers are trained to recognize and learns useful features during training and then generalize for future predictions. The learning process of these CNN-model has been researched for a long time, and it still is, the visualization of feature maps. It has shown that these models learn the low-level features on the image in the earlier layers of the network, such as lines, dots, curves. The high-level features are learned in the next layer of the network, such as common objects and shapes. Therefore, during the training, various layers were retrained for this project; all other weights are frozen. The low-level features relatively common for many images, hence the focusing point was to retrain the last 5-10 layers of the VGG19-net, to learn the high-level features extensively. This method is illustrated by strategy 2 in figure 3.9, where some extra layers of VGG are retrained. The customized model was, however trained according to strategy 1 in the figure 3.9. The complete model is trained from scratch. The ADAM optimizer algorithm (see section 2.7) was used for the loss minimization and cross entropy (see section 2.5.2) was implemented as loss function combined with softmax. Regular SGD and RMSprop optimizer were also tested. Additionally, the principal of early stop (see section 2.9) was used to minimize the overfitting. The ReLU activation function 2.4 is used almost exclusively for all the hidden layers for both networks. The models were trained locally in a PC with an internal GPU (NVIDIA Quadro M2000M) with 1029 MHz GPU speed and 4GB in memory. It worked fine for image sizes smaller than 156x156 for these models used here. Three image processing steps were performed before the input batches were sen through the network. First, images were rescaled between 0 to 1. Second, the sample-wise centering, the dataset was normalized to 0, the mean value was set to 0. Third, sample-wise standard normalization, the standard deviation value to was set to 1. 3.5.2 Model Performance The performance measure is an essential part of machine learning model. A model is trained to make good future predictions. The model accuracy is one the most used metric to measure the performance. But, accuracy alone is not enough to judge the prediction power. Null Error Rate or Accuracy paradox [37] states that the model performs excellent for the majority class, but very poor for the minority classes. For example, if the incidence of class Macro is dominant, being found in 99% of cases, then predicting that every case is class Macro will have an accuracy of 99%. Confusion matrix or error matrix is a better or complementary method to evaluate the final model [38]. This matrix contains as many rows and columns as the number of classes. Table 3.4: Confusion table/matrix for binary classification Predicted True Predicted False Actual True True positive (TP) False negative (FN) Actual False False positive (FP) True negative (TN) As shown in table 3.4 the confusion matrix represents the predicted classes versus the actual classes, for all classes, and not only the majority class. From the confusion matrix some important performance measures can be deduced, which is used in this project. True positive rate (TPR) or sensitivity, it is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives (TPR) to the sum of true and false positives. 32

T P T P R = T P + F N Positive predicted value (PPV) or recall,it is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives. T P P P V = (3.2) T P + F P F 1 score is the harmonic average of the TPR and PPV, it is defines as follows. F 1 = 2T P 2T P + F P + F N The final accuracy (ACC) is also included into to the performance analyses, which is defined as follows, (3.1) (3.3) ACC = T P + T N T P + T N + F N + F P All these four metrics are used for the final model performance analyses. (3.4) 33

Chapter 4 Classifying Material Defects using Image Processing This chapter is about the more practical implementation part about image processing for defect classification and defect detection related to the second goal in this project. It is an alternative method to the first one. The result final result will be compared to each other. In section 4.1 more general description of object detection is outlined. In section 4.2 and 4.4 Canny edge detection algorithm and contour detection algorithms related to image processing are presented. In the section 4.5 the implementation details are presented. 4.1 Image Processing and Object Detection In the field of computer vision classification and object detection are two major categories. Both of these categories often deal with images as input. But the output is different. For the classification the models are often one to one, a model takes an image as input and output a class score, for example, to determine if an image cat or a dog, the model does not care where the cat is in the image. In object detection, the network usually takes in two input parameters and output two parameters. The class score and the coordinates of the object. These models can deal with several objects on an image. It exists many different object detection approaches. Some successful deep learning approaches are: Region Proposals (R-CNN) [39], Single Shot MultiBox Detector (SSD) [40], You Only Look Once (YOLO) [41]. These methods are very sophisticated and efficient and require a lot of data. Specially YOLO and SSD are speedy and accurate, and those two are used commercially today in self-driving cars and other sectors. However, in this thesis, we will be focusing on a much simpler strategy, implementing some classical image processing algorithms to detect these defects (objects) on the images. An edge detection algorithm combined with contour detection and some other mathematical morphological operations described in the following sections. This approach is somehow rule-based; their pixel size values will categorize the defects. If the width or height of a defect is smaller than some specific value, then it is classified as Por or Macro or Slits. Therefore, this method does not consider a machine learning approach. The essential approach behind the gradients based detection is that local object appearance and shape within an image describes by the distribution of intensity gradients or edge directions. One of the most fundamental image analysis operations is edge detection. Edges are often vital clues toward the analysis and interpretation of image information, both in biological vision and in computer image analysis. Edge detection was an essential part of the machine vision to make general assumptions about the image formation process, a discontinuity in image brightness can be assumed to correspond to a discontinuity in either depth, surface orientation, reflectance, or illumination. Edges in images usually have strong links to the physical properties of the world, which can help for image segmentation and object detection. Edge detection requires image derivatives to compute. However, differentiation of an image is an ill-posed problem; image derivatives are sensitive to various sources of noise [42]. Smoothing the images is a common approach. An early approach to edge detection involved the convolution of the image f by a Gaussian kernel G or 34

Green function F, followed by the detection of zero-crossings in the Laplacian response [43][44]. 2 (G f) = 0 (4.1) This smoothing approach introduces some undesirable result, for example, false edges, loss of information, and pixel displacement. Many sophisticated methods exist today. All edge detection algorithms compute image derivatives, but they use different filters and approaches. We will describe one of those in the next section in more detail, called the Canny filter. ( f ) 2 ( ) 2 f f = + x y ) (4.2) ψ = arctan ( f x ( f y The gradient of an image is a vector of its partial derivatives as equation 4.2. The gradient direction is perpendicular to the edge orientation; the second equation in 4.2 is the edge direction. 4.2 Canny Edge Detection Canny edge detection is a five-stage algorithm to detect a wide range of edges in images. This algorithm was developed by J.Canny [45]. It is a five-step method. Noise Reduction As mentioned above, edge detection involves image derivatives. Those derivatives are susceptible to image noise. Therefore, one way to get rid of the noise on the image is by applying Gaussian blur to smooth it. This process uses an image convolution technique with a specific Gaussian Kernel. The equation for a Gaussian filter kernel of size g 1 g 2 is given by. G ij = 1 2πσ 2 exp ) ( (i g1 ) 2 + (j g 2 ) 2 ) The Kernel size is an important hyperparameter which impact the amount of image blurring and noise sensitivity. Larger Kernel reduce the sensitivity to noise. Gradient Calculation The Gradient calculation step detects the edge intensity and direction by calculating the gradient of the image using edge detection operators. A well-known operator is the Sobel operator, developed by I.Sobel [46]. Two different Sobel filter exists, one for horizontal and one for vertical gradients. 1 0 1 1 2 1 G x = 2 0 2 I, G y = 0 0 0 I 1 0 1 1 2 1 G y and G x are vertical and horizontal derivatives (gradient intensity matrix) for the input image I approximations respectively. f = (G y ) 2 + (G x ) 2 ( ) Gy (4.4) ψ = arctan Then, the magnitude f and the direction ψ of the gradient calculates according to 4.4. Non-Maximum Suppression Non-maximum suppression is an edge thinning technique. It suppresses extra and unwanted edges. Ideally, the final image should have thin and distinct edges. This step is to implement after the gradient calculation step. The principle is relatively simple; the algorithm searches for maximum values in the gradient intensity matrix and finds the pixels with the maximum value in the edge directions. G x 2σ 2 (4.3) 35

Double Threshold After application of non-maximum suppression, remaining edge pixels provide a more accurate representation of real edges in an image. However, those edges can contain noise and substantial color variation. The algorithm divides the pixels into three categories: Strong, weak, and nonrelevant. The active contribution to the final edges certainly, the non-relevant are sorted out. The intermediate pixels processed in the next step of the algorithm, to decide if they should be in the final edges. Edge Tracking by Hysteresis Based on the double threshold result, the intermediate pixels transform into strong or non-relevant pixels. If and only if at least one of the pixels around the one being processed is a strong one the corresponding pixel is ungraded to strong pixel, else sets to zero. 4.3 Morphological Operations Mathematical morphology is a tool for extracting image components useful in the represent ion and description of region shapes, such as boundaries, skeletons, and convex hulls. Morphological operations apply a structuring element or Kernel to an input image, creating an output image of the same size. In a morphological operation, the value of each pixel in the output image is based on a comparison of the corresponding pixel in the input image with its neighbors. The most basic morphological operations are dilation and erosion. The mathematical formulas and definitions is discarded in this report. The fundamental operations are union, intersection, and complement plus translation, the essential components of set theory, for more theoretical definitions and proofs read this article [47]. Erosion The first effect of this operator on a binary or grayscale image is to erode the boundaries of regions of foreground pixels (i.e., white pixels). Thus areas of foreground pixels shrink in size, and holes within those areas become more substantial. The operation takes two inputs, the first one is the image, and the second is the structuring element (Kernel). It is an approach to remove the small white noises in the picture. Dilation Dilation is principally the opposite of erosion. This operation generally increases the sizes of objects, filling in holes and broken areas, and the connecting regions that are separated by spaces smaller than the size of the structuring element. With grayscale images, dilation increases the brightness of objects by taking the neighborhood maximum when passing the Kernel over the image. The first operation is erosion to remove the noises, subsequently shrinks the object in the image. Thereafter dilation is implemented to increase the object size. 4.4 Contour Detection A contour is a closed curve of points or line segments, representing the boundaries of an object in an image. In other words, contours represent the shapes of objects found in an image. The contours are a useful tool for shape analysis and object detection and recognition. Contour detection is usually after edge detection. Ramer Douglas Peucker algorithm [48], developed 1972 and Satoshi Suzuki algorithm developed 1985 [49] are two contour finding algorithms. 4.5 Implementation and Software The first idea for this part was to implement an object detection algorithm like YOLO [41] or R-CNN [50] to detect and draw a box around the different defects. Both classify the defects on an image and also detect the coordinates. YOLOV3 [51], which is a region selective algorithm, was implemented for this project. These object detection algorithms require two input parameters, the input image and the coordinates of the object on the image. Therefore, much more work 36

is needed to prepare and process the data. A graphical object annotation program was used to annotate all the images manually. This process was very tedious and time-consuming. A transfer learning approach was used in this case with the darknet neural network framework weights for YOLOV3. A new softmax layer was added in the top of the existing network for the propose of this project. After all these endeavors, however, the final result was not very promising for this project. Firstly, the defects were too small for the network to detect. Secondly, the training data was limited because it was difficult to find a good data augmentation method to both augment the data and simultaneously keep track of all the annotated boxes, the coordinates. R-CNN methods was another possible approach. It is better for detecting smaller objects. After all these works, a much simpler concept was implemented to detects the defects on these images. An edge -and contour detection method were used instead. The Canny edge detection algorithm (see section 4.2) combined with some morphological operations (see section 4.3) was tested to detect these defects. This approach was selected because the images in this project are relatively simple. The form and color of these defects are very distinctive from the rest of the image. However, this approach is applicable only images without etching process, the images with clear white background, see figure 3.1. Therefore, detect edges and contours would be simple enough. OpenCV (CV2) which is a standard library in Python for image processing contains all these algorithms. The complete strategy is described below. It contained many different components. 1. The original image was converted to grayscale 2. The original images were cropped about 80 % to remove the image corners 3. A Gaussian filter was added to remove some noises 4. The image was send through a Canny filter to detect edges 5. Dilation and erosion operations were added to make the edges more distinct and clearer 6. The contour function in CV2 was added to detect the contours, connected all the edges together from previous steps 7. Another function in CV2 was added to find the bounding boxes detected by the contour function 8. A rectangle draw function was added to draw the bounding boxes on the images 9. Classified the defects based on the detected height and width 10. Finally saved the resulted images with bounding boxes locally. Additionally, saved the height, width and corresponding files in an Excel spread sheet to see how many boxes were detected per image 37

Chapter 5 Result In this chapter, all the results will be presented. The performance comparison between two CNNmodels implemented in this project. In section 5.1 the overall training strategy for the two CNN models are outlined, including some figures related to the training process. The model evaluation is discussed in section 5.2, with some statistical performance measures. These two sections are related to the first goal of this project, the classification task using CNN models, linked to chapters 2 and 3. Section 5.3 is correlated to the second goal of this project, using rule-based image processing to detect and classify defects, linked to chapter 4. In the last section 5.4 a small comparison between these two strategies or goals are presented. 5.1 Defect Classification using CNN Models The training process of neural networks involves many hyperparameters that have to be tuned. A few of these parameters are learning rate, regularization, batch size, image resolution, data augmentation techniques, etc. Parameter tuning is a trial and error process. It requires many experiments and iterations until a satisfying result is achieved. Finding the optimal parameters for a specific problem is simply an optimization task in itself. Therefore, during this project, many iterations and experiment are performed to find hyperparameters that produced somehow good result. The final image resolution was decided to 224x224 according to VGG19 model training images. The available GPU could not handle higher resolution, and lower resolution reduced the accuracy. The final batch size for training was set to 32 and 16 for validation. The mini for validation i because of the computational power of GPU. Overfitting was a significant problem at the beginning of the training process for both CNN models. Two regularization techniques solved this problem, feature map dropout, and early stop. The scheduled learning rate was another strategy that improved convergence and training stability significantly. The initial learning rate was 1e-2 and then reduced by factor 0.75 after four steps. Five different data augmentation techniques were performed for this project, described in section 3.3. The augmentation was applied equally; the training dataset increased by a factor of five. The models were trained using Keras with GPU based TensorFlow as backend. The models were trained both with and without data augmentation. Input samples were shuffled at the beginning of each epoch, and the model parameters are updated after each epoch. The validation process is done at the end of each. Every training epochs takes around 20-80 seconds. For more details about the network implementation, see section 3.5. The final results for five different training strategies are summarized in table 5.1, two for VGG19 and three for the customized model. Figures 5.1 and 5.2 presents the cross-entropy loss, training -and validation accuracy for the VGG19 model, both with and without data augmentation. Figures 5.3 and 5.4 presents the cross-entropy loss, training -and validation accuracy for the customized model, both with and without data augmentation. 38

Table 5.1: Training result for CNN-models [VGG19 and customized=c] with -and without data augmentation. In the table data augmentation is denoted by a + sign to reduce the space, so C+ means customized model with data augmentation. The best performance metric is indicated by bold text. Model Training loss Validation loss Training accuracy Validation accuracy VGG19 0.1020 0.9012 0.9540 0.9010 VGG19+ 0.0010 0.5510 0.9950 0.9503 C 0.1520 0.6540 0.9560 0.7850 C+ 0.2230 0.8505 0.9430 0.7890 C grayscale 0.2530 0.9040 0.9430 0.7450 Figure 5.1: Training accuracy and cross entropy loss of the pretrained VGG19 using transfer learning, training the top layers only. This figure shows the result without data augmentation. Figure 5.2: Training accuracy and cross entropy loss of the pretrained VGG19 using transfer learning, training the top layers only. This figure shows the result with data augmentation. 39

Figure 5.3: Training accuracy and cross entropy loss of the customized model. This figure shows the result without data augmentation. Figure 5.4: Training accuracy and cross entropy loss of the customized model. This figure shows the result with data augmentation. All the networks were trained 100 epochs, as shown in the figures above. The training -and validation accuracy increases very fast in the beginning. After approximately 20 to 30 epochs, they start to stabilize and grow very slowly. The cross-entropy loss follows the same pattern; it decreases very rapidly and then stabilizes. Figures 5.1 and 5.2 illustrates that the accuracy convergence is much faster for the pretrained VGG19-network compare to the customized model in figures 5.3 and 5.4. The accuracy goes from zero to almost 70 % after five to 10 epochs. This rapid growth is due to the transfer learning strategy, as described in section 3.5.1. VGG19 network is already trained on the millions of images and can capture low-level features of the training samples because those features are common for all the images, the edges, lines, etc. The customized model, however, is trained from scratch. As seen in figures, the overall accuracy for VGG19 is significantly higher compared to the customized model. As mentioned previously; the networks are trained both with -and without data augmentation to compare the overall performance and the effect of augmentation. Figure 5.2 and 5.4 shows the networks with five different data augmentation techniques. The augmentation strategy improved the overall accuracy by ca 5 % for the VGG19 model. However, this strategy did not impact the customized model greatly. The training accuracy increased by ca 5 %, but the validation accuracy suffered by this approach slightly. The cross-entropy loss is fluctuating more as the result of data augmentation for both networks. As shown in the table 5.1, the VGG19 with data augmentation reached the highest accuracy and lowest loss. 5.2 Evaluation of CNN Models During the training, the best-trained weights per epochs are saved into a final model that is used to evaluate the model and make future predictions. The models are reviewed by the test dataset, 40

which is 5 % of the entire dataset. The complete test set contains 105 images, 50 of category Macro, 34 of Por and 21 of Slits. It is a biased set, as described in section 3.5.2 it is essential to have reliable performance metrics to avoid so-called accuracy paradox. Therefore, for this project, four different performance indicator is implemented, True Positive Rate (TPR), Positive Predicted Rate (PPV), F 1 score, and final accuracy (ACC). All these four are calculated using the confusion matrix for each model. The confusion matrices are represented in figures 5.5 and 5.6 which gives an indication of number of misclassified images per classes. As shown in the figures, the misclassification rate is higher for the customized model compare to VGG19 both with -and without data augmentation. The Slits are misclassified more often as Macro. For the customized model, 7 of 43 Slits are labeled as Macro. Both models perform very well for Por defects, and they are mostly labeled correct. It is not a surprising result Macro, and Slits have similar features. Therefore, it is difficult to distinguish those. However, Por defects are easily distinguishable because of their characteristic features. In figures, number 0 represents Por, number 1 Macro, and number 2 Slits. The four matrices represent the VGG19 and customized model with -and without data augmentation. Figure 5.5: The confusion matrix of the VGG19 model without data augmentation (right) and with data augmentation (left) Figure 5.6: The confusion matrix of the customized model without data augmentation (right) and with data augmentation (left) 41

Figure 5.7: The four performance measures for Por (left) and Slits (right), the + signs represent the model with data augmentation. The blue chart is VGG19, and the orange one is VGG19+. The green one is the customized model (C), and the red is C+. Figure 5.8: The four performance measures for Macro (left) and the final accuracy of all models (right), the + signs represent the model with data augmentation. The blue chart is VGG19, and the orange one is VGG19+. The green one is the customized model (C), and the red is C+. Figures 5.7 and 5.8 illustrates the four performance metrics for the models used in this project. All the four metrics for each defect type is summarized in one chart to make the compression easier. The first three charts represent metrics for Por, Macro, and Slits. The final graph shows the total accuracy for all the models, four different combinations. All these charts reflect the previous results, which indicates that VGG19 performs better compared to the customized model for all three defects. VGG19 with data augmentation performs best of all. F 1, which is a weighted average of TPR and PPV shows that Por types are recognized with almost no errors for both models; the recognition rate is close to 100 %. The overall result for Macro example is also very high, especially for VGG19 with data augmentation. However, the overall results for Slits are significantly lower for Slits, especially for the customized model data augmentation. F 1 score is only 78 % compare to over 90 % for other model combinations. The final accuracy is somehow coherent with the other metrics. It shows that VGG19 is the clear winner, with almost 100 % accuracy. 5.3 Defect Detection using Image Processing In this section, the result related to the second goal of this project is presented, the theory and specific implementation details are written in chapter 4. The second of this project was to use a machine learning approach to detect and classify the individual defects, in contrast to the first goal where the entire image was classified. Primarily YOLOV3, which is a selective region algorithm based on CNN, was implemented for this task. It is a powerful real-time object detection algorithm. However, after some experimentation with this algorithm, the produced result was not satisfying. Three main limitations made this approach challenging to overcome, firstly the limited 42

dataset, secondly data augmentation techniques, and thirdly the defects are too small. In the end, the machine learning approach with YOLOV3 was disregarded. Instead, a ruled based image processing method is chosen, for more details read section 4.5 an edge -and contour detection algorithm combined with some other image processing methods, implemented in OpenCV. This strategy is completely hard-coded rule-based. The classification rule is based on the table 3.1. If a defect is under 25 pixel it is labeled as Por, otherwise Macro or Slits depending on the length/width ratio. Defect bigger than 25 pixels and length/width ratio of 5 : 1 is recognized as Slits and vice verse. A few detection results are illustrated in figures 5.9 and 5.10. Figure 5.9: The detected defects by image processing, both Macro and Por defects and some false detected defects Figure 5.10: The detected defects by image processing, Macro, Por and Slits defects and some false detected defects in the edges of the picture on the left In figures 5.9 and 5.10 it is shown the algorithm is capable of detecting the majority of defects existing on the images. One problem with is method is, however, that it can identify areas on images that are not irrelevant. For example, in figure 5.9 on the right image the algorithm has detected three boxes as Macro that is not a defect, they are just line created by the microscope during the test. Another similar problem illustrated in figure 5.10 on the left image on edge, three or four long boxes detected as Slits; those are neither a defect. This false detected boxes can create problems if the entire image is classified based on the number of boxes per defect category. During the implementation phase, the images were cropped by 80 % to reduce the effect of false contributed boxes on the edges and corners of the picture. 5.4 Comparison of CNN models and Image Processing An overall comparison between defect classification using a CNN model and defect detection using image processing is difficult to make because they are focusing on the same problem but from a different point of view. One possible approach to compare those two methods is to count the 43

number of detected defects per image for image processing algorithm and classify the image after majority defected defect type. For example, in figure 5.10, the right image is classified as Slits and left the image as Por. For this comparison, 38 images with a white background are used, 13 Por, 10 Macro, and 15 Slits VGG19 with data augmentation which performed best for the classification task is selected for this purpose. The VGG19 model is not retrained, it is the same as the previous results. Instead of 105 images as previous evaluation dataset, only 38 images have the right background for the image processing algorithm, which only works with white background. First, the images are sent through the VGG19 model and the confusion matrix are calculated. The same procedure is repeated for the image processing algorithm. The confusion matrices are presented in figure 5.11. F 1 score and overall accuracy (ACC) is used as performance metrics. The result is presented in figure 5.12. As shown in the figure, VGG19 outperforms the image processing algorithm in all aspects. ACC is 89 % for VGG19 and 69 % for image processing. In particular, VGG19 is predicting 100 % for Macro type and image processing only 67 %. However, image processing performance slightly better for Por and Slits, around 76 %, but it is still less than VGG19, which predict around 90 %. From the confusion matrix for image processing, it can be seen that many Macro and Slits are misclassified as Por, this reduces the performance metrics. One interesting remark about VGG19 accuracy is that it reduces by 10 % from the earlier results. It goes from 98 to 89 %, see figure 5.8. One explanation is that the image background plays an important role in the overall accuracy, which is not highly surprising. The majority of Macro and Slits images have a dark background in which the networks learns as an essential feature. Therefore, it is vital to have a homogeneous and balanced dataset. Figure 5.11: The confusion matrix of the VGG19 model with data augmentation (left) and image processing (right) 44

Figure 5.12: The F 1 score and final accuracy for the image processing approach and VGG19 model with data augmentation. The first three charts represent F 1 score for each defect, and the last chart is the final accuracy for all three combined 45