Self-supervised pre-training of CNNs for flatness defect classification in the steelworks industry

the one hand, due to different elongation in the internal strip fibre caused by uneven stress along the width or by the high rolling speed process that leads to fluttering strips. On the other hand, flatness defects are an uneven thermal gradient across the strip that is responsible for flatness


Introduction
In the steelmaking cycle, continuous casting is the process where molten steel is solidified in different semi-finished products, and it is the starting point of the Hot Rolling Mill (HRM) process. Slabs are one of these intermediate products, characterized by a rectangular cross-section, and transformed into flat steel products. The primary thickness reduction of a slab can be gained via the roughing mill process where the heated slab enters, after a descaling phase, while the finishing mill process refines the thickness of the strip providing the final thickness and definitively changing the slab into a long and thin product called a strip. From an industrial point of view, a serious concern is represented by the hot-rolled products shape defects and particularly those concerning the strip flatness. Such types of defects, in fact, highlight non-uniformities within the hot rolling process but can be detected only at the end of the process and thus cannot be recovered in time before the next slab is being processed. The main consequence is the evident degradation of the quality of the final product that leads to economic losses due to non-compliant quality of the products.
Flatness defects are, on the one hand, due to different elongation in the internal strip fibre caused by uneven stress along the width or by the high rolling speed process that leads to fluttering strips. On the other hand, flatness defects are an uneven thermal gradient across the strip that is responsible for flatness Classification of surface defects in the steelworks industry plays a significant role in guaranteeing the quality of the products. From an industrial point of view, a serious concern is represented by the hot-rolled products shape defects and particularly those concerning the strip flatness. Flatness defects are typically divided into four sub-classes depending on which part of the strip is affected and the corresponding shape. In the context of this research, the primary objective is evaluating the improvements of exploiting the self-supervised learning paradigm for defects classification, taking advantage of unlabelled, real, steel strip flatness maps. Different pretraining methods are compared, as well as architectures, taking advantage of well-established neural subnetworks, such as Residual and Inception modules. A systematic approach in evaluating the different performances guarantees a formal verification of the self-supervised pre-training paradigms evaluated hereafter. In particular, pre-training neural networks with the EgoMotion meta-algorithm shows classification improvements over the AutoEncoder technique, which in turn is better performing than a Glorot weight initialization.
14 International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 1, March 2020, pp. 13-22 defects generating waviness. Uneven heating or cooling process is the main cause of the latter type of defects due to internal stresses that can locally overcome the yield stress of the material leading to plastic deformation of the strip [1] [2]. Defects due to different elongation of the fibre are particularly relevant, as they are directly connected to rolling process parameters such as the inflection of the working rolls (bending) or the relative sliding of the work rolls along the transverse axis (shifting).
Flatness defects are typically divided into two sub-classes, depending on whether the edge of the strip is affected or not. When the edge is affected, the defect is typically referred to as a "wave defect," while a buckle typically refers to a defect that does not affect the strip edge. In addition, the position along the transverse direction of the strip allows categorizing buckles in center-or quarter-buckles. In the former case, the defect occurs near the longitudinal centerline of the strip, while in the latter case, it occurs in the transverse regions that engage the upper/lower strip at a distance of about one-quarter of the width from the strip edge.
The strip planarity is usually measured by considering the strip as formed by a series of adjacent longitudinal fibres: if all the fibres have the same length, the strip is perfectly flat. The presence of flatness defects derives from the fact that the fibres do not stretch independently, and when they have different lengths, flatness defects appear as waves on the strip. The main parameter used for the numerical evaluation of strip flatness is the so-called I-Unit index, which is computed for each fibre as follows: where is the length of fibre , and is the length of a reference fibre. Typically, the reference fibre is the shortest one, and the I-Unit assumes only non-negative values.
The transversal flatness profiles of each strip are usually concatenated and represented as a bidimensional map of the strip flatness, which is read directly from a measuring system installed at the end of the finishing mill. The procedure to detect and isolate each defect on the strip surface, which is detailed in [3], provides defect sub-images from the full strip image. HRM surface defects classification was tackled in recent years by exploiting Support Vector Machines (SVM) [4], supervised Neural Network with Back Propagation [5], unsupervised classifiers via Self-Organizing Maps (SOM) [6], or Learning Vector Quantiser (LVQ) [7].
In general, industrial surface defects detection and classification systems currently applied in the steel sector exploit Artificial Intelligence-based approaches at different levels: in the preliminary preprocessing stage, for instance for removal of unreliable data [8] and feature selection [9][10] as well as in the actual classification stage [11]- [13]. Moreover, machine learning approaches are applied to correlate the different kinds of defects with their potential causes [14] [15]. However, in this context, the potential of high capacity networks has not yet been fully exploited. Very recently, in [3], the use of Convolutional Neural Network (CNN) is also introduced to cope with the classification problem of surface defects.
In this paper, the classification problem is extended and tackled from a different point of view. In particular, we explore the idea of using unsupervised pre-training of CNNs, which does not require manual labeling of a dataset. Later fine-tuning on a labelled dataset via transfer learning lets us compare the effectiveness of the considered methods to increase classification accuracy over the use of mere supervised learning

Self-Supervised Learning Techniques
High capacity networks are solving many different machine learning tasks, ranging from large-scale image classification [16], segmentation [17], and image generation [18] to natural speech understanding [19] and realistic text-to-speech [20]. A few general trends are easily identified in academia and industry: deeper networks show increasingly better results [21] as long as they are fed with ever-larger amounts of data, and labelled data in particular. Computational and economic costs increase linearly with the size of the dataset. For this reason, in the latest years, some unsupervised approaches were aimed at the exploitation of unlabelled data. The intuition behind many of these techniques was emulating the human brain's ability to self-determine the task goal and to improve it.
Advancements in algorithms able to exploit labels inherently contained within an unlabelled dataset gave rise to what is now referenced as self-supervised learning. LeNet-5 [22] popularized convolutional operators by embedding apriori knowledge of the data into networks by preserving the spatial correlation of the pixel of an image as the signal proceeds through the layers of the network itself. Similarly, selfsupervision embeds apriori knowledge about a dataset into a network, but not by introducing a different operator. Instead, the output of the network is typically constrained to be coherent with a known transformation of the inputs. Since the input and the transformations are known, we can picture this situation as deriving labels from the input data and forcing the network to converge to those labels. Assuming weights learned through self-supervised learning generalize to a similar task, one can use transfer learning [23][24] to fine-tune the network on a labelled dataset. A few examples of selfsupervised techniques include: a) Physics and Domain Knowledge [25]: The authors show how a CNN fed with images of a video stream of a falling ball learns to predict the height of a falling object, just by forcing the output to be coherent with the coordinates of a parabola, which is the physically feasible trajectory of a falling body.
b) Unsupervised Jigsaw Puzzles [26]: quoting the authors "By following the principle of self-supervision, the authors build a CNN that can be trained to solve jigsaw puzzles as a pretext task, which requires no manual labeling. The CNN is later repurposed to solve classification and detection via transfer learning". c) Colorization [27]: the auxiliary task is to predict two color channels of an image. It has given the luminosity of each pixel. Also, the representations of the internal feature are learned by colorizing unlabelled images, which can be fine-tuned for classification and detection.
The above methods could not be exploited because: a) our system does not provide a video stream; b) classified objects do not have strong structural properties that identify each shape and c) images are greyscale, not having color channels other than luminosity. Conversely, the method proposed by Agrawal et al. [28] investigates if the awareness of EgoMotion could be used as a supervisory signal for feature learning. In other words, images of a moving item show different instances of the same object, i.e., a fixed label for different samples. Edges, texture, and colors needed to recognize the object are visual features that persist independently of the location of the object itself.
One way to emulate the situation of learning via EgoMotion is to: 2) Feed the same network a randomly transformed version of the same image, by translating/rotating it and let it output a new tensor w·h·f.
3) Concatenate the two outputs to form a w·h·2f tensor and feed it to a (top) CNN tasked with predicting the random transformation, which is known, and constitutes the label.
The schematics of the network is shown in Fig. 1.  The self-supervised learning problem is framed as a supervised learning problem and, by backpropagating and iterating over the unlabelled dataset, the solution is a network that exploits visual features to predict the random transformations applied to the image. It is a reasonable assumption that these features can instead be repurposed to classify an image, via transfer learning, which is the goal of our approach.
Another example of a self-supervised technique for learning considered in this work is that of autoencoders, which consist of a neural network that tries to learn the identity function ℎ , ( ) ≈ [29]. Without placing some form of information bottleneck inside the function ℎ, the task of learning the identity function would be trivial. Instead, the amount of information that passes through the network is reduced by having layers with smaller representation capacity, in a way that allows projecting input data in a latent space characteristic of the training data. As the autoencoder is forced to prioritize which aspects of the input should be transferred, it often learns useful properties of the data. Autoencoders are typically composed of two parts: An encoder, that takes the input and generates the latent encoding, and a decoder, that takes the latent encoding and generates the reconstruction of the input. Depending on the task at hand, a different type of autoencoders can be used, for instance: a) Under complete autoencoders [30]: the latent space representation in the bottleneck layer is achieved, constraining the dimension of the output of the encoder to be smaller than the dimension of the input by placing less hidden units than input units. b) Regularized autoencoders [31]: a loss function with regularization used to encourage the model to have representation sparsity (Sparse Autoencoders [32]) and robustness noise/missing inputs (Denoising Autoencoders [33]), rather than limiting the model to reduce the hidden units number.
Since we are dealing with images and we need to reduce the image representation to a tensor coherent with the one produced by the networks pre-trained with EgoMotion, under-complete Convolutional AutoEncoders represent a reasonable solution. These models present a series of convolutional and maxpooling layers to reduce the input to a certain encoding. While resorting is used to transpose convolutional and up-sampling layers for decoding.

Architectures
Throughout the experiments, we used a repeating pattern to develop network architectures of different representational capacity. Independently of the self-supervision method applied for pretraining, every network shares the same type of layers. Specifically, we built two modules: 1) Inception module: based on Szegedy et al. [34], we derived an inception layer where the input branches out to four convolutional modules with different kernel sizes, such as the one reported in Fig. 2(a). 2) Residual module: similarly, based on He et al. [35], we defined a residual layer where the input undergoes heavier convolutional processing on one path while being left almost untouched on another path. Both signals are summed to produce the module's output, as shown in Fig. 2(b).

Fig. 2. -Inception (a) and Residual (b) modules
Both modules provide the possibility to apply batch normalization, as well as different convolutional strides. We define a number of base networks composed of the above modules, and the corresponding naming convention is, e.g., EMInc4BN for a network of four Inception modules with Batch Normalization and pre-trained with EgoMotion, and AERes8 for a network of 8 Residual modules without Batch normalization and pre-trained with AutoEncoders. Table 1 shows the structure of 4 models with increasing complexity. Each structure is composed of either Inception or Residual modules, for a total of 8 networks. They have been trained with and without batch normalization, totaling to 16 models.

Experiments
Every single model undergoes three different training techniques, such as pre-training with EgoMotion and transfers learning on the classification dataset, pre-training with AutoEncoder, and transfer learning on the classification dataset, and training from scratch on the classification dataset.
During transfer learning, every layer is left trainable, and weights learned during pre-training were not frozen when turning to classification. In the context of pre-training with EgoMotion, each one of the base networks described in Section 2.1 constitutes the bottom CNN, while the top CNN consists of a dense layer of 300 ELU units [36], followed by a 0.3 rate dropout and the output layer. This learning technique requires three outputs: one for predicting rotations and two for vertical and horizontal translations. As in Agrawal et al. [28], the problem is framed as a classification task, so every output is an array of softmax units predicting the bin corresponding to the right transformation.
In the context of pre-training with AutoEncoders, each one of the base networks described in Section 2.2 constitutes the encoding part, which outputs a 7x7x64 tensor. The decoder architecture is common to each model and is composed of 5 transposed convolutional layers preceded by up-sampling layers. The International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 1, March 2020, pp. 13-22 last convolution has sigmoid activation functions, which are a good fit for regressing pixel luminosity values scaled to the 0-1 range.
Every network is trained using Adam [37] optimizer for 100 epochs with early stopping and L2 regularization to prevent overfitting. Once pre-training is completed, every network is repurposed for classification by removing either the top CNN or the decoder for EgoMotion and AutoEncoders, respectively, and by plugging a 0.3 rate dropout layer, a 20 ELU unit dense layer, and a final four softmax unit layer. Adam optimizer was run for 100 epochs every 64-sample batch, and training was terminated with early stopping. Heavy artificial data augmentation was part of the process, applying random affine transformations to the input images, such as horizontal and vertical flip, width and height shift, and zooming. Similarly, the training process for classification was also carried out without pre-training of the networks and using Glorot initialization [38]. In order to have better confidence in the performance scores, training on the classification dataset was run three times, and the results averaged.

Dataset
In this work, we exploit the data used in Vannocci et al. [3] for what concerns the labelled dataset, where a thorough explanation of how the built dataset is presented. Using the same data, we can compare pre-training techniques against a common baseline to establish the effectiveness of self-supervision. Here we propose a summary of the main features of the exploited dataset.
Defect images are extracted from the overall image of the strip and manually classified in 4 different categories -Wave, Buckle, Multiwave, and Multibuckle (see Fig. 3) by expert personnel. Every strip image is affected by a varying number of defects, so dataset splits refer to defect images, not the strip images. Of these, ~80% is devoted to the training and validation sets, while the remaining ~20% are test images. This results in a dataset composed of 4806 images: 3938 images were used for training and validation in a 70-30% split, 868 images were used for testing. The class distribution is shown in Table  2. For what concerns the data used for self-supervised training, we ran the bounding box algorithm in Vannocci et al. [3] on new strips to recover 32437 new unlabelled defect images.

Results
The results of all experiments are summarized in Table 3 and Table 4. Fig. 4 showed each model's performance, comparing the accuracy of the same model with different pre-training policies. In the vast majority of models, we see an increase in validation accuracy whenever pre-training occurs. Pre-training with EgoMotion almost always guarantees a better classification accuracy over training from scratch,

19
Vol. 6, No. 1, March 2020, pp. 13-22 where initial weights are initialized with Glorot [38]. Specifically, the overall average accuracy increase is equal to 1.03%, which for a validation accuracy of 90%, would mean a relative decrease in the error rate of about 10%. Similarly, pre-training with AutoEncoders shows a performance increase when the model is simpler -typically when the model is 1 or 2 modules deep. The overall average accuracy increase is still positive and equal to 0.41%.

Conclusion
The problem of classification of surface defects in the steel industry has been examined and advanced in this study by exploiting unlabelled data. By doing so, the improvements in classification accuracy come without a corresponding increase in costs due to expert personnel devoted to assembling a bigger labelled dataset. In particular, we have shown that using self-supervised learning algorithms for pre-training different Convolutional Neural Network architectures leads to increased accuracy once the models are fine-tuned via transfer learning on the classification task. Concerning similar results on Vannocci et al. [3] on the same classification dataset, we underline four major achievements. The first, validation accuracy is generally improved, with the best performing network EMRes2 outperforming the results of previous research, 92.7% to 93.0%. Second, All the models evaluated in this context have drastically reduced the number of parameters needed to achieve a comparable -if not better -performance. EMRes2 has more than 160-times fewer parameters than Inception 311 in Vannocci et al. [3]. Third, the accuracy of EMRes2 (90.6%) showed overfitting signs, but it still increased with respect to Inception311 (89.2%). At last, we can conclude the increase in accuracy comes without the need of additional labelled images, by adopting self-supervised algorithms for pre-training.