An Automated Learning Method of Semantic Segmentation for Train Autonomous Driving Environment Understanding

This article proposes an automated machine learning method for semantic segmentation that can be used for automated training of models in fields such as autonomous driving. This method is not specific to a particular semantic segmentation model. Users can simply upload a dataset with the semantic segmentation model they use, and then choose to use this method. This method implements end-to-end machine learning from sensing data to semantic segmentation results and model evaluation. It integrates four main components: unsupervised data reduction through feature extraction and clustering, a contrastive learning-based evaluator of semantic segmentation results, interactive reinforcement learning-based data selection, and automatic hyperparameter tuning through Bayesian optimization. We demonstrate the practicality of this method on the MRSI and Cityscapes datasets, and trained mainstream semantic segmentation models, such as BiSeNet and STDC. Our results show that this method can effectively guide semantic segmentation training and reduce the training time by more than 20$\%$.


An Automated Learning Method of Semantic Segmentation for Train Autonomous Driving Environment Understanding
Yang Wang , Jin Zhang , Member, IEEE, Yihao Chen , Hao Yuan , and Cheng Wu , Member, IEEE Abstract-This article proposes an automated machine learning method for semantic segmentation that can be used for automated training of models in fields such as autonomous driving.This method is not specific to a particular semantic segmentation model.Users can simply upload a dataset with the semantic segmentation model they use, and then choose to use this method.This method implements end-to-end machine learning from sensing data to semantic segmentation results and model evaluation.It integrates four main components: unsupervised data reduction through feature extraction and clustering, a contrastive learning-based evaluator of semantic segmentation results, interactive reinforcement learning-based data selection, and automatic hyperparameter tuning through Bayesian optimization.We demonstrate the practicality of this method on the MRSI and Cityscapes datasets, and trained mainstream semantic segmentation models, such as BiSeNet and STDC.Our results show that this method can effectively guide semantic segmentation training and reduce the training time by more than 20%.Index Terms-Automated machine learning, interactive reinforcement learning (RL), semantic segmentation.

I. INTRODUCTION
A S ONE of the most important means of environmental understanding among many image recognition algorithms, semantic segmentation has recently made great progress in many application fields, which has driven the growing demand for high-performance semantic segmentation systems.Especially with the rise of autonomous driving, semantic segmentation has become an indispensable means to recognize roads and avoid obstacles [1].
With the rapid growth of automated machine learning (Au-toML) over the past few years, a large number of automated methods and systems have emerged and each algorithm and system claims to be better than the other in terms of search capabilities in the search space.However, in many cases, we may not need a machine to help us search for a satisfactory model.A more common situation is that we have deployed a model on the terminal.This model has been carefully designed by researchers and does not need to be replaced.But as the application scenarios change slightly or migrate to other scenarios, the performance of model is no longer satisfactory and researchers are required to update the model.
The method proposed in this article is designed to address this problem.It is not specific to a particular semantic segmentation model, but an automated processing system around the semantic segmentation model.Our focus is not on how to automatically construct a semantic segmentation model (e.g., NAS), but on the basis of the selected model, retraining of the model on new data received from sensors (which reflects new changes in application scenarios) to adapt to the changed scenarios.In short, we are targeting model retraining, which is a part neglected by many current AutoML methods.
The mainstream semantic segmentation process generally includes the following steps [2]: first, data collection and annotation are performed for the scene that needs to be segmented, and then the segmentation network is trained through the annotated data.After a certain period of training, the network parameters are updated to converge.Finally, the convergent model is used for inference.It is worth noting that the accuracy of the convergent model is often unsatisfactory, especially when it is migrated to a new scene.Therefore, the model training process in most cases requires researchers to continuously analyze the model training effect and adjust the model parameters based on experience until the model meets the requirements.
Based on the analysis of the above phenomenon, we believe that there are two major problems in the current training of semantic segmentation: 1) Network training requires a large amount of labeled data, but data collection and labeling is a tedious task.It not only includes two steps of collection and annotation.In order to improve the quality of the dataset, we will have to carefully study the distribution of data and eliminate inconsistent data, such as occlusion and overlap.Each step requires considerable labor input.In addition, too little data tends to put the network into an overfitting state, so the amount of data is also required.At the same time, exploring unknown scenarios is unavoidable in practical applications, which also puts forward higher requirements on the typicality and generality of the selected data.2) After the completion of each stage of training, researchers are required to evaluate the learning effect of semantic segmentation, and then manually adjust appropriate parameters for retraining, which lacks a self-update mechanism.At the same time, the process of parameter adjustment is extremely dependent on the researcher's experience and judgment, which is highly subjective.
In this article, we propose an AutoML method for semantic segmentation model iteration.AutoML innovatively utilizes TAMER [3] and reinforcement learning (RL) to achieve automated data selection, which avoids the huge cost of data collection and cleaning by researchers, and at the same time, the method can be combined with a general semantic segmentation model.The agent evaluates the degree of model training and adjusts the training set to guide the automatic training of the model.Its structure is shown in Fig. 1.
The focus of this research is on designing an AutoML method for the field of semantic segmentation for autonomous driving, aiming to solve the difficulties in acquiring training data and automated model training.
The rest of this article is organized as follows.Section II reviews related work; in Section III we introduce our proposed automated learning method for semantic segmentation in detail; Section IV describes the experiments and the analysis of results; Finally, Section V concludes this article.

A. Parameter Optimization in AutoML
Parameter optimization has been an important research topic in AutoML.Optimization methods focus on optimizing the hyperparameters used for training.Popular methods include grid search (GS) [4], random search (RS) [5] and Bayesian optimization (BO) [6], [7].GS divides the search space into regular intervals and selects the best execution point after evaluating all points, while RS selects the best point from a set of randomly selected points, and BO builds a probabilistic model mapping from hyperparameters to the validation probabilistic model mapping from hyperparameters to evaluation metrics on the set, which well balances exploration and exploitation.In addition, gradient-based optimization (GO) [8], [9], [10] uses gradient information to optimize hyperparameters and significantly improve the efficiency of hyperparameter optimization (HPO).Maclaurin et al. [11] proposed the reversible-dynamics memory-tape method, which efficiently handles thousands of hyperparameters through gradient information.To further improve efficiency, Pedregosa [12] used approximate gradient information instead of real gradients to optimize continuous hyperparameters.Chandra [13] proposed a final gradient-based optimizer that not only optimizes regular hyperparameters (such as learning rate) but also optimizes hyperparameters of the optimizer (such as moment coefficients of the Adam optimizer [14]).

B. Semantic Segmentation Automation
Automated learning for semantic segmentation is necessary and has achieved many results.Zhang et al. [15] pointed out that manually designing and tuning parameters of semantic segmentation networks requires a lot of expert work, and it is difficult to find a balance between speed and performance for some realtime applications such as autonomous driving.Therefore, he proposed a customizable architecture search method to automatically generate lightweight networks with specific constraints.This is the first attempt in the direction of automatic network architecture generation for semantic segmentation.Nekrasov et al. [16] pointed out that since manually designing networks is tedious and difficult to handle, automated design of neural network architectures for specific tasks is a very promising route.He used an recurrent netural network (RNN) controller to cyclically output network structure and operations of each layer for semantic segmentation, with specialized modifications for compact semantic segmentation and the inclusion of auxiliary units to speed up search and training.Liu et al. [17] proposed a network-level search space containing many popular designs and developed a formulation allowing gradient-based architectural search.Kim et al. [18] applied NASNet, an AutoML RL algorithm, to deep U-Net network to improve image semantic segmentation performance.Chen et al. [19] proposed a decoupled, fine-grained delay regularization method to address the problem of crashing semantic segmentation models designed automatically using neural architecture search (NAS) and better achieves a balance between high accuracy and low delay.Yang et al. [20] introduced automated semantic segmentation to the medical field by proposing a composite structure for dense labeling in which a custom 3-D fully convolutional network explores spatial intensity concurrency of the initial labeling, and RNN encodes spatial orderliness to counteract boundary ambiguity, resulting in significant refinement.It allows simultaneous segmentation of multiple anatomical structures with clinical significance, such as fetus.It can be seen that automated learning of semantic segmentation is becoming a very important research direction.

III. FRAMEWORK
Our framework aims to automate the training of semantic segmentation models.It assembles the necessary training steps into an end-to-end machine learning pipeline and obtain features, algorithms, and hyperparameters that return the best performance on validation dataset.The AutoML framework is shown in Fig. 1.The process mainly includes preprocessing of massive data, training of semantic segmentation result evaluators based on MoCo [21], data selection based on interactive RL, and automatic adjustment of hyperparameters through BO.Given any semantic segmentation model, AutoML is responsible for data preprocessing to generate a highly available dataset based on perceived relevant scenarios.Interactive RL is used as a key controller throughout the process, and BO guides the self-learning process to find the most suitable hyperparameter values within specific computing power and time constraints.
First of all, we need to clarify what kind of data are worth labeling and training.We notice that the general training data has two problems: 1) There are inevitable duplications in the huge amount of raw data, and repeated training of highly consistent scenes wastes valuable resources and has high labeling costs.
2) The new data obtained may have been perfectly segmented by the network, and repeated training of this kind of data cannot improve performance but takes up resources.Therefore, the training data needs to meet the following two points: 1) This data is quite different from the data in the existing dataset.
2) The existing network has poor segmentation effect on this data.For the first point, we add the step of deduplication using feature extraction and clustering on the original data.For the second point, we utilize MoCo for contrastive unsupervised learning.We use a small amount of data to train a segmentation result evaluator and an interactive RL model based on TAMER + RL to take into account the judgments of humans and agents.It adaptively pick out more valuable data for retraining based on the scoring results.Finally, BO is used to find the best hyperparameter values of the semantic segmentation model to optimize the machine learning model.

A. Data Preprocessing
Our aim is to find images that have too much duplication with the images in existing datasets.In this solution, we will first perform feature extraction on the data to be processed, which is implemented using ResNet50 in this article.The original structure is finally connected to the 2048 d → 1000 d fully connected layer.In order to ensure the completeness of the extracted features, we redefine the fully connected layer to be 2048 d.After that, we read clustering centers of the original dataset as the initial clustering centers, and calculate the distances between the feature vectors of the data to be processed and clustering centers.Then, we divide them into the class corresponding to the clustering centers with the smallest distance, and judge the number of elements in each class for splitting and merging operations.Recalculate the cluster centers and repeat the above steps until the maximum number of iterations is reached.Finally, we select the cluster center images and save the center vectors.

B. Semantic Segmentation Result Evaluator
Our approach is based on the insight that data that are useful for model training should be those with bad results.Following the general process, we can input the data to be labeled into the trained model for segmentation, and judge the quality of the segmentation results as a basis for whether it has been proficiently mastered by the network.For example, the object detection system misdetects an object in front of it, or the semantic segmentation system segments a lane lines into pieces.Therefore, our method should effectively screen out these misidentified data.For manually trained models, this requires researchers to continuously judge the quality of the output segmented images at the backend of the semantic segmentation network to find out the data with poor quality.This means that once new data is obtained, a lot of labor is required from researchers.We need a method that can automatically judge the quality of segmentation results.
We notice that in semantic segmentation of tracks/roads, a good segmentation result must be those smooth and regular masks, while a bad segmentation result may be an irregular division that also includes parts such as burrs.Certain features in the segmentation map can be used as a basis for judging whether the results are good or bad.The emergence of contrastive learning provides a solution to this problem.We use MoCo to for comparative unsupervised learning (shown in Fig. 2).Its function is to quantitatively score the processing results of the specified task and label each input data with a score.
The biggest advantage of contrastive learning is the ability to self-supervise the learning of features from new data.Its work shows that the performance of self-supervised learning can be equal to or even surpass that of supervised methods.As shown in Fig. 2, encoder-q is the feature extraction network we want to train.We use instance discrimination as a pretext task.For the image x 1 in the semantic segmentation result set {x 1 , x 2 , . . ., x n }, data augmentation is performed on it to obtain the images x 1 1 and x 2 1 , which are from the same original image to form a positive sample pair.The other images {x 2 , . . ., x n } in the segmentation set are all negative samples.x 1  1 is input into encoder-q for feature extraction to get q, and {x 2  1 , x 2 , . . ., x n } are input into encoder-k for feature extraction to get {k + , k 2 , . . ., k n }.MoCo treats the features output from encoder-k as a dynamic dictionary, and the feature q output from encoder-q as query.Comparative learning is converted into a dictionary query problem.Our purpose is to make the feature q and k + from the positive sample pair is as close as possible in the feature space, and as far away as possible from the features {k 2 , . . ., k n } of other negative samples.
Contrastive loss is a function that should be low when q is similar to its positive key k + but not to all other keys (which are considered as negative keys of q).This process can be viewed as an n + 1 classification problem, where our expectation is to categorize the query into k + , so the contrast loss function takes the form of InfoNCE where τ is a hyperparameter that controls the shape of the distribution.Query and k are encoded by their respective encoders.
In this article, the same encoder ResNet50 is used for both.
The dictionary holds a subset of all data.In order to maintain a large dictionary, the dictionary is represented using the data structure of the queue, which make the dictionary size break the GPU memory limit, and the encoder is slowly updated using momentum update to maintain the consistency of features extracted by the encoder where θ q and θ k are the parameters of encoder-q and encoder-k, respectively.
In this article, the data stored in the dictionary is the semantic segmentation result image, and the encoder can learn the features of the segmented image after being trained by the MoCo model.We add a global average pooling layer and a fully connected layer to the backend of encoder-q for the final classification of segmented images.It classifies the segmented images based on the features extracted by the encoder (ResNet50), and outputs the probability of good or bad segmentation results, which is positively related to the actual quality of the image.Therefore, we define where p i refers to the probability of being a bad segmentation result, score i refers to the score of the i-th image, and α, depending on the overall distribution of the dataset, is used to limit the range of scores.From this, we get a evaluator for the semantic segmentation results.

C. Interactive RL
After getting the scores from the preprocessing module, the simplest choice is to use a threshold function to separate images with lower scores, but this is not an excellent processing method.It has two main problems: 1) Using the threshold function is a hard division and lacks adaptability.
2) The threshold does not reflect the real model performance.First of all, a single threshold can only be applied to a single scenario.When the scenario switches, researchers need to regeneralize the threshold based on the dataset.Even worse, using a threshold function will lead to disastrous results when the model experiences multiple scenarios at the same time.The thresholds in different scenarios may be completely different, and it cannot be adaptively optimized for changing scenarios.
Second, using the threshold function does not reflect the real model performance.As an example of semantic segmentation for autonomous driving, a camera in front of the car is acquiring images at a certain frame rate.The image at moment t is incorrectly segmented by the model in the car, but at moments t − n, . . ., t − 1, t + 1, . . ., t + n, the images are correctly segmented.We believe that the error at moment t does not reflect the poor performance of the model, because the model performs very well throughout the entire time.At the same time, due to the continuity of objects in the real world and the high frame rate of the camera, the t − 1 frame, t frame, and t + 1 frame reflect almost the same scene, the error of a certain frame in a long period of time does not correctly reflect the performance of the model.We need to consider the model based on the results of the entire period in order to select misjudgment data that is valuable for training.The method of using threshold function does not provide a global consideration.In this article, we adopt RL to solve this problem.Furthermore, a feedback mechanism that can be engaged by researchers, i.e., an interactive RL is provided to help RL converge quickly.
As shown in Fig. 3, RL are used as key classifiers for dataset adjustment throughout the pipeline.RL agents can select training data based on the degree of model training, eliminate worthless data, and focus on increasing the proportion of incorrectly segmented data.
In this scenario, we define the external environment as the image set after semantic segmentation using the trained model, as shown in Fig. 4. The agent interacts with the environment (i.e., this segmented dataset) and selects images that are valuable for retraining.In each round of semantic segmentation model training, since the parameters of the model are different, the valuable images selected should also be different.The agent has the ability to adaptively explore dynamic selection criteria.
For simplicity, we make a streaming assumption that unprocessed data arrives as a stream.As each data arrives, the agent must decide what action to take, i.e., whether the data is retained for retraining.In this model, we define the elements of RL as follows: The state s t includes the candidate image being considered for retention or deletion and the image left after processing at time step 1, . . ., t − 1. Define the state space S as an N-dimensional space and use the vector ϕ(s t ) to represent the state at moment t.Each dimension stores the data retained after selection by the agent, where the 1, . . ., N − 1 dimension represents the retained data after previous processing and the Nth dimension represents the candidate data being processed.The larger N is, the richer the information contained in ϕ(s t ), indicating that the agent considers more global information to select data.
Action space A is a set of N actions, i.e., A =< a 1 , a 2 , . . ., a N >, where a 1 , . . ., a N −1 , respectively correspond to the actions of the candidate image replacing the image in the 1st, . . ., N − 1th dimension.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The reward function r t+1 represents the reward obtained by taking action a t in state s t .In the state iteration, every time the agent performs an action, the environment needs to return the corresponding reward to evaluate the impact of the action.Obviously, incorrectly segmented images are more valuable for retraining compared to correctly segmented images, so the model should be encouraged to retain more poorly segmented data.We define the reward function as where score is obtained by the MoCo evaluator mentioned above.score N refers to the Nth dimension score of s t , which is the score of the candidate image being processed, and score i represents the score of the original image replaced by the current candidate data.The intuition behind this reward function definition is that if the new segmentation image is worse (higher score) than the original segmentation image, the new segmentation image replaces the original image, otherwise it is skipped.In the formula, the positive or negative of the reward (i.e., whether it encourages the behavior) is determined by score N − score i .min (score N , score i ) is used to limit the value of the reward to prevent the reward from becoming abnormally large when the difference between score N and score i is too large.
In this article, value-based RL model is used.The deep Q-network (DQN) model [22] is used as an example in the experiment.As shown in Fig. 5, in order to make network training more stable, DQN uses two Q networks.Current Q network is used to select actions and update model parameters, and target Q network is used to calculate the target Q value.The structure of two networks are the same.The parameters of the target Q network are not iteratively updated, but copied from the current Q network at regular intervals to reduce the correlation between the target Q value and the current Q value.DQN uses a neural network to simulate the value function.The network inputs the state and each possible action has a separate output unit to give its predicted value.The value of all feasible actions in the state can be given through a single forward pass calculation.In this method, we design a neural network containing five convolutional layers and two fully connected layers to represent the value function.The neural network inputs N-dimensional image scores, which represents N image information obtained from the environment.The output layer is used to output the N images that are ultimately retained in the final state.These are the selected images with poor process.The whole constitutes RL controller network.
We design the neural network L2 loss function based on temporal difference error (TD error): (5) where θ refers to the set of parameters included in the neural network, y = r + γ max a Q(s , a ; θ), E s,a,r [v s [y]] is the expectation of variance of y.
At intervals c, update the target Q-network parameters θ − to the parameters θ of the current Q-network.Thus, the loss function is expressed as In order to speed up the training process of RL agents, we introduce the TAMER framework, an algorithm that replaces the reward function with human feedback.Knox and Stone [3] showed that human reinforcement signals (information-rich but defective) and Markov decision process (MDP) reward (information-poor but not defective) are complementary signals and can be used together.The TAMER agent performs better than the RL agent in early training and rapidly shows goodquality behavior.Therefore, we combine the powerful early learning of TAMER with the superior long-term learning of RL agent to accelerate the climb of the RL on the learning curve.
TAMER attempts several methods of combining scalar feedback with RL reward functions.We selected the method with the best performance among them.In the following formula, Ĥ(s, a) is the human reward function The weighted predictions of human reinforcement are added to Q(s, a) only at action selection, corresponding to the green part of Fig. 5.This method does not affect the update of the Q function.The weight is an input parameter, and this factor was 0.98 in the experiment.
In each round of training, the state s 0 , the experience replay pool D, the current Q-network and the target Q-network are first initialized.At each step, the agent uses the ε − Greedy method to make current replacement action a t according to the current state.The ε − Greedy method selects the action with the greatest value based on the prediction of the current Q network with probability 1 − ε, i.e., a t = arg max a∈A Q (s, a; θ), while randomly selecting an action a t from other actions as the current action with probability ε.The ε − Greedy method is shown as follows: After executing the action a t , the instant reward r t is obtained, and the agent transfers to the state s t+1 .The agent past interaction experience e t = (s t , a t , r t , s t+1 ) will be stored in the experience replay pool D. At the same time, a batch of experiences {e 1 , . . ., e j } is randomly selected from D and the target Q value y j is calculated (when the MDP is not terminated) as follows: where Q refers to the target Q network, i.e., the target Q network is used for the computation.Finally, gradient descent is used to update the Q network parameters, and the loss function is as described above.
The current Q network copies its own network parameters θ to the target Q network θ − after each c rounds of training.During the process of training, with the iteration and update of the current Q network and the target Q network, it gradually approaches the real value function.

D. Hyperparameter Auto-Tuning
Automatic HPO is a black box optimization problem.Our aim is to optimize semantic segmentation and maximize model performance on the validation set by finding optimal algorithm settings and hyperparameter values.This process requires trying all promising settings and values, making optimization tedious and expensive.BO is an effective solution in this case, which builds a probabilistic model of the objective function based on past evaluation results to find the value that minimizes the objective function.BO uses a continuously updated probabilistic model and make inferences based on past search results.Thus, next trials can be focused on more promising values, making it greatly reduce the number of trials compared to RS and GS.Bayesian HPO is expressed as where f(x) is the target we need to optimize.Tree-structured Parzen Estimator (TPE) [23] is used in this study.In each trial, TPE maintains a Gaussian mixture model l(x) for the hyperparameter associated with the optimal target value, and another Gaussian mixture model g(x) for the remaining hyperparameters.TPE selects the hyperparameter values corresponding to the maximization of l(x)/g(x) as the next set of search values.In this way, the algorithm is able to adaptively adjust the size of the parameter search space and find the globally optimal solution in as few iterations as possible.

A. Experiment Preparation
As a test scenario, this article uses the MRSI dataset [24] and Cityscapes dataset [25].MRSI provides visible-light images collected by freight railways and subways to simulate possible scenarios in autonomous driving.MRSI uses various sensing devices mounted on the vehicle to record rail scenes under different lighting and weather conditions, including straight lines, curves, and forks during the day, night, and rainy days.It is a dataset specific to rail transportation.CityScapes contains various stereoscopic video sequences recorded from street views in 50 different cities.It has high-quality 5000 frame pixel-level annotations and is a commonly used road dataset.The MRSI dataset and Cityscapes dataset evaluate model performance from two aspects: rail transportation and road traffic, respectively.

B. Feature Extraction and Clustering
In order to evaluate the deduplication performance of the algorithm, we make the following comparative experiments.The dataset used is rail transit scene data containing 18 414 images, which are arranged in chronological order.We divide it into three parts to simulate the original dataset A, the first new scene dataset B, and the second new scene dataset C. The ratio of the three datasets is 4:3:3.The experimental parameters are set as follows: the minimum number of samples in each class is 3, the standard deviation threshold is 0.0005, the minimum center distance is 0.13, and the number of iterations is 50.First, the original dataset A is clustered to obtain 2046 deduplicated images (the expected cluster centers are 2000), and these clustering centers are recorded.Afterward, dataset B is clustered with these clustering centers as initial clustering centers.Merge A and B to obtain 3262 deduplicated images and record these clustering centers.Finally, dataset C is clustered based on the new clustering centers, and the three datasets are merged to obtain the final 4873 images.At the same time, we conducted repeated experiments to verify the stability of the clustering results (for the original dataset A, because the cluster centers are randomly selected when A is clustered).The experimental results are shown in Table I.We randomly select a group of clustering centers for the first experiment and obtain a total of 2046 deduplicated images.The second clustering gets 2088 images.Compared with the first clustering results, the total number of duplicated images is 1935, with a consistency rate of 0.94.The consistency rates from the second to the fourth are shown in the Table I.It can be seen that the final result of this method is highly stable and will not cause large changes in the dataset.

C. Interactive RL
In this section, BiSeNet [26] and STDC [27] are used as semantic segmentation networks.BiSeNet is a classic lightweight real-time semantic segmentation model with high accuracy and speed.STDC also supports real-time semantic segmentation, and the performance is SOTA in 2021.Both are tested separately to verify the effectiveness of the method.
The success of the TAMRE + Rl combination technology can be defined in two different ways.One considers cumulative reward from RL, that is, getting more and more reward in iterations.The training data are segmented images output by BiSeNet.In training, one round is taken from the initial state to the terminal state.After every 20 rounds of training, the RL model is evaluated once.Five randomly selected data selection tasks are tested in each evaluation and the total reward of their output is calculated as the result of this model evaluation.Obviously, the higher the total reward, the more valuable training data are selected.Fig. 6 shows the experimental results obtained by training 30 000 rounds using this method, where the horizontal coordinate represents the number of training rounds and the vertical coordinate represents the cumulative reward in each round.
This result illustrates that a data selection method based on interactive RL can effectively learn how to handle data selection tasks.In the initial stage, the model effect improves rapidly, and then gradually levels off.This shows that the selection strategy is continuously optimized during the training process and the model performance gradually becomes better.
The second definition of success is to receive a better final performance than the original model.The evaluation metrics of semantic segmentation are based on the confusion matrix.The popular methods for evaluating semantic segmentation are pixel label, including pixel accuracy (PA), class PA (cPA), mean PA (mPA), intersection over union (IoU), and mean IoU (mIoU) [24].Among all these metrics, mIoU is highly representative and more inclined to comprehensively describe image segmentation results.Therefore, it has become the most commonly used measurement standard at present.
In order to prove that the data selected by this method is beneficial for model retraining, we train the semantic segmentation models BiSeNet and STDC on the railway dataset MRSI and the road dataset Cityscapes, respectively, and compare the loss and mIou with and without AutoML to verify the performance of the model.The experimental results are shown in Fig. 7.It should be noted that the STDC official model is evaluated with 1 k iterations as the cycle, and the BiSeNet model is evaluated with 1 epoch as the cycle.Therefore, the horizontal coordinate of the mIou curve of BiSeNet in the figure is epoch, which has no impact on the results since we are self-comparing.Taking STDC training MRSI as an example, the red line represents that AutoML is used, and the green line represents that AutoML is not used.As can be seen from Fig. 7(a) and (b), the red line converges  significantly faster than the green line.From the mIOU, the red line reaches convergence at 34 k iterations with a convergence mIOU of 86.1%, and the green line reaches convergence at 41 k iterations with a convergence mIOU of 86.4%.The convergence speed of the red line is significantly faster than that of the green line, and the final convergence mIOU is almost the same, which means that the data selected by our model covers the poorly trained data well and ensures the training accuracy while reducing the amount of data.The specific results are shown in Table II.It can be seen that the adoption of our model can speed up more than 20% and greatly reducing the training time.

D. Display of Data Selection
We use the trained RL model for data selection.Fig. 8 shows a complete process of data selection.Each row in Fig. 8 represents a moment and includes N = 10 images, of which 1-9 are previously retained images (dark green images in Fig. 8), Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III HYPERPARAMETERS, DOMAIN DISTRIBUTIONS AND OPTIMAL VALUES
and the 10th is the new image arriving at this moment (brown images on the right side in Fig. 8).The lack of data in each row indicates that our model is selecting images.Well-segmented images indicate that the model has been adequately trained on those original images and does not need to be added to the dataset for subsequent training, so they are screened out by the model.The model tends to retain poorly segmented images.The agent makes replacement actions at each moment.Take step 0 and step 9 for example: At step 0: the current state is images 1-10, and the action space includes 10 kinds of actions, including replacing images 1 to 9 with the 10th image (the current new incoming image) or discarding the 10th image.The trained value network calculates rewards for each of these 10 actions and select the action with the largest reward (greedy method).Finally, the third image (the bright green image in Fig. 8) is replaced with the new image.We can see from the figure that the segmentation result of the 10th image is very poor and is data of great value for retraining, while the segmentation result of the 3rd image is well and does not require retraining.
At step 9: the optimal action returned by the value network is to discard the image.It can be seen from the figure that compared with the retained image, the current image is indeed of no value for subsequent training, so the image is discarded and returned -1 reward.Returning a negative reward indicates that we do not encourage the model to discard too many new images, but actively update retained images.

E. Bayesian HPO
Hyperparameters and domain distributions are listed in Table III.The domain space of hyperparameters to be searched is created accordingly and then refined in subsequent searches.The logarithmic uniform distribution is used for the learning rate, since it varies over several orders of magnitude.The values in the domain are sampled with equal probability.The objective function here is to minimize the logarithmic loss of the semantic segmentation model using specific hyperparameters through 10-fold stratified cross-validation.The optimized hyperparameter values are listed on the right side of Table III.Performance and hyperparameters versus number of iterations are plotted to examine the auto-tuning process, as shown in Fig. 9.The Black pentagram indicates the optimal value.As expected, the loss decreases over time, indicating that the method is trying better hyperparameter values.As the search proceeds, the algorithm switches from exploration (trying new values) to  exploitation (selecting values with better past results).The losshyperparameter relationship is shown in Fig. 10.It can be seen that BO tends to focus the search on evaluating more promising values and ultimately finding the optimal hyperparameter values.

V. CONCLUSION
In order to reduce the training cost of migrating models to new scenarios and help researchers with retraining, this article proposes an AutoML method for semantic segmentation.AutoML integrates the main steps of semantic segmentation model retraining into an automated process, including unsupervised data reduction through feature extraction and clustering, a contrastive learning-based evaluator of semantic segmentation results, interactive RL-based data selection, and automatic hyperparameter tuning through BO.This method innovatively uses TAMER and RL to achieve automated data selection, avoiding the huge cost of data collection and cleaning by researchers.Meanwhile, this method can be combined with a general semantic segmentation model.The agent adjusts the training set based on the degree of model training and guides the automated training of the model.We demonstrate the practicality of this method on the MRSI and CityScapes datasets and train mainstream semantic segmentation models, such as BiSeNet and STDC.Our results show that this method can effectively guide semantic segmentation training and shorten the training time by more than 20%.In the future, we will try to combine this method with general AutoML (e.g., NAS), which focuses more on the construction of model, while our method focuses more on data and retraining, i.e., replacing researchers to automatically screen out valuable data for retraining to reduce consumption and improve speed.The two target different angles and can be combined with each other.

Fig. 5 .
Fig. 5. Interactive RL framework and training process.The gray blocks are the components of DQN, the blue arrows indicate the direction of parameter transmission, and the green blocks indicate where the introduced human feedback plays a role.

Fig. 7 .
Fig. 7. Comparison of model effects.(a) and (b) are the loss and mIOU of STDC on MRSI, (c) and (d) are the loss and mIOU of STDC on Cityscapes, (e) and (f) are the loss and mIOU of BiSeNet on MRSI, (g) and (h) are the loss and mIOU of BiSeNet on CityScapes.

Fig. 8 .
Fig. 8. Example of data selection based on interactive RL.

TABLE II COMPARISON
OF MODEL EFFECTS