Cable fault classification in ADSL copper access network using machine learning

ABSTRACT


Introduction
One telecommunication technology that persists in being the utmost popular is the digital subscriber line (DSL) [1] [2]. Asymmetrical Digital Subscriber Line (ADSL) is one of the earliest DSL technologies that enable internet services that utilize the existing twisted-pair telephone subscriber loop infrastructure [3] [4]. Typical ADSL copper access network infrastructure consists of central office (HQ), Multi-Service Access Network (MSAN) cabinet, distribution point (DP), and customer premises [5]- [7]. In the ADSL copper access networks, fiber cables become the backbone infrastructure that connects the central office and MSAN. The copper cables are laid between MSAN and the customer premises through the distribution point (DP). The ADSL network cards and Plain Old Telephone Service Line (POTS) cards are integrated at MSAN, and this is the important key factor that allows both telephone and internet services to be offered using the same copper cables infrastructure [8]. This network topology is usually established at the suburban site where the capital expenditure of the telco needs to reflect the customer's populations in that particular area. This ADSL network configuration can support up to 8 rates for distances up to 5km [9]. However, the offered data rates are still subjected to the copper cables' nature where the attenuation degrades rapidly with distances and the presence of external electromagnetic interferences, leading to crosstalk. Furthermore, the most common copper cable faults that occurred at cable jointing are the imbalance impedance line. Note that the effective impedance matching of copper cables is 100 Ohm. Any values above or below will lead to open or short cable faults. Also, bridged taps that the cable fault contributed when an open-circuited twisted pair is connected in shunt with the twisted working pairs. Uneven length occurs when the length of one twisted pair is not the same. All the cable faults described here will affect the ADSL network line parameters: line operation and loop line test. These parameters will indicate the conditions of the copper access networks. The line operation attributes that are only available when the subscriber line accesses the internet service are the speed rate, attenuation, and signal to noise ratio (SNR). In comparison, the loop line test attributes are used to test the electrical indicators of the line from MSAN to the customer house. When the customers' POTS services are faulty, the loop line tests can be performed to identify and localize the copper cable faults [10]. Based on the line operations and loop line test parameters, the ADSL network performance can be observed to indicate possible cable faults that may occur along the subscriber loop line. The copper cable fault classifications may be deployed by using a machine learning algorithm [11]. To date, there are not many papers emphasizing the usage of machine learning for cable fault classification in the ADSL copper access network.
The feasibility of classification using machine learning algorithms is widely demonstrated for other applications such as identifying objects or data, determining organisms, gender classifications, and transmission line faults. Classification categorizes data instances to their specified class based on learned training data [12][13] to forecast targeted output. Several popular machine learning classifications algorithms have been used to categorize and classify various applications, such as Decision trees, Naïve Bayes, Artificial Neural Network (ANN), Random Forest, and k-Nearest Neighbor (k-NN) [14].
Multi-Layer Perceptron Neural Network (MLP), Bayes, and Naïve Bayes machine learning algorithm were deployed to distinguish the type of fault that exists in shunt compensated static synchronous compensator (STATCOM) transmission line by applying a Discrete Wavelet Transform (DWT) as feature extraction method and Naive Bayes (NB) algorithm as classification method [15]. Several types of faults are classified based on the changes in fault resistance. Faults that are applied are LG, LL, DLG and LLLG. The feature extraction using DWT results in the acquisition of the SD and Energy values have been various faults with the resistance of fault 0.001 Ω. The acquired features are then adopted for the training of the transmission line fault type classifiers. Result acquired from the classifier indicates the excellent performance of the NB classifier for the case of with and without STATCOM, which is 100% accuracy compared to MLP and Bayes, which gives an average of 80% and 20% of accuracy, respectively.
Fault classifications was used to detect normal and abnormal sensed data of Wireless Sensor Network (WSN) using Support Vector Machine (SVM), Convolutional Neural Network (CNN), Multilayer Perceptron (MLP), Stochastic Gradient Descent (SGD), Random Forest (RF), and Probabilistic Neural Network (PNN) [16]. There are 40 datasets used in this paper, alongside 9566 instances and 12 dimensions each. The dataset has a column for distinguishing normal and abnormal instances, which is 1 and -1, respectively. The faults that are induced in this paper are gain, offset, stuck-at, out of bounds, spike, and data loss fault. The classifier's accuracy is evaluated using Detection Accuracy (DA), true positive rate (TPR), Matthews Correlation Coefficient (MCC), and F1-score. These accuracy metrics indicate that RF's performance is better than another classifier, which means RF can accurately classify faults as listed above in WSN.
The multilayer perceptron is one of the ANN methods that was often used for classifications. It is a feed-forward network, and it comprises several layers: the input, hidden, and output layers [17]. Multilayer Perceptron Neural Network classifier was used in Adnan et al. [18] to detect heart abnormality using P, Q, R, S, and T amplitude of ECG data. The learning algorithm that is used for the training of these data are Back Propagation (BP), Bayesian Regularization (BR), and Levenberg-Marquardt (LM).
The high accuracy of all learning algorithm, which is more than 90%, indicates the capability and reliability of the MLP model to detect heart abnormality. Despite the high accuracy, the BR gives the lowest Mean Square Error (MSE), which is 0.237, compared to other learning algorithms that give MSE values of 4.667 and 4.067 for BP and LM learning algorithms, respectively. The low MSE value for BR means it can give better predictions than other learning algorithms.
Different from multilayer perceptron that used several layers of network nodes to process and classify the data samples, the decision tree classifies data points using tree representations. The decision tree aims to build a model for forecasting the value of a target output by giving a considerable amount of inputs [19]. Each leaf node signifies a test on a feature, each branch denotes a test result, and each leaf node represents a class label. Decision tree's disadvantage is that their training process may lead to overfitting [20]. Several works are done using machine learning algorithms to classify transmission line fault. As demonstrated in Chandra et al. [21], a decision tree algorithm was applied for classifications of fault in the transmission line. Using 4262 fault data samples, data are split into 70% training data and 30% testing data. The training and testing of the fault data samples result in 81.23% of accuracy. Thus, it has been concluded that the decision tree algorithm was easy to use and implement for classification purposes.
The closest application of machine learning to the ADSL network is described in Akhikpemelo et al. [22]. The fault detection of 132kV transmission line system is done using the Levenberg-Marquardt learning algorithm in Artificial Neural Network (ANN). The data are generated and extracted using MATLAB software. The detection output is represented as 1 or 0 to indicate the presence of a fault and no-fault. The network configuration setting for ANN (15-15-10-5) gives excellent classification results, with 0.99953 correlation coefficient and 0.000145 overall MSE.
Based on the described machine learning algorithm in the paper, the copper cable fault classification is performed by using Naïve Bayes, Random Forest, Multilayer Perceptron, k-NN, and Decision Tree. The purpose of this preliminary work is to provide the feasibility study of the algorithm that can classify cable faults that may occur in the ADSL copper access network. All the algorithms are readily embedded in WEKA, a tool that can provide visualization and classification for many applications [23]- [25]. WEKA is used as the initial step of exploration in classifying ADSL copper access network fault types. The intention of this paper is to get insights on which algorithm that might benefit more from this application. The accuracy of the classification algorithms was achieved up to 97% [23]. These algorithms are applied to cross-validated ADSL network data, where the copper cable faults focus on bridge tap, open, short, and uneven. The rest of the paper is organized as follows: Section 2 explains the proposed method of classifying cable fault types. Section 3 discusses the finding and evaluates the accuracy of all algorithms. Section 4 concludes the findings and the future direction of this work.

Method
The cable fault classification in the ADSL copper access networks is realized in four stages, as shown in Fig. 1. The first stage involves lab data acquisition. The laboratory setup is established to imitate the actual ADSL copper access network. The specific ADSL network data representing the ideal cable condition and the faulty network are gathered. Only four types of cable faults are emulated in the laboratory measurements, which are open, short, bridge tap, uneven and short. The dataset acquired in the laboratory has been compared with National Telecommunication company dataset, which consists of the same parameters to make sure the benchmarking of the ADSL access network dataset is met and in range. The collected data is fed into the second stage which is called data preprocessing. The raw collected data are prepared in this stage, and only highly correlated parameters and attributes are extracted as the data training set. The data preprocessing of the acquired data also handles any invalid or missing data that may affect the accuracy of the classification in WEKA. The third stage establishes the actual cable fault classification in which the preprocessed data set has been used to train the machine learning algorithm in WEKA. This involves five types of algorithms which are J48, k-NN, Multilayer International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 7, No. 3, November 2021, pp. 318-328 Perceptron, Naïve Bayes and Random Forest. The cable fault classification algorithm's accuracy in WEKA is evaluated in the fourth stage.

Laboratory Setup and Data Acquisition
The laboratory setup is developed to imitate the actual ADSL copper access network. Fig. 2 shows the laboratory setup of network data acquisitions, consisting of MSAN, copper cables with various lengths, and modem. The raw lab data are gathered from MSAN using the command prompt Telnet script. The raw lab data are stored in the form of a text file, which is then manually processed and converted into an excel file for ease of use. The UTP Cat 3 twisted copper cables are used to connect MSAN and Modem through tag block. The ten binder copper cable is connected to 10 modems which represents the network line termination at the customer premises. The possible presence of cable faults in the ADSL networks is emulated on the tag block.   The UTP Cat3 cable can support voice and data transmission at 8Mbps data speeds. Each twisted pair can be modeled as the RC lumped component model [26], as illustrated in Fig. 4. Each A and B component has the same component of RC but possibly with different values when the cable fault occurs along the line [27]. Each A and B is connected to the Ground (G), which are represented by component Resistances, RA-G and RB-G; Capacitances, CA-G and CB-G; and Voltage, VA-G, and VA-G. Even though the other components are still available, they were neglected in the electrical schematic due to insignificant changes when different cable faults are emulated. In ideal cases, the effective impedance, Rs is 100 Ohm, and obviously, cable faults such as imbalance impedance, uneven length, and bridge tap can be assessed from Components RA-B, hence also affecting the capacitance and voltage components. As highlighted previously, the ideal cable conditions and cable faults such as bridge tap and imbalance line emulation are conducted on the tag block side. The copper cables length (between the two tag block) varies at various lengths from minimum 20 m to maximum 4000 m. These cable lengths are chosen to represent the actual minimum and maximum distance possible between MSAN and the customer premises in ADSL network topology. In these cases, if the location of the customer premises are relatively far from MSAN, the attenuation and signal-to-noise ratio (SNR) will be higher, but the achievable data rates will be lower.
The emulation of different cable faults is conducted by modifying the RLC lumped component. Emulation open wired cable fault, the emulation are conducted by disconnecting A and B. Open circuit fault type usually causes a slight increase of capacitance value between tip and ring, while decreasing the value of resistance between tip and ring of the line. Emulation of short-wired impairment is performed by connecting its twisted pair A and B directly to each other. Short circuit fault causes the capacitance and resistance between the tip and ring of the line to be extremely low. The bridge tap impairment type is emulated by adding another copper cable connected in parallel to existing A and B at the tag block near the modem. The bridge tap causes a slight drop in line performance due to the increase of signal attenuation and decreasing of attainable rate of the line, while the resistance and capacitance stay in a good range of value.
In reality, the bridge tap fault occurs when the customer did unnecessary additional telephone wiring. Uneven line fault occurs when the length of A and B is not the same. Therefore this fault is emulated by increasing the length of either A or B by 100 m. Uneven fault causes an increment in capacitance value, while other line performance parameters and resistance stay in a good range of value. The acquired lab data may contain missing data and may interfere with the classification process in WEKA. Therefore, these raw data will be preprocessed, and details are explained in the next section.

Data preprocessing
WEKA are equipped with the utilization of learning algorithms that can be easily used to a dataset. It also provides a variety of tools for transforming datasets, such as the algorithms for discretization and sampling. Other than that, WEKA allows data to be preprocessed before feeding it to a learning algorithm and evaluates the classifier results and their efficiency without having to write any program code. The workbench also provides regression, classification, clustering, association rule mining, and attribute selection methods for solving data mining problems. In addition, many data visualization facilities and data preprocessing tools are provided for the purpose of data exploration. Since the irrelevant values such as symbols or missing values in the dataset may interfere with the classification process in WEKA software, the acquired lab data will need to undergo preprocessing stage, which enables the data to be transformed into a suitable format for machine learning algorithms. The data that contains symbols are manually removed since its presence in data might cause certain algorithms in WEKA classified cannot be used. Other than handling irrelevant data, primary parameters that are suspected to be correlated with the types of fault that occurred in the ADSL copper access network are selected, while others are removed from the dataset. After undergoing the preprocessing stage, 420 samples of ADSL lab data are obtained. The line operation parameters and loop line test parameters which are represented by 26 attributes, are gathered as tabulated in Table 1. The loop line test attributes are extracted based on the RLC lumped components which are previously discussed in Figure 4. The cable faults classifications are initiated by labeling all related cable fault types. Table 2 shows the divisions of data samples for each class of ideal conditions of 4 types of cable faults.   After undergoing data preprocessing to handle irrelevant data, the cleansed data are used for classification. The classifier or algorithms are chosen to classify ADSL copper access network fault type. Table 3 shows the parameter setting of each of the algorithms used. Table 4 depicts the classification performance of ADSL data based on K-fold cross-validation method. In WEKA software, by default, a 10-fold setting for cross-validation is chosen [28]. The percentage accuracy of each ADSL line fault type is made by comparing the classified instances from the total samples of each tested case. The performance of the algorithms are evaluated based on total accuracy percentage, which is the percentage of correctly classified instances. All samples are used for classification and are evaluated using the cross-validation method, in which the dataset is resampled into 'k' samples. The evaluation of every sample will then result in the average accuracy of the dataset.

Result and Discussion
The classification of ADSL line impairment data done using J48 algorithms results from the highest accuracy percentage of 91.67% for short-wired impairment type. As for k-NN algorithms, the line impairments that produce high percentage accuracy is open wired line impairment, which gives out 72.50% of accuracy. It can be observed that multilayer perceptron and random forest algorithm bring about 85.11% and 95.74% of accuracy. The accuracy of uneven line impairment classification using random forest is noticeably higher than multilayer perceptron. Considering the overall accuracy of these algorithms, the random forest classifier gives the highest accuracy, 71.67%, with 301 correctly classified samples. The algorithm with the lowest accuracy is the k-NN algorithm, with only 51.43% accuracy and 216 correctly classified samples.
Even though the random forest has the highest accuracy, the training time of the cable impairment classifications is slightly higher than the k-NN classification algorithm, which is 0.34s and 0.00s, respectively. The multilayer perceptron algorithm takes the longest training time to classify the data based on its impairment type, which takes 3.74s to train 10-fold cross-validation. Training of MLP can be a bit of a gradual process [29] because the number of training epochs and time elapsed for total epochs indicates when the training was completed [30].  The result of classifications done using WEKA in Table 4 indicates that's random forest algorithm gives a promising result for the application of classifying fault types in ADSL copper access networks. Due to a small number of short fault type samples, the J48 decision tree algorithm can provide a high percentage of accuracy [31]. The small number of instances increases the reliability of the algorithms. These are due to the algorithm characteristics that provide imputation technique, making it suitable for short wired line impairment parameters.
KNN is an algorithm, which the classifications are achieved by calculating the closest distance between data attributes [32]. Thus, given that most of the values of the attributes between each class are rather close to each other may cause the classification using this algorithm results in a bit low accuracy compared to J48 algorithms. The high accuracy resulting from the random forest method may be due to its ability to handle an abundant number of parameters due to the existing feature selection in the model development process [33]. The random forest algorithms can be used with many variables and do not require complex parameter tuning to achieve high classification accuracy.
Ensemble classifier is a method that combines the predictions of several classifiers into one classifier model. Several weak classifier models are trained and combined using either the voting or averaging method. Ensemble a classifier such as random forest that uses the bagging method is best for this dataset due to its ability to consider small changes produced by weak learners to its decision [34] [35]. The aspect that may contribute to the low accuracy of the k-NN is the disposing of new data classes that is based on a plain majority voting rule, where the majority voting rule may overlook the adjacency of the data, this is intolerable when the distance of every nearest neighbor varies greatly against the distance of the test data [36]. Another critical factor that may affect its accuracy is the value of k chosen for k-NN classification. Choosing a high value of k may result in misclassification of a new point, while choosing a small value of k may lead to overfitting [37].
The nearest neighbor search algorithm is used by kNN in WEKA is linear NN search or brute force method. The Brute force method calculates a new point's distance to another point in training data, sorts out the distances, and takes on the k nearest for a majority vote. Thus, it does not need other training processes, just deal with prediction complexity, resulting in 0 seconds of the training process for kNN [38].

Conclusion
This paper presents the copper cable faults classification with WEKA by using several machinelearning algorithms, namely, Decision tree, k-NN, Multilayer Perceptron, Random Forest, Naïve Bayes, and SMO. The results showed that the random forest algorithm provides the highest accuracy of about 71% and the shortest training time. This is followed by the Decision Tree (J48). Based on the existing literature, the accuracy is considerably low because the accuracy of the machine learning classification is normally higher than 80%. This may be due to an imbalanced distribution of lab data for each tested class, and further data preprocessing to tackle specific ADSL lab data is required. Considering this matter, the accuracy of the cable fault detection should be improved based on random forest and J48 by developing a machine learning algorithm using Python. In the future, the algorithms that show promising accuracy may be explored more using Python Programming, which consists of a variety of libraries suitable for machine learning development and may allow flexibility of development of certain algorithms. New ADSL and VDSL data may be acquired to test the reliability of the classification model chosen by WEKA and further explored by Python. This may give an insight into the efficiency of a chosen algorithm in determining fault types for different copper access network technology. This research on copper access networks might benefit telecommunication sectors that still have customers that use copper access networks and may only need remote line condition assessment rather than onsite troubleshooting.

Declarations
Author contribution. All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper. Funding statement. None of the authors have received any funding or grants from any institution or funding body for the research. Conflict of interest. The authors declare no conflict of interest. Additional information. No additional information is available for this paper.