Similarity measure fuzzy soft set for phishing detection

.

Technology giants such as Google and Facebook have lost about $100 million to phishing emails from hackers who were impersonated as hardware vendors in 2017. The phishing attack's economic effect is enormous; a report gathered for five years by the FBI internet crime complaint center showed that financial loss occasioned by phishing attacks exceeds $12 billion globally [5].
Phishing attackers are increasingly becoming more resilient over the years due to alarming attack volume and its innovativeness that were being implemented. Security specialists and phishers are in a vicious circle because it becomes very complicated to catch phishers. Phishers are continually changing their tactics to beat anti-phishing techniques [6]. The total number of phishing sites detected by APWG in the second quarter of 2019 was 182,465, and it was marginally up from 180,768 in the first quarter that Significantly increased from 138,328 in the fourth quarter of 2018 and 151,014 in the third quarter of 2018 [7]. The email has also been identified as the top phishing target; consequently, a phishing email attack aimed at individuals and corporate bodies is on the rise [8]. To safeguard the sensitive information of users, an adequate means of spotting phishing emails must be developed.
Anti-phishing development has been conducted by a previous study [9] to prevent users from phishing scams. Today, numerous email filters continue to use certain static approaches; they are insufficiently resilient to comply with emerging phishing trends and could only comply with established phishing activities. It caused email users vulnerable to different phishing attacks. Since the impostor is not static in his activities, this is a loophole; As often as possible, they change the operating mode not to be detected [10]. This has inspired several researchers to investigate additional successful strategies for combating both proven and emerging fraud. Additionally, the techniques have been implemented that contributed to the Data Mining algorithm invention [11]- [14]. One of the Data Mining approaches is Classification that could be useful for predicting phishing websites [15]- [17]. Phishing is a prevalent classification issue in data mining to create a classifier based on huge website features. Phishing attacks, phishing classification, detection, and future challenges have been described in [18]- [20].
There are two important concepts in classification problems in applying soft set theory, specifically, the idea of decision-making based on fuzzy soft set (FSS) and the theory of comparing the similarity of two fuzzy soft sets [21]. Maji et al. [22] studied the soft set decision-making issue as a basis for classification implementation. Furthermore, Handaga [23] has suggested an extended classification approach, called the Fuzzy Soft Set Classifier (FSSC) based FSS, which uses the two soft sets' similarity. As compared to soft set classification based on decision-making problems, FSSC has low computational complexity and a high degree of accuracy.
Based on these findings, this study's main objective was to investigate the FSS to classify phishing websites. We hope to get early detection of phishing activity from the results of this study. A classification model is constructed using a feature set. For instance, in this case, web page information is required, such as URLs and network features. These features and classification or machine learning techniques collection in this category could be extracted [16]. The best feature sets are identified with high demands when mined. Thus, the prediction accuracy of classifiers can be improved [15].
The thwarting phishing attack studies are currently challenging, while researchers focus on phishing attack prevention and identification. Therefore, In this paper, we proposed a novel approach to phishing website detection. In this study, we choose the complete Classification of anti-phishing solutions as the research methodology. The experiments conducted to explore fuzzy soft set (FSS) at several similarities focus on determining the phishing dataset's classification performance. This paper also describes the basic theory and definitions of fuzzy set (FS), soft set (SS), fuzzy soft set (FSS) [24], Similarity measure, and Classification. In addition, FSS and new related results are presented, and open-ended questions are provided for further investigation.

Fuzzy Soft Set
This part is intended only to introduce the main definitions and preliminaries which was used in the sequel in the following set theory's extensions, respectively: fuzzy set (FS), soft set (SS), soft matrix (SM), and fuzzy soft set (FSS). Definition 2.1 Fuzzy set (FS) [25]: Given as the universal set of point or object spaces. The set characterized by function : → [0,1] as a fuzzy set (class) upward . Furthermore, defines a membership function, the fuzzy set as an indicator function, and the value of ( ) as the membership grade of ∈ in . A fuzzy set over (a universal set) could be written as in (1).
Definition 2.2 Soft set (SS) [26][27]: Given and are a universal set and a set of parameters, respectively. Suppose that ⊆ , the formula ( ) = 2 is used to express the power set of , then a pair ( , ) is to express the soft set of , and is defined as the set of ordered pairs (2).
where is the mapping that formulates by : → ( ). The support of is A where ( ) ≠ , ∀ ∈ and ( ) = ∀ ∉ . It could be defined as the relatives parameters of the set is the soft set ( , ) of .

Example 2.1
The problem of making a decision to buy a car is given based on the "attractiveness of the car," which is then expressed as a soft set ( , ). Assume that the universal set U contains five cars (c), denoted as U = {c 1 , c 2 , c 3 , c 4 , c 5 }, and E = {e 1 , e 2 , e 3 } with e i (i = 1, 2, 3) were the notation used to express the parameters in the meaning of the words: "beautiful", "expensive", and "luxurious", respectively. Furthermore, the soft set (F, E) over U could be written in the relation:  Table 1.
Example 2.2 Based on Example (2.1), the soft matrix of the soft set is written as  Table 3. Table 3. The FSS ( , ) representation.

Similarity measures 2.2.1. Matching function
In this section, the fuzzy soft set (FSS) is redefined for larger computational facilities. It is also , and are finite. Furthermore, we define an FSS as follows: Definition 2.5 Given universal set and a set of parameter . Suppose that the collection of all fuzzy subsets of U is written with the notation . An FSS over is stated as a pair ( , ), with F is a mapping formulated by ∶ → .
Basically, definitions 1 and 4 are the same if we take the exact subset A from E and assign the eapproximation F(e) = 0 ∀e ∈ E \ A, then the FSS (F, A) and (F, E) has the same meaning. We can formulate an FSS over U as a matrix. An example is given to illustrate this process. Look again at example 2.2. In the fuzzy membership matrix, the (i, j) th entry is filled the value of membership F(ei)(ej) if ei ∈ A, and it is equal to 0 if ei ∉ A. Therefore, a fuzzy membership matrix can be written as: An example is given below to illustrate the Definition 2.6.
Given the similarity within the soft � � as M ( � , � ). Calculate the eapproximations to determine the similarity between � and � . To do that, We defines ( � , � ) to state the similarity between the two 1 approximations F( 1 ) and G( 1 ). Definition 2.7 Let us define M i (F � , G � ) as in (4).
where F ij = F(e 1 ) (x 1 ) ∈ I and . The definition could be illustrated by Example 2.5.
Proof. Proven by definition becomes easier. Note 2.1 Also here . = did not imply � = � .

Similarity measure based distance
Given two fuzzy sets denoted as A and B. If the distance between the two sets is , the similarity between them can be formulated as = 1 1+ . Again, an FSS is a group of its fuzzy sets' eapproximations. Furthermore, the distance between two fuzzy sets can be defined as

Fuzzy Soft set classification
The steps of the classification algorithm consist of the learning (training) and classification step. Before the two steps are done, firstly, fuzzification and formation of the fuzzy soft set are applied. These two steps yield all data's feature vectors as well as the training and testing dataset. The data set is split into two parts which are used and testing training and testing. Each experiment splits the data randomly into nine different percentages of training and testing data as the data training and testing sample size variations, respectively, as shown in Table 4. The training aims to produce a fuzzy soft as each class fixed model. The data will be learned based on the data class group [31]. The Learning step is to obtain each class center. Data = { 1 , 2 , … , }, there is class of data with ; = 1,2, … , data of each class where ∑ =

=1
, and ⊆ , { , = 1,2, … , } with E is a set of parameters,. Suppose the set of r-th class FSS as F . Then the class center vector is denoted as can be defined as in (9).
Classification is a technique for assigning unknown data to a target class. The new data generated by the training phase will be used to evaluate the classes in the new data, specifically by comparing two sets of acquired class center vector fuzzy soft sets and the new data. This comparative study uses the formula for similarity measure (10).
where is the similarity and distance measure that have been discussed, i.e., Similarity measure, Distance measure, Matching function, and Comparison table.
After obtaining each class similarity value, it will determine which class label is most suitable for the new data F by calculating the maximum value of the similarity result for all classes. The class label could be written as in (11).

Computational experiment
The algorithm, for experimentation, is built in MATLAB R2016a (9.0.0.34136) version that runs on an Intel Core i5 1.80GHz processor and of 8GB RAM under macOS High Sierra 10.13.1 operating system. A fuzzy soft set (FSS) algorithm was used to measure the algorithm's precision, recall, and response times when running the experimental datasets. The result is summarized and shown in Fig. 1 to Fig. 3. Fig. 1 shows that the accuracy results. It can be seen that the FSS based Similarity measure has the best performance than the other measurement. Meanwhile, the lowest one is based on a comparison table.  Fig. 2 shows that the highest recall is FSS based on the Similarity measure. It proved that the Similarity measure could select the most widely relevant item to predict the phishing case with the highest accuracy. Even though, refer to the response time shown in Fig. 3   The overall average of all techniques in terms of accuracy, recall, and timely response is summarized in Table 5. It shows that the Similarity measure based on FSS has good performance raising to 0.9549 and 0.9977 in accuracy dan recall. This result concludes that the Similarity measure has the best precise of the other measurements, although its response time was not better than Matching Function.

Conclusion
In this article, we have carried out an analysis of the proposed technique. Phishing data collection on web pages and important application areas in web mining are part of Data Classification. Some similarity measures based on a fuzzy soft set have been applied to the phishing dataset. The experimental results based on the accuracy and recall show that the best classifier is the Fuzzy soft set (FSS) based Similarity measure. It means that FSS has a promising approach in phishing detection in this study, although its response time was not better than the Matching Function. Future work could also include a hybrid classification model combining multiple web mining techniques such as attribute selection and grouping.