An extended approach of weight collective influence graph for detection influence actor

al


Introduction
Hate speech is one of the significant topics of discussion related to social media analysis. It is mainly associated with the freedom of users to share content and opinions on existing social media platforms. Freedom of opinion in social media has also led to an increase in the number of hate speech through social media. This increase is one of the challenges faced by the government in uncovering influential actors of hate speech on social media and issuing a legal law in the form of the Information and Transaction Law of Electronics (UU ITE). One of the methods used to uncover influential actors is the centrality measure, which means the vital node in a network. It represents the influence of a person whose presence on the network tends to dominate others. There are numerous benefits associated with detecting actors on the network, such as explaining the network's dynamics. Some of the traditional methods widely used to detect the influence of a node in a network include Betweenness Centrality (BC), Degree Centrality (DC), and Closeness Centrality (CC) [1]. These metrics are evaluated according to the number of node connections, relationship with neighbours, and the path that crosses the node [2]. DC only looks at the target node's information with low accuracy and time complexity [3]. DC and CC have a better result than BC, which has high time complexity [4]. Other methods include Cross-Face Centrality [5]. Rahim et al. [6] collaborated centrality and similarity measurements used for friendly recommendation. Detection of the central or influential actors in a network is mainly out on social media, such as Twitter [7]- [12]. Besides, influential actors implement other fields, such as, cybercrime [13]- [15] and e-commerce [16].
Based on these problems, Morone et al [25] proposed a new Collective Influence (CI) method, which introduces an influential node based on the number of connected neighbouring nodes. After introducing the CI method, many studies were carried out to improve its performance [24], [26]- [29]. Research conducted by Morone et al. [24] added a heapification function as a subtree at the node to sort the largest to smallest CI values in the existing subtree nodes. Similarly, Kobayashi et al [26] stated that each connected node is considered a separate community used to calculate the centrality. Furthermore, Teng et al [27] proposed developing the CI method by adding a Linear Threshold Model (LTM). Another researcher by Kong et al. [28] introduced the concept of the probably-established subcritical path (PSP) to determine the distribution of information based on the path traversed by nodes. The results showed that PSP-based CI performed better than CI-TM. Other studies Wu et al [29] also proposed Enchanted Collective Influence (ECI), a CI method developed by adding features to overcome the local network topology's loop density and degree diversity. However, the concept of the CI method can only be used for unweighting graphs. Based on this, the purpose of this research is to introduce a new method of detecting influential actors in a network, specifically in hate speech known as WCIG. The contributions are as follows; (1) The WCIG method handles weighted graphs by adding a weight parameter based on the user's number of followers; (2) Implementation of WCIG with different parameters ∂, is used to determine the coverage of neighbouring nodes.

Method
This research proposes the WCIG method, which was later used by Morone et al [25] to develop CI. DC is one of the traditional methods used to determine the influential actors in a network. This method is based on calculating the number of nodes connected to the degree. The formula used to calculate the degree of centrality is stated as follows [30].
Where is the number of nodes in the graph and ( , ) = 1 assuming a relationship between nodes and . Conversely, ( , ) = 0 supposing then it simply implies no relationship between node and node . The value ( ) = 0 indicates that the node is an isolated node because it is not connected to any node. The degree centrality method is the basic concept used to calculate the centre of a node based on its relationship with others. This method was further developed by Opsahl [31] using a weighted graph. It was concluded that each node offers a different contribution, therefore, it is necessary to consider their weights. This method combines the number of relations or connectedness between nodes (degree) with its strengths. Due to this, Opshl et al [31] used the tunning parameter (α) to determine the number of relations compared to the weights. The Opsahl method is based on the Eq. (2) [31].
Here ⍺ is the centrality of a node with weight , which are several degree in node , is weight in node and ⍺ is a tuning parameter. This is similar to the degree centrality method, proposed by Opsahl [31] which is based on the connectivity between two nodes (local). According to Opsahl, it is necessary to calculate the connection or influence of each node connected to the network (global). This method was finally developed by Morone [25], and it is called Collective Influence (CI). The CI calculation is based on the number of ties formed between a node and its neighbor [25]. Moreover, the greater the number of nodes connected to it, the higher the value of its centrality. In addition, the collective influence method considers the neighboring node network. It spreads faster when connected to the centre and has a more excellent value. Furthermore, this method is also based on the number of nodes connected to it and the neighboring value (∂). The formula for Collective Influence is stated in Eq. (3) [24].
Where denotes the number of ties (degree), , are the nodes used in the graph, and ∊ ( , ) is the range of neighboring nodes ( ) of the source ( ) at the radius . The value ℓ is calculated for each node based on its connectivity. Next, they are sorted based on the ℓ value, with the largest identified as an influential actor in the network. This process is repeated until none of the nodes is connected or an isolated graph is plotted. In Fig. 1 it is known that the WCIG method is not only used to determine the centrality value between two nodes rather it also calculate the centrality of the connected nodes. For example, in Fig. 1, parameter ∂ = 2 (marked as a line with red color), the WCIG value of node A is based on the connection between nodes B, C, D, E, W, O, H, I, K, and M. The greater the neighboring value (∂), the wider the influential node. However, this tends to reduce computational performance [32]. The parameter ∂ is used to determine the extent of the relationship between them. An illustration is shown in Fig. 1. This study proposes a new method called Weighted Collective Influence Graph (WCIG). It was developed because the CI approach is only based on the presence or absence of relations, thereby leading to information loss. The network's topology is difficult to explain [31]. In addition, according to Rachman [33], interactions between users on social media have varying intensities. Therefore, the higher the interactive session power, the greater the flow of information. Furthermore, the use of weights to determine actor centrality provides better accuracy when compared to unweighted graphs [34]. This research is the development of a CI method using DC, which calculates the value of the centrality of a node based on its relationship. The development of the WCIG method is carried out by adding weights to the calculation based on Opsahl, which combines the number of bonds and its weights in the graph. The tuning parameter ( ) determines the number of relations compared to the weights, and its formula is shown in Eq. (4). The traditional CI method was developed from the WCIG by adding degree and weight. The measure of the WCIG is shown in the following formula.
Where represents the number of connected nodes (degree), Si represents nodes' weight, Sj represents nodes' weight and ⍺ is a tuning parameter. In this study, the two attributes used in the WCIG calculation, are the number of connected nodes (degree) and the total weight of each. The number of ties is obtained based on the retweet interaction that exists between each user. In addition, when user A retweets user B, the users are connected. The total weight is obtained from followers' interactions. In disseminating information on social media Twitter, retweets and followers' interactions play a significant role. Meanwhile, with the retweet interaction, users indirectly disseminate information to those connected to them. This is also realized through followers' interaction because everyone that follows a user tends to be able to see the posted information. The proposed WCIG method is used to determine the value of every node per iteration. In addition, it further deletes those with the most significant WCIG value. This research aims to determine the impact of information dissemination by removing the most influential node followed by the subsequent ones. Iterations are repeated until an isolated node is formed. The pseudocode of the WCIG method is shown in Algorithm 1, Fig. 2.

Dataset
Data were collected through Twitter Streaming API and the dataset used in this research is mainly in the Indonesian Language. Crawling was carried out by retrieving tweets containing hate speeches based on keywords obtained from https://hatebase.org/ such as babi, banci, bule, cabo, celeng, cina, kafir, pelacur, lonte, munafik, perempuan jalang, perempuan nakal, singkek, sundal, bencong, bagong, waria, dan binal. As many as 18 keywords were crawled from January 01 to 22,2021. The summary of the number of tweets from crawling and description of the total dataset is shown in Table 1, and

Experiment Scenario
The pre-processing technique, consisting of data cleaning, tokenization, filtering, stemming, and stop-word was carried out before implementing the WCIG method. First, the data cleaning process includes deleting unused information, missing values and ensuring that those used adheres to the graphic format. Afterwards, the process of tokenization aims to separate constituent words. Filtering is carried out by removing the word "RT" and symbols in the tweet data. The stemming process is used to change implicating words into essential words. Meanwhile, the stop-word removal is the process of removing terms that are considered less critical such as conjunctions "dan," "jika," "atau," "tetapi," etc. The preprocessing results were further used to generate data graph. This is carried out by connecting user based on retweet and mention (marked with symbol "@" or "RT"). When the tweet contains both symbols then the user is connected to those that retweeted. Afterwards, the detection influence actor is implemented with WCIG, and the result is visualized through a graphical representation of the relations between nodes. The test was carried out two times with parameters = 2 and = 4. The adoption of various parameters is to determine changes in the WCIG value of each user in respect to its increase in the same dataset. The larger the parameter used, the greater the coverage of calculated neighboring nodes. The dataset used has a network diameter of 16, this simply implies that 1 node is connected to 16 neighboring ones (16 levels). The results of the implementation are further analyzed to determine the most influential users; besides, this is carried out using a High-Performance Computer (HPC). The detail of the experiment is shown in Fig. 3.

Results
The results are divided into two sections. The first implements WCIG and CI methods using parameters ∂ = 2 and ∂ = 4. Furthermore, Kendall's Tau test was carried out to determine the suitability between the results of both methods.

Experiment Result of WCIG and CI
This section presents experimental results using the WCIG method and compares it with CI. The WCIG calculation is based on the number of ties that connects one user to another (retweet interactions), the total weight (interactions with followers) and the number of connected users. The process allows the expansion of information dissemination from neighboring nodes. This concept determines the impact of an influential user, not only based on its connection with others, rather, it also includes all those connected to them. This is in contrast to CI, which calculates centrality based on connected nodes and relationship with neighboring nodes. This research is implemented using Python language with four experiments, namely WCI with parameter ∂ = 2, and 4, as well as CI with parameter ∂ = 2, and 4. Implementation with 2 different ∂ parameters is used to determine the effect and differences in the nodes' coverage. The larger the parameter used, the better the results because more nodes are counted. The experiment results are 10 users who influence the spread of hate speech on Twitter. The comparison of the results of the WCIG experiment with the CI is shown in Table 3. Table 3 shows the results of each experiment yield by 10 users in implementing the two methods with different parameters. Several users appear in each experiment, namely bernacleboy, zack_rockstar, dr_koko28, hafizzismailz anwarrrahmad, Republikaonline, fahmirusliMFT, and AzzamIzzulhaq. Meanwhile, kakti_64 appeared in the WCI experiment with parameters 2 and 4, and CI with parameter ∂ = 2. Rudyroutepecker users appeared in WCI and CI experiments with ∂ = 4, respectively. Norazambudin and Fundulus users only appeared on WCI experimental results with parameter ∂ = 2 and 4. An example of information dissemination is evident in username bernacleboy, with 2,271 types of retweet interactions and 1,732 followers. The spread of information started when bernacleboy posted a tweet on Twitter, which was seen by all 1,732 followers. Furthermore, other users retweeted the tweet, while 2,271 others engaged in interactions. This caused _led to the occurrence of information dissemination at level 1 of the interaction session. Everyone following the 2,271 users that retweeted the tweet were able to see the outcome thereby spreading information. This interaction is referred to as level 2 information dissemination. As long as other users retweet the information, more people read it and indirectly spread the information. The interaction causes the user with the username bernacleboy to be the most influential person and the first to disseminate the information. Interestingly, this occurred in the form of the username _bernacleboy's followers that also shared the information both internally and externally, directly or indirectly. Visualization of the graphical representation is shown in Fig. 4 where the yellow-colored node represents the username bernalceboy and the black colored ones depicts other connected users. As illustrated in Fig. 4, in the first iteration process, it is evident that bernacleboy is the most influential user, therefore it is important to delete the 2,271 connected users. This process affects the connected nodes based on their degree. Next, the WCIG value will be recalculated based on the updated value of each node. This iteration shows that the username zack_rockstar has the highest WCIG value. Therefore, connected users are deleted, and the iteration is repeated until an isolated node is formed.
The tests showed that the greater the parameter value ∂, the more time needed to process the WCIG method. When parameter ∂ = 2, the processing time is 120,557.94 seconds. The number of iterations needed to increase the value ∂ to 4 was 148,883.42 at an average time of 35.11 seconds. The condition 8 International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 8, No. 1, March 2022, pp. 1-11 also applies to the number of iterations needed to process the WCIG method from 3,433 iterations for parameter ∂ = 2 to 3,485 for parameters ∂ = 4. The processing time for the CI method is similar to the WCIG. CI with parameter ∂ = 2 takes 115,907.43 second at an average iteration time of 34.63 second and 3,347 iterations. CI with parameter ∂ = 4 takes 140,390.31 seconds with average iteration time of 41.51 seconds and 3,382 iterations. The time complexity required to complete this algorithm is Ο( 3 ). The time complexity quite different with previous methods such as CI which is Ο(Nlog N) [26]. A summary of the tests is shown in Table 4

Kendall's Tau Correlation Coefficient
Kendall's Tau coefficient is a non-parametric statistic used to measure the degree of correspondence between two rankings [3]. Its correlation coefficient test is used to compare the suitability or relationship between the WCI and the CI methods. Assuming variable X is the WCI method, and Y is CI methods are X = {X 1 , X 2 , X 3 , … , X n } and Y = {Y 1 , Y 2 , Y 3 , … , Y n }. The formula used to calculate Kendall Tau Correlation is as follows [35].
Where nc, nd are the concordant and discordant pairs, respectively and n is the total number of pairs. The hypothesis obtained is as follows; (1) H0: There is no match between the results of the WCI and CI methods; (2) H1: There is a match between the results of the WCI and CI methods. The implementation is carried out with the SPSS 26 application on WCI and CI outputs using the parameters ∂ = 2 and 4. The Kendall's Tau correlation coefficient test results for the WCI and CI methods using the parameter ∂ = 2 and 4 are shown in Table 5 and Table 6 The first test of Kendall's Tau correlation shown in Table 5 obtained a correlation coefficient value of 0.491 between the WCI and CI variables. It can be concluded that the relationship between the WCI and CI variables is strong. In addition, the significance value or sig. (2-tailed) between both variables is 0.036 < 0.05, it can be concluded that H0 is rejected and H1 is accepted. The result means that there is a match between both variables. Similar with the second test in Table 6, the correlation coefficient value between the WCI and CI variables is -0.822. It can be concluded there is a solid negative relationship between both variables. The significance value or sig. (2-tailed) between WCI and CI variables is 0.001 < 0.01, and it can be concluded that H0 is rejected and H1 is accepted.

Conclusion
This research shows the implementation of the WCIG method, which is used on weighted and directed datasets instead of the CI method. Furthermore, this weight represents the number of interactions between one user and another. In addition, the neighbor value is also one of the WCIG methods parameter that significantly contribute to detecting influential actors, especially in hate speech. The result showed that there are correlation between WCIG and CI using Kendall's Tau coefficient. Furthermore, the time complexity between WCIG and CI is different, which is Ο( 3 ) and Ο(Nlog N), respectively. The time complexity showed that time to process CI is slightly faster compare than WCIG. Although in the experiment the different not too much. There are several limitations associated with the use of WCIG methods. One of them is related to the weight used, which is obtained from the number of followers. Therefore, future research needs to be carried out using the WCIG method with other Twitter interactions such as tweets, retweets, follows, and mentions. Other approaches, such as fuzzy can be used to determine the interaction between users. Furthermore, it is hoped that the WCIG method is continuously used to detect influential actors and the associated communities in a dataset.

Declarations
Author contribution. All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper. Funding statement. None of the authors have received any funding or grants from any institution or funding body for the research. Conflict of interest. The authors declare no conflict of interest. Additional information. No additional information is available for this paper.  Vol. 8, No. 1, March 2022, pp. 1-11