Sentiment classification from reviews for tourism analytics

User-generated content is critical for tourism destination management as it could help them identify their customers' opinions and come up with solutions to upgrade their tourism organizations as it could help them identify customer opinions. There are many reviews on social media, and it is difficult for these organizations to analyze them manually. By applying sentiment classification, reviews can be classified into several classes and help ease decision-making. The reviews contain noisy contents, such as typos and emoticons, which could affect the accuracy of the classifiers. This study evaluates the reviews using Support Vector Machine and Random Forest models to identify a suitable classifier. The main phases in this study are data collection, preparation, labeling, and modeling. The reviews are labeled into three sentiments; positive, neutral, and negative. During pre-processing, steps such as removing the missing value, tokenization, case folding, stop words removal, stemming, and applying n-grams are performed. The result of this research is evaluated by looking at the performance of the models based on accuracy, where the result with the highest accuracy is chosen as the solution. In this study, data is data from TripAdvisor and Google reviews using web scraping tools. The findings show that the Support Vector Machine model with 5-fold cross-validation is the most suitable classifier with an accuracy of 67.97% compared to Naive Bayes with 61.33% accuracy and the Random Forest classifier with 63.55% accuracy. In conclusion, the result of this paper could provide important information in tourism besides determining the suitable algorithm to be used for Sentiment Analysis related to the tourism domain.


Introduction
Tourism is one of the fastest-growing sectors globally; it is a significant source of income for many countries. Through the years, tourism has evolved from the conventional to the implementation of technology in tourism. One key factor that has evolved along with the implementation of technology is feedback writing. Nowadays, tourists can write feedback on social media platforms. Nowadays, tourists can write feedback on social media platforms about their experiences visiting any tourist destination. These are accessible to everyone to read and are considered user-generated content (UGC) [1]. Public opinions, regarded as the most authentic contents, are the best source of feedback that businesses can use to upgrade their services [2]. Its influence on the tourism industry is growing [3]. The tourism industry is an industry that relies on feedback from its visitors [4]. Tourism organizations could not User-generated content is critical for tourism destination management as it could help them identify their customers' opinions and come up with solutions to upgrade their tourism organizations as it could help them identify customer opinions. There are many reviews on social media, and it is difficult for these organizations to analyze them manually. By applying sentiment classification, reviews can be classified into several classes and help ease decision-making. The reviews contain noisy contents, such as typos and emoticons, which could affect the accuracy of the classifiers. This study evaluates the reviews using Support Vector Machine and Random Forest models to identify a suitable classifier. The main phases in this study are data collection, preparation, labeling, and modeling. The reviews are labeled into three sentiments; positive, neutral, and negative. During preprocessing, steps such as removing the missing value, tokenization, case folding, stop words removal, stemming, and applying n-grams are performed. The result of this research is evaluated by looking at the performance of the models based on accuracy, where the result with the highest accuracy is chosen as the solution. In this study, data is data from TripAdvisor and Google reviews using web scraping tools. The findings show that the Support Vector Machine model with 5-fold cross-validation is the most suitable classifier with an accuracy of 67.97% compared to Naive Bayes with 61.33% accuracy and the Random Forest classifier with 63.55% accuracy. In conclusion, the result of this paper could provide important information in tourism besides determining the suitable algorithm to be used for Sentiment Analysis related to the tourism domain. afford to ignore UGC because it could help Tourism Destination. Management (TDM) better understand their visitors' preferences [5] [6].
Analysing these reviews could reveal the underlying information from UGC that could help the TDM in their decision-making process. Instead of manually analyze the reviews, the feedback can be analysed using the sentiment analysis method with machine learning classifiers. Sentiment analysis could detect the sentiments behind the reviews and assign them into respective categories. The implementation of sentiment analysis can be done in many ways and one of them is extracting the polarity of reviews by using several machine learning classifiers [2] [7] [8].
Sentiment analysis with the machine learning method has been used in many different domains such as classifying the informal Malay textual data using Decision Tree (J48), Support Vector Machine and Naive Bayes to evaluate the consumer feedback and spot patterns in social commerce [9]. In another study, Twitter data related to two top apparel international brands was compared and analyzed using Naive Bayes and lexicon dictionary to get the public opinions of the two brands [2]. In paper [10] , airline passenger reviews were classified into 5 categories: plane condition, flight comfort, staff service, food and entertainment and price using the Bayes and Support Vector Machine method to understand the satisfaction levels of the passenger for these categories. Moreover, people's perceptions regarding vaccines in Indonesia on Twitter were captured in the first two weeks to predict the people's sentiment about it by using Support Vector Machine and Random Forest [11]. Majority of the previous research on sentiment analysis uses Naïve Bayes, Support Vector Machine and Random Forest as their Machine Learning classifiers.
Tourism destination organizations can analyse the reviews left by their visitors on social media to come up with decision-making strategies to improve their tourism destinations. However, the feedback is hardly used because the reviews contain noise which affects classification data analysis tasks [12] [13]. Noisy content on social media, such as reviews that contain abbreviations, informal language, and emoticons, needs to be pre-processed. This paper will identify and apply the appropriate machine learning algorithms for sentiment analysis on social media for tourism analytics, and the results will be compared along with designing a sentiment analysis dashboard for visualizing the reviews. The best model with selected k-fold validation and n-gram was selected based on its performance accuracy. The main contributions of this paper are as follows: • This paper proposes a sentiment modeling for tourism analytics from reviews.
• A dictionary-based approach and several different parameters of machine learning models are compared to achieve the best model. • The findings are represented using Microsoft Power BI to understand the results further. Section 2 of this paper presents the related studies while Section 3 explains the methods that were used in this study. Section 4 presents the results and findings, and this paper is concluded in Section 5.

Related Works
Tourism destinations are any geographic location that attracts and caters to tourists as a source of revenue and is located within administrative regions where tourist attractions, public facilities, accessibility, and communities are interconnected [4] [14]. It is a combination of a complex and integrated portfolio of services offered by a destination that supplies a holiday experience that meets the needs of tourists [15]. In Malaysia, Taman Negara has become a key ecotourism destination with its diverse flora and fauna [1] [16] [17] .The national parks are located across several states, the largest being Taman Negara Pahang, which spans three provinces and contains a total area of 4, 343 square kilometers [1] [16] Taman Negara Pahang is 130 million years old and one of the oldest tropical rainforests in the world. It is on the national agenda to accomplish the Sustainable Development Goals.

Sentiment Analysis
The computational analysis of text to determine people's opinions, appraisals, attitudes, emotions, and sentiment polarity (positive, negative, or neutral) towards entities, situations, events, and topics is known as sentiment analysis or opinion mining [7] [18]. It is an excellent tool for businesses to analyse opinions expressed by users on social media without explicitly asking any questions, as this approach often reflects their genuine thoughts [18] [19]. The increasing number of daily active users on social media makes it a potential data source for sentiment analysis as it could observe people's behaviors and opinions in textual form [11]. Sentiment analysis is widely used in many domains such as education, healthcare, politics, e-commerce and many more [20] [21] [22] [23]. The sentiment analysis method is applied to one of the famous domains, tourist reviews, to understand tourists' experiences, opinions, and emotions towards a tourism destination [24] [25] It is better than the traditional way of sending questionnaires to get the visitors' feedback [25] [26].
To get data for the sentiment analysis, the reviews are extracted by web scraping and transformed into a machine-readable format for use during the classification process [27] [28] [29].Thus, it ensures only critical parts of the text are kept, avoiding noisy data, such as the computational time, and ensuring the accuracy of the classifiers [30] [31]. Tokenization is used to break text documents into meaningful elements, and tokens are used in the next stage [ [29]. N-gram tokenization, such as unigram, bigram, and trigram, is applied to tokenize the reviews into terms [18] [35] [36] Any Stop Word (words that often appear in a text but have no meaning) is discarded, and the Case folding task converts all the upper-case letters in the text into lower cases to uniformize the shape of the letters [29] [35] [37]. The Word Normalization technique consists of Stemming and Lemmatization, which can be applied to eliminate the word's prefix and suffix based on the root word [23] [25] [29] [34] [38].
Meanwhile, Term weighting is used to find the repetitive terms in a corpus to determine their significance [29] [32] [39]. A term is a word or phrase in a document used to understand the context of the whole corpus that has a weight associated with it. A high-level weight is given to the term with the highest occurrences in a document [29]. It also represents the class of a corpus, obtained from the result of n-gram tokenization [35]. The Term Frequency-Inverse Document Frequency (TF-IDF) process finds the term with the highest frequency and weight. The process then compares other terms based on the occurrence of the words in each document multiplied by the word that provides common or rare information in the total number of documents [32] [35].

Classification Method
Classification is a supervised learning task that applies training and testing processes for learning from experiences. Sentiment analysis is an example of supervised learning that uses classification methods to produce the analysis output based on three categories, which are "positive", "negative", and "neutral". Several classifiers are used for this purpose, such as Naive Bayes (NB), Support Vector Machine (SVM), and Random Forest (RF). NB classifier is a simple probabilistic machine learning classifier and can be used to solve classification problems by assuming that the attributes are independent of each other. It can also classify a review's emotions and polarity classes for text classifications [ [40]. The SVM classifier can solve binary classification problems [40] [41] with a representation of feature vectors by using the Bag-of-Words model [10]. RF can be used for regression and classification tasks and is known as an ensemble technique because it combines multiple models to make predictions.
NB is one of the predictive classifiers used in this study. It is the easiest classifier to be implemented in model development as it has low computational time and can accept many types of data. As it is generally a simple classifier, parameter Laplace correction is applied to the classifier. NB is a classifier that uses probability for its algorithm. If there is a class label where the record does not appear in the training set, NB will set the conditional probability as 0. This probability is calculated with other labels, which leads to a misleading result; the Laplace correction fixes this problem by changing the probability from 0 to 1. Naive Bayes outperforms other algorithms by having the highest accuracy [8].
The SVM operator used in this study is the Support Vector Machine (LibSVM) because it supports polynomial attributes, which are the sentiment attributes, unlike the normal SVM operator. A few parameters are to be tuned in the SVM classifier to increase its accuracy. One of the parameters tuned is the type of SVM, and for this study, C-SVC is used as it supports the classification tasks. The kernel used as the linear kernel is claimed a preferred kernel for text classification problems. The C parameter is a penalty parameter that represents misclassifications in the model. This parameter is tuned to 0, 1, and 2 throughout the project. The problem with this classifier is that it needs the data to be balanced for it to predict the classes accurately. SVM is suitable to be used in tourism domain as it shows a high accuracy of 80.11% by using tourism dataset [24].
The last classifier is RF, an ensemble of random trees specified by the number of trees. The problem with this classifier is that the computational power is high automatically stops when the computer has insufficient memory. The parameter tuning done during RF is by changing the number of trees to 10, 50, and 100 trees. This parameter decides how many random trees will be generated for RF. RF is implemented because, in Comparison to an RF, the limitation of single tree does not generate great performance. RF will create many classification trees that are aggregated to obtain optimum accuracy. The advantages of RF are it has higher accuracy than a single tree, is a powerful algorithm, and can handle a large dataset with a high dimension. When RF is combined with feature extraction process, RF outperformed other algorithms based on accuracy, precision, recall and F1-score [27]. Although it has many good points, it also has its disadvantages. It is expensive to train an RF and hard to interpret, and the user does not have complete control for the RF model. The comparison table in Table 1, shows the three main studies using sentiment analysis approach. Generally, these papers applied Naive Bayes and SVM with additional KNN, J48 and Random Forest, with different best result from each paper. Therefore, for this study, SVM and Random Forest algorithms were applied for the sentiment classification.

Method
This study includes data collection, pre-processing, data labeling, text vectorization, and modeling, as shown in Fig. 1. Subsequently, the methods applied for each phase are explained in this section.

Data Collection
The raw data is automatically collected from two UGC sources, TripAdvisor and Google Reviews, using Instant Data Scraper and Simplescraper web scraping extensions. The keywords used to search on TripAdvisor are "Taman Negara", "National Park", "Endau Rompin", and "Gunung Ledang". The keywords used to search on Google Reviews are "Taman Negara ", "Taman Negara Pulau Pinang", "Endau Rompin", "Taman Negara Canopy Walkway", "Mutiara Taman Negara", and "Gunung Ledang". The total number of instances collected from TripAdvisor is 2,154. After analyzing the values inside the columns, the attribute columns obtained are: • The user's URL Since there are duplicates during the search process, the data collection is implemented on all sites and classified as the same location, even if they are from a different URL. The number of instances for all locations is 5,618. Total raw data collected on both websites are 7,772 instances, shown in Table 2.

Data Pre-processing
The dataset for this project was pre-processed before modeling to ensure the optimum model performance. In this research, data pre-processing is done manually using Microsoft Excel and operators in RapidMiner. Noisy and missing values are removed as they have no importance and are irrelevant. A few pre-processing operators of RapidMiner extension are also used, such as Tokenize, Transform Cases, Filter Stopwords (English), Filter Tokens (by Length), Filter Stopwords (Dictionary), and Stem (Porter), as shown in Fig. 2. The study also tests Generate n-Grams (Terms) for the best combination of words and finds 2-gram is the best for application. For the Filter Tokens (by Length) operator, the characters are set to a minimum of two words and a maximum of 999 words in the parameter. This means it will remove words that are lesser than two characters and more than 999 characters, as they will have no meanings.
The text vectorization methods, Term Frequency (TF) and TF-IDF, are compared to determine the better accuracy method. The text vectorization method is chosen by selecting one with the highest accuracy for each classifier model. Out of the models with the highest accuracy, the method with a higher frequency is chosen for the last model. The accuracy displayed in the bar chart is the average accuracy obtained from 10 documents. The classifiers used in the models are NB, SVM, and RF, with cross-validation with 5-fold and 10-fold.

Data Labelling and Text Vectorization for Data Representation
The review column is selected during the sentiment extraction process to generate the sentiment score used for sentiment classification. RapidMiner automatically calculates the score column for the sentiment polarity of a text, and the scoring string places the values on every word based on the availability of the words in the VADER sentiment dictionary. Any additional words are added to the Extracting Sentiment operator. The sentiment column, which is the class label, is obtained after the score column calculation using the rule set by the VADER documentation. If the score is higher than 0.05, the sentiment is considered positive. If the score is lower than -0.05, the sentiment is considered negative. The sentiment is considered neutral if the score is more than -0.05 and less than 0.05. Since the library's VADER score is already established, this study adopted the existing method. Further study will be planned to look into the labelling and extending the sentiment scoring method. The processed dataset has an unbalanced class label that labels 4,807 reviews as positive, 430 as neutral, and 560 as negative. The challenge encountered due to this condition is that any misclassified label may lead to inaccuracies. The minority label is treated as noise and disregarded by classifiers, such as SVM and RF. Thus, the class label is separated into ten new documents to overcome this problem. The sample operator is used on these documents to balance the data. Documents 1 to 9 each have 480 positive reviews, 43 neutrals, and 43 negatives. Document 10 has 487 positive reviews, 43 neutrals, and 56 negatives. Process Documents from Data (PDfP) operator can perform text vectorization. This research uses text vectorization schemas for this process, Term Frequency and TF-IDF, to determine the suitable method for better accuracy. This study applies Term Frequency for the vectorization. However, the experiments done to achieve this will not be discussed in detail due to page limitations. Labeling the records shown in Fig.3.   Fig. 3. Labeling the records using Sentiment score 'VADER' in RapidMiner

Modeling
Two supervised machine learning algorithms, Support Vector Machine (SVM) and Random Forest (RF), are applied. The main intention of this analysis is to determine the accuracy of each sentiment analysis technique to interpret the sentences from the customer reviews on Taman Negara Pahang. The model's accuracy could be derived from a confusion matrix that RapidMiner automatically generates. Other details such as the true positive and false positive could also be derived by it. Each classifier gives out different accuracy thus, comparisons between the model are done. The higher accuracy of the models indicates that the algorithm used for the model is suitable for the dataset and will generate a trustworthy result. The Comparison and analysis of the performance of these techniques in classifying this informal Malay textual data are modeled through 70% training and cross-validation. The k-fold cross-validation is used in this work; the 'k' number of groups from the dataset is split for testing and training. Since the dataset used in this project is split into ten different documents, the average accuracy for all ten is taken as a single accuracy. During the evaluation phase, the results and research findings were discussed according to the best performance of the supervised techniques, based on accuracy and further detailed Comparison was done based on True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) of the best classifier.

Results and Discussion
This section elaborates on the finding through SVM and RF classifiers for the sentiment classification of positive, neutral, and negative.

Modelling Experiments with SVM and RF Classifiers
SVM Classifier relies heavily on hyperparameters such as costing as c parameter. The c parameter affects the misclassification of the model, which is tuned in this experiment. The purpose of tuning is to select the best value of c for the final model. The c values that are changed in this experiment are 0, 1, and 2. Ten documents are tested in six types of models and the model accuracy is averaged. The highest accuracy is the SVM model that uses 1 as the c value; therefore value 1 is chosen and applied to the final SVM model. The confusion matrix for the table with the best result for SVM is shown in Table  3, Fig. 4 and Fig. 5, show that of two out of 3 classifiers, SVM and RF used in this experiment with higher accuracy when the TF method is used, Term Frequency is chosen for use in the last model.   RF number of trees comparison (without n-gram) The number of trees parameter measures the number of random trees generated for the RF model. The purpose of tuning the number of trees parameter in the RF classifier is to select the most optimum number of trees in the final model. The numbers of trees changed in this experiment are 10, 50, and 100, as shown in Fig. 4. The highest accuracy is recorded when the model uses 100; thus, 100 is chosen as the number of trees values for the last model.

n-Gram Comparison
A sequence of words is created during the pre-processing phase using the n-gram operator. The type of n-gram is compared to measure which one generates better accuracy when applied during the preprocessing phase. The n-grams compared are 2-grams (bigrams) and 3-grams (trigrams). They are compared by changing the maximum length parameter at the Generate n-Grams (Terms) operator in RapidMiner. Fig. 5 illustrates four out of six classifier models recording high accuracy when 2-gram is applied. The models use SVM and RF as the classifier, and their accuracy ranges between 64% and 68%. This shows that 2-gram must be applied to the model for better accuracy.

Comparison of Results
The performance of the final model without n-gram in RF and SVM is compared based on 5-fold and 10-fold cross-validation. The model with 10-fold cross-validation appears to be the most accurate for classifying the sentiment of reviews for Taman Negara, with 69.06% using the SVM models and 65.15% for RF models. N-gram operator is added to the pre-processing phase and the model with the highest accuracy is chosen as the best model to be used with the Taman Negara dataset, with TF representation method. The 2-gram and 5-fold cross-validation model generally has higher accuracy than others. The RF model records an accuracy of 63.55% while the SVM model has an accuracy of 67.97%. In previous research, TF and n-gram are included in the pre-processing stage and SVM is the most efficient algorithm for classification modeling problem [11] [20] [21] [42]. These comparisons shows that the SVM algorithm is the most suitable to be used with the Taman Negara dataset in Comparison to RF. However, the limitation of this study is the processing machine, therefore, they need to be separated into 10 documents and compared the results. Fig. 7 illustrates the dashboard that visualizes the analytics features of the reviews written by Taman Negara visitors, such as the summary of the feedback, the word cloud, and the sentiments. It is a onepage dashboard with six charts, three slicers, and one card developed using Microsoft Power BI and deployed on the web. The positive sentiment score is 4,807 reviews, the neutral is 430 reviews, and the negative is 560 reviews. These scores show that more tourists give significantly good reviews on Taman Negara from early 2020 till April 2020.

Conclusion
This study aims to evaluate the polarity of the tourism destination reviews by employing Support Vector Machine (SVM), and Random Forest (RF) classifiers and find a suitable classifier for the tourism dataset. The dataset used for this study is from the reviews of Taman Negara. The performances of these classifiers are compared to determine the most suitable classifier to be used in sentiment analysis for tourism reviews on social media. The SVM with a 5-fold cross-validation model is chosen as the most suitable classifier to be used with the tourism dataset. It outperforms other classifiers with a 67.97% accuracy, while RF with a 5-fold cross-validation has a 63.55% accuracy. The data visualization, a sentiment analysis dashboard, is produced to visualize the reviews from Taman Negara visitors. Among the limitations faced in this research are the inability to run several models due to inadequate computational power and a small scale number of records. Enhancements to be considered before continuing this project include hiring an expert human translator, running models on a machine with high computational power and collecting a larger dataset to avoid the issue of data imbalance. Improving the model by adding new classifiers are also recommended for future works.