Dynamic convolutional neural network for eliminating item sparse data on recommender system

E-commerce has changed the methods applied by many companies in business transactions. It has become not only an alternative but an imperative. These business organizations are faced with the challenges of finding the best way to conduct and develop their businesses in the digital economy. As a result of this, most of them have moved their total operations to web-based applications while some are settling subsection, then spinning them off as independent online business entities [1]. The need to provide buyers with relevant information about products has brought about the development of recommender system which is used in delivering this information to them through websites portal or mobile phones [2]. This system is an essential part of the e-commerce industry used in promoting sales and services on various online websites and mobile applications. For instance, Movies on Netflix watched 80 percent of movies watched on Netflix came from recommendations [3], and 60 percent of video clicks came from YouTube recommendation system [4]. In addition, according to Schafer [5], sales agents recommended by NetPerceptions system achieved higher average cross-sell value and higher success rate (60 percent and 50 percent respectively) than the ones based on experimental traditional techniques.


Introduction
E-commerce has changed the methods applied by many companies in business transactions.It has become not only an alternative but an imperative.These business organizations are faced with the challenges of finding the best way to conduct and develop their businesses in the digital economy.As a result of this, most of them have moved their total operations to web-based applications while some are settling subsection, then spinning them off as independent online business entities [1].The need to provide buyers with relevant information about products has brought about the development of recommender system which is used in delivering this information to them through websites portal or mobile phones [2].This system is an essential part of the e-commerce industry used in promoting sales and services on various online websites and mobile applications.For instance, Movies on Netflix watched 80 percent of movies watched on Netflix came from recommendations [3], and 60 percent of video clicks came from YouTube recommendation system [4].In addition, according to Schafer [5], sales agents recommended by NetPerceptions system achieved higher average cross-sell value and higher success rate (60 percent and 50 percent respectively) than the ones based on experimental traditional techniques.
Recommender system can be divided into four types based on technical approach [6], [7].The first, content-based, is the mechanism that involves product classification.It uses information retrieval to generate product recommendation.The second is the knowledge-based which is developed for specific recommendation.The particular character in this approach provides information rarely needed by individuals, for example, house, loan, insurance, car etc.The third is the demographic-based which provides recommendation based on demographic information.The fourth is the collaborative filtering which is founded on past behavior of the user.This method is the most successfully applied approach in A R T I C L E I N F O A B S T R A C T larger e-commerce company because of its ability to provide recommendation characters that are product fit, provide relevant information, and accurate [8].It adopts rating through the use of rating matrix to assess the similarities in the feedback of users in order to generate recommendation for products.However, the method is limited by the population of people, which is about 1%, that gives ratings on products.This problem is popularly called sparse data and when it is in extreme condition it is known as cold start.When this happens, the system cannot generate any recommendation.
One of the several efforts towards solving the problem is by extracting information from items.Many authors have come up with different approaches of extracting content feature, for instance, den Oord et.al. [9] developed a music recommendation using convolutional neural network (CNN) to classify music based on acoustic type for the purpose of tackling unpopular music (long tail).Another approach introduced by Jaradat [10] is about extracting deep color feature to recommend fashion using color classifier.
Collaborative filtering is the most popular recommendation engine applied in the e-commerce business because of its ability to provide relevant, accurate information and give surprise product information to users [11].However, it is limited by sparse data caused by the inability of users to give ratings [12].Several attempts have been made to reduce sparse data for it to be more accurate.The use of auxiliary information has proven to be effective in improving this accuracy.Most of them have been successful in increasing robustness in sparse data such as audio features in music [9], [13], color features on online fashion shop [10], and documents in online news [14].According to Chen et.al. [15], valuable information can be extracted from reviews for the purpose of enhancing product recommendations.This can be done through several methods despite the fact that the raw review information is in unstructured textual form that cannot be easily understood by the system.Qu et.al. [16] proposed the use of bag of word method to analyze customer review on movies so as to increase its rating value.Using this method can provide optimal results [17], [18], despite the fact that it has difficulty in capturing contextual meaning of texts.For example, it cannot detect surrounding words like "not good", "not bad", "not very well" and this the reason its prediction of rating results is sometimes inaccurate.Therefore, there is a challenge of the method to use in increasing the level of accuracy when convolutional neural network such as sentiment analysis is applied [19].
Enhancement of text has been applied by many researchers in building recommendations, for example, Wang [20] use classical method of bag of word from users to predict rating value but this method still has its own limitations such as misunderstanding in detecting the meaning of text or sentences.This has led to inaccuracy in the extraction of sentences from product reviews.This can be observed from IMDB movie (www.imdb.com)review portal which include several text resources such as product description, user opinion, user comment, and testimony with the aim of improving accuracy.Feedback provided by customers can be said to be the best form of product rating.According to marketing theory, majority of customers will not buy a new product without experience.They will consider an opinion or testimony from others that have had deep experience about the product.Other previous approaches, such as categorical Bag of Word (BOW), Collaborative Topic Regression [21], and Latent Dirichlet Allocation (LDA) [17], [18] were succeed to generate item latent factor from text sentence document.However, they failed to catch the suitable view of contextual meaning point.
The aim of this study was enhancing DCNN as subclass of Convolutional Neural Network (CNN) to capture contextual meaning of the consumer product review to produce item latent factor which was combining with probabilistic matrix factorization.The evaluation results by Root Mean Squared Error (RMSE) showed that the proposed approach was outperforming the previous works.

Method
In handling the problem of minimum contextual meaning of text sentences convolutional neural network (CNN) is considered.This is a state of the art machine learning platform that can outperform others applications, such as information retrieval [22], [23], computer vision [24], and natural language processing (NLP) [25]- [27].On sentiment analysis, Kim [27] proposed that the approach is effective in sentence classification.Table 1 shows various approaches that have been significantly relevant in different researches in recent years.After LDA model approach [17], [18], collaborative topic regression [21] which was categorized into bag of word and word order was also applied to have deep contextual meaning but this method has been observed to the most shallow of all the sentence models introduced.Kim et.al. [29] proposed a model adopted from [27] to predict rating based on convolutional neural network aimed at capturing all contextual meaning in text sentences document.This model combines both convolutional neural network and probabilistic matrix factorization and it has the ability of outperforming others based on latent item factor.However, its limitation includes high computation perspective and longer time of computation.On the other hand, it has less scalability point of view.
A variant of convolutional neural network has been proposed [26], and it was dubbed Dynamic Convolutional Neural Network (DCNN) for sentences classification.In a language processing, the system ability to represent accurately meaning is important.Therefore, Dynamic K-Max Pooling was implemented in the semantic modeling to handle varying length input sentences, and to generate a graph feature over sentence.It adequate to capture long and short-range relations.The network does not depend on a parse tree and is easily implementing to any language.In solving the problem of sparse data identified, an hybrid of DCNN and probabilistic technique is proposed to improve the result of previous research that adopted convolutional neural network approach to extract content feature based on text sentence document [29].

Text Sentences Description and Probabilistic Matrix Factorization
The proposed Dynamic Convolutional Neural Network and Probabilistic Matrix Factorization (DCNN-MF) (Fig. 1) involves three steps: 1) Exploitation of the probabilistic approach, and implementation of critical factor in bridging Probabilistic Matrix Factorization and DCNN to calculate both rating and item text sentences description; 2) Definition of a structure that produces text sentences latent model by calculating items text sentences description documents; and 3) Optimization of the latent variables of DCNN.

Dynamic Convolutional Neural Network Adopted Sentiment Analysis
The objective of the DCNN architecture is to obtain text sentences documents latent vectors from documents items, then it is utilized to create the latent factors items among epsilon variables.Fig. 2 reveals that the DCNN architecture contains four layers: 1) embedding layer, 2) convolutional layer, 3) pooling layer, 4) output layer.Input Layer.The input layer is the primary information source that comes from the feature content, in this case, the product description.This is the representation of a product through the use of sentences.Based on different researches, it is different from music or fashion products recommendation where the pattern is clear.Products description undergoes many modifications at the preprocessing stage, such as the removal of punctuation marks and elimination of conjunctions.
Embedding Layer.The function of this layer is to transform a raw document into a numeric matrix according to the length of the words, which will be conducted as a convolution operation in the next segment.For instance, if number of words in a document is , then a vector of each word can be concatenated into a matrix in accordance with the sequence.The vectors are initialized with a pre-trained word embedding model as in (1).
where  stands for the dimension of word embedding and  [1:,] represents raw word  in the document.
Convolution Layer.This involves extracting contextual features from a sentence expression.The sentence conveyed by consumers on a movie was considered.Convolution architecture was applied to dispose the documents.The detail formulation is as shown in (2) Pooling Layer.This layer did not only extract instance features from the convolution layer but also dealt with documents variable lengths (via pooling operation) that constructed a fixed-length feature vector.After the convolution layer was created, a document was represented by   contextual feature vectors.Each contextual feature vector has variable length (i.e.,  −  + 1 contextual feature).However, such representation causes two problems: 1) there are too many contextual   while most contextual features might not help to increase performance of the model; 2) varied length of the contextual feature vector caused problem to establish the following layers.Therefore, max-pooling was utilized to reduce the document representation into an   fixed-length vector by generating only the maximum contextual feature from each contextual feature vector as in (4).
where   is a contextual feature vector of length  −  + 1 extracted by th shared weight    .
Existing approach connected to the model sentences was modified by utilizing a convolutional engineering that exchanges wide convolutional layers with dynamic pooling layers given by dynamic kmax pooling.In the system, the width of a component maps a middle layer changes relying upon the length of the info sentence and this gave use the Dynamic Convolutional Neural Network as shown in Fig. 3   Output Layer.Generally, at this stage, high-level features obtained from the previous layer should be converted for a specific task.Thus,  was projected on a k-dimensional space of users and items' latent factors for the recommendation task, which completely produced a document's latent vector by using conventional nonlinear projection.
where W 1 ∈ ℝ ×  , W 2 ∈ ℝ × are projection matrices, and b 1 ∈ ℝ  , b 2 ∈ ℝ  is a bias vector for W 1 , W 2 with  ∈ ℝ  .Eventually, by the mentioned processes, the CNN architecture became a function that takes a raw document as input, and returns a latent vector of each document as output of (6).
where  indicates all the weight and variables of bias to prevent clutter,   denotes a raw document of item , and   denotes a document's latent vector of item .

Optimization Method
For the purpose of optimizing some variables such as item latent models, user latent models, maximum a posteriori (MAP) estimation, and DCNN weight and bias variables was applied as follows:

Probabilistic Matrix Factorization
The main idea of this approach is similar to Kim et.al. [29] but it was modified with the aim of increasing the understanding of contextual meaning and level of scalability.In doing this, integration of probabilistic matrix factorization (PMF) approach with the proposed dynamic convolutional neural network (DCNN) was considered.Assuming there are  users and  items, then learn rating was supposed to be represented by  ∈ ℝ × matrix.Therefore, the objective was to find the user and item latent models ( ∈ ℝ × and  ∈ ℝ × ) whose product (  ) reconfigure the rating  matrix.According to the probabilistic approach, the distribution of the situation through observe rating can be denoted by: (|, ,     This is different from the tradition method such that there was consideration of three variables, 1) internal weight  in the approach; 2)   representing the document of item ; and 3) Epsilon variable as Gaussian noise.Considering the optimization of item latent model in predicting rating, then the equation ( 11) can be used to the generate final item latent model.
where weight   in , zero-mean spherical Gaussian prior was assumed as in (13).
where, the distribution item latent model is denoted by: where  represents a set of text sentences description document of items.A description of product latent vector obtained by the DCNN model was applied to the mean of Gaussian distribution noise of the product as employed as the variance of Gaussian distribution to bridge the gap between CNN and PMF in order to utilize both text sentences description and rating.

Measure Metric
Root Mean Squared Error (RMSE) was frequently used to measure the differences between sample and population values predicted by a model or an estimator and the values that were actually observed [30].It represents the standard deviation sample of the differences between predicted values and observed values.It is the most popular metric used in evaluating accuracy of rating prediction.The RMSE between the predicted and actual rating is given by:

Experiment Setting
This research was focused on developing a method to handle sparse data in items rating observed through the application of collaborative filtering recommender system.According to previous studies, the use of matrix factorization has not been able to effectively address this issue to the extent that the mistakes are most time seen in the results [31].Therefore, this is a categorical lab scale research involving public datasets.MovieLens dataset was considered as the convince dataset in accordance with different researches carried out in the same field [32].The second datasets involved include specific customer opinions on Amazon videos.

Device and library tools
In carrying out this experiment, several tools including software and hardware were used (Table 2).These include Python with several libraries such as TensorFlow for deep learning implementation and GeForce GTX 1001 for running convolutional neural networks supported by Quad Core Xeon 2.4 GHz processors.

Dataset
Two real-world datasets, MovieLens [32] and Amazon Information Video (AIV), were applied to show the robustness of the approach regarding rating prediction.These datasets contained consumer' explicit ratings for products on a scale of 1 to 5. Amazon dataset contained opinion for movie products as item description documents.Because MovieLens does not include item description documents, documents use corresponding items from IMDB server was generated.In accordance to Kim [27] and [21], a preprocessed description document was conducted for all the datasets as follows: 1) set the data with maximum length of raw documents to 300; 2) remove stop words; 3) calculate TF-IDF score for each word; 4) remove corpus specific stop words that have frequency higher than 0.5; 5) Choose top 8000 distinct words as a vocabulary; and 6) remove all non-vocabulary words from raw documents.Consequently, the average numbers of words per document were 97.09 on MovieLens-1m (ML-1m), 92.05 on MovieLens-10m (ML-10m) and 91.50 on Amazon Instant Video (AIV).Items that have no description documents in the dataset especially in Amazon dataset and user's data that have just only less than 3 ratings were considered for removal.The resulting statistics of each data shows that three datasets have different characteristics on Table 3.Finally, even though some users have been removed during preprocessing, Amazon dataset still has more sparse data compared to the others.

Results and Discussion
This experiment considered the ability of DCCN model to capture contextual performance.It was applied in several sentences with similar words but with different contextual meaning.The result of the experiment is as shown in Table 4.The first sentence used the verb word "love" and it has the highest share weight score of 0.0431.In the last sentence, "love" was used as a noun and it has a share weight score of 0.0408.It can be concluded that DCNN has the ability of catching contextual meaning of text sentence in document and even distinguish smooth contextual diversity of similar words through their shared weight scores.Based on the results shown in Fig. 4 and Table 5, the researchers demonstrated that at the beginning of the training process, the results of evaluation errors with RMSE were very high, but after the next iteration, the r value decreases significantly until it approached 30-50 levels where the value of the error reached a minimum of 0.8606681400.This means that the least error is reached at the 50th iteration.The training data process was automatically stopped when it reached the iteration level with the best performance in accuracy as shown in Fig. 4. The results of the iterations that produce a significant result on the level of accuracy are as presented in Table 5.Table 6 shows the average RMSE of Probabilistic Matrix Factorization (PMF), Collaborative Topic Regression (CTR), Collaborative Deep Learning (CDL) and the approach with various ratios for testing and training data on four datasets.First, it can be observed that ConMF approach model has the overall best performance on item side information based on product description documents.This implies that incorporating side information is effective.The proposed approach, DCNN, outperforms other previous ones founded on traditional text sentiment analysis using bag of word and word order (CTR).It shows that increasing deeper understanding of text sentences description would increase the level of accuracy.It was discovered that composition ratio of training datasets has impact on accuracy such that it constantly increases with ratio training and testing composition ration.The application of DCNN is more effective and efficient than the use of traditional methods such as CDL and CTR.

Conclusion
The research which was conducted using a dynamic convolutional neural network (DCNN) and a standard probabilistic matrix factorization revealed that the inclusion of side information in product description is effective in movie e-commerce business.It also found that the application of this method can increase the level of accuracy compared to previous models.Its involvement has also been identified to increase deeper understanding of texts which can be used in predicting rating of a product.There is possible integration of this method with another standard matrix factorization such as SVD, NMF, SVD++.This hybridization approach has been proven to have the capability of improving performance in rating prediction and applied text sentence document to increase the effectiveness of item latent factor representation which is very essential in upgrading the performance of recommendation machines.

Fig. 1 .
Fig.1.Left side with smooth dash is DCNN, and right side with a Coarse dash is the Probabilistic approach[29]

Table 1 .
Comparison of proposed Dynamic-CNN with product rating prediction approaches ∈ ℝ −+1 as contextual feature vector of a document with    can be constructed by: sides, one type of contextual feature is captured by one shared weight captures while multiple types of contextual features were captured by multiple shared weights.This conditions occur possibility to produce contextual feature vectors as much as the number   of   (i.e.,    where  = 1,2, . . .,   ).
where    ∈  is a bias for    , * is an operator of convolution, and () is a nonlinear activation function.Among these functions are sigmoid, tanh, and rectified linear unit (ReLU).ReLU was used to avoid the vanishing gradient issue that occurs minimum optimization convergence, and may evidence to minimize local value.Then,On other 2 ) = ∏ ∏ (  |     ,  2 ) |,  2 ) is the probability density function of the Gaussian normal distribution with mean  and variance  2 , and   is an indicator function.To get the user latent model, conventional priori, a zero-mean spherical Gaussian prior was used on the model with variance   2 .

Table 2 .
Devices and Library

Table 4 .
Phrase test

Table 5 .
The sample result of RMSE with loss, time, and convergence

Table 6 .
Comparison RMSE on ML-10M with existing approach