Japanese sign language classification based on gathered images and neural networks

There are lots of techniques to classify the SL for hand-shape feature extraction, hand and/or finger motion feature extraction, and SL word classification [1]–[28]. In the hand-shape feature extraction, Jeballi et al. [9] classified French SL using HMM, Ranga et al. [10] classified American SL using Gobor filter with wavelet transform and CNN, and Tao et al. [12] classified the American SL alphabet using CNN. In the hand and/or finger motion feature extraction, Silanon [14] classified Thai fingerspelling using histograms of the orientation gradient feature and Phitakwinai et al. [15] classified Thai SL using scale-invariant feature transform. In the SL word classification, Pariwat et al. [16] classified Thai SL using SVM. Pigou et al. [17] classified the hand gestures of SL using CNN, Molchanov et al. [18] classified hand gestures using 3D CNN, Mukai et al. [19] classified JSL using SVM, and Takayama and Takashi [20] classified JSL using an improved HMM. It requires a specific size of input data for machine learning technique.


Introduction
Sign language (SL) is one of the communication tools for humans. In communication between humans and computers, it is important to develop the communication tool and to make SL recognition techniques. SL includes fingerspelling and a hand gesture. In the hand gesture, there are a finger alphabet and a hand motion.
There are lots of techniques to classify the SL for hand-shape feature extraction, hand and/or finger motion feature extraction, and SL word classification [1]- [28]. In the hand-shape feature extraction, Jeballi et al. [9] classified French SL using HMM, Ranga et al. [10] classified American SL using Gobor filter with wavelet transform and CNN, and Tao et al. [12] classified the American SL alphabet using CNN. In the hand and/or finger motion feature extraction, Silanon [14] classified Thai fingerspelling using histograms of the orientation gradient feature and Phitakwinai et al. [15] classified Thai SL using scale-invariant feature transform. In the SL word classification, Pariwat et al. [16] classified Thai SL using SVM. Pigou et al. [17] classified the hand gestures of SL using CNN, Molchanov et al. [18] classified hand gestures using 3D CNN, Mukai et al. [19] classified JSL using SVM, and Takayama and Takashi [20] classified JSL using an improved HMM. It requires a specific size of input data for machine learning technique.
It is not easy to specify the size because they are differences of SL speed on each human and length of word of SL. Furthermore, Rao et al. [21] classified SL using CNN and used a dataset wherein the sample size was maintained constant. If it is possible to gather information on word of SL, it is no need to specify the size and possible to develop the method to classify SL words without depend on language speed and length of word.

Method
The proposed method consists of grayscale transformation, mean image creation, gathered image generation, and JSL word classification. Flowchart of JSL words classification method is shown in Fig. 1. Each step is detailed in the following sections.  Fig. 2 shows a grayscale transformation. In this preprocessing stage, grayscale images are created by transforming on all images after the video is converted to still images, as follows = 0.299 * + 0.587 * + 0.114 *   

Grayscale Transformation for Preprocessing
where Gray is grayscale value of each pixel. B, G, and R are the blue-, green-, and red-scale values, respectively.

Mean Image Creation
A mean image is created by calculating the average value on each block divided into N × M pixels as follows: where Mean, GrayImage, x, y, i, and NumImage indicate the mean image, grayscale image, x-coordinate of a block in an image, y-coordinate of a block in an image, image number, and total sample images of a JSL word. Fig. 3 shows the creation of a mean image. The created mean image expresses information concerning the hand motion of a JSL word because the gray value of the hand motion information is thin. Fig. 3. Mean image creation.

Gathered image generation
In gathered image generation, the difference values of the blocks in an image between the mean image and all images of a JSL word are calculated. Then, the winner blocks that have the maximum difference values are decided. The gathered image consists of the winner blocks, i.e., those that have the calculated maximum difference values. In Fig. 4, MeanImage, GrayImage, maximum, and max_num indicate the created mean image, the grayscale image, the maximum difference value between the created mean image, and the total number of grayscale images in each block, respectively.   5 shows the gathered image generation based on computing the maximum value of difference from the mean image. The generated gathered image highlights the hand motion information of a JSL word due to the embedding of the gray value of the block that has the maximum difference value from the created mean image.

Japanese sign language classification
It is not easy to extract the features of the generated gathered image because the generated gathered image is complex. CNNs are therefore used for extracting the features of the generated gathered images. The convolutional layers have L × L filters. The information in the generated gathered image is compressed using the pooling layer. Then, a dropout function (dropout ratio: Q%) is applied to protect against overtraining. The CNN structure is shown in Fig. 6. Finally, the JSL words are classified using the MSVM and MLP classifiers, respectively. Fig. 7 shows the structure of the MSVM and MLP. The accuracy rate for classifying JSL words is expressed as follows.
where Accuracy, CorrectClassification and TotalNum represent the accuracy rate for JSL word classification, the number of correctly classified data, and the total number of gathered images.

Results and Discussion
We conducted experiments using actual JSL videos. The total number of subjects was 11 healthy persons (3 females and 8 males; mean age = 24.7 years). The number of JSL words was 20 related to greeting and enquiries using JSL during general communication in an information center and/or office. A total of 13,200 images were generated (the numbers of subjects, words and generated gathered images were 11, 20 and 60). The number of classes was 20 (20 JSL words). Some common phrases used by the participants were "excuse me," "I see," "I'm not sure," "where," "when," "please," and "thanks." In JSL, "excuse me" consisted of "talk," "not care," and "could you." "I see," "I'm not sure (not sure)," "when," "please," and "thanks" were single word. "Where" is expressed using two words: "place" and "what." Additionally, some selected greetings consisted of "morning," "afternoon," or "night" with "greeting." The place words used in this experiment were "athenaeum," "hospital," and "information," (a single word). The selected verbs were "go," "say," and "hope," (a single word). C/C++ codes were employed to implement the grayscale transformation, mean image creation, gathered image generation. MATLAB toolbox was used to extract the features using the CNNs and JSL classification using the MSVM and MLP. The gathered image consisted of 108 × 192 pixels. The size of blocks was 6 (N) and 6 (M), respectively. The size of filters of the first to third hidden layers were six, three and three, respectively. The total numbers of first to third convolutional layers were 64, 64 and 192, respectively. The pooling layer employed the max pooling algorithm. The number of units for the full-connection layer was 1,000. The dropout rate Q was 50. The hidden layer of the MLP had 1,000 units. Training data sets were selected 80% of the datasets randomly. Table 1 shows the mean and standard deviation of the sample number of each Japanese sign for each subject. SubA to SubK in Table 1 represent subjects A to K, respectively. We confirmed that the mean numbers of the sample images were different for each JSL word, and for each subject and that the mean numbers of the sample images were considerably variable. The maximum and minimum means of the sample images were 92.9 for "athenaeum" in the case of subject C and 18.0 for "place" in the case of subject H. In addition, the maximum and minimum standard deviations of the sample images were 19.0 for "go" in the case of subject H and 1.9 for "could you" in the case of subject I. Table 2 shows the recognition accuracy (mean and standard deviation) for 20 JSL words classification.  [31], the mean image creation method, and the method of embedding information of the block having the maximum value of difference between the mean image and the grayscale images for the gathered image creation, respectively. In the previous method, the difference values of all blocks between target image and the previous image, and next image, respectively. Information of target image and the block having the maximum value of difference. The information on each block having the maximum value was embedded in all blocks in an image (Fig. 8). This previous method has often been employed to analyze security footage with residual images that express human movements [29], to visualize sleep conditions (e.g., sound sleep and bad sleep) [30], and to classify 10 JSL words [31]. T indicates the trail number. We confirmed that the maximum mean of the recognition accuracy was 94.1% using the proposed method and the MSVM classifier and that the minimum standard deviation was 1.6%. The mean and the standard deviation of the recognition accuracy using the previous method were 64.3% and 3.9%, respectively, and the mean and standard deviation using mean image creation were 89.3% and 2.9%, respectively.  Fig. 9 to Fig. 11 show the generated gathered images for each JSL word for each subject using the previous method, mean image creation, and the proposed method, respectively. In the previous method, the maximum difference for each block was calculated. The information of each block having the maximum difference was embedded in an image to generate the gathered image. The gathered image expresses the most significant action in the hand motion of a JSL word because the gathered image comprises the block information that had maximum difference. "Place" in JSL is a downward motion from the top with the dominant hand open. The significant action of "place" is a downward motion with the dominant hand open. "Afternoon" in JSL is a motion in front of the face with the forefinger and middle finger of the dominant hand. The significant action of "afternoon" is in front of the face with the forefinger and middle finger up. "Greeting" in JSL is expressed by the action of bending the forefingers of both hands in front of one's face. The significant action of "greeting" involves bending the forefingers of both hands in front of one's face. "Could you" in JSL is expressed through a motion that shows the palm of the hand to the person one is conversing with from the state of showing the back of the dominant hand. The significant action for this sign is showing the palm of the dominant hand. "Go" in JSL is a motion where the forefinger of the dominant hand moves from the bottom to the front. The significant action for this sign is moving the forefinger of the dominant International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 5, No. 3, November 2019, pp. 243-255 hand forward. We confirmed this in Fig. 8 and Fig. 9(b). "Athenaeum" in JSL is a motion that opens the hands from the state of pressing the hands together and then forms a square with both hands. The significant actions of "athenaeum" are opening the hands and making a square. It was difficult to find the opening hands motion in the gathered image. The mean and standard deviation of the previous method were 65% or less and 3.5% or more, as shown in Table 2. These results suggest that the gathered information in the gathered image becomes insufficient when the JSL word includes complex hand motions, as in the case of "athenaeum," and that the recognition accuracy is not high when using the previous method. In mean image creation, the gathered image expressed information of the hand motion of a JSL word even though the gray value of the hand motion information deteriorated. The hand motions of "afternoon," "greeting," "could you," and "go" are shown in Fig. 10(b). A part of the hand motions of "athenaeum" is shown in Fig. 10(b). The gray value of the area related to the hand motion was too thin because the number of sample images was too high. We confirmed that the mean and standard deviations were 85% or more and 3% or less, respectively. From these results, it is difficult to extract features using CNNs when the number of sample images is too large and that it is easy to classify the JSL words when 251 Vol. 5, No. 3, November 2019, pp. 243-255 the number of sample images is small because the created mean image has gathered enough hand motion information. In the proposed method, the gathered image highlighted the hand motion information of the JSL word because the embedded gray value of the blocks contained the maximum difference value from the created mean image. The hand motions of "afternoon," "athenaeum," "greeting," "could you," and "go" are shown in Fig. 11(b). Here, the proposed method is compared to the mean image creation method. Fig. 12 shows samples ("place") of mean images and gathered images using the proposed method. Even though hand motion information can be found in the mean image, the gray values of the areas related to the hand motion were thin. We confirmed that the thin information was highlighted and that the mean and standard deviation were 94% or more and 1.8% or less, respectively (excellent results). These results suggest that it is easy to extract the features of JSL words and classify these words by highlighting the gray values of the areas related to the hand motion in the mean image. In addition, the proposed method had a beneficial effect on extracting the features related to the JSL words and classifying the JSL words. Then, the proposed method employed CNNs (Fig. 6) and the SVM (Fig. 7), obtaining experimental results with high recognition accuracy. These results suggest that the combination of CNNs and the SVM effectively extracts features from complex images and classifies hand motions.

Conclusion
To solve the issues that language speed during the JSL word motions and length of word were different and that the size of input data for CNNs was specified, this paper employed the gathered image generation technique to make the gathered image to classify JSL words without dependence on language speed and length of word. It was not easy to extract the features to classify the JSL words because the generated gathered image was complex images. The CNNs were employed to features because the CNNs could obtain the features from complex image. This paper proposed the novel approach to classify JSL words. The proposed method consisted of grayscale transformation, mean image creation, gathered image generation, and JSL word classification. In the grayscale transformation, input data were transformed via preprocessing. In the mean image creation, information related to the hand motions of JSL words was gathered. Information related to the hand motions of JSL words was highlighted in the gathered image generation. Then, the CNNs were employed for extracting features of the gathered images. Moreover, the MSVM and the MLP were employed for classifying the 20 JSL words.
In the experimental results, we confirmed that the thin information was highlighted and that the mean and standard deviation were 94% or more and 1.8% or less, respectively. The experimental results suggest that it is easy to extract the features of JSL words and classify these words by highlighting the gray values of the areas related to the hand motion in the mean image. In addition, the proposed method had a beneficial effect on extracting the features related to the JSL words and classifying the JSL words. Then, the proposed method employed CNNs and the MSVM, obtaining experimental results with high recognition accuracy. These results suggest that the combination of CNNs and the SVM effectively extracts features from complex images and classifies hand motions. However, the gathered image of the proposed method contains no time-series information of the hand motions. Therefore, we will create the novel gathered image considering the time-series information to improve the classification accuracy.