Hybrid deep neural network for Bangla automated image descriptor

Humans can always recognize information when something is shown visually. They possess the power to understand visual information. Therefore, getting to experience new sights for the first time typically needs a quick response [1][2]. The capacity to grasp the scene’s definition is limited not just to perceiving images but also to the syntactic and semantic meaning of an image. Among the substances of the images can be found some connectivity. Once it comes to a language-based textual representation of a picture, it is generally an important area of study in computer vision, image processing, and natural language processing. As there is a rising range of practical applications focused on image captioning, research on these areas has been increasing. Many researchers worldwide have begun focusing on, such as image classification, text-based image analysis, image to object detection, allowing people with visual disabilities to grasp the digital world, and recognizing the image in social media [3]–[7]. ARTICL E INFO ABSTRACT


Introduction
Humans can always recognize information when something is shown visually. They possess the power to understand visual information. Therefore, getting to experience new sights for the first time typically needs a quick response [1] [2]. The capacity to grasp the scene's definition is limited not just to perceiving images but also to the syntactic and semantic meaning of an image. Among the substances of the images can be found some connectivity. Once it comes to a language-based textual representation of a picture, it is generally an important area of study in computer vision, image processing, and natural language processing. As there is a rising range of practical applications focused on image captioning, research on these areas has been increasing. Many researchers worldwide have begun focusing on, such as image classification, text-based image analysis, image to object detection, allowing people with visual disabilities to grasp the digital world, and recognizing the image in social media [3]- [7]. Automated image to text generation is a computationally challenging computer vision task which requires sufficient comprehension of both syntactic and semantic meaning of an image to generate a meaningful description. Until recent times, it has been studied to a limited scope due to the lack of visual-descriptor dataset and functional models to capture intrinsic complexities involving features of an image. In this study, a novel dataset was constructed by generating Bangla textual descriptor from visual input, called Bangla Natural Language Image to Text (BNLIT), incorporating 100 classes with annotation. A deep neural network-based image captioning model was proposed to generate image description. The model employs Convolutional Neural Network (CNN) to classify the whole dataset, while Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) capture the sequential semantic representation of textbased sentences and generate pertinent description based on the modular complexities of an image. When tested on the new dataset, the model accomplishes significant enhancement of centrality execution for image semantic recovery assignment. For the experiment of that task, we implemented a hybrid image captioning model, which achieved a remarkable result for a new self-made dataset, and that task was new for the Bangladesh perspective. In brief, the model provided benchmark precision in the characteristic Bangla syntax reconstruction and comprehensive numerical analysis of the model execution results on the dataset. The image captioning model is a crucial topic that generates a simple language text from the given image. The primary goal is to classify the whole dataset and then implement a hybrid model for generating text using a specific optimization technique [8]- [11]. We are attempting to construct a new dataset for that statement whose name is Bangla Natural Language Image to Text (BNLIT) [12], and that dataset is generated for a different target language. This dataset comprises 8,743 images along with an individual annotation and extracts certain annotations from the cover to the specialists. As our best idea, there is still no dataset of the image to generate Bangla language text for researching and improving the accuracy and loss score [13].
From the logical and working perspective, text generation from the given input image is an interesting sector of the machine learning, image processing, and deep learning sector. Moreover, the image to the Bangla text generation technique is a unique work from Bangladesh and Bangla language society's perspective. Also, the growing image and video datasets pose a remarkable challenge to computational natural language-based processing due to limited linguistic and semantic templates and closed vocabulary.
The image narrator's visual meaning needs to be enhanced and promoted to create a model for generating image captions. For example, how the models intercept the context, detect a region of the image, and then construct the image caption that is consistent with the content of the image. Improving accuracy is required for this role, but the challenging task is to generate Bangla text from the given image. Meanwhile, it often includes how specific textual embedding in an image may be adjusted to various contexts. To carry out this challenging task, we proposed a hybrid neural network model. The most significant and challenging element in the design component of the encoder-decoder models is to create and build a model that incorporates Convolutional Neural Network (CNN) [14], Recurrent Neural Network (RNN) [15], and Long Short-Term Memory (LSTM) models [16]. Therefore, another important aspect of image generation to text is to precisely follow this hybrid model and train the existing structure appropriately.
In the portion of image processing and the pattern recognition section, the first task is to classify the dataset, and then it is required to take an attempted image to text generation, object detection, and so on. On the other hand, there are different types of datasets existing, which are very much popular for image processing, and they are Flickr8K, Flickr30K, MS COCO, CIFAR-10, and CIFAR-100. In the classification image section, the main task is to identify objects from the images of the dataset. The accuracy improvement is the main challenge, along with the exact model development and Inception-v3 is the best method for accuracy improvement and efficiency [17]. Moreover, some researchers showed that they adopt a CNN model to classify the dataset and show the dataset's smoothness prior to the labels. This paper's methodology is related to our research because we also followed CNN for the image classification, and we also use different types of class for labels. To extend, they also implemented and showed that the accuracy improvement along with how to change the CNN parameters using the most popular stochastic gradient descent optimization technique and α-expansion min-cut-based algorithm [17] [18]. Hyperspectral imaging (HSI) technique is one of the superior techniques for the image processing portion. It proposed a CNN architecture based on spectral-spatial capsule networks to achieve better accuracy and be used for classification accuracy and computational time both [19]. To do this, a generative adversarial network (GAN) is as well as the other superior model for the classification and it is the challenging task of the HSI portion [20]. In addition, HSI has a high image classification technique Image to text generation is the most crucial task, and in that task, the main motive is to generate image caption using a complex neural computing model. To add to do this, using Attention Generative Adversarial Networks is the best technique and achieves better accuracy [1]. On the other hand, one researcher showed that creating and developing a new model, which is a combination of CNN, RNN, and LSTM models, is also working fine and gets better accuracy for the text generation [23] [24]. Like them, we also implemented a hybrid neural image captioning model with the best combination of CNN, RNN, and LSTM methods and achieved a benchmark result for BNLIT. Feature-Guiding Generative Adversarial Networks (FGGAN) is another hot research to solve the image captioning technique. It has a good efficiency, which can generate text from blur or poor quality image. Furthermore, they also showed that the text generation technique and their performance also depend on data efficiency, data resize, data re-shape, and dataset size [25] [26]. Furthermore, BLEU and METEOR metric evaluation is another crucial topic for the judgment of the models, and in that paper, researchers are also highlighting how those metrics are so much important for understanding how efficient that proposed model [26]. They also achieved a large scale evaluation score, which is 63.5% and 30.6% for the human performance using the benchmark dataset, e.g., Flickr8K, Flickr30K, MS COCO, and papers as mentioned above are the state-of-the-art of our research [27].
Meanwhile, the primary objective and contribution of this research are, first, to invent a new target language dataset. Second, developing a hybrid image captioning model which is capable of generating Bangla caption from the given any image. Third, classifying the dataset with relevant classes. Fourth, improving the accuracy for that target dataset, and the sixth, successfully testing the proposed model in semantics recovery tasks of images. Our self-made BNLIT dataset is already published in the machine learning and image processing repository and available for every researcher [12].

Hybrid deep neural network
In the domain of Computer Vision, a neural network system is a set of algorithms that enables the computational system to find patterns by matching complex input data relationships like human brains.
Convolutional Neural Network (CNN) is a deep learning algorithm that selects characteristics in the taken image and differentiates from others. In previous years, filters like blurring, sharpening, and detecting edge was needed to be hand-engineered and included enough training before CNN comes into play. The broad implementation of this algorithm, such as, has perfected facial recognition, image detection, recommendation system, and natural language processing.
We used four main layers of the architecture of CNN: Convolutional Layer, Pooling Layer, Rectified non-linear unit, and Fully-Connected Layer. Convolution layer is present at the center of the network and performs convolutions that involve linear operation utilizing multiplication, a set of weights with the array of input data called filter or kernel. The main purpose of convolution is to fetch high-quality features like edge detection and some low-quality features like color, gradient orientation, etc. To be specific, the filter usually applies following a process to the parts which overlap each other left to right and top to bottom. Using the same filter to detect a particular object in the image has been recognized powerful as it will sort out systematically all over the image where the object is present [28].
The next comes the pooling layer. The main objective is to continuously decrease the spatial size of the representation to decrease the number of parameters and computation in the network as well as controlling overfitting [29]. With the help of MAX operation, it works independently to every depth slice of the input and resizes it spatially. Fully-connected layer: Neurons in an F-C layer have full connections to all activations in the previous layer, their activations can hence be computed with a matrix multiplication followed by a bias offset. It is possible to convert FC layers to CONV layers as there is a very small difference. We showed the architecture of CNN in Fig. 1, and the proposed model in Fig. 2.  Recurrent Neural Network (RNN) is a neural network where the output of a previous computation is implemented as the input of the current one. Usually, the inputs and the outputs are independent of each other, but whenever the system needs to predict output in a sequence, it needs to remember the previous input. RNN keeps such calculations in a memory located in a "Hidden State". That's why it became applicable to tasks like unsegmented, connected handwriting recognition, and speech recognition.
In speech recognition and handwriting recognition, Bidirectional RNN is implemented as sometimes there might be ambiguity in the provided input so that we need to know the next possible outputs to sequence past to present outputs, for example, words in a sentence. For translation services in NLP, the Encoder-Decoder or Sequence to Sequence RNNs is used. Here, the encoder RNN keeps updating the "Hidden State" for continuous output as 'context'. Then the produced outputs are then fed into the decoder RNN as input to produce 'context' translations sequence by sequence [30]. While remembering long sequences, RNN couldn't process when 'tanh' or 'relu' is present as an activation function. In addition to that, disadvantages like gradient vanishing and exploding problems are being fixed by the modified version of RNN which is called Long Short-Term Memory (LSTM). It performs well to classify, process, and predict time series given provided the time duration is unknown. Back-propagation is used to train the model. LSTM has three gates-1) Input gate, 2) Forget gate and 3) Output gate: 1. The input gate determines which value should modify the memory by the decision of "sigmoid function" through 0, 1 and "tanh function" determines the weight to the values within the range of priority of -1 to 1 to pass through. 2. Forget gate determines the details that need to be discarded from the block. It is additionally chosen by the "Sigmoid capacity". It observes the previous state and the content input and outputs a number between 0 and 1 for each number in the cell state. 3. In general, the input values and the memory will decide the output. Like previously, the "sigmoid function" will decide the values that will pass through 0, 1 and 'tanh' function determines the weight of the values within the range of priorities of -1 to 1 to pass through multiplied with the output of sigmoid [7]- [10][31].

Dataset
Data is everywhere. It requires the methodological procedure, statistical analysis, and categorical demonstration to convert it into usable information. Moreover, such information, when ordered, organized, and represented by variables using values, forms a dataset. A dataset is used to build model, recognize patterns, and generate meaningful insights widely in the field of image processing, machine learning, and deep learning sector. Real-world datasets include musical note datasets, voice clip datasets, image properties datasets, character matching datasets, recursive datasets, etc. Likewise, the image accumulated together has been established as the most dynamic and accurate in the research and development of complex data-driven application systems in recent years. For serving the purpose, we introduced a new dataset titled BNLIT [12] that is comprised of a gallery of 8,743 photos representing the life, heritage, ethnicity & culture of our country Bangladesh where every image speaks with its language. Instead of portraying western socioeconomic image collection based datasets like Flickr8K, Flickr30K, and MS COCO, we choose to reflect our country's lifestyle, culture, and beauty where the authors have been raised and brought up. The dataset is exclusively constructed by aggregating numerous sources to ensure variety, depth, and authenticity of the images. We collected images for the dataset from both urban and rural settings; images were captured on natural sceneries, shopping centers, local grocery shops, public transportation, and marriage ceremonies and so on. We used mobile phone camera, DSLR, and action camera to take the images. We also took images of ethnic and religious festivities. We also collected some photographs from various web sources that are not under any copyright obligation.
For easy referencing, recognizing and labeling the analysis of information, image annotation is a good practice that is being implemented distinctly for every picture, and the language is selected in our mother tongue "Bangla". In machine learning and deep learning, computers rely mostly on the training data that is being fed in the algorithm, and the performance also depends on absolute precision. In addition to that, image annotation is a highly qualified technique for computer vision to detect an object from training images predetermined by the scientists. However, our dataset is viewed as a kind of multiclassification image characterization with a terrible measurement of classes with the vocabulary estimate.
While training a machine to learn, the larger the dataset, the better the precision to produce results. To handle the large dataset BNLIT of 8,743 pictures, we characterized it into 100 classes. For the explanation of images, we set up a sentence for each and resized them equally as they were of varied sizes in pixels and resolutions. Here, file formats include JPG, JPEG, and PNG. We represent in the table below how many images stay in different formats e.g., JPG, PNG, JPEG, and their resolutions, image dimension, and bit depth.
In our self-made BNLIT dataset, 58 images are staying in the format of JPEG. Also, in the PNG and JPG formation, there are a number of 1,237 and 7,448 images, respectively. Before the CNN implementation, we resized the full dataset into dimension of 224 × 224. On the other hand, we represented the technical characteristics of the BNLIT dataset in Table 1. Moreover, we showed the dataset grouping in Table 2. A specialized program in Python language is developed to resize all the pictures into the same pixel and split them into different categories. Later, the pictures are prepared for CNN and RNN. reason we choose 7,243 images for training reasons. On the other hand, testing data is important for that type of data, which is provided for an unbiased evaluation of a final model whose on the training dataset. Furthermore, validation data is used to evaluate a given model, and it is crucial for evaluating the models. We selected 1,000 and 500 images for the testing and validation period, respectively.

Simulation 2.3.1. Image and annotation processing using hybrid deep model
We resize the images of all our datasets to affirm higher all-inclusive statements and to maintain a strategic distance from any numerical irregularity all through training and testing stages. We tend to utilize crude image documents of dataset nearby CNN and VGG16 highlights. We set pixels to measure 224 x 224 x 3. The images of the dataset are doubtlessly concealing images with pixel regards running from 0 to 255 with a part of 224 x 224, so before feeding the data into the model, it is indispensable to pre-process it. First, adhere each 224 x 224 image of the dataset into a framework of size 224 x 224 x 3, which we would then be able to bolster into the CNN arrangement. On the other hand, we classified a full dataset utilizing CNN and VGG16 highlights. Utilized 100 types of classes with the batch size of 16, we completed full dataset image classifications.
Furthermore, we implemented Conv2D features with the Maxpooling 2D and ReLU activation function. To extend, we conducted a dropout layer on CNN and a dropout layer, which value 0.5 because mainly the dropout layer regularized the neural network, and it can reduce the overfitting tendency. We implemented categorical cross-entropy for the measure losses of CNN and selected Stochastic Gradient Descent (SGD) for the CNN optimization period. The defined learning rate was 0.01, decay rate = 1e-6, momentum = 0.9 and neserove = True. We ran 20 epochs for CNN. After completing classification training, pick the best weight during the preparation period, and create an hdf5 record with misfortune and exactness versus ages chart and store it in an index.
At that point, we were concerned about the RNN and RNN, for the most part, used to produce content from the given information images. We picked one Bangla annotation for each picture. We centered on Keras, LSTM model close by with NumPy, Matplotlib library, and trained up our dataset's comment record. We defined 256 filters in the LSTM and set dropout value 0.2. Finally, we conducted a fully connected layer with the NAdam optimization technique, chosen batch size 128, and measured loss for used categorical cross-entropy. After training up, make an index and pick the best weight from the preparation time frame and make an hdf5 model which creates misfortune, exactness versus ages chart and run 50 epochs for RNN.
At long last, we joined both the model of CNN and RNN highlights of our dataset and trained up 30 epochs again and produced an accuracy and loss diagram. In the wake of finishing the dataset train up, all prepared models were put away in our CV organizer. At that point, we took the endeavor to assess our prepared model for these datasets to show signs of improvement.

Implementation
Image recognition is known for being one of the essential aspects of image processing. Typically it is easy to get big loads of ideas when we inspect a few late works [32]. Searching for those sentence portrayals assemble visit references is required for the things and their qualities. In this scenario, owing to their good precision, we used CNN for image classification. We have used CNN on ImageNet to prearrange, and we have good results [33]. We have identified 100 ImageNet Recognition Task classifications [17] that are then optimized using CNN. The plan [5] is to use the support of Region Convolutional Neural Network to scan for each item from each image. Having followed the paper [34], we continue to use the initially known space of nineteen irrespective of the pixel all-out pictures by using the jump box, as stated in the equation: There is a fully connected layer that is placed before the classifier, usually in a split second. Bidirectional Recurrent Neural Network (BRNN) [21] [35] is used to evaluate the representation of the title. Several pieces of BRNN are contained in the RNN field. Thus sentence composition was a critical part of our plan. Alternatively, BRNN is often used to predict a specific structure. Any component of the BRNN model category depends upon a component's past and future context. BRNN executes this process, where the close yield of two RNNs and one strategic planning of the sequence is conducted from left to right. Because of this activity, we obtained the subsequent yields that forecast the target signals given. There is an arrangement of N terms, as per our model. To transform each element into an h-dimensional matrix, the BRNN selects such N terms. We also used the overview t = 1.... N to represent the situation of a term in a text. The precise BRNN condition is conducted as per the following: Thus, the weights w W specify a word that is incorporated into the network such that 300-dimensional word2vec [33] weights will be used to inject it. It also stays set, as overfitting occurs. We also used a pointer column vector ( t I ) in a word vocabulary that has a single component of the t-th word. There are two different guiding sources located at BRNN. The first passes from left to right

Optimization
We used stochastic gradient descent (SGD) for the optimization section in CNN part with a mini-batch of 16 image sentence sets moreover. We implemented a large dataset, and SGD can show the best and faster performance on that. Furthermore, it can converge faster than other optimization techniques with the definition of a batch size because it can perform and update data more frequently. On the other hand, the SGD optimization technique is a simple combination of gradient descent, whereas the stochasticity comes with a mini-batch measurement technique and computes the gradient technique at each descent [10]- [12]. To extend that, it has a regularization effect, making it appropriate for the exceptionally non-raised function of losses, for example, those involved in preparing profound systems for an order. In addition, it can update weights on the fly for the raw and extraordinary data, but as the frequently update the weight loss and cost functions are heavily fluctuates. We used learning rate 0.001, decay rate 1e-6, momentum=0.9, and nesterov=True because we implemented a large dataset containing 8,743 images so that we chose the SGD optimization technique. From that point onward, for measured misfortunes utilized misfortune all out categorical cross-entropy, and for measure exactness, used precision metric. We utilized 100 types of classes for the classification of the entire dataset.
For the RNN part, we used Nesterov-accelerated Adaptive Moment Estimation (NAdam) optimizer for the image to Bangla caption generation. NAdam optimization technique is the combination of NAG and Adam techniques, and it is utilized for uproarious slopes or for inclinations with high bends [24] [25]. On the other hand, the learning procedure is quickened by summarizing the exponential rot of the moving midpoints for the past and current slope. We also maintained accuracy and loss in the RNN part for measuring accuracy and loss vs. epoch.

Results and Discussion
We implemented a hybrid neural image captioning model which is capable to generate Bangla text based practical depiction of the given image. We prepared our model to get familiar with the connection between better bits of the image alongside the pertinent segment of the sentences. To measure exactness, we utilized grouping exactness measurements.

Encoding: CNN implementation
In this section, we gave concern and talked about the CNN feature execution of the BNLIT dataset. We picked CNN strategy for image classifications and utilized 100 classes on CNN. Furthermore, we utilized stochastic gradient descent (SGD) as an optimized technique because we implemented a large dataset, and SGD can show the best and faster performance on that. Moreover, we ran 20 epochs and with the select batch size 16, and we implemented categorical cross-entropy for the measure losses of CNN and selected Stochastic Gradient Descent (SGD) for the CNN optimization period with the defined learning rate 0.01, decay rate = 1e-6, momentum = 0.9 and neserove = True. From the first epoch of CNN training time, we got improvement results for our self-made dataset. After training and completing one epoch, model saved in a directory and finally completed all epochs, choose the best weight from them, and create a final model in the hdf5 file, which saved in model.hdf5.
After running 20 epochs, we got 0.824538 training accuracy, which is the best precision for this dataset for CNN results. We got 0.801161 validation accuracy for the BNLIT dataset. After 20 epochs, overfitting happened, and that outcome did not store in graphically. We showed that training time accuracy and validation time accuracy vs. epoch for CNN in Fig. 3. We demonstrated that outcome graphically for the entire dataset.

Decoding: RNN and LSTM implementation
After CNN, for the most part, talked about the RNN implementation portion of our dataset. For the RNN part, we utilized the NAdam optimizer technique for this thesis. We also maintain accuracy and loss in the RNN part for measuring accuracy and loss vs. epoch.
We selected batch size 128 during RNN train up. We ran 50 epochs during RNN training time and chose a vocabulary size of 98. From the first epoch of RNN training time, we got better accuracy. We showed that epoch vs. accuracy and loss in Fig. 4. After running 50 epochs, we got 0.889419 accuracy, which is the best accuracy for this dataset for RNN results.

Image to Text Generation: Hybrid Model Implementation For Generate Caption
For Generate Text, we generated features.pkl file from the whole dataset, which contains 8,743 images. Pickle library mainly serialized objects in python. Finally, we ran 30 epochs for generating text from the given input image. Each epoch took approximately 1 hour 50 minutes, and our accuracy reached 0.917895 for training and 0.768651 for validation. We got approximately 0.181132 losses in the training period and 1.605326 losses invalidation purposes. We showed a graphical representation of the period training, and validation accuracy and loss result in Fig. 5. Therefore, the initial accuracy value was 0.688928 in the first epochs for the training period; however, from the second epoch with the accuracy value coming down to 0.7112296. The accuracy value increased, which indicates that the model is starting to pick up the new Bangla language.

Evaluation Result
After successfully training up, we need to evaluate our model and represent that result to the readers. There are different types of evaluation in the machine learning and image processing sector, but among them, BLEU (bilingual evaluation understudy) and METEOR (Metric for Evaluation of Translation with Explicit Ordering) evaluation are so much popular for NLP. First of all, if we give concern towards the BLEU evaluating system, then we can see it is an algorithm, which is used for machines translated from one language to another. The thought behind BLEU is the closer a machine interpretation is to an expert human interpretation; the better it is. The BLEU's assessment framework requires two sources of info: 1) a numerical interpretation closeness metric, which is then appointed and estimated against, 2) a corpus of human reference interpretations. BLEU midpoints out different measurements utilizing an n-gram strategy, a probabilistic language model frequently utilized in computational linguistics. We used BLEU-1, BLEU-2, BLEU-3, and BLEU-4 for the evaluation score.
On the other hand, METEOR is presented as another assessment approach, in view of the incomplete request hypothesis. METEOR joins straightforward choice help and advantageous devices for information investigation with the capacity to remember partners' inclinations for the choice procedure. The essential thought is an orderly step-by-step total of pointers, including their weighting. We implemented and used different types of the hidden layer, and they are 64, 128, and 256, respectively. We illustrated our evaluation metrics score of our model and represented it in Table 3, which is the benchmark result of our dataset. Furthermore, we showed how we could generate a Bangla language caption from the given image in Fig. 6.   6. Example of the image to Bangla language caption generation using hybrid image captioning model.

Discussion
We represented a deep hybrid neural image captioning model for generating images to Bangla text. Our hybrid model is the combination of CNN, RNN, and LSTM models and is used for implementing the self-made BNLIT dataset. We achieved better accuracy, and it can generate text from the image. We implemented CNN features for the image classification using the VGG16 architecture. Moreover, we achieved great performance in the CNN portion, which is 0.824538 during the training time period and 0.801161 for validation time. We set batch size 16 and implemented the SGD optimization technique in that CNN period. After that, we gave concern about the RNN part and achieved 0.889419 accuracies during training time with the implemented NAdam optimization technique. Finally, we combined both models and trained the full system, and our accuracy hit 0.917895 for training and 0.768651 for validation. In Table 3, we represented our BNLIT dataset evaluation test score using the effects of BLEU-1, BLEU-2, BLEU-3, BLEU-4, and METEOR, respectively.

Conclusion
In this study, we proposed a complex neural network with sufficient deep structure that efficiently generates a Bangla natural language sentence by comprehending intricacies in the contents of an image. We implemented our proposed model using the combination of CNN, RNN, and LSTM architecture and obtained benchmark accuracy for our self-made dataset. Furthermore, we implemented and represented the classification of our full dataset along with the annotation portion implemented by RNN, which is crucial for the image to text generation. Then we combined both architecture and achieved a benchmark result for BNLIT. Our analysis with the model indicates that improved implementation through more comprehensive databases may be achieved by methods to improve model fine-tuning and engineering. Moreover, we have an intension in the future to improve accuracy by implementing an extraordinary neural image captioning model and object detection.