TelsNet: temporal lesion network embedding in a transformer model to detect cervical cancer through colposcope images

Cervical cancer ranks as the fourth most prevalent malignancy among women globally. Timely identification and intervention in cases of cervical cancer hold the potential for achieving complete remission and cure. In this study, we built a deep learning model based on self-attention mechanism using transformer architecture to classify the cervix images to help in diagnosis of cervical cancer. We have used techniques like an enhanced multivariate gaussian mixture model optimized with mexican axolotl algorithm for segmenting the colposcope images prior to the Temporal Lesion Convolution Neural Network (TelsNet) classifying the images. TelsNet is a transformer-based neural network that uses temporal convolutional neural networks to identify cancerous regions in colposcope images. Our experiments show that TelsNet achieved an accuracy of 92.7%, with a sensitivity of 73.4% and a specificity of 82.1%. We compared the performance of our model with various state-of-the-art methods, and our results demonstrate that TelsNet outperformed the other methods. The findings have the potential to significantly simplify the process of detecting and accurately classifying cervical cancers at an early stage, leading to improved rates of remission and better overall outcomes for patients globally.


Introduction
Cervical cancer has become a major health hazard for women worldwide, with its high mortality and morbidity rates [1].The majority of fatalities occur in underdeveloped and developing countries [2].Unlike other types of cancers that are genetically triggered, the cause of cervical cancer is known to be human papillomavirus (HPV) [3].Cervical cancer can be completely cured if identified in its early stages [4].In 2018, the World `Health Organization (WHO) urged global countries to work towards eradicating cervical cancer [5].Globalized uniform cervical cancer screening can be a potential step toward achieving this goal [6].Cervical cancer can be nipped off in the bud altogether through systematic screening and swift intervention.
There are certain limitations of identifying cervical cancer manually.First and foremost, there is the problem of interobserver variability [7].It means that same colposcope image examined by different clinicians will have varied diagnostic decisions given by each of them.The concordance of diagnosis for colposcope images is 65% [8].Reducing interobserver variability in colposcopy diagnosis is essential for improving the reliability of cervical cancer screening and early detection [9], [10]

. Standardized training, A R T I C L E I N F O
A B S T R A C T Cervical cancer ranks as the fourth most prevalent malignancy among women globally.Timely identification and intervention in cases of cervical cancer hold the potential for achieving complete remission and cure.In this study, we built a deep learning model based on self-attention mechanism using transformer architecture to classify the cervix images to help in diagnosis of cervical cancer.We have used techniques like an enhanced multivariate gaussian mixture model optimized with mexican axolotl algorithm for segmenting the colposcope images prior to the Temporal Lesion Convolution Neural Network (TelsNet) classifying the images.TelsNet is a transformer-based neural network that uses temporal convolutional neural networks to identify cancerous regions in colposcope images.Our experiments show that TelsNet achieved an accuracy of 92.7%, with a sensitivity of 73.4% and a specificity of 82.1%.We compared the performance of our model with various state-of-the-art methods, and our results demonstrate that TelsNet outperformed the other methods.The findings have the potential to significantly simplify the process of detecting and accurately classifying cervical cancers at an early stage, leading to improved rates of remission and better overall outcomes for patients globally.
the use of technology to aid the decision making of the clinicians can be a step towards achieving the same.In addition to the variability problem, in many regions, especially low-income and remote areas, there is a severe shortage of skilled healthcare professionals, including gynecologists and pathologists, who are trained to accurately interpret cervical cancer screening results [10].This scarcity of expertise can lead to delayed diagnoses and inadequate follow-up care.Hence, computational support for the existing clinicians with modest experience and boost the diagnostic accuracy.Also, traditional diagnostic methods, like Pap smears and VIA, have limitations in terms of sensitivity and specificity.Therefore, choosing colposcope test as the deciding examination for cervical malignancy identification can benefit in multiple angles like reducing the cost burden associated with pap smear test, overcoming the impediment of limited training of the clinicians and most importantly reducing the turn around time of final diagnostic decision.
Artificial intelligence (AI) assisted cancer screening [11] has gained notable traction in the past two decades, and cervical cancer diagnosis has benefitted remarkably from AI solutions [12].Several researchers have embodied deep-learning solutions for cervical cancer detection through medical imaging.Images range from pap smear, colposcope, magnetic resonance imaging (MRI), and computerized tomography (CT) [13].Singh et al. [14] published a chronological review of the deep learning methods in cervical cancer screening.The outcome of the survey ascertains that deep learning CAD solutions are a bridge to developing automatic screening of cervical cancer.Colposcopy examination is a pivotal tool for cervical cancer screening that offers a greater degree of accuracy than the human papillomavirus (HPV) and Thin-Prep cytologic test (TCT) tests [15].During the colposcopy examination, a 5% acetic acid solution is topically administered to the cervical region to accentuate cancerous characteristics [16].Subsequently, a colposcope is utilized to capture detailed images of the cervix, where lesions become conspicuously visible within a few minutes following acetowhite application.In some cases, the cervix images are captured in a time series fashion with saline, acetic acid, and Lugol's iodine application.The classification of colposcopy images is primarily employed for diagnostic purposes, aiming to discern between benign lesions and low/high squamous intraepithelial lesions or cervical intraepithelial neoplasia (CIN) [17] or cervical intraepithelial neoplasia (CIN).Clinician's experience and expertise are the basis of diagnostic accuracy in traditional colposcope exams, a scarce resource in many low-income countries.There are insufficient experienced specialists to accommodate the number of patients who need screening.Parallelly, several researchers are investigating the use implementation of deep learning to distinguish between cervical lesions seen in colposcopy images to help with patients triaging in clinical settings and improve clinicians' diagnostic accuracy.This research aims to create a novel technique to handle multi-stage cervix images and patient data to provide classification support to expert clinicians to enable efficient diagnosis of cervical cancer.
A transformer architecture is a neural network that was initially developed keeping in mind the natural language processing tasks (NLP).Nevertheless, it has worked well in the image classification problem sector.Integrating a transformer model in computer vision, which assesses the attention weights of specific local regions in an image, has the potential to enhance image classification tasks.This is because the model can direct its attention towards the most relevant areas of an image, allowing it to better capture subtle differences and fine-grained details that may be crucial for accurate classification.The transformer model is applied to an image classification task to calculate and assess the attention weights of the regions of interest in an image.In this approach, the image is divided into smaller regions, or "patches", and each patch is treated as a sequence of pixels.The transformer is then applied to these patch sequences to calculate attention weights that indicate the relative importance of each patch for the classification task.
A colposcope image frequently contains extraneous elements such as background noise and unwanted objects like vaginal walls and speculum [18].The cervix region must be precisely cropped for subsequent efficient classification.The previous research on cervix ROI extraction is broadly classified into machine learning and deep learning methods.In this paper we use an enhanced gaussian mixture model (GMM) to precisely segment the cervix region of interest which can further be given as inputs to the subsequent classification module.Gaussian mixture model is a probabilistic model that represents the data as a mixture of multiple Gaussian distributions.This is a statistical model used to describe the distribution of data in multiple dimensions.In the context of image segmentation, each pixel in the image is treated as a data point with multiple attributes.The key parameters of the gaussian mixture modelling are the eigen vector, covariance matrix, and mixture coefficients.Usually, these parameters are obtained with expectation-maximization optimization.However, in this study we used a nature-based metaheuristic optimizer for tuning the parameters of GMM.Hyperparameter tuning is a crucial step in the process of training machine learning models.It involves finding the best set of hyperparameters for a given model and dataset to optimize its performance.Hyperparameters are parameters that are not learned from the data but are set prior to training.In order to extract the optimal hyperparameters, a Mexican axolotl algorithm is used in this study.Metaheuristic optimization refers to a class of optimization algorithms that are used to find the best solution to a problem without guaranteeing optimality.The Mexican Axolotl Algorithm is one such optimization technique.Mexican axolotl works by an opposition-based learning strategy.Opposition-Based Learning (OBL) is a machine learning strategy or optimization technique that involves considering both the positive (conventional) and negative (opposite) solutions when searching for optimal solutions in a search space.The concept behind OBL is to use opposition or contrast to improve the exploration of the solution space and enhance the performance of optimization algorithms.
The evolution of lesions is dynamic between the saline, acetic acid, and iodine induced cervix images.To capture the same, the acetowhite lesion recognition is converted to a fine-grained visual classification problem.In order to extract a fine-grained feature of the image, attention weights for specific local regions are obtained by utilizing a transformer model.These attention weights, which denote the significance of various parts of the image, are computed by the transformer model during its analysis of the input dataset.By acquiring attention weights for designated regions of the image, we can pinpoint the areas that are the most pertinent to the feature we are attempting to extract.This enables us to identify the most crucial elements of the image and thus obtain a more precise and accurate fine-grained feature.Hence, we put together a CNN embedded in a transformer to extract highly accurate local lesion features that can solve the over-segmentation problem faced by previous models.The framework takes in the cervix images in small sections and generates latent features of the same.These latent features are integrated with information about where the lesion is located, thus making the features more informative.These features are sent as input to the proposed network, and subsequently, the attentionbased model generates and learns the weights of lesion features.Based on the weight assignment, the lesion area is marked for model performance.As the last step, the features and attention weights are optimized by a metaheuristic loss model.A well-structured literature review is essential as it provides a foundation of existing knowledge, contextualizes the research, and identifies gaps that this study aims to address.
Artificial intelligence (AI) is playing an increasingly important role in medical image processing.AI algorithms can be used to automatically analyze and interpret medical images, which can help clinicians with diagnostic decision making [19].The application of AI in gynecological cancer research [20] has gained traction in the past couple of decades.Diagnosis of endometriosis [21] vaginal cancer [22], ovarian abnormalities [23], uterine cancer [24], cervical cancer [25] and vulvar cancer [26] have all been benefited by machine learning and deep learning models.A significant amount of research is aimed at segmenting and classifying colposcope images [27].Fan et al. [28] used a Mask R-CNN to segment the cervix area of interest, encoded the input images through EfficientNet B3 architecture, and attained 92.7% accuracy with 0.9856 AUC.Yan et al. [29] designed a BFCNN, a bilinear fuse convolutional neural network for the segmentation and classification of cervigrams.Yuzhen Cao [15] developed a multiscale feature fusion classification network to classify cervical transformation zone and reported an accuracy of 88.49% with 90.12% sensitivity.Asiedu et al. [30] used machine learning methods of using boundary boxes to extract ROI and classify the region through support vector machines.
Park et al. [31] used anatomical maps with texture and color to identify cancerous regions, then employed k-means clustering to divide these regions into sub-regions.Using a CRF classifier, they amalgamated the categorization results of surrounding areas. in a probabilistic way and finally determined the overall classification results with the help of KNN and LDA integration, thus enabling automatic recognition of normal, CIN, and SCC (squamous cells of the cervix).Xu et al. [32] carried out a study in which they took three pyramid features (PLBP, PLAB, and PHOG) and manually extracted them, then compared seven traditional classifiers and one convolutional neural network (CNN).The cancer classification was then completed, and it was found that CNN was more effective than the standard machine learning classifiers.Chen et al. [33] tested a multimodal deep fusion technique called MultiFuseNet to classify cervical dysplasia.They proposed Multimodal Fusion Learning for Cervical Dysplasia Diagnosis for feature fusion of image modality with metadata and reported an accuracy of 87.4% with 86.1% specificity and 88.6% sensitivity.Li et al. [34] created a computer-generated diagnostic program based on an AW opacity index, which yielded a diagnosis with 84% specificity and 88% sensitivity.Authors of [35] developed a diagnostic image analysis system based on acetowhite lesion-based statistical features and evaluated its diagnostic accuracy.The reported sensitivity and specificity were 79% and 88%, respectively.Despite their satisfactory performance, these models suffer from methodological fallibility of using a single acetic acid image as input.In order to overcome the said drawback, the input of the model could be enhanced to harbor multiple states of information, like sequences of cervix examination images.
As an extension to the above approach, Li et al. [36] built a convolutional network with graph and edge features (E-GCN) and noted a 78.33% accuracy from using time series image features.Perkins et al. [17] contradicted these findings by fusing 17-time series colposcope images.The study reported no meaningful increase in accuracy after analyzing the performance of 17 fused images.It provides a scope to ponder over the speculation of the possibility of adding non-image information to meaningfully increase classification accuracy.Peng et al. [28] analyzed multimodal feature changes by building a multistate convolution neural network with an extension of genetic algorithm optimization technique.They declared 86.3% accuracy.Parallelly, Yinuo Fan et al. [37] built a multimodal fusion colposcopic convolutional neural network (CMF-CNN) that made use of Squeeze-and-Excitation fusion to combine to achieve 92.70% accuracy.The above two multimodal approaches have used image and clinical data.However, they have the limitation of using a single acetic acid image as input.Adding meaningful information from the cervix image via saline and Lugol's iodine solution application is the way forward to assert superior, interpretable, and dependable results.Li et al. [36] approached the time series imaging problem by building a graph convolutional network with edge features (E-GCN) to fuse sequential images of the cervigrams (images captured at 60s, 90s, 120s, 150s) and achieved an accuracy of 78.33.A bird's eye view of the results seen in this section comes down to a scattered version of accuracies.One explanation to account for the varied results is that the strength of a deep learning model is dependent on the dataset size and quality.Higher accuracies with low sensitivity and specificity may represent the overfitting that could have occurred.In the same manner, the lesser accuracy with consistent specificity and sensitivity indicates the robustness of the trained model.This paper proposes a transformer model (TelsNet) embedded with 3D CNN to extract the latent features and their weights corresponding to the acetowhite lesion region.In addition to that, the paper presents a preprocessing technique unique to colposcope images.The contributions of this paper are as follows : • A preprocessing mechanism using a Gaussian mixture model is proposed to segment the whole image to remove the noise and artifacts in the cervix image.
• A novel specular reflection removal model is proposed through bi-dimensional histogram decomposition and Laplacian transformation.
• This is the first experiment using a transformer model that is embedded with 3-D CNN to extract the latent features and their weights corresponding to the acetowhite lesion region.
• A nature inspired meta-heuristic optimization algorithm is proposed to enhance the performance of the model to reach convergence.
• The proposed model is evaluated on a sequential cervix image dataset obtained from international archives for research in cancer (IARC) The remainder of the paper is structured as follows: Section 2 presents the methodology of TelsNet transformer architecture.Section 3 discusses the results and comparative analysis of the performance of the model.Section 4 concludes the study.

Method
The schematic architecture of the current study is given in Fig. 1.The architecture of the proposed model contains a 3-dimensional CNN embedded transformer module and a metaheuristic optimization module for the GMM based segmentation.

Preprocessing
Preprocessing an image before inputting it into a deep learning model is a routine practice.Medical images typically have issues like contrast enhancement, blur, medical artifact involvement etc.In the current model, we address two problems: unblur and resizing of the image.

Unsharp filter for blur removal
The colposcope images are typically blurred to an extent due to the bodily movements of the patient.In order to remove the blur, a noise removing machine learning filter called unsharp filter is employed.It is an image enhancement technique frequently employed to sharpen the images.The filter works by creating and subtracting a blurred version of the original image from the original image.This is accomplished by applying a Gaussian blur kernel [38] to the original image through convolution, which has the effect of diminishing the high-frequency details present in the image.Once the blurred image is subtracted from the original image using equation in Fig. 2, a scaling factor is used to add the high pass image to the original image resulting in a non-blurred high-quality image without any loss of information.Fig. 3 shows the blurred image and the resulting image after applying the unsharp mask.Step 4: Multiply the image with scaling factor '' Step 5: Add the scaled image to original image to obtain the clearer version of the image without blur.

Bilinear interpolation for resizing
The images are resized to a uniform size of 512 x 512 using this technique.The bilinear interpolation algorithm calculates the location of each new pixel by dividing the location of the original pixel by the scaling factor using: For example, if the original image is being scaled down by a factor of 2, the new image will have half as many pixels in each dimension.Fig. 4 demonstrates the resizing of the cervix images after using bilinear interpolation

Original image
Cropped image Fig. 4. Images after resizing to 512 x 512 uniform size using bilinear interpolation

Specular reflection removal
Acetowhite lesions (AW) are abnormalities that appear white or pale in color when viewed under post-acetic acid application to the uterine cervix.In the context of cervix images, acetowhite lesions may be indicative of precancerous changes in the tissue [39].Since specular reflections (SR) have the same morphological appearance as acetowhite lesions, the diagnosis will be hindered by SR.In order to overcome the said drawback, it is essential to remove SR before classifying the cervigram.Over the last couple of decades, several researchers have proposed SR removal techniques using machine and deep learning methods [40].
The fundamental principle of specular reflection areas involves the reflection of light from a smooth, shiny surface.This type of reflection produces a clear and sharp image of the light source as opposed to diffuse reflection, which produces a more scattered and diffused image.The angle at which the light strikes the surface, as well as the angle at which it is reflected, plays a critical role in determining the characteristics of the reflected light.Additionally, the smoothness and shininess of the surface can affect the clarity and sharpness of the reflected image (Fig. 5) Fig. 5. Specular reflections on the cervix surface Specular reflections obstruct the efficient analysis of cancerous changes of surface regions [40].For instance, [6] has explored the role of SR in confusing the endoscope procedure.SR removal has two phases.The first is to locate the specular region and remove the SR pixels.The second is to paint these areas back to their original morphology.In the detection phase, Generally, the image is transformed into a different color space to facilitate further processing of the region of interest (ROI).For instance, the image formats used are RGB [30], grey-level [41], HSV, HSI [42] and a threshold value to identify the SR.Subsequently, the removed pixels are replaced with inpainting to preserve the image morphology.

Specular reflection identification
A specular reflection is a type of reflection in which the reflected light rays are at an angle to each other.In other words, the reflection is in the opposite direction as the incident light.In a bi-dimensional histogram, specular reflection refers to the symmetrical nature of the histogram when it is reflected along the x-axis or y-axis.This means that the shape of the histogram remains the same after it is reflected, and the relative frequencies of the data points are preserved.Therefore, a bi-directional histogram decomposition is used to detect specular reflections whose formula is given in equation (2).
where 'm' stands for pixel intensity.

𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚 �
Here, 's' denotes saturation, and (r, g, b) = (red, green, blue).Two important threshold values (mmax, Smax) determine the specular reflection pixels through a bi-dimensional histogram.Two independent criteria that must be met for a pixel to be considered as SR are given in equation (3)

Specular reflection removal
Image linear correction is a simple and effective way usually employed to improve the quality of an image and is often used as a preprocessing step for more advanced image analysis techniques.It involves applying a linear transformation to the pixel values of the image in order to stretch or compress the range of intensity values.There are several different techniques that can be used for linear image correction.The pixels must be replaced in such a way that the information of the cervix image is preserved.Routinely, the SR pixels are replaced with the mean of pixels surrounding the pixel that needs to be replaced.

Inpainting of deleted specular pixels
The Laplacian equation is a partial differential equation that describes the behaviour of a twodimensional surface.The Laplacian equation can be used in image repainting, a technique used to restore damaged images.In this context, the Laplacian equation can be used to identify SR in the image, which can then be used to repaint the SR areas.In order to apply the Laplacian equation in image repainting, the image is first convolved with a Laplacian kernel to enhance the edges and boundaries.The repainting is then performed in the areas of the image that have SR, using the enhanced edges and boundaries as a guide.The final step is to smooth the repainted areas and blend them with the rest of the image, to produce a seamless and natural-looking result.The equation for Laplace transformation is given in equation (4).

Segmentation Using Enhanced Gaussian Mixture Model
A colposcope image typically includes the surrounding organ and medical equipment interferences in addition to the cervix region.In order to efficiently implement the transformer deep learning model, the image needs to be segmented to rid the extraneous noise.Hence, we have used a multivariate gaussian mixture model enhanced with an adaptive mexican axolotl algorithm.Colposcope image show as Fig. 6.

Fig. 6. Colposcope image
Multivariate gaussian distribution is an extension of the univariate model that can fit vectors, which are the pixels in this case.It is a probabilistic model that represents the data as a mixture of multiple Gaussian distributions.This is a statistical model used to describe the distribution of data in multiple dimensions.In the context of image segmentation, each pixel in the image is treated as a data point with multiple attributes.X is an input vector with 'd' values.The distribution is parameterized by mean µ (a length' d' vector) and a covariance matrix Σ (d x d matrix).Subsequently, the equation of the probability density function is given by: Where: � is a constant that ensures the integral value is 1. , (x-µ) T gives a scalar number which is the probability or likelihood of the value 'x' belonging in the cluster k.These parameters were optimized using a metaheuristic algorithm called Mexican axolotl.The segmented cervix image is presented in Fig. 7.

Adaptive mexican axolotl optimization algorithm
The hyperparameters of the Gaussian mixture described above are the mean vector value, mixture coefficient value, and the covariance matrix's eigenvector.These values are computed using a naturebased metaheuristic algorithm, Mexican axolotl [43].A meta-heuristic optimization algorithm is an algorithm that searches for near-optimal solutions to a given problem by using a combination of heuristic techniques.These algorithms are useful for solving complex problems where an exact solution is not practical or possible.Meta heuristics are not problem-specific but rather provide a framework to generate solutions to any given problem.Mexican axolotl works by an opposition-based learning strategy.The algorithm of Mexican axolotl is given below: Here; Step 1: initializing the solution: In the Mexican axolotl algorithm, the initial position of each particle in the population also greatly influences the evaluation of the population.The initial solution is created from the parameters: normalized mixture coefficient, th d dimension of the mean vector, and the th d Eigenvalue.Initially, the values are assigned randomly.The population of this algorithm is defined as follows: Np represents the population size, and 'A' is the axolotl (solution), which is described as such: Here, k ρ represents the mixture coefficient, kd µ which is yet un-normalized, represents the mean Step 2: Opposite solution generation: Subsequently, for every solution initialized, alternative solutions are created.The opposite solution i A' can be deducted as: where [ ] is a real number.
Step 3: Fitness calculation: once the solution is initialized, for every solution, the fitness is calculated.The maximum likelihood (MLE) function is taken as the function to calculate fitness.The maximum likelihood function enhances the resulting segmentation accuracy.The fitness is given in equation ( 9) As stated in (8), the best parameter value is chosen as the one with the highest fitness score.If this is not the case, then the solution is adjusted in the subsequent step.
Step 4: Updating the solution space using AMAO: The three steps that make up this algorithm are transitioning, injury and reviving, reproducing, and sorting.
The most well-adapted male axolotl, denoted as, is determined by the fitness of the solution and the transition parameter within the range of 0 to 1.This male axolotl changes the coloration of its body parts in accordance with the value of, Similarly, female axolotls are identified as progressing from larvae to adults to highly adapted females through the use of equation 10.The female axolotl is represented by Y n .

𝑌𝑌 𝑛𝑛𝑚𝑚 ← 𝑌𝑌 𝑛𝑛𝑚𝑚 + (𝑌𝑌 𝑏𝑏𝑏𝑏𝑠𝑠𝑠𝑠,𝑚𝑚 − 𝑋𝑋 𝑛𝑛𝑚𝑚 ) • 𝑦𝑦 (11)
A number rand between 0 and 1 is chosen randomly to decide which individuals to pick for random transition.Additionally, the inverse probabilities of transition for female and male axolotls are calculated by The random transition of individuals will occur as ( 13) and ( 14).This situation holds true under the condition of rand being less than the inverse probability value.
←   + (  −   ) ×   (15) Injury & Restoration: When walking on the water, axolotls can be at risk for accidents and injuries.This risk has been taken into account during the healing and rehabilitation stages.
Reproduction and Assortment: For female axolotls, a male is selected through a process of competition in order to produce offspring.The male axolotl will deposit sperm, which the female will then coat and place into the sperm to create an egg containing genetic material from both parents.This pair will generate two eggs.Afterward, the female will lay the eggs and watch for them to hatch.The hatchlings then compete with their parents for survival using a fitness function.If the young axolotls are more adapted compared to their parents, they are able to take their place.
Step 5: Termination: The aforementioned steps are iterated until achieving the optimal solution or the initial value.Alternatively, if this condition is not met, the algorithm will be concluded.The chosen value is then applied to the encryption process 2.5.TelsNet embedding A dataset containing one saline, acetic acid, and iodine images pertaining to each patient is inputted into the model, and the latent features are developed.The original image is divided into smaller patches of size  × , and then these patches are flattened to create a sequence of images denoted by   .The rate of change between the sequential images is not highly significant.In the event of attempting to analyze an image sequence by projecting it into latent features using a linear or convolutional layer, we may miss out on the spatial relationships between adjacent image patches.This is because these layers typically process individual patches in isolation without considering their context.This could result in the latent features being incomplete and failing to capture the full meaning and details of the original image sequence.The proposed TelsNet has the ability to consider both the temporal and spatial dimensions of the sequential image frames when performing convolution operations.
The temporal dimension refers to the time-based aspect of the series of images, while the spatial dimension refers to the visual aspects of the individual images.By taking both of these dimensions into account, the TelsNet model can analyze how the visual elements of the images change over time, and how these changes relate to the lesion evolution over time.We take the key acetic acid frame use the TelsNet model to analyze the feature dynamics between the image patch and its adjacent patches, thus projecting the latent features into vector subspace.The model is given by = 3 (image sequence),  is the 3-dimensional feature generated by the network, , and  are the width and length of the images (512x512),  and  are parameters of the network.

Transformer embedding
Multi-head self-attention technique is used to identify relationships between different elements in a sequence of data.Multi-head self-attention builds by incorporating multiple attention mechanisms, or "heads," that operate in parallel to evaluate the different elements.Each attention head processes the input sequence independently, generating its own set of attention weights.These weights are then combined to capture complex relationships between the different elements in the sequence.In the current problem, each image patch is assigned a weight based on its importance, as determined by the attention mechanism.Subsequently, the network classifies the attention weights.The current network is built using a transformer encoder to solve the cervix image classification and recognition problem.
Due to the mild changes in epithelium, the changes in acetowhite lesions are sometimes very minimal, making it hard to capture the change.However, the remaining cervix region does not change and remains the same, thus directing the multi-head self-attention transformer to target the dynamically evolved patch areas.When training models with attention mechanisms, some image patches may be assigned a low weight, indicating that they are less relevant to the task at hand.To help the model converge more efficiently and learn the attention weights of each patch more accurately, these lowweight patches are excluded from the training process.By focusing only on the most important patches during training, the model is able to learn the attention weights more effectively and achieve better overall performance.This technique, known as dropout, improves the efficiency and accuracy of the model by prioritizing the patches that are most relevant to the task at hand.The architecture of the proposed model is shown in Fig. 8.This weight assignment helps in marking the lesion area for model performance.
• Metaheuristic Optimization: To optimize the model's features and attention weights, a metaheuristic loss model, possibly the "gravitational search algorithm" mentioned earlier, is used to ensure convergence and improve the model's performance.
Feature vector   generated by TelsNet is given as input to the encoder.A classification token  c,q is generated and fused with   .In order to enable a transformer-based model to take positional information into account during training, a learnable positional embedding called  pos is associated with each image patch.This embedding is then combined with the patch data and fed into the transformer encoder model.The process of combining the positional embedding with the patch data is referred to as weighted fusion.This step involves weighing the importance of the patch data and the positional embedding relative to each other based on the requirements of the task being performed.Fig. 9 demonstrates the layout of the layers in the model.The resulting new feature is a representation of both the image patch and its position in the sequence.By incorporating positional information in this way, the model is able to better understand the relationships between different patches in the sequence and achieve greater accuracy and effectiveness in its analysis.The feature equation is given by: Here, number of patches is denoted by ,  , is a class token,  1, is the embedded patch.projection,   denotes the position embedding.The class token  , is used to learn the attention weight of the lesion projection.
To enable a transformer model to encode the position of image lesions, it needs to know the order relationship of each patch in the sequence.The TelsNet embedding can provide this information because it captures both the temporal and spatial dimensions of the image patches.By using the TelsNet embedding to extract features that encode the relationships between each image lesion patch and its neighboring patches, we can ensure that the transformer model receives accurate positional information.This allows the model to better understand the context and structure of the image sequence, leading to more accurate and effective analysis.The positional embedding   is inputted into the transformer for conducting a weighted fusion with the new features.Lastly, the feature vectors that are generated by the transformer encoder with features produced by TelsNet are fused for cervix lesion identification.
The loss given by the metaheuristic GSA is 0.106, which is considered to be satisfactory and better than the traditional loss functions used in traditional networks.The value given by the GSA implicates that the network model is optimized with desired convergence.
The gravitational search algorithm (GSA) [44] is a type of optimization algorithm that uses Newton's gravitational law as its basis.In this algorithm, each agent is treated as an object, and gravitational forces act on them, causing all objects to move towards those with heavier masses, which represent the optimal solution in the search space [45].The position of agents, which corresponds to a potential solution to the problem being solved, is updated repeatedly until a termination condition is met.

Results and Discussion
This section presents a comprehensive analysis of the results achieved and their significance in diagnosing cervical cancer through segmentation and classification of colposcope images.

Dataset
The dataset is downloaded from the International Agency for Research on Cancer (IARC) as a part of the study.Sequential colposcope images of 200 patients were collected after the application of saline, 5% acetic acid, and lugol's iodine in that order.A total of 916 images were collected, owing to the repeated duplicates of pictures of the cervix.These images are divided in 80-20 ratio for training the model and testing it.This is a standard practice in machine learning models.
The images are augmented by using transformation techniques like flip, rotate, etc., to increase the volume of the dataset, which is a crucial aspect for attaining higher and more reliable accuracy.Fig. 10 shows the transformation operations on a single image.

Experimental environment
The current segmentation framework is implemented on Google Colab in Python, with the system having 16GB of RAM with an Intel Core i7 processor and the Windows 10 operating system with 256 SSD.The efficiency of the recommended method is evaluated using a range of popular performance metrics.

Evaluation metrics
To evaluate the proposed framework, the following components are used: accuracy, sensitivity, specificity, Jaccard Index and dice coefficient.
Where S is the region of overlap.If Sg intersection St is empty, then J(St,Sg) = 0 Accuracy measures the overall correctness of a diagnostic model's predictions, reflecting the proportion of correctly classified cases (both positive and negative) out of the total cases evaluated.High accuracy provides confidence in the model's ability to make correct predictions.This is crucial for medical practitioners when making treatment decisions based on the model's output.Incorrect diagnoses can lead to inappropriate treatments or delayed interventions, potentially endangering patient safety.High accuracy reduces the risk of misdiagnosis and its associated consequences.On the other hand, sensitivity, also known as recall, measures the model's ability to correctly identify true positive cases among all actual positive cases.In cervical cancer, sensitivity is crucial for early detection.It ensures that the model can identify cases of cancer (or its markers) even at the earliest stages, when intervention is most effective.High sensitivity reduces the likelihood of false negatives, which occur when the model fails to detect actual cases of cancer.A single false negative diagnosis carries a substantial burden, particularly in the context of cancer.The ramifications of a missed cancer diagnosis are profound, as individuals may unknowingly continue their daily lives while the disease continues to progress within their bodies.From both moral and ethical standpoints, diagnostic support systems must prioritize the attainment of the highest possible sensitivity.

Results
This section discusses the results attained for segmentation and classification

Segmentation Results
Comparative analysis of the enhanced gaussian mixture model for segmentation is carried out with two state-of-the-art clustering methods, k-means, and univariate gaussian mixture model.It is noted that the proposed method consistently outperformed the baseline methods.Fig. 11 shows the segmentation outline, while Table 1 and Table 2 display the segmentation results.As demonstrated in Table 1, in comparison with K-means and GMM models, the traditional GMM achieved a better accuracy of 74.80%.However, it has been further enhanced as EGMM is improved with its parameters optimized through AMAO.The current model, EGMM, demonstrated an accuracy of 96.1%.Parallelly, the proposed framework displayed a sensitivity of 97.9%, specificity of 91.6%, dice score of 0.939, Jaccard index of 0.869, and menial loss of 0.173.The results suggest that the methodology proposed in this study produced better outcomes compared to the existing state-of-the-art approaches.This can be attributed to the utilization of the AMAO algorithm for optimizing parameters with the EGMM.Fig. 12 displays the ROC curve.

Comparative analysis
The proposed model is compared with four state-of-the-art pre trained IMAGENET models, namely, AlexNet, VGG16, ResNet50 using transfer learning techniques.The final layer is frozen to three classes to depict 'Normal', 'pre-cancer', and 'cancer' class labels.The details of the pre-trained models are given below.Additionally, the model is evaluated with respect to other proposed models to affirm the results achieved are the best.

AlexNet
AlexNet is a convolutional neural network architecture that has gained popularity due to its success in the ImageNet Large Scale Visual Recognition Challenge.It utilizes key building blocks such as max pooling, convolutions, and dense layers to extract features from input images.The architecture of the model is comprised of a total of eight layers, with each set of learnable parameters consisting of five convolutional layers that incorporate both fully connected and max pooling layers.Additionally, the model includes two normalizing layers and one softmax layer.Each layer in the architecture is composed of a convolutional layer paired with an activation function that utilizes the rectified linear unit (ReLU).The AlexNet model performed with an accuracy of 0.702.Fig 13 shows the accuracy plot of the AlexNet model.

ResNet 50
ResNet50 is a convolutional neural network comprising of 50 layers, among which there are 48 fully connected layers, additionally with a max pool layer and an average pool layer.Over top of that, it has the capability of performing floating-point calculations of more than 3.8 × 109.The ResNet50 model is designed with a unique approach that utilizes convolutional filters of various sizes, addressing the challenge of degradation commonly found in deep CNN models.This approach has also contributed to faster training times.With 48 fully connected layers, a max pool layer, and an average pool layer, ResNet50 can process floating-point calculations with efficiency.Furthermore, the model's utilization of a limited number of filters translates to even quicker performance.ResNet architecture takes multiples of 32x32 dimensions of height, width and channel.ResNet model gave an accuracy score of 0.8462 which is the highest among the transfer learning models used in this study.The accuracy plot of ResNet is shown in Fig. 14.VGG16 is a convolutional neural network architecture that has gained popularity due to its deep structure and use of small 3x3 convolutional filters.The model comprises 16 layers, consisting of 13 convolutional layers and 3 fully connected layers.In addition to that, VGG16 incorporates max pooling layers and dropout regularization to mitigate the risk of overfitting.For classification tasks, the final layer of the model is often a SoftMax layer.VGG16 has proven to be effective in a variety of computer vision tasks, such as image classification, object detection, and segmentation.VGG 16 model has achieved an accuracy score of 0.8174 (Fig. 15), making it second best among the transfer learning models used in the current study.

Comparative analysis with state-of-the-art models
In the context of cancer diagnosis, the impact of a false negative surpasses that of a false positive.The aforementioned findings unmistakably demonstrate that our proposed technique produced superior results than previous techniques.Table 2 shows the results in terms of accuracy, sensitivity, and specificity.TelsNet model attained 92.7% accuracy with 73.4% sensitivity and 82.1 specificity.Fig. 16 demonstrates the model's behavior over the training and validation accuracy.TelsNet's successful performance and high diagnostic accuracy hold immense promise for its integration into clinical practice.With its ability to accurately identify cervical cancer-related lesions in colposcope images, TelsNet can serve as a valuable decision support tool for clinicians.This integration would streamline and enhance the diagnostic process, ensuring that patients receive timely and accurate assessments of their cervical health.

Fig. 16. Performance of the proposed TelsNet
TelsNet's potential influence on cervical cancer diagnosis extends beyond improved consistency.Its high accuracy, as demonstrated in our experiments, indicates that it has the potential to detect cervical cancer at an early stage with remarkable sensitivity and specificity.This implies that TelsNet can contribute significantly to the early identification of cervical malignancies, enabling timely interventions and potentially life-saving treatments.
Moreover, TelsNet's efficiency in processing colposcope images and generating rapid diagnostic assessments can reduce the turnaround time for results.Quicker diagnosis means that patients can receive follow-up care and treatment promptly, further enhancing their chances of positive outcomes.TelsNet can serve as a valuable tool in the hands of clinicians, offering them a second opinion and aiding in more accurate decision-making.While the ultimate diagnostic decision will still be made by a healthcare professional, TelsNet's support can enhance their confidence in the diagnosis and provide additional insights into lesion characteristics.

Discussion
Cervical cancer remains a significant global health concern, particularly in underdeveloped and developing countries, where high mortality and morbidity rates persist.Early intervention is pivotal for complete remission and cure, and global uniform cervical cancer screening is a step toward achieving this goal.However, manual diagnostic methods have several limitations that hinder early detection and accuracy.In this study, we developed a deep learning framework, incorporating advanced deep learning techniques, to enhance the diagnosis of cervical cancer using colposcope images.
The use of deep learning techniques, including transformer-based models and GMM segmentation with hyperparameter tuning, holds significant implications for cervical cancer diagnosis.This innovative model has the potential to revolutionize cervical cancer diagnosis by reducing interobserver variability, enabling early detection, expediting the diagnostic process, and enhancing the overall quality of cervical health assessments.With further validation and integration into clinical practice, TelsNet can make a substantial contribution to cervical cancer prevention and patient care.
By leveraging transformer-based models, we achieve improved accuracy and sensitivity in feature extraction, allowing for more precise lesion identification.This special focus on segmentation was given based on the limitations pointed out by the previous models.These models reviewed in introduction have pointed out that the deep learning approaches so far have overlooked the independent segmentation process and proceeded with automatic segmentation.In order to bridge this gap a novel segmentation was developed.The use of the Mexican Axolotl Algorithm for hyperparameter tuning enhances the optimization of GMM parameters, further improving segmentation accuracy.This innovative model has the potential to revolutionize cervical cancer diagnosis by reducing interobserver variability, enabling early detection, expediting the diagnostic process, and enhancing the overall quality of cervical health assessments.With further validation and integration into clinical practice, TelsNet can make a substantial contribution to cervical cancer prevention and patient care.
While our approach demonstrates promise in enhancing cervical cancer diagnosis, it is not without limitations.Although the sensitivity of the segmentation step is 97.9% the sensitivity of classification is modest compared to the same.As a cancer diagnosis support, we aim to mitigate the burdensome false negatives.As future work we plan to increase the sensitivity of the framework while preserving the accuracy and specificity.Another noteworthy shortcoming of the model is that the classifier heavily relies on the segmentation module.In real world point of care low cost colposcopes do not capture ideal cervix images [30].This leaves scope for exploring possible solutions to solve the problem of automated segmentation processes.

Conclusion
Cervical cancer ranks as the fourth most prevalent malignancy among women, presenting a substantial threat to women's health globally due to its elevated mortality rates.Automated colposcopy image analysis represents a pivotal stride towards the large-scale screening for cervical cancer.This paper introduces a pioneering approach for the early detection of cervical cancer by leveraging colposcope image analysis.Our proposed model, TelsNet, employs a transformer-based neural network combined with temporal lesion convolutional neural networks to identify cancerous regions within the images.To enhance model accuracy, we incorporated diverse preprocessing techniques, such as unsharp filters, bilinear interpolation, and bi-dimensional histogram equalization.We also used Gaussian mixture models to segment the images and identify regions of interest.Experimental results revealed that TelsNet achieved an exceptional accuracy of 92.7%, with a sensitivity of 73.4% and a specificity of 82.1%.Comparative evaluations against state-of-the-art methods showcased the superior performance of TelsNet.By automating colposcope image analysis, TelsNet has the potential to significantly enhance the efficiency and precision of cervical cancer screening, providing invaluable support to medical professionals Declarations Author contribution.LM conducted the study and prepared the manuscript; JT supervised the experiment and corrected the manuscript.Funding statement.None of the authors have received any funding or grants from any institution or funding body for the research.Ethical Statement: The authors bear responsibility for every facet of the work, ensuring that any queries concerning the accuracy or integrity of any portion of the work are thoroughly investigated and addressed.

Fig. 1 .
Fig. 1.Schematic architecture of the proposed model
the fitness quotient in female and male axolotls.

Fig. 8 .
Fig. 8. Encoder of Transformer architecture used in TelsNet.The output of the model, embedding space is fused with positional embedding into the transformer encoder containing MLP blocks for output classes

Fig. 9 .
Fig. 9. Embedding architecture of TelsNet, the images are split into acetowhite, saline, and iodine patches and are projected into embedding space.The architecture contains six convolution and four pooling layers

Table 2 .
Comparative analysis of TelsNet with ImageNet models