Analysis and review of the possibility of using the generative model as a compression technique in DNA data storage: review and future research agenda

The amount of data in this world is getting higher, and overwriting technology also has severe challenges. Data growth is expected to grow to 175 ZB by 2025. Data storage technology in DNA is an alternative technology with potential in information storage, mainly digital data. One of the stages of storing information on DNA is synthesis. This synthesis process costs very high, so it is necessary to integrate compression techniques for digital data to minimize the costs incurred. One of the models used in compression techniques is the generative model. This paper aims to see if compression using this generative model allows it to be integrated into data storage methods on DNA. To this end, we have conducted a Systematic Literature Review using the PRISMA method in selecting papers. We took the source of the papers from four leading databases and other additional databases. Out of 2440 papers, we finally decided on 34 primary papers for detailed analysis. This systematic literature review (SLR) presents and categorizes based on research questions, namely discussing machine learning methods applied in DNA storage, identifying compression techniques for DNA storage, knowing the role of deep learning in the compression process for DNA storage, knowing how generative models are associated with deep learning, knowing how generative models are applied in the compression process, and knowing latent space can be formed. The study highlights open problems that need to be solved and provides an identified research direction.

The amount of data in this world is getting higher, and overwriting technology also has severe challenges.Data growth is expected to grow to 175 ZB by 2025.Data storage technology in DNA is an alternative technology with potential in information storage, mainly digital data.One of the stages of storing information on DNA is synthesis.This synthesis process costs very high, so it is necessary to integrate compression techniques for digital data to minimize the costs incurred.One of the models used in compression techniques is the generative model.This paper aims to see if compression using this generative model allows it to be integrated into data storage methods on DNA.To this end, we have conducted a Systematic Literature Review using the PRISMA method in selecting papers.We took the source of the papers from four leading databases and other additional databases.Out of 2440 papers, we finally decided on 34 primary papers for detailed analysis.This systematic literature review (SLR) presents and categorizes based on research questions, namely discussing machine learning methods applied in DNA storage, identifying compression techniques for DNA storage, knowing the role of deep learning in the compression process for DNA storage, knowing how generative models are associated with deep learning, knowing how generative models are applied in the compression process, and knowing latent space can be formed.The study highlights open problems that need to be solved and provides an identified research direction.worldwide data storage demand will increase to 175 ZB (Fig. 1), or 1.75 x 1014 GB, by 2025 [3].The estimated storage capacity will be exceeded by current storage media with a maximum density of 103 GB/mm 3 [4].Besides, the cost of data maintenance and transmission, limited storage space, and significant data loss requires information storage [4].
The estimated storage capacity will be exceeded by current storage media with a maximum density of 103 GB/mm 3  .
Fig. 1.Annual global data volume (modified from [3]) Almost all digital data is stored using technology that operates for a limited period.The lifespan of memory cards and chips is five years from their first use [5].Standard hard drives are tolerant of damage caused by high temperatures, magnetic degradation, exposure to ultraviolet light, and mechanical damage.While a solid-state drive (SSD) performed better than a hard drive, it will lose its information if it is not used for more than a few months.[5].
Instead, nature has solved this problem in its own way since the beginning of time on Earth: by storing unique information that characterizes organisms in a unique sequence of bases (A, T, C, G) contained within a small molecule called deoxyribonucleic acid (DNA).For three billion years, this method of information storing has been used.As an information carrier, DNA molecules offer several advantages over conventional storage media.DNA's high storage density, low maintenance costs, and other outstanding properties would make it a durable information storage option in the future [6].
The storage capacity of DNA is phenomenal.Castillo stated that the entirety of the internet's information could be stored on devices smaller than cubic inch units [5].DNA is considered an ideal medium in this regard since instead of computers that utilize 1's and 0's for storing data, DNA consisting of adenine, guanine, cytosine, and thymine (A, G, C, and T) that have been paired into the two nucleotide base pairs A-T and G-C can be used to store information in the form of binary code [1].Since a single nucleotide can represent two bits of information, DNA is viewed as an ideal storage medium as the demand for high-capacity storage media increases.Therefore, 1 gram of single-stranded DNA can encode 455 EB of information (ssDNA) [5].All of the data created by the entire world in a single year can be stored in just 4 grams of DNA [1].Due to its three-dimensional (3D) structure, DNA provides ample storage space.
DNA has a more significant temperature tolerance (-800 to 800 °C).DNA uses energy millions of times more efficiently than today's personal computers.In addition, DNA has more storage options than most media because it stores data in nonlinear structures, as opposed to linear systems.DNA promises more opportunities for enhancing latency and data extraction because it permits bidirectional data reading.DNA is safe and unlikely to be damaged by living organisms due to the significant fact that it is invisible to the human eye [1].
In light of DNA's potential as a medium for information storage, numerous studies have been conducted to determine how digital information can be stored in DNA.Fig. 2 depicts the process of generally storing information in DNA.The process of converting digital files of various formats, such as images, videos, music, and documents, into binary code is known as the binarization stage.The process of converting files that are already in binary format into the form of a row of four DNA bases is known as encoding.Researchers continue to refine this encoding procedure.After the data is converted into a line of DNA molecules, DNA synthesis is performed, which involves inserting the DNA line into a living organism or creating artificial DNA that will be stored in a location or tube.To retrieve data that has been stored in the DNA medium, the sequencing procedure is performed, which entails reading the baseline of DNA molecules from the DNA medium, resulting in data in the form of rows of DNA molecular bases.The row's data will be decoded, which involves converting it back into binary code.Once the binary code has been recovered, it will be converted back to the original data that was initially stored.In general, this is the stage at which information can be stored and extracted from DNA [7].

Fig. 2. DNA process data storage
Binarization stage converts digital data such as photos, documents, and videos into binary code.In an image, the pixel consists of 3 channel (RGB) color values with a value range of 0 -255 in each pixel.For example, the pixel value of 255 will be converted to a binary value with eight digits, namely 11111111, and a value of 25 will become 00011001.In general, computers will read digital data as information that has a binary code [8] [9].
Encoding stage converts the binary code into a DNA sequence (A, C, T, & G).Several encoding techniques have been carried out, as in the research by Goldman et al. [10].Goldman et al. [10] encoded 739 kilobytes of hard-disk storage with an estimated Shannon information of 5.2 x 10 6 bits into DNA code, then synthesized this DNA, sequenced it, and rebuilt the original file with 100% accuracy.Erlich and Zielinski [11] used the DNA fountain technique that explored the limit architecture in terms of bytes per molecule and obtained a perfect retrieval from a density of 215 petabytes per gram of DNA, orders of magnitude higher than previous reports.In addition, several researchers also did encoding in the DNA storage process, namely Yi Zhang et al. [12], Anavy et al. [13], Newman et al. [14], Kosuri & Church [15], Lee et al. [16] and Takahashi et al. [17].
The synthesis stage is the process of forming artificial DNA from a DNA sequence.According to Dong et al. [18], there are several synthesis techniques, such as: (1) Based on solid-phase phosphoramidite chemistry, (2) Array-based DNA synthesis, (3) Based on enzymatic synthesis.This synthesis cost is expensive, so the DNA storage technology cannot be adapted into a digital data storage technology.For example, DNA storage costs 800 million USD per terabyte of data, compared to tape storage which only costs 15 USD per terabyte [19].Thus, a technique is needed to minimize the costs required in the synthesis process.
The sequencing process is reading the DNA sequence from a DNA medium.The result of the sequencing process is the nucleotide base sequence of a DNA, namely Adenine Cytosine, Thymine, and Guanine.According to Dong et al. [18], there are several sequencing techniques, such as: (1) Sanger sequencing, (2) Next generation sequencing, (3) Heli Scope single-molecule sequencer (4) Pacific biosciences SMRT technology, (5) Odford nanopore technologies and (6) Single-cell genomic sequencing technologies.This decoding technique is the opposite of encoding, for example : 00 A (Encoding) A 00 (Decoding) Thus, this decoding technique is usually one unit with the encoding algorithm, such as compress and decompress.Reading process returns digital data from binary code to initial data, for example, images, videos, and others [8][9].
The cost of chemical DNA synthesis, which is $3,500 per 1 megabyte of information (Fig. 3) [11], is still quite high.The expensive synthesis process is one of the primary reasons why DNA data storage technology has not been widely adopted [7].Consequently, both the Binarization stage and the DNA sequence data can be compressed prior to the synthesis process in the DNA data storage stage.The goal is for the amount of DNA sequence data to be synthesized to be minimal so that the cost of the synthesis process can be kept to a minimum.Fig. 3.The cost of DNA synthesis and sequencing [11] As a result of multimedia and communication technology advancement, multimedia entertainment plays a crucial role in contemporary human life.Imagery and video play a significant role in modern multimedia entertainment.This can provide a method for storing and transmitting image and video data and become essential when internet bandwidth is constrained, especially for large, high-quality digital images.Researchers are concerned about image compression technology due to the internet's bandwidth constraints, which hinder the development of image communication.Image compression aims to represent and transmit large original images using the fewest bytes possible and restore images with acceptable results.Deep Learning-based image compression [20] is one of the image compression techniques currently undergoing development.
The image's features can be automatically rather than manually learned using the deep learning model.Image recognition can be made more efficient with the addition of convenient features.During this time, the image features are always determined manually by the initial knowledge of the model maker, and the number of features is limited.An infinite number of features are learned automatically by the deep learning model.Optimizing image processing requires a method for extracting features that is effective.Using deep learning models, unpredictable image characteristics can be learned and used for image security.Consequently, the deep learning model can also be applied to image compression [20].
Generative models are one of the deep learning techniques utilized in the image compression process.Generative models describe the generation of a data set in terms of the opportunity model.Using the sampling model as an example, we can generate new data [21].Generative models are also called deep generative models.The word profound is used here because the focus will be on a generative model with neural network representation.Where neural networks exhibit adaptability and strength.With the development of neural networks and the rise in computing power, the deep generative model has emerged as one of the primary directions for advancing artificial intelligence.
This paper review will focus on image compression using deep generative modeling.We will determine if the compression using generative modeling achieves sufficient compression of the image and if it has been implemented for DNA data storage to reduce the cost of synthesizing the DNA sequence to be stored in the DNA medium.

DNA Data Storage
Using DNA, scientists have already begun a major project to develop an alternative to data storage.Watson and Crick published one of the oldest and most influential publications in biology history in the journal Nature in 1953, which suggests that DNA is a transporter of genetic information [22].Since that time, DNA also known as the genetics of an organism's information has been stored in a four-base linear row.Many researchers proposed storing specialized information in DNA after a decade [23].However, this was unsuccessful due to the limited knowledge of proper DNA synthesis and sequencing techniques.
In 1988, Joe Davis created the first chance to compile information storage on DNA, or DNA Storage [24].The information contained in the pixel value of a "Microvenus" image was converted into a line of 0-1 arranged in a 5 x 7 matrix, where 1 indicates a dark pixel and 0 indicates a bright pixel.The information is then encoded into DNA molecules with 28 base pairs (bp) and fed to Escherichia coli bacteria.After being successfully recovered using DNA Sequencing, the original image was successfully viewed again.Clelland proposed in 1999 using a method based on "DNA micro-dots" such as steganography to conceal information within DNA molecules [25].Bancroft, two years later, introduced the idea of using DNA bases to directly modify English writing in the same way that amino acid sequences in DNA are changed.
Church and Goldman led the field of DNA Storage research in 2012 [10] [26].Church could store up to 659 KB of information in the DNA model, whereas the previous maximum size that could be stored successfully was less than 1 KB.Goldman contains more data, amounting to 739 bytes.According to these two studies, the data stored in DNA includes not only text but also images, sounds, PDF files, and so on.
The research of Church and Goldman is the genesis of additional research in the broader field of DNA storage research.Thus, the amount of data that can be stored continues to increase as methods become more complex.By the end of 2018, the maximum amount of data that could be stored amounted to 200 MB, stored in over 13 million oligonucleotides.Alongside the continued advancement of DNA Synthesis and DNA Sequencing technology, this new method of DNA storage continues to evolve, bringing the application of DNA storage closer to fruition (Fig. 4).

Generative Model
Generative models are one of the most indicative areas of artificial intelligence's rapid development [27][28].Comparable to teams of counterfeiters attempting to produce and use counterfeit currency undetected, generative models can be compared to the police trying to detect counterfeit currency.In this analogy, competition prompted both teams to perfect their techniques until counterfeit goods were identical to the original [29].The objective of a generative model was to investigate training data sets or examples and the distribution of opportunities that could re-generate those data.The generative model relies on Deep Learning [29].
Deep generative models can be divided into three major categories (Fig. 5): autoregressive generative models (ARM), flow-based models, and latent variable models.Text analysis [30], image analysis [29], audio analysis [31], active learning [32], reinforcement learning [33], graph analysis [34], medical imaging [35], image compression [36], and other applications use deep generative modeling.Unsupervised learning is a subfield of machine learning that contains numerous algorithms with varying objectives.The primary aim of unsupervised learning is to discover something useful by analyzing datasets containing examples of unlabeled inputs.Typical examples of unsupervised learning include grouping and dimension reduction.Generic modeling is an additional method for unsupervised learning.Examples of x training are taken from the pdata(x) distribution in generative modeling.The objective of generative modeling algorithms is to examine a pmodel(x) that closely resembles pdata(x).Using latent variable z with a fixed prior distribution p(z), such as a Gaussian distribution and a network of decoders or generators that calculate x = f(z), generative models implicitly define the distribution of pmodel(x) [36].
Directly examine pdata approximation by writing the pmodel (x;θ) function controlled by the parameter and searching for parameter values that bring pdata and pmodel as close together as possible.In particular, maximum likelihood estimation, which minimizes the Kullback-Leibler divergence between the pdata and the model, is likely the most popular approach to generative modeling.Taking the average of a set of observations to estimate the average parameters of a Gaussian distribution is one of the connotations of maximum likelihood estimation.This method relies on the density function depicted in Fig. 6.In recent years, generative models, such as generative adversarial network (GAN) and variational autoencoder (VAE), have dominated unsupervised deep learning techniques [37].GAN is trained and reused as a fixed feature extractor for supervised tasks [38].The network is based on the Convolutional Neural Network (CNN) and has demonstrated its superiority in visual data analysis as unsupervised learning.Sparse Autoencoder was trained on large-scale image datasets to study features in another study [39].This network generates a high-level feature extractor from unlabeled data that can be used for unsupervised face detection.The resulting features are sufficiently discriminatory to identify other highlevel objects, such as animals or human bodies.

Method
In this section, the author describes research questions, search processes, study selection criteria, and approaches to quality assessment.This review paper utilizes Paganelli et al. [40] guidelines and protocols.Once a research question has been identified, a search strategy is formulated, and pertinent articles are extracted from multiple scientific journal databases.The obtained paper is subjected to the study selection criteria, and some of these papers will be selected again for quality evaluation.A collection of successfully identified articles was chosen following rigorous quality testing.The author carefully read this paper and answered the research question satisfactorily.

Research Question
This study defines the following research question (RQ) for the review.The selection of this RQ is based on the fact that the response to this question will explain the primary purpose of this paper and serve as a model for future research.
• How can Machine Learning be applied to DNA storage?• How are compression techniques utilized in methods for DNA data storage?
• What role does deep learning play in DNA data storage compression?• How are generative models associated with deep learning?
• How can generative models be implemented in compression strategies?
• How can a latent space/generative model be constructed?

Search Process
The search sources are from the online digital databases, such as: IEEE Xplore, Science Direct, ACM Digital Library, Wiley Online Library, and other sources accessible via the Publish and Perish applications.The SLR will search for articles published period from 2012 to 2022, this considered due to the development of DNA data storage began to reappear in 2012 and advanced quite rapidly.Various word combinations are employed to restrict the scope of the search.Each RQ uses a unique query, as presented in Table 1."method" AND "generative model" AND ("latent space" OR "latent variable")

Selection Criteria
Many articles are extracted from electronic article search databases to compile a comprehensive review for this article.The numerous articles will be filtered due to the exclusion criteria as in Table 2.Meanwhile, Fig. 7 depicts the results of paper selection using the PRISMA method.

Quality Assessment
The chosen paper is then evaluated using the quality evaluation method described in the paper [41].Table 3 contains a list of questions to assess the article's quality.Articles with scores below eight were removed from the list to refine the search further.This quality assessment procedure is included in Fig. 5's eligibility stage.Table 4 also depicts the number of papers selected after the quality assessment.Fig. 8 shows the results of the quality assessment process.After applying the selection criteria and conducting a quality evaluation, 34 articles remained.

Results and Discussion
This section presents the answers to the study's research questions.Each research contest answer is supported by selecting our search results articles.

RQ1. How can Machine Learning be applied to DNA storage?
The application of machine learning techniques to DNA data storage methods has begun.As Stanley et al. [42] demonstrate, machine learning techniques are used to overcome errors caused by repeated oligo or rewriting of DNA oligo during encoding before DNA synthesis.In a separate study, Ben Cao et al. [43] developed machine learning through the Damping Multi-Verse Optimizer (DMVO) algorithm to optimize the encoding process based on the constraints of DNA sequence arrangement.In the production of DNA strands, one must consider making DNA a more efficient storage medium and avoiding errors during the synthesis and sequencing processes.According to the study, DNA storage coding is limited by the Hamming distance constraint, the storage edit distance constraint, the GC content constraint, the no-run length constraint, and the uncorrelated address constraint.
Chao Pan et al. [44] used signal processing and machine learning techniques to store quantized imagery in DNA without using oligo or excessive rewriting to address error and cost issues.The core of the research method consists of quantizing and compressing color images using a unique encoding method on each of the three color channels.The quantization scheme reduces the image color palette to eight intensity levels per channel and compresses the intensity level by combining Hilbert-space Filing Curves, differential, and Huffman Coding.
The answer to this research question concludes that machine learning methods have been implemented in DNA data storage research at the encoding stage to minimize errors during DNA synthesis and sequencing.The formation of DNA through synthesis must adhere to certain biological constraints.This limitation is due to the DNA base protein's inherent properties.

RQ2. How are compression techniques utilized in methods for DNA data storage?
DNA synthesis is one of the steps involved in DNA data storage.This method is used to store data on DNA both in Vivo and in Vitro.Currently, the cost of synthesis is typically higher than the cost of sequencing [45].This is the rationale utilized by Shufang et al. to compress digital information data [46].Before converting the original digital file into a DNA sequence, they proposed Huffman's quaternary coding method to compress the binary data.Based on the statistical properties of the source, Huffman's proposed quaternary process can achieve a very high compression ratio for files by distributing non-uniform opportunities from the source file.Fig. 9 depicts, in general, the process of encoding DNA using the Huffman quaternary coding method.This method can also be proposed to correct standard synthesis and sequencing errors.The study's findings were able to convert a 5,2 KB document into 3,934 bits of DNA base.
Due to the high cost of DNA synthesis, Mishra et al. [47] also compress digital data during the DNA data storage process.They proposed a method for efficiently compressing digital data into DNA sequences using Huffman tree drinking variation.This method simultaneously addresses DNA coding's limitations.GC-constraint and run-length constraints are the limitations that have been overcome.Fig. 9 depicts the methods utilized in their research.The illustration of a simple encoding algorithm has a 2-digit binary mapping rule that converts to one of the nucleotides of DNA (A, C, T, G), as shown in Table 5.For instance, if the binary code is as follows: 1100011010110001, then it will be converted into DNA nucleotide bases: TACGGTAC.For the conversion formula in the decoding stage, it is just the opposite of encoding.However, the DNA sequence must consider "biological constraints" for the DNA produced during the synthesis process to match the character of the DNA compound bonds [48].6) uses recovered DNA data and the DNA Tree algorithm, which restores the same binary sequence using decoding algorithms.The seventh (7) original input data is retrieved from the recovered binary sequence.Fig. 10.Binary code of DNA data storage method used by Mishra et al. [47] The answer to this research question concludes that scientists utilize compression techniques to reduce the price of DNA synthesis.Most scientists compress the input/digital data of the file/document stored on the DNA medium.With all of its development, the Huffman Code method is the most widely used technique for digital data compression.

RQ3. What role does deep learning play in DNA data storage compression?
As described in the third research question section on compression techniques used in the DNA data storage method, several studies have employed compression techniques to address the issue of the expensive DNA synthesis procedure.This section will present some research that uses compression techniques for DNA data storage, but compression techniques are based on deep learning.
In their research, Franzese et al. [49] utilized neural network-based compression techniques.They convert an image into a latent space representation that is then stored in DNA.Compression based on neural networks produces excellent results.The generative model technique achieves compression outcomes ten times superior to the conventional scheme.In addition to reducing the cost of synthesis, this technique can also cover data back with lower coverage and tolerate numerous errors, thereby reducing the cost of sequencing.
Franzese et al. [49] then utilized Huffman coding algorithms to convert binary data from Latent Space to ternary, then converted to quaternary DNA using the rotational code method to ensure biological constraints were met.The questioned biological constraint is the G-C content (ratio of base G to base C), and there is no homopolymer repetition.These constraints may cause difficulties during synthesis and sequencing.This third research question concludes that few researchers have employed deep learning techniques to compress digital data for DNA data storage.Due to their high potential, it is worthwhile to attempt to integrate network-based compression or deep learning into the concept of DNA data storage.This must be done to provide a diverse option for the future development or implementation of DNA data storage.

RQ4. How are generative models associated with deep learning?
Commonly, generative models are utilized as potent instruments for feature extraction, regression, clustering, and classification.Generative models are also used for pattern recognition via data generation, recommendation generation, topic modeling, text generation, etc. [50].Several deep learning and machine learning algorithms derive their concept or operation from the generative model method.The algorithms Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), Latent Dirichlet Allocation (LDA), Boltzmann Machine (BM), Variational Auto Encoder (VAE), and Generative Adversarial Network (GAN) use generative model concepts [50].Fig. 11 illustrates the distinction between algorithm machine learning and deep learning classification within the generative model concept.Generative models are widely used in deep learning algorithms, particularly for generating new outputs that resemble the input.Discriminative is the antonym of generative.This phrase is similar to the concepts of supervised and unsupervised in machine learning or deep learning [51].The primary objective of generative models is to capture the joint distribution of all system variables, including input variables.Due to the lack of labels associated with each input pattern, such learning abilities are typically referred to as unsupervised learning.The objective is to create an internal model of the input environment by identifying a set of latent features that precisely characterize the correlation between statistically observed variables.This fourth research question concludes that generative models and deep learning have a strong relationship.In machine learning, generative and discriminatory models share similarities with unsupervised and supervised learning.We are also aware that deep learning is a subset of machine learning.Consequently, some deep learning algorithms employ concepts from generative models.

RQ5. How can generative models be implemented in compression strategies?
Generative models are widely implemented in deep learning algorithms for various beneficial purposes, including data compression [52].Compression is one of the existing steganography methods [53].This procedure necessitates the explicit distribution of generative objects.In this compressionbased steganography method, generative models provide an advantage.This is because generative models can offer the ideal "sampler" or explicit distribution of generative media.Variational Autoencoder (VAE), Generative Adversarial Network (GAN), and similar algorithms are generative model-based algorithms.The algorithm can produce objects derived from latent variables that adhere to past distribution patterns, such as the Gaussian Distribution.
Variational Autoencoder (VAE) was used to perform encryption and compression in the study [54].This is because VAE can compress and encrypt images more efficiently and accelerate image encryption.VAE is a generative model that can generate similar images through neural network training and unsupervised learning.Changing the weight and bias of the generative model resulted in the generation of an unidentified noise image, an encrypted image, for the first time in this study.Using two images to train weights and biases from VAE to generate different images is the method.The system then divides the weight and bias of two different training images and divides the data into generated models to create noise images.
One of the generative model implementations in the deep learning method is the autoencoder architecture.There are various kinds of development of Autoencoder, including the Variational Autoencoder (VAE).Several researchers use variational autoencoders to compress digital data, for example, image data.In the autoencoder architecture, there are two main parts: Encoder and Decoder.The output from the Encoder is a representation of Latent Space with much smaller dimensions.This is meant by "compression," which is performed by the Generative Model.This compression intends to minimize the image dimensions.This compression will be used to overcome the problem of the high cost of synthesis in the DNA storage process at this time.With the image dimension compression or reduction process, the resulting DNA sequence will have a much smaller number compared to images that are not compressed.Image compression using a generative model has been carried out by Liu et al. [55] with predetermined input images.

RQ6. How can a latent space/generative model be constructed?
A latent variable is a low-dimensional subspace generated by projecting a monitored multivariable sample room [56].Autoencoder (AE) is one of the most effective and versatile unsupervised learning techniques for reducing the dimensions of big data models.Variational Autoencoder (VAE), an extensification of AE, can discover an efficient latent variable space as a magnitude of multivariate normal distribution by adding constraints on the coding network; VAE is a nonlinear form of probabilistic principal component analysis (PCA).VAE uses Bayesian variational inference for parameter estimation and integrates AE into a generative framework.This technique can be used for dimensionality reduction, reconstruction, and generation.Recent studies have shown great interest in this regard., and x can be reconstructed from z using generation models.The model   (|) is considered a probabilistic encoder and   (|) a probabilistic encoder.It is referred to as a probabilistic decoder.The VAE is graphically represented in Fig. 12(b).The filled and outlined circles represent the observed and latent variables, respectively.Arrows represent dependencies, whereas plates represent the number of instances.VAE's capability of encapsulating complex data distributions into continuous, low-dimensional latent spaces makes it ideal for design applications due in large part to the characteristics of its latent space [57].It is more difficult to replicate the latent space in generative adversarial network (GAN) algorithms, but GAN has proven to be an excellent candidate for generative design applications [58].
In deep learning, latent space / latent variables are produced from the Autoencoder architecture, which has two primary components, the Encoder and Decoder a shown in Fig. 13.Fig. 13.Autoencoder architecture component [59] From Fig. 13, it can be seen that the latent space representation can be generated from the Autoencoder architecture and its derivatives.The contents of the latent space will be used in the binarization stage in the initial process of DNA data storage.The contents of this latent space are compressed or reduced data.This sixth research question concludes that generative algorithm models, including VAE, can generate latent space.A latent variable is a variable that fills the latent space, where the amount depends on the requirements and will serve as the basis for generating new data based on input.In general, a less dimensional latent space extracts more meaningful change directions from the data but suffers more significant compression losses, making it more difficult to reconstruct the input data from its representation of latent.This means that latent spaces of lower dimensions tend to generate fewer and more diverse design candidates in the exploration context of the design space.However, the smaller the latent space dimension will become, the simpler it will be to explore.

Discussion
DNA data storage research journey from the 1960s to the present has been extensive.From discovering DNA compounds to how digital data can be stored on DNA.DNA data storage research begins with a problem, as with most scientific discoveries and research.This research is motivated by the current limitations of data storage media.The boundaries that arise consist of the materials required to create the storage technology and the increasing volume of data produced.This issue prompted the development of research into alternative data storage media.
DNA base compounds have been demonstrated to be a storage medium for living organisms' biological data.Numerous studies have utilized information from biological data collected on living things thousands of years ago.Based on this study, the potential of DNA for data storage is enormous.Therefore, research into the viability of DNA as a digital data storage medium was initiated.
Storing digital data on DNA involves six steps: binarization, encoding, synthesis, sequencing, decoding, and reading.Existing research focuses on each stage individually or as a whole.There is research on creating or developing algorithms during the encoding process or steps.This stage converts binary data extracted from a document to be stored into a row of four DNA bases, specifically A, T, G, and C. The research results on this DNA sequence indicate that constraints must be considered during its preparation.GC constraints and the numerous repetitions of rows in the form of homopolymers are two limitations.Therefore, researchers continue to develop algorithms aided by machine learning techniques, to facilitate adaptation to DNA's biological constraints.
The DNA data storage process is also hindered by the expense of the Synthesis stage, which entails creating artificial DNA or synthesizing DNA based on the Encoding stage's sequence.Compared to the sequencing process, synthesis is still costly.This is possible due to the restricted ownership of DNA synthesis-capable instrumentation.Due to this cost barrier, the researchers are attempting to reduce the cost of synthesis by compressing the data stored on DNA.Whether digital data compression or DNA row data occurs first, the synthesis process will involve compression.This process of DNA data storage incorporates a variety of compression techniques.In research, Huffman coding is the most frequently used compression method.In addition to Huffman coding, compression techniques based on deep learning are being tested for incorporation into the DNA data storage process.However, this compression method based on deep learning has not been applied extensively to DNA data storage techniques.This compression method based on deep learning employs the concept of generative models.This method generates a latent variable with a smaller dimension than the input.This latent variable represents the data input distribution despite its smaller dimension size.Therefore, the technique can generate or generate new data resembling input data.
Limitations of the research reviewed in this paper related to its implementation in the DNA data storage process is that existing research has not specifically used the generative model method or utilized latent space in digital data compression to minimize the cost of the synthesis process.The compression carried out by previous research is at the digital data stage, which has been through the binarization step.There is also compression carried out at the DNA Sequence data stage.From the results of this paper, it is possible to compress digital data from the start (beginning), especially digital image data.In this compression, generative models can be used to reduce the size of the image dimensions to represent latent space with one of the deep learning architectures, such as the Autoencoder.

Conclusion
The possibility of using generative models as compression techniques in DNA data storage remains extremely open, even though this article has addressed some research questions.Several deep learning algorithms employ generative model concepts and are designed for various applications, including data compression.Latent variables contain the compression technique's underlying principle using this generative model.In this generative model, latent variables have a smaller dimension than input data.Consequently, latent variables will be used to carry out the encoding process during the DNA data storage phase.
Variational Autoencoder is a deep learning algorithm that uses generative model concepts for data compression (VAE).In addition to VAE, implementing the Generative Adversarial Network (GAN) algorithm includes a generative model concept.Research on digital data storage methods on DNA enables deep learning algorithms in the data compression procedure to reduce synthesis costs.The results of this SLR show that the use of the generative model method on DNA data storage has not been used, so it is still widely potentials to use the generative model method to be integrated at the stage of digital data storage in the DNA medium.
This review paper provides information regarding the use of generative models in data compression in general, particularly about DNA data storage.This study hows research opportunities in utilizing the Generative model to compress data and integrate it with DNA data storage stages.This can develop existing methods of using DNA as a digital data storage medium.With the increase in DNA data storage methods, there will be further opportunities for developing the implementation of digital data storage technology in DNA.Furthermore, the cost of DNA synthesis technology will be cheaper.Future research is recommended to investigate how to integrate the benefits of genetic models in deep learning algorithms into the stages of storing digital data on DNA.Thus, this application's primary objective is to maximize data compression to reduce synthesis costs.In addition to compressing data to be stored, it must also be capable of decompressing data to be restored as its original state quality.

Fig. 6 .
Fig. 6.Illustration of density estimation with multiple data points on the actual number line used to match the Gaussian density function describing the observed example

Table 1 .
List of queries for each RQ

Table 2 .
Criteria in the selection of articles

Table 4 .
Process of paper selection