A coarse-grained parallelization of genetic algorithms

Genetic algorithms (GA) are frequently used to solve scheduling, shortest paths, machine learning, and modeling problems. Genetic algorithms are basically a search and optimization technique. The working mechanism of GA is based on the principles of genetics and natural selection. On the other hands, the must be solved problems become more complex and bigger. Consequently, it takes much longer times and more advance objective functions to find optimal solutions. Until now, time complexity analysis is still developed to get a good performance estimation of genetic algorithms [1]. Since the complexity may certainly affect its processing time, a novel approach should be done to improve the GA performance.


Introduction
Genetic algorithms (GA) are frequently used to solve scheduling, shortest paths, machine learning, and modeling problems.Genetic algorithms are basically a search and optimization technique.The working mechanism of GA is based on the principles of genetics and natural selection.On the other hands, the must be solved problems become more complex and bigger.Consequently, it takes much longer times and more advance objective functions to find optimal solutions.Until now, time complexity analysis is still developed to get a good performance estimation of genetic algorithms [1].Since the complexity may certainly affect its processing time, a novel approach should be done to improve the GA performance.
This paper investigates a new method to increase the GA speed of genetic algorithms in finding the optimal solutions by parallelizing the processing of subpopulations.Splitting the population into subpopulations may prevent premature convergence since each subpopulation finds a different genetic combination.The proposed method employs two levels of parallelization: message passing and Single Instruction Multiple Threads (SIMT).On the first level, message passing is used because of its ability to connect more than one computer, and hence to provide, in principle, unlimited scalability.Previous researches, such as Liu and Wang [2], have shown the feasibility of this parallelization.On the second level, SIMT is used because it can generate a large number of threads and one individual can, therefore, be processed by one thread, as shown in Zhang and He [3].Rapid advances in the technology of general purpose Graphics Processing Units (GPU) have allowed for massive numbers of threads in SIMT parallelization.
The proposed coarse-grained genetic algorithm consist of several GAs, which perform concurrent computations on different subpopulations.These genetic algorithms communicate with each other to exchange their best individual information; this technique is often called migration.According to Skolicki and De Jong [4], migration in a coarse-grained method a definite impact to data convergence.

A R T I C L E I N F O A B S T R A C T
Wang et al. [10] implemented a hybrid parallel genetic algorithm based on two layers of parallelism: process and thread.Their proposed coarse-grained method uses a hardware processing master-slave model by integrating message-passing parallelism using Message Passing Interface (MPI) and sharedmemory parallelism using OpenMP [11].Wahib et al. [12] parallelized genetic algorithms by using SIMT architecture with general purpose GPUs.They discussed the features of the GPU and the relevant issues when implementing parallel genetic algorithms.Johar et al. [13] conducted an analysis of genetic algorithms implemented in parallel both CPU and GPU using CUDA [14] architecture.The analysis was performed by comparing the operations performed in both implementations.GPU Millan et al. [15] used to improve the computation time.Hou et al. [16] built a parallel genetic algorithm that makes use of two parallel systems: multi-core CPU and many-core GPU.Furthermore, Li et al. [17] also developed a parallel genetic algorithm that runs in GPU using island model.The last three studies, however, did not employ message passing interface to migrate the best individuals.This study combines both message passing and GPU to speed up parallel genetic algorithms.Network is required for migration; hence CPU is used, but not to process genetic algorithm operations.GPU instead processes each individual in the subpopulations as [15], [16] and [17].GPU, however, spends a lot of time to move individuals from host to device and vice versa; and this problem affects the resulting parallel genetic algorithm.Asynchronous migration technique is proposed to handle this problem.The message passing and GPU are combined to build a massive and scalable machine and to speed up parallel genetic algorithms.
The rest of the paper is organized as follows: Section II provides a brief introduction to genetic algorithms and granularity in parallel computation.Section III describes the proposed parallelization methods.Section IV presents experimental results and their analysis, and Section V concludes the paper.

Simple Genetic Algorithm
The generic sequential genetic algorithm is shown in Fig. 1.In general, the most time-influential part of the generic algorithm is the population size, a number of individuals in one population.If the population size is large, then genetic algorithm takes a long time to complete its iterations up to the defined maximum generation.Therefore, partitioning the population into subpopulations and then do parallel processing on them may speed up the computation time.Each subpopulation is processed on different computers connected by a network.Beside partitioning the population into subpopulations, on certain operations, each individual in a subpopulation can also be processed in parallel.Hence, parallelization can be applied to the generic sequential genetic algorithms in two levels.Section III will discuss this parallelization scheme in detail.

Granularity in Parallel Computing
Granularity in parallel computing is a qualitative measure of the ratio of computation to communication [18].Periods of computation are typically separated from periods of communication by synchronization events.There are two types of parallel program designs based on granularity: coarsegrained and fine-grained.In the coarse-grained design, large amount of computation work is performed between communication events.On the other hand, fine-grained design performs relatively small amount of computation work between communication events.Parallelization with fine-grained design facilitates load balancing nicely since many communication events between processing keep the balance of the workload.Parallelization with coarse-grained design, on the other hand, depends on size of the workload being processed.If the size of the workload being worked on by each process can be divided equally, then load balancing can be achieved.Parallelization with fine-grained design is, however, vulnerable to communication overhead, which results in the overall speed that cannot be increased; and sometimes even decreases.The main cause is data flooding in communication media.Thus, the parallelization with fine-grained design is not effective when used to process large amount of data with slow communication media.The parallelization of genetic algorithms in this paper will be designed with coarse-grained model.

Parallelization of Genetic Algorithms
In this paper, we propose a method to develop parallel genetic algorithms with two levels of parallelization as shown in Fig. 2.This proposed method is an improvement from Ratomi [19].The first level exploits the fact that the population can be partitioned into loosely dependent subpopulations, while the second level exploits the fact that individuals in a subpopulation are independent of each other and moreover perform similar computation.
On the first level, the population is divided into subpopulations by the number of available computing nodes.This level uses of message-passing hardware with master-slave parallelism: node 0 (master) broadcasts the size of the subpopulation to all slave nodes.Each slave node generates its own Algorithm 1 (The generic sequential genetic algorithm) Phase 1: Initialization: Step 1.1: Set parameters: Pc, Pm, popsize, and maxgen.
Step 1.2: Generate popsize individuals randomly to build the initial population and evaluate their fitness values.gen = 0. Phase 2: Main Loop.Repeat the following steps until gen > maxgen: Step 2. Step 2.4: Replace the current population with the new population.gen = gen + 1. Phase 3: Submit the final popsize individuals as the result of the genetic algorithm.End subpopulation.Subpopulations are processed using coarse-grained method: a slave node only establishes communication to exchange its best individuals, which is transferred through the network.At the end, the master node gathers all final subpopulations from all slave nodes and produces the final result.In this study, the first level is implemented using MPI.Fig. 3 shows the proposed generic parallel genetic algorithm.Each computing node runs the whole algorithm (from Phase 1 to the End) in parallel; this corresponds to "Level I" in Fig. 2.  On the second level, parallelization will be performed in the processing of each individual (cf."Level II" in Fig. 2).However, not all of genetic operations is parallelized.Looping in each generation is performed sequentially.The parts that are parallelized are genetic algorithm operations that have no individual dependency, namely selection, crossover, mutation, individual evaluation, and updating the population's individuals.These correspond to Phase 2, Steps 2.1 until 2.4 in Fig. 3. Steps 2.1 until 2.4 in Phase 2 are all similar data updating operations that are performed lockstep for all individuals in the subpopulation.Hence, these steps are further parallelized as a thread in each computing node's GPU.Threads' IDs in the GPU are used as indices of the individuals.Hence, threads are generated as many as the number of individuals in the subpopulation.Unbiased tournament is used in the selection method because this selection method can be easily executed in parallel.Random permutation values are generated only at the beginning of the execution of algorithm, and each thread builds its own mating pool based on these random permutation values.In the crossover operation, each thread selects parents (namely two individuals) randomly from the mating pool, and then based on Pc, the thread performs crossover for these individuals.Afterward, each thread obtains offsprings produced by the crossover operation, and based on Pm, applies mutation operations on these offsprings.Subsequently, all mutated offsprings are simultaneously evaluated to obtain fitness values, and then individuals in the old subpopulation are simultaneously replaced by new individuals.After previous sequence of operations is completed, elitism operation is executed sequentially to obtain the best individual.The best individual obtained is then copied to host CPU.Subsequently, MPI sends this individual to other nodes .All of these steps are repeated until the predefined maximum generation is reached.
Memory allocation for data required in the genetic algorithm operations is carried out in the host as well as in the GPU device.Although genetic algorithm operations are only performed on the device, these data also need to be copied to the host so that the best individual can be delivered by MPI.At the end of the genetic algorithm operations, all individuals at each node can be combined.

Distribution of Population
The distribution of population also influences the speed of parallel genetic algorithm because the speed of algorithm depends on the speed of the computing node processing the most individuals.The population is distributed, as far as possible, in equal size in order to achieve load-balanced nodes, and the processing speed of each node is then relatively equal.The value distributed to each node is the size of the node's subpopulation that has been calculated in the master node.Each node then generates its own subpopulation.If the size of the population is less than the number of nodes, each node processes only one individual.If the size of the population is more than the number of nodes, population is divided equally by rounding it up

Individual Migration Strategy
In this research, the migration process is integrated into elitism process, which derives the best individual resulted from individual comparison in elitism memory.Individual migration among the slave nodes is performed in one direction with a ring topology.The best individual in slave node i is sent to slave node i+1, and so on until the last node sends its best individual to the first node.Two transfer modes are available during the migration process: asynchronous and synchronous modes.In synchronous mode, migration is carried out by directly sending out the best individual to another computing node.On the other hand, in the asynchronous method, migration is carried out by first copying it to a buffer.Individual's migration in the asynchronous mode is performed by a separate thread, and the individual that is sent and received is accessed through the buffer.
Fig. 4 depicts the working boundaries of threads in migration and elitism processes in asynchronous mode.Genetic algorithm operations and individual migration work concurrently via a global shared buffer that stores the best individuals.In order to avoid collisions in accessing the buffer, the buffer is built with a handler, namely a status flag.If a migration thread wishes to send an individual, then it first checks the status flag of the sender buffer; whether it is free or in use.If the status is in use, then the migration thread must wait until the buffer's status is free.Similar to elitism process in the genetic algorithm operation, if the corresponding thread wishes to copy the individual from the receiver buffer, then it must first check the status of the buffer whether it is free or in use.If the status is in use, then the thread must wait until the status of the receiver buffer is free.
There is no migration on the second level, because the threads in the GPU work with Single Instruction Multiple Threads model, and therefore there is no communication between the threads.Communication only occurs between the host CPU and device GPU, namely to copy the best individual produced by elitism in the device memory to the elitism memory on the host, which then is migrated using MPI.

Results and Discussion
Experiments are conducted in computers with an Intel Core-i5 (4 CPU), 4 GB RAM, Linux operating system Ubuntu LTS 14:04.The computer is connected by a network using Ethernet LAN, each with an NVidia GeForce GT 420 GPU.In the experiments, the parameters of the genetic algorithm use crossover probability 0.9, mutation probability 0.05, and population size 100 individuals [20].The number of copies for selection with unbiased tournament is 5 individuals.Experiments have been conducted in sequential and parallel implementations; for parallel one, we have used 5 and 10 computer nodes.
Three series of experiments are carried out in this study.The first series is intended to investigate the effect of parallelization on the accuracy of the results of the genetic algorithm.Two tools are used for this, namely the standard deviation (STD) and the Average Relative Percentage of Error (ARPE).Standard deviation is used to measure the variability of the objective values produced by the genetic algorithm, which in this experiment corresponds to the variability of the obtained makespan for each JSSP case.A makespan is the total length of the time required to complete all tasks in a JSSP case.Given a JSSP case study, it is run n times with different initializations for each parallelization methods.Each run produces a makespan.Based on the obtained makespans, their standard deviation is defined in (1).
where xi is the i-th makespan and  ̅ is the average value of all makespans.ARPE, on the other hand, is used to quantify the error obtained by the genetic algorithm searching in this study compared to the best makespan ever obtained in the previous researches for each JSSP case study.Let xo be the makespan produced by running the current implementation, and xb be the best makespan ever obtained by earlier researches, then ARPE is defined in (2).
Table 1 presents the standard deviations of the obtained makespans for all 10 JSSP cases under sequential and parallel implementations.For each JSSP case, 20 executions of the genetic algorithm have been run with different initialization values.Each of these executions produces a certain makespan as a solution.The standard deviations of all produced makespans for each JSSP case basically measure the variability of the results.Table 1 indicates that the standard deviations of the results of sequential and parallel implementations are close to each other and their difference is not significant.Even the highest standard deviation is still relatively small, namely for the la36 case, with parallel genetic algorithm using asynchronous migration model that shows the standard deviation value of 23.51.This suggests that genetic algorithms with sequential and parallel implementations converge to a solution.The only anomaly is case la02; but even in this case, the obtained makespans are not really divergent.
In this series of experiments, we also investigate the distance between the obtained makespans to makespans already known in previous researches; Table 2 shows the results.Columns with heading "x" contain the smallest makespan ever obtained for each case in various settings.Negative ARPE values occur in cases la06 and la31; this means that the makespans obtained in this research for these two cases are smaller than the makespans ever obtained in earlier researches.For case la31, the parallel implementation can even achieve smaller makespan than that of the standard genetic algorithm.
However, most of ARPE values in Table 2 are greater than 0 because they do not reach the makespans already known, unlike the Hybrid PGA (the fifth and sixth columns) that obtained ARPE values less than 1 [20].This is because JSSP coding used in this research is a direct encoding.Direct encoding depends on the order of jobs in a given problem, so the set of possible combinations formed is small.In contrast, previous studies mostly used indirect encoding, whose set of possible combinations is more than that of direct one.Our purpose in using direct encoding is that we are more interested in observing the impact of migration in the parallel genetic algorithm on the structure of chromosomes permutation migrated.This is because direct encoding is highly dependent on the structure of chromosomes permutation.If the migration process damages the structure of chromosomes permutation, it will certainly result in larger ARPE values and the search results will not satisfy the given genetic algorithm case.In general, the performance of the searching for optimal values using genetic algorithms that run in sequential and parallel is similar.Therefore, the proposed parallel genetic algorithm is not worse at finding the smallest makespan on the JSSP cases used in this study than the standard genetic algorithm.

Computation Times
In this series of experiments, we observe the computation times of the implementation of the proposed parallelization method on many scenarios and case studies and compare them to those of sequential implementation.For this purpose, the computation time of each scenario is obtained by running the corresponding implementation for 10,000 generations.Table 3 shows the resulting computation times.Note that the columns "S(n)" derived speedup when using n nodes.Table 3 shows that speed increase is obtained only when the genetic algorithm is executed in parallel using 10 nodes.Genetic algorithms running using 5 parallel nodes is not faster than genetic algorithms executed sequentially.The main cause of the lack of speed increase is due to the ineffective use of the GPU memory.The size of data processed in the high-speed memory is relatively small, since most of the data is processed using memory of slower access speed.Table 3 also shows that genetic algorithms running in parallel with asynchronous migration are faster than synchronous migration.The use of buffers and assignments of different threads to handle the migration process and the operation of genetic algorithms can reduce the waiting time required for sending and receiving the best individual through the network.Only one case is solved longer by the parallel implementation using 10 nodes (with synchronous migration) than the sequential one, namely case la02.This is because the time required to initialize MPI on the first test is longer than the time required to initialize MPI on subsequent tests.However, speedups obtained in this research are more stable than previous studies [20].As can be observed in Table 3, using Hybrid PGA (the third column) only six cases end up with speedups in processing time, although they amounted to an average of 6.6.In this research, speedups are obtained in all of the parallel, although most of them have speedup less than 2. This is because JSSP direct encoding used in this research are simpler than indirect encoding.Moreover, the migration technique of asynchronous model in this research is easily implemented than Liu and Wang [2].This research uses only buffer to save the individual to be sent and received, and consequently the technique can reduce the migration processing time.

Transformation of Objective Values
In this series of experiments, we investigate the transformation of the objective values in each generation of a particular JSSP case study for different parallelization methods.Comparison of the objective values transformation is carried out to observe and compare the speed of convergence of the genetic algorithm, sequential and parallel.The selected JSSP case study is la31 and the transformation is observed for the first 1,000 generations in finding the smallest makespan.Fig. 5(a) shows the transformation of the objective values of the sequential and parallel implementations with synchronous migration.The parallel genetic algorithm, both with 5 nodes and 10 nodes, converges faster than the sequential one.At the first 100 generations, the objective values obtained by the parallel genetic algorithm decrease more steeply than the sequential genetic algorithm.The performance of parallel genetic algorithm that uses 5 nodes or 10 nodes with synchronous migration model is relatively the same.
Parallel genetic algorithm with asynchronous migration model also converges faster than the sequential one as shown by Fig. 5(b).However, the parallel genetic algorithm with asynchronous migration model takes longer to converge than with synchronous migration model.In the first 1,000 generations, the objective value obtained by the parallel genetic algorithm with asynchronous migration model is still higher, above 1,850, compared to the parallel genetic algorithm with synchronous migration model, which only reaches 1,810.

Conclusion
Parallelization of genetic algorithms using coarse-grained method can be done by combining two models of processing hardware: MPI and GPU.Based on the standard deviation and ARPE obtained from the experiments, the precision of the results obtained by the parallel genetic algorithm and sequential genetic algorithm is relatively the same, with the biggest standard deviation difference of approximately 9.31.The computation time in finding a solution using parallel genetic algorithm is not yet optimal when compared to sequential genetic algorithm.Nevertheless, the proposed parallel implementation reach the convergence result faster than the sequential one.Furthermore, parallel genetic algorithm with asynchronous migration model is faster than the synchronous migration model.
In the future, we would like to investigate the use of GPU memory processing techniques to reduce data transfer time to improve the performance of the parallel genetic algorithm.For the migration process, it is also worth looking into a client-server software that is lighter to carry out the messagepassing operations.Thus, the data transmission process can be efficiently shortened.

Algorithm 2 ( 1 . 1 : 1 . 2 : 1 . 3 : 2 . 1 :
The generic parallel genetic algorithm) Phase 1: Initialization:Step Receive spopsize from node 0.Step Set parameters: Pc, Pm, and maxgen.Step Generate spopsize individuals randomly to build the initial subpopulation and evaluate their fitness values and elitism.gen = 0. Phase 2: Main Loop.Repeat the following steps until gen > maxgen:Step Generate the mating pool from the current subpopulation using Tournament Selection.Step 2.2: Repeat the following operations until a new subpopulation with spopsize individuals is generated: Select two individuals from the mating pool randomly without replacement to perform crossover with probability Pc, and perform mutation for every gene of the offspring with probability Pm.Then insert the mutant into a new subpopulation.Step 2.3: Evaluate the fitness value for every new individual in the new subpopulation.Step 2.4: Replace the current subpopulation with the new subpopulation.gen = gen + 1. Step 2.5: Perform elitism and migration.Phase 3: Submit the final spopsize individuals to node 0. End Since modern general-purpose GPUs have a large number of threads, this allows for speedy computation of a subpopulation having a large number of individuals.

Fig. 4 .
Fig. 4. The working boundaries of migration threads and genetic algorithm operations threads in asynchronous mode where communication proceeds via a shared buffer.

1 :
Select popsize individuals from the current population using Roulette Wheel Selection to generate mating pool.Step 2.2: Repeat the following operations until a new population with popsize individuals is generated: Select two individuals from the mating pool randomly without replacement to perform crossover with probability Pc, and perform mutation for every gene of the offspring with probability Pm.Then insert the mutant into a new population.Step 2.3: Evaluate the fitness value for every new individual in the new population.

Table 3 .
Computation Time Comparison