Gabor-enhanced histogram of oriented gradients for human presence detection applied in aerial monitoring

Human detection has been used in many applications such as crime mitigation [1], surveillance [2] [3], crowd-estimate or people counting [4], abnormal event recognition [5], person reidentification [6], gender classification [7], elderly fall detection [8][9], illegal trespassing detection [10], and rescue operations [11]. In applications of human detection where the subject video or images are captured using unmanned aerial vehicles (UAV), the difficulty of automatic human detection is amplified by the variations in pose, the angle, and the distance by which a camera is capturing an image or video sequence mounted in UAV. Furthermore, the area which includes any human figure is relatively small compared to the extent of the image rending the Region-of-Interest (ROI) at low resolution.


Introduction
Human detection has been used in many applications such as crime mitigation [1], surveillance [2] [3], crowd-estimate or people counting [4], abnormal event recognition [5], person reidentification [6], gender classification [7], elderly fall detection [8] [9], illegal trespassing detection [10], and rescue operations [11]. In applications of human detection where the subject video or images are captured using unmanned aerial vehicles (UAV), the difficulty of automatic human detection is amplified by the variations in pose, the angle, and the distance by which a camera is capturing an image or video sequence mounted in UAV. Furthermore, the area which includes any human figure is relatively small compared to the extent of the image rending the Region-of-Interest (ROI) at low resolution.
Vision-based human detection systems are generally equipped with visible light, FIR, NIR, IR cameras, or a combination of any of these. Each system differs in the features used to characterize human In UAV-based human detection, the extraction and selection of the feature vector are one of the critical tasks to ensure the optimal performance of the detection system. Although UAV cameras capture high-resolution images, human figures' relative size renders persons at very low resolution and contrast. Feature descriptors that can adequately discriminate between local symmetrical patterns in a low-contrast image may improve a human figures' detection in vegetative environments. Such a descriptor is proposed and presented in this paper. Initially, the acquired images are fed to a digital processor in a ground station where the human detection algorithm is performed. Part of the human detection algorithm is the GeHOG feature extraction, where a bank of Gabor filters is used to generate textured images from the original. The local energy for each cell of the Gabor images is calculated to identify the dominant orientations. The bins of conventional HOG are enhanced based on the dominant orientation index and the accumulated local energy in Gabor images. To measure the performance of the proposed features, Gabor-enhanced HOG (GeHOG) and other two recent improvements to HOG, Histogram of Edge Oriented Gradients (HEOG) and Improved HOG (ImHOG), are used for human detection on INRIA dataset and a custom dataset of farmers working in fields captured via unmanned aerial vehicle. The proposed feature descriptor significantly improved human detection and performed better than recent improvements in conventional HOG. Using GeHOG improved the precision of human detection to 98.23% in the INRIA dataset. The proposed feature can significantly improve human detection applied in surveillance systems, especially in vegetative environments.

224
International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 3, November 2020, pp.  figures. Some human detection algorithms are based on the subject's silhouette or edges [12], matching templates [13], sector rings [14], thermal images [15], visible human attributes [16], spatiotemporal features [17] [18], covariance descriptor [6], and feature maps derived from convolutional neural networks [2] [19]. Histogram of Oriented Gradients (HOG) is used for pedestrian detection [20], which performed relatively well for upright human figures. Since then, HOG has been considered by most researchers due to its robustness against varying illumination. Improvements in HOG are often made by supplementing it with another feature descriptor. In Hurney et al. [21], HOG is used in conjunction with Local Binary Patterns LBP to detect human figures during the night.
Similarly, LBP and HOG features are extracted from images containing human figures, then subjecting them to Singular-Value Decomposition and Principal Component Analysis PCA, to select the most significant feature segments from LBP and HOG [22]. Human detection in uncontrolled environments can also be done by using a textured image instead of the original. This is accomplished by preprocessing an image with Gabor filtering then extracting HOG features afterward [23]. Similarly, HOG features and Haar wavelets are combined to form a cascade head-shoulder detector that works well for overhead and tilted captures from surveillance systems [24]. Another hybrid use of HOG with LSS has been proposed to improve the feature descriptors' description capability for human detection [25].
Although there are various descriptors proposed, automatic human detection is still a challenging task. Holistic human detection methods perform well in varying illumination, but the performance degrades under occlusions or pose variations [26]. As mentioned above, among the other feature descriptors, HOG has been considered by most researchers as standard for several years. However, conventional HOG features cannot detect human figures in the low-contrast region since gradient computation is less effective or when local patterns occurred symmetrically. Most recent improvements on HOG, which are proposed to address these issues, include Histogram of Edge Oriented Gradients (HEOG) [26] and Improved HOG [27]. Other literature claims that combined features are superior to the use of any individual descriptor alone [28]; however, such a system becomes exclusive to a specific dataset [29]. It is in this context that this paper is proposed.
The main contribution of this paper is a proposal for a new feature computation for human detection. This paper proposes and evaluates the Gabor-enhanced HOG feature descriptor to address the issue of discrimination of local patterns at low-contrast and with symmetrical occurrence. Finally, the paper construct a dataset intended to detect humans in the vegetative background at very low resolution.

Method
In human detection, two major steps are performed: first, extraction of the Region-of-Interest (ROI), and second, validation of the human target in the proposed ROI. The proposed method falls under the second step in which the feature extraction of the candidate is generated. As previously mentioned, HOG has been a persisted standard in holistic human detection. Hence, before detailing the proposed feature descriptor, the conventional HOG and two of the most recent improvements of HOG, are discussed as benchmarked methods. For simplicity and speed, linear SVM is used as the baseline classifier throughout the experimentation.

Dataset description/statistics
Many efforts have been expended to standardize the evaluation of performance in machine-vision systems. Although there are available datasets for person detection such as Caltech, INRIA, CAVIAR, etc., there has not been found a dataset of farmers working in fields. Due to this works' intended application, we used UAV to gather working farmers' images in the crop fields. INRIA Persons Dataset and a custom Farmer Dataset (FDS) [30] were used. Originally, INRIA was developed for HOG, which deems it most appropriate to test the proposed method. However, the Farmers' dataset has not been tested before, so to match it with the INRIA dataset, all FDS images are resized to 128 x 64 upright. Besides, multiple scales of images in FDS are also considered. In Table 1, the characteristic of the

Theoretical comparison of the proposed method to related descriptors: HOG, HEOG, and ImHOG
Histogram of Oriented Gradients is a robust local descriptor designed to capture the gradient information that emphasizes the edges of an object or human figure. It starts with the normalization of gamma and the color of the input image. Then, gradients are computed, which are used for weighted voting into spatial and orientation cells. Contrast is also normalized across the overlapping spatial blocks from the HOG features that are collected and fed to the classifier [20]. Image gradients and its magnitude at any given point are calculated as shown in equations (1) to (4) where Gx and Gy are the horizontal and vertical gradients, respectively, GMag is the gradient magnitude with θ as the orientation.
The gradient of the image is divided into cells wherein a single histogram is collected. In the conventional HOG, the cell is usually 8 x 8 pixels and the orientation ranges from 0 to pi, which is divided into 9 bins. Cells are grouped into blocks (e.g. four cells per block) and normalized using L2norm described in (5).
Finally, normalized blocks are concatenated in a 1D vector as the HOG feature vector. The conventional HOG is robust in varying illumination; however, its performance is reduced at low-contrast images. HEOG included 3 additional steps on the conventional HOG to address the problem in low contrast. HEOG feature extraction starts with gradient computation on x-and y-axes used to derive the gradient's magnitude and orientation. Unlike conventional HOG, the HEOG feature's dimensionality is reduced right after the gradient calculation where only the local maximum on a 3 x 3 running window is considered for gradient voting. If the local maximum is present, the gradient magnitude is preserved, otherwise, the value is zero. Adaptive thresholding is performed by calculating each block's median value to reinstate the robustness to illumination variance lost during the non-max suppression,. The lower (6) and upper (7) thresholds are computed using the median value. These thresholds are used for the next process, i.e. hysteresis thresholding, which preserves the edges affected by illumination. It works by tagging edges as "strong" or "weak". The strong edges participate in the gradient voting, while weak ones are discarded. Lastly, gradient voting and block normalization are conducted, which are similar to conventional HOG's. Improved HOG addresses the issue of the reversed angles placed in one bin, which is done in conventional HOG rendering some local patterns x discriminated inadequately. In ImHOG, the disparity between the reverse angles is accumulated across all histogram bins and placed into an extra bin. Initially, Gradient Histogram (GH) is generated. Each histogram bin represents a distinct orientation, i.e. reverse angles are placed on distinct bins. Suppose that GH(x) is the Gradient Histogram value at bin x, where x can assume value from 1 to N/2 (N is the number of bins), and GH(x + N/2) is the value at the bin opposite to x. Then, the total disparity over the entire GH is described in (8).
By accounting for the absolute differences (disparities) between bins of reverse angles and converge them into an extra bin, the issue in discriminating similar but high contrast pattern is resolved. However, an extra bin for disparity doesn't work for symmetrical local patterns. That is where the proposed method complements.
In Gabor-enhanced HOG (GeHOG), Gabor filters are used to enhance the HOG at each cell on the ROI. The objective is to resolve the issue of inadequacy in discriminating symmetrical local patterns and those that occur in low contrast. Fig. 2 shows the general framework for generating the Gabor-Enhanced HOG (GeHOG) feature vector. Initially, a three-channel image is converted to grayscale for reduced dimensionality of the Gabor images which are directly extracted from the single-channel version of the target image. Considering the spatial domain, a 2D Gabor filter resembles a sinusoid-modulated Gaussian kernel described as: Where f is the frequency of the sinusoid, θ represents orientations orthogonal to the stripes of Gabor function, ϕ is the offset in phase, σ is the standard deviation of the envelope the Gaussian function and γ is the two-dimensional aspect ratio which specifies how elliptical the Gabor function support is.
Gabor filter is applied to each cell in the image's ROI to determine the dominant orientation of edges that occurred. Suppose that there is N number of Gabor orientations used at M different scales, then there will be N x M Gabor images generated for each cell. For each orientation, cells' local energy at various scales is summed up creating a scalar value denoting the accumulated local energy at that particular orientation. Hence, a 1D vector of length N is created representing the local energies at different orientations. After normalization, the maximum local energy is used to the index which bin among GH needs "enhancement". The term "enhancement" refers to increasing the particular orientation's influence in creating the final feature vector. Local energy Es at a particular scale s can be computed using (12) where xij is the pixel value at coordinates i and j of an n x n cell. Local energy histogram is normalized to unity as described in (13) where Enorm(N) and EN are the normalized and original value of local energy at N th bin, respectively, and E is the 1D vector of local energies. The maximum value in normalized energy, Enorm(N), is used to determine which bin in the GH is enhanced. The algorithm for enhancing the indexed orientation bin is described in Table 2 where β, enhancement modulus, is also introduced.

Experiment set-up
The UAV camera used for data acquisition has a 3840 x 2160 resolution paired with a 3-axis gimbal mechanism for optical image stabilization. The 1280 x 720 resolution is used instead of the maximum to simulate the actual surveillance camera drone's distance to the Point-of-Interest. This approach allows the UAV to fly just 30m above the ground while simulating a 100m flight. The three improvements on HOG, namely HEOG, ImHOG, and GeHOG (proposed) are implemented in a workstation with Intel i7 3.20 GHz processor CPU backed with 8GB RAM. The images acquired from the UAV are forwarded to the ground workstation, where human detection is implemented. The original parameter set-up used in the conventional HOG is also implemented to all feature descriptors, i.e., the cell size of 8 by 8 pixels, block size of 2 by 2 cells, and 9 orientation bins. For the proposed method, the Gabor Filter bank is created using five scales, eighteen orientations that represent the GH bins from 0 o to 360 o separated by 20 o interval, and an eight-pixel wavelength to match the dimension of each cell. Based on the original parameter set-up of HOG, the length of the feature vectors for the improved descriptors are as follows: HEOG has 3,780 values (105 blocks by 4 cells with 9 bins each), while ImHOG and GeHOG have 4200 (105 blocks by 4 cells with 10 bins each). To simply contrast the different feature descriptors, linear SVM is used as the base classifier model. Suppose G(xj, xk) is an element (j,k) of the Gram matrix, where xj and xk are n-dimensional vectors representing observations j and k in x. The linear kernel can be described in (14). Also, five-fold cross-validation is used to all datasets during training to reduce the possibility of overfitting the data.

Quantitative evaluation
The F1-score derived from Precision and Recall of the linear SVM classifier models is used to evaluate the feature extraction algorithms' performances. The SVM models are trained separately using HEOG, ImHOG, and GeHOG feature vectors. Precision, recall, and F-score measurements are calculated for each algorithm. Precision is the classifier's ability to detect only the important data points while recall 229 Vol. 6, No. 3, November 2020, pp. 223-234 Ocampo et al. (Gabor-enhanced histogram of oriented gradients for human presence detection applied in aerial monitoring) pertains to the classifier's ability to detect all relevant cases in the Dataset. F1-score is a measure that combines the merit of precision and recall in one number. Precision, Recall, and F1-score can be computed as described in equations (15)(16)(17).

Results and Discussion
This section presents the results of the experiments done using the proposed and the other two feature descriptors in human detection. Part of the proposed algorithm is the use of enhancement modulus, β, which serves as the discriminating function between GeHOG and other HOG variations. Several functions are evaluated to serve as the enhancement modulus. Table 3 presents the evaluation for the various enhancement moduli derived from Gabor images. The convention used to describe how each modulus is calculated. The cell refers to an 8 x 8 region of the Gabor image with max{} is finding the maximum pixel value from Gabor cell. The sum() refers to the summation of all pixel values while Mask is an overlaying kernel and eigenvalue() refers to extract the eigenvalue of the 8 x 8 matrix. Fig. 3 shows the orientation masks used as well as the arrays used in generating Gabor cells. The last row is the modulus selected for enhancing the HOG feature descriptor. It is selected based on the F1-score of the classifier when such is used. Although it renders lower precision than other moduli, it has the highest recall among the others. The cell described in Table 3 has the same size as that of the conventional HOG size. The Gabor cells contain the absolute values derived from filtering.  Fig. 4 shows the comparison of the histograms generated using HEOG, ImHOG, and GeHOG. The first column describes the contrasting symmetrical patterns (patterns 1 to 6) in a local 8 x 8 cell. Consider pattern 1 and its invert, pattern 2. The generated HEOG and ImHOG have similar feature histogram generated for the corresponding pair of patterns, manifests the inability of HEOG and ImHOG to discriminate against the contrasting patterns due to the inherent weakness of combining Gradient Histogram's reversed angles. However, this is not the case for GeHOG. It can be noticed that the 10th bin in the histogram changes at the contrast of a pattern. The modulus used to enhance the gradient histogram makes the proposed feature descriptor able to capture the texture and most dominant orientation. With the use of Gabor filtering, GeHOG has enhanced the original HOG feature to discriminate contrasting symmetrical patterns locally adequately. Similar to patterns 3 to 6, the addition of an extra bin in the proposed feature made the HOG patterns distinguishable. The extra bin (10th bin) is influenced by the other bins' local energy and weighted vote as described by the enhancement modulus β. Therefore, the details carried by Gabor images are embedded in that 10th bin. It can also be noticed that the 10th bin in ImHOG cannot discriminate between symmetrical patterns.
Precision, recall, and F1-scores are calculated to evaluate the relevance of the results of the SVM classifier using the feature descriptors HEOG, ImHOG, and GeHOG. In Table 4, the classification F1score derived from precision and recall is presented. As described, the proposed method performed better than the other two feature descriptors when it comes to the balance of precision and recall.

Conclusion
This paper demonstrates that Gabor filters can be used to enhance a well-known feature such as HOG and its recent improvements. By incorporating the local energy calculations of Gabor images to the HOG feature vector, the influence of such a feature can be increased. This allows any classifier to adequately discriminate the local patterns that are both low in contrast and symmetrical. Initially, Gabor images are generated from a sample image. The number of Gabor images is equal to the scales multiplied by the number of orientation bins. Accumulated local energy per bin is calculated and the bin with the highest energy content is indexed. The indexed bin is a similar bin of Gradient Histogram that is International Journal of Advances in Intelligent Informatics ISSN 2442-6571 Vol. 6, No. 3, November 2020, pp. 223-234 enhanced. Finally, a new feature vector is generated by adding the enhancement modulus to each cell histogram. GeHOG is used on partial INRIA and FDS. Experiment results show that GeHOG outperformed recent improvements of HOG, i.e. HEOG and ImHOG during the testing process.
However, the diverse pose variations of farmers working in the field remains a challenge and are subject to future works. In aerial images where the camera's position relative to the farmers working is very diverse, the human figure can assume almost all possible poses. The on-going work on this project is developing a monitoring system for farmers' activities based on their poses. The use of GeHOG can provide adequate discrimination for almost similar and symmetrical patterns thus increasing the capability to differentiate poses. Moreover, various dimensionality reduction methods were not implemented in this work but could significantly increase speed. The use of GeHOG in human detection for low-resolution images can be enhanced further by adopting region proposal methods where the trust areas are identified before detection.