Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning

(1) * Hock Hung Chieng Mail (Universiti Tun Hussein Onn Malaysia, Malaysia)
(2) Noorhaniza Wahid Mail (Universiti Tun Hussein Onn Malaysia, Malaysia)
(3) Ong Pauline Mail (Universiti Tun Hussein Onn Malaysia, Malaysia)
(4) Sai Raj Kishore Perla Mail (Institute of Engineering and Management, India)
*corresponding author


Activation functions are essential for deep learning methods to learn and perform complex tasks such as image classification. Rectified Linear Unit (ReLU) has been widely used and become the default activation function across the deep learning community since 2012. Although ReLU has been popular, however, the hard zero property of the ReLU has heavily hindering the negative values from propagating through the network. Consequently, the deep neural network has not been benefited from the negative representations. In this work, an activation function called Flatten-T Swish (FTS) that leverage the benefit of the negative values is proposed. To verify its performance, this study evaluates FTS with ReLU and several recent activation functions. Each activation function is trained using MNIST dataset on five different deep fully connected neural networks (DFNNs) with depth vary from five to eight layers. For a fair evaluation, all DFNNs are using the same configuration settings. Based on the experimental results, FTS with a threshold value, T=-0.20 has the best overall performance. As compared with ReLU, FTS (T=-0.20) improves MNIST classification accuracy by 0.13%, 0.70%, 0.67%, 1.07% and 1.15% on wider 5 layers, slimmer 5 layers, 6 layers, 7 layers and 8 layers DFNNs respectively. Apart from this, the study also noticed that FTS converges twice as fast as ReLU. Although there are other existing activation functions are also evaluated, this study elects ReLU as the baseline activation function.


Deep learning; Activation function; Flatten-T Swish; Fully connected neural networks




Article metrics

Abstract views : 3357 | PDF views : 529




Full Text



[1] R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, p. 947, 2000, doi: https://doi.org/10.1038/35016072.

[2] K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “What is the best multi-stage architecture for object recognition?,” in Computer Vision, 2009 IEEE 12th International Conference on, 2009, pp. 2146–2153, doi: https://doi.org/10.1109/ICCV.2009.5459469.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105, available at: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.

[4] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. icml, 2013, vol. 30, p. 3, available at: https://pdfs.semanticscholar.org/367f/2c63a6f6a10b3b64b8729d

[5] W. Ouyang, A. Aristov, M. Lelek, X. Hao, and C. Zimmer, “Deep learning massively accelerates super-resolution localization microscopy,” Nat. Biotechnol., 2018, doi: https://doi.org/10.1038/nbt.4106.

[6] P. Wang, R. Ge, X. Xiao, Y. Cai, G. Wang, and F. Zhou, “Rectified-Linear-Unit-Based Deep Learning for Biomedical Multi-label Data,” Interdiscip. Sci. Comput. Life Sci., vol. 9, no. 3, pp. 419–422, 2017, doi: https://doi.org/10.1007/s12539-016-0196-1.

[7] W. Xie, J. A. Noble, and A. Zisserman, “Microscopy cell counting and detection with fully convolutional regression networks,” Comput. Methods Biomech. Biomed. Eng. Imaging Vis., vol. 6, no. 3, pp. 283–292, 2018, doi: https://doi.org/10.1080/21681163.2016.1149104.

[8] A. Valada, L. Spinello, and W. Burgard, “Deep feature learning for acoustics-based terrain classification,” in Robotics Research, Springer, 2018, pp. 21–37, doi: https://doi.org/10.1007/978-3-319-60916-4_2.

[9] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” ArXiv Prepr. ArXiv150500853, 2015, available at: https://arxiv.org/abs/1505.00853v2.

[10] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” ArXiv Prepr. ArXiv151107289, 2015, available at: https://arxiv.org/abs/1511.07289v5.

[11] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034, https://doi.org/10.1109/ICCV.2015.123.

[12] D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with Gaussian error linear units,” ArXiv Prepr. ArXiv160608415, 2016, doi: https://arxiv.org/abs/1606.08415v2.

[13] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 971–980, available at: http://papers.nips.cc/paper/6698-self-normalizing-neural-networks.

[14] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998, doi: https://doi.org/10.1109/5.726791.

[15] J. Xiao, Z. Liu, P. Zhao, Y. Li, and J. Huo, “Deep Learning Image Reconstruction Simulation for Electromagnetic Tomography,” IEEE Sens. J., vol. 18, no. 8, pp. 3290–3298, 2018, doi: https://doi.org/10.1109/JSEN.2018.2809485.

[16] F. Belletti, A. Beutel, S. Jain, and E. Chi, “Factorized Recurrent Neural Architectures for Longer Range Dependence,” in International Conference on Artificial Intelligence and Statistics, 2018, pp. 1522–1530, available at: http://proceedings.mlr.press/v84/belletti18a.html.

[17] M. A. Masrob, M. A. Rahman, and G. H. George, “Design of a neural network based power system stabilizer in reduced order power system,” in Electrical and Computer Engineering (CCECE), 2017 IEEE 30th Canadian Conference on, 2017, pp. 1–6, doi: https://doi.org/10.1109/CCECE.2017.7946634.

[18] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning,” in International Workshop on Artificial Neural Networks, 1995, pp. 195–201, doi: https://doi.org/10.1007/3-540-59497-3_175.

[19] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” 2018, available at: https://openreview.net/forum?id=SkBYYyZRZ.

[20] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” in Advances in neural information processing systems, 2014, pp. 2924–2932, available at: http://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.

[21] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Comput., vol. 1, no. 4, pp. 541–551, 1989, doi: https://doi.org/10.1162/neco.1989.1.4.541.

[22] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015, available at: https://www.nature.com/articles/nature14539.

[23] G. Van Rossum, “An introduction to Python for UNIX/C programmers,” Proc NLUUG Najaarsconferentie Dutch UNIX Users Group, 1993, available at: http://liuj.fcu.edu.tw/net_pg/python/Intro-Python.pdf.

[24] M. Abadi et al., “Tensorflow: a system for large-scale machine learning.,” in OSDI, 2016, vol. 16, pp. 265–283, available at: https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.

[25] S. S. Girija, “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” 2016, available at: https://cse.buffalo.edu/~chandola/teaching/mlseminardocs/TensorFlow.pdf.

[26] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256, available at: http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf?hc_location=ufi.

[27] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using convolutional 3d neural networks for user-independent continuous gesture recognition,” in Pattern Recognition (ICPR), 2016 23rd International Conference on, 2016, pp. 49–54, doi: https://doi.org/10.1109/ICPR.2016.7899606.

[28] H.-J. Kim and Y.-H. Kim, “Classifying Copyrighted Designs through Convolutional Neural Networks,” Int. J. Appl. Eng. Res., vol. 13, no. 1, pp. 590–597, 2018, available at: https://www.ripublication.com/ijaer18/ijaerv13n1_79.pdf.

[29] S. K. Gouda, S. Kanetkar, D. Harrison, and M. K. Warmuth, “Speech Recognition: Key Word Spotting through Image Recognition,” 2018, available at: https://arxiv.org/abs/1803.03759v1.

[30] L. Botton, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010, 2010, pp. 177–186, doi: https://doi.org/10.1007/978-3-7908-2604-3_16.

[31] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” 2015, available at: https://arxiv.org/abs/1502.03167.

[32] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814, available at: https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf.

[33] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 315–323, available at: http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf.

[34] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia, “Incorporating second-order functional knowledge for better option pricing,” in Advances in neural information processing systems, 2001, pp. 472–478, available at: http://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf.

[35] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Netw., 2018, doi: https://doi.org/10.1016/j.neunet.2017.12.012.

[36] E. Alcaide, “E-swish: Adjusting Activations to Different Network Depths,” ArXiv Prepr. ArXiv 1801.07145, 2018, available at: https://arxiv.org/abs/1801.07145v1.

[37] S. Qiu, X. Xu, and B. Cai, “FReLU: Flexible Rectified Linear Units for Improving Convolutional Neural Networks,” ArXiv Prepr. ArXiv170608098, 2017, available at: https://arxiv.org/abs/1706.08098.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
   andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0