Internal and collective interpretation for improving human interpretability of multi-layered neural networks

(1) * Ryotaro Kamimura Mail (Kumamoto Drone Technology and Development Foundation; IT Education Center, Tokai Univerisity, Japan)
*corresponding author


The present paper aims to propose a new type of information-theoretic method to interpret the inference mechanism of neural networks. We interpret the internal inference mechanism for itself without any external methods such as symbolic or fuzzy rules. In addition, we make interpretation processes as stable as possible. This means that we interpret the inference mechanism, considering all internal representations, created by those different conditions and patterns. To make the internal interpretation possible, we try to compress multi-layered neural networks into the simplest ones without hidden layers. Then, the natural information loss in the process of compression is complemented by the introduction of a mutual information augmentation component. The method was applied to two data sets, namely, the glass data set and the pregnancy data set. In both data sets, information augmentation and compression methods could improve generalization performance. In addition, compressed or collective weights from the multi-layered networks tended to produce weights, ironically, similar to the linear correlation coefficients between inputs and targets, while the conventional methods such as the logistic regression analysis failed to do so.


Mutual information; Internal interpretation; Collective interpretation; Inference mechanism; Generalization



Article metrics

Abstract views : 1185 | PDF views : 401




Full Text



[1] K. R. Varshney and H. Alemzadeh, “On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and Data Products,” Big Data, vol. 5, no. 3, pp. 246–255, Sep. 2017, doi: 10.1089/big.2016.0051.

[2] B. Goodman and S. Flaxman, “European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation,’” AI Mag., vol. 38, no. 3, pp. 50–57, Oct. 2017, doi: 10.1609/aimag.v38i3.2741.

[3] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan, “Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model,” Ann. Appl. Stat., vol. 9, no. 3, pp. 1350–1371, Sep. 2015, doi: 10.1214/15-AOAS848.

[4] F. Wang and C. Rudin, “Falling rule lists,” in Artificial Intelligence and Statistics, 2015, pp. 1013–1022, available at :

[5] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible Models for HealthCare,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’15, 2015, pp. 1721–1730, doi: 10.1145/2783258.2788613.

[6] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” in Advances in neural information processing systems, 1996, pp. 24–30, available at : Google Scholar.

[7] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. MÞller, “How to explain individual classification decisions,” J. Mach. Learn. Res., vol. 11, no. Jun, pp. 1803–1831, 2010, available at : Google Scholar.

[8] I. Kononenko and others, “An efficient explanation of individual classifications using game theory,” J. Mach. Learn. Res., vol. 11, no. Jan, pp. 1–18, 2010, available at : Google Scholar.

[9] G. Bologna, “Is it worth generating rules from neural network ensembles?,” J. Appl. Log., vol. 2, no. 3, pp. 325–348, Sep. 2004, doi: 10.1016/j.jal.2004.03.004.

[10] G. G. Towell and J. W. Shavlik, “Extracting refined rules from knowledge-based neural networks,” Mach. Learn., vol. 13, no. 1, pp. 71–101, Oct. 1993, doi: 10.1007/BF00993103.

[11] R. Andrews, J. Diederich, and A. B. Tickle, “Survey and critique of techniques for extracting rules from trained artificial neural networks,” Knowledge-Based Syst., vol. 8, no. 6, pp. 373–389, Dec. 1995, doi: 10.1016/0950-7051(96)81920-4.

[12] J. L. Castro, C. J. Mantas, and J. M. Benitez, “Interpretation of artificial neural networks by means of fuzzy rules,” IEEE Trans. Neural Networks, vol. 13, no. 1, pp. 101–116, 2002, doi: 10.1109/72.977279.

[13] R. Wall and P. Cunningham, “Exploring the potential for rule extraction from ensembles of neural networks,” in 11th Irish Conference on Artificial Intelligence & Cognitive Science, 2000, pp. 52–68, available at : Google Scholar.

[14] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems, 1990, pp. 598–605, available at :

[15] A. Krogh and J. A. Hertz, “A simple weight decay can improve generalization,” in Advances in neural information processing systems, 1992, pp. 950–957, available at : Google Scholar.

[16] S. Srinivas and R. V. Babu, “Data-free Parameter Pruning for Deep Neural Networks,” in Procedings of the British Machine Vision Conference 2015, 2015, p. 31.1-31.12, doi: 10.5244/C.29.31.

[17] G. G. Oliveira, O. C. Pedrollo, and N. M. R. Castro, “Simplifying artificial neural network models of river basin behaviour by an automated procedure for input variable selection,” Eng. Appl. Artif. Intell., vol. 40, pp. 47–61, Apr. 2015, doi: 10.1016/j.engappai.2015.01.001.

[18] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’06, 2006, p. 535, doi: 10.1145/1150402.1150464.

[19] J. Ba and R. Caruana, “Do Deep Nets Really Need to be Deep?,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2654–2662, available at:

[20] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv Prepr. arXiv1503.02531, 2015, available at:

[21] R. Kamimura, “Neural self-compressor: Collective interpretation by compressing multi-layered neural networks into non-layered networks,” Neurocomputing, vol. 323, pp. 12–36, Jan. 2019, doi: 10.1016/j.neucom.2018.09.036.

[22] R. Linsker, “Self-organization in a perceptual network,” Computer (Long. Beach. Calif)., vol. 21, no. 3, pp. 105–117, Mar. 1988, doi: 10.1109/2.36.

[23] R. Linsker, “Local Synaptic Learning Rules Suffice to Maximize Mutual Information in a Linear Network,” Neural Comput., vol. 4, no. 5, pp. 691–702, Sep. 1992, doi: 10.1162/neco.1992.4.5.691.

[24] R. Linsker, “Improved local learning rule for information maximization and related applications,” Neural Networks, vol. 18, no. 3, pp. 261–265, Apr. 2005, doi: 10.1016/j.neunet.2005.01.002.

[25] S. Becker, “Mutual information maximization: models of cortical self-organization,” Netw. Comput. Neural Syst., vol. 7, no. 1, pp. 7–31, Jan. 1996, doi: 10.1080/0954898X.1996.11978653.

[26] G. Deco and D. Obradovic, An information-theoretic approach to neural computing. Springer Science & Business Media, 2012, available at: Google Scholar .

[27] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010, available at: Google Scholar.

[28] R. Kamimura, “Information-Theoretic Competitive Learning with Inverse Euclidean Distance Output Units,” Neural Process. Lett., vol. 18, no. 3, pp. 163–204, Dec. 2003, doi: 10.1023/B:NEPL.0000011136.78760.22.

[29] D. E. Rumelhart and D. Zipser, “Feature Discovery by Competitive Learning*,” Cogn. Sci., vol. 9, no. 1, pp. 75–112, Jan. 1985, doi: 10.1207/s15516709cog0901_5.

[30] DeSieno, “Adding a conscience to competitive learning,” in IEEE International Conference on Neural Networks, 1988, pp. 117–124 vol.1, doi: 10.1109/ICNN.1988.23839.

[31] A. Banerjee and J. Ghosh, “Frequency-Sensitive Competitive Learning for Scalable Balanced Clustering on High-Dimensional Hyperspheres,” IEEE Trans. Neural Networks, vol. 15, no. 3, pp. 702–719, May 2004, doi: 10.1109/TNN.2004.824416.

[32] T. Kohonen, Self-Organizing Maps, 1995, vol. 30, doi: 10.1007/978-3-642-97610-0.

[33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, Jun. 2002, doi: 10.1613/jair.953.

[34] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, no. 2, pp. 123–140, Aug. 1996, doi: 10.1007/BF00058655.

[35] L. Breiman, “Random forests,” Mach. Learn., 2001, doi: 10.1023/A:1010933404324.

[36] J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Ann. Stat., 2001, doi: 10.2307/2699986.

[37] J. Friedman, R. Tibshirani, and T. Hastie, “Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors),” Ann. Stat., vol. 28, no. 2, pp. 337–407, Apr. 2000, doi: 10.1214/aos/1016120463.

[38] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, “Boosting the margin: a new explanation for the effectiveness of voting methods,” Ann. Stat., vol. 26, no. 5, pp. 1651–1686, Oct. 1998, doi: 10.1214/aos/1024691352.

[39] R. E. Schapire and Y. Singer, “Improved Boosting Algorithms Using Confidence-rated Predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–336, Dec. 1999, doi: 10.1023/A:1007614523901.

[40] J. W. Foreman, Data smart: Using data science to transform information into insight. John Wiley & Sons, 2013, available at: Google Scholar.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
E: (paper handling issues) (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0