(1) * Hakim Nasaoui Mail (LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida, Morocco)
(2) Hassan Silkan Mail (LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida, Morocco)
(3) Insaf Bellamine Mail (LSATE, Sidi Mohamed Ben Abdellah University ENSA, Fes, Morocco)
*corresponding author

Abstract


Fine-grained surgical action recognition in laparoscopic videos remains a challenge even with recent deep learning progress. While current VideoMAE approaches reach 89.11% accuracy on cholecystectomy tasks, they face specific limitations. Random masking strategies often miss surgical instruments that occupy only 10% to 15% of frames. Furthermore, context-independent models struggle with visually similar actions across different phases, and symmetric two-stream architectures tend to waste computational resources. To solve this, we developed SA-VideoMAE, a surgical-aware video masked autoencoder specifically designed for laparoscopic action recognition. Our method utilizes surgical-aware adaptive masking that integrates YOLOv7x object detection to prioritize instrument patches. This increased instrument visibility from 10% to 60% during training, ensuring the model focuses on action-relevant regions rather than static backgrounds. We also utilized phase-conditioned hierarchical attention to inject learnable phase embeddings into the attention mechanisms, enabling the model to disambiguate visually similar actions based on surgical context. For efficiency, our asymmetric dual-stream architecture processes RGB with ViT-Base (86M parameters) and optical flow with ViT-Tiny (5.7M parameters), which achieved a 47% parameter reduction compared to symmetric designs. Our training process then balanced reconstruction, classification, temporal consistency, and phase prediction through a novel multi-objective optimization strategy. Results on Cholec80's Calot's Triangle Dissection phase show 93.5% accuracy, representing a 4.4 percentage point improvement over the verified baseline. Notably, challenging action recall improved from 51% to 74% while maintaining real-time inference at 62ms per clip. These findings demonstrate that encoding surgical domain knowledge into video architectures significantly enhances action recognition performance.

Keywords


surgical action recognition; masked autoencoders; phase-conditioned attention; asymmetric dual-stream; laparoscopic cholecystectomy

          

Article metrics

Abstract views : 5

   

Cite

   


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571  (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
 andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0