Surgical-aware video masked autoencoders with phase-conditioned attention for laparoscopic action recognition

Hakim Nasaoui; Hassan Silkan; Insaf Bellamine

^{(1) *} Hakim Nasaoui

(LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida, Morocco)
⁽²⁾ Hassan Silkan

(LAROSERI Laboratory, Department of Computer Science, Faculty of Sciences, Chouaïb Doukkali University, El Jadida, Morocco)
⁽³⁾ Insaf Bellamine

(LSATE, Sidi Mohamed Ben Abdellah University ENSA, Fes, Morocco)
^*corresponding author

Abstract

Fine-grained surgical action recognition in laparoscopic videos remains a challenge even with recent deep learning progress. While current VideoMAE approaches reach 89.11% accuracy on cholecystectomy tasks, they face specific limitations. Random masking strategies often miss surgical instruments that occupy only 10% to 15% of frames. Furthermore, context-independent models struggle with visually similar actions across different phases, and symmetric two-stream architectures tend to waste computational resources. To solve this, we developed SA-VideoMAE, a surgical-aware video masked autoencoder specifically designed for laparoscopic action recognition. Our method utilizes surgical-aware adaptive masking that integrates YOLOv7x object detection to prioritize instrument patches. This increased instrument visibility from 10% to 60% during training, ensuring the model focuses on action-relevant regions rather than static backgrounds. We also utilized phase-conditioned hierarchical attention to inject learnable phase embeddings into the attention mechanisms, enabling the model to disambiguate visually similar actions based on surgical context. For efficiency, our asymmetric dual-stream architecture processes RGB with ViT-Base (86M parameters) and optical flow with ViT-Tiny (5.7M parameters), which achieved a 47% parameter reduction compared to symmetric designs. Our training process then balanced reconstruction, classification, temporal consistency, and phase prediction through a novel multi-objective optimization strategy. Results on Cholec80's Calot's Triangle Dissection phase show 93.5% accuracy, representing a 4.4 percentage point improvement over the verified baseline. Notably, challenging action recall improved from 51% to 74% while maintaining real-time inference at 62ms per clip. These findings demonstrate that encoding surgical domain knowledge into video architectures significantly enhances action recognition performance.

Keywords

surgical action recognition; masked autoencoders; phase-conditioned attention; asymmetric dual-stream; laparoscopic cholecystectomy

Article metrics

Abstract views : 85

Cite

How to cite item

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

___________________________________________________________
International Journal of Advances in Intelligent Informatics
ISSN 2442-6571 (print) | 2548-3161 (online)
Organized by UAD and ASCEE Computer Society
Published by Universitas Ahmad Dahlan
W: http://ijain.org
E: info@ijain.org (paper handling issues)
andri.pranolo.id@ieee.org (publication issues)

View IJAIN Stats

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

Username
Password
Remember me