Innovation Series: Advanced Science (ISSN 2938-9933, CNKI Indexed)

Volume 3 · Issue 3 (2026)
41
views

A Sparse-Gated Cross-Modal Alignment Framework for Sewer Defect Image-Text Retrieval

 

Chuanxi Zhu*, Kehe Wu

School of Control & Computer Science North China Electric Power University Beijing, China

Corresponding Author: Chuanxi Zhu (zcx666@ncepu.edu.cn)

 

Abstract: We present SGC-Align, a Sparse-Gated Cross-Modal Alignment framework for pipeline defect image-text retrieval. SGC-Align replaces dense pairwise similarity with a learnable Sparse-Gated Hybrid Similarity Mask (HSM) that selectively filters and reweights cross-modal interaction channels—suppressing noise, amplifying semantically salient signals, and enabling interpretable mask visualizations. To capture fine-grained correspondences, we introduce Multidirectional Cyclic Cross-Modal Attention (MCCA), a multi-round, bidirectional attention mechanism that iteratively refines local-to-text alignments. We further propose a sparsity-oriented alignment loss (LHSM) jointly trained with the mask to encourage sparse and semantically coherent matches. Extensive experiments on pipeline inspection retrieval settings demonstrate substantial improvements in Recall@K and mAP against strong baselines, as well as reduced computation and memory overhead due to the sparse-first design.

 

Keywords: Culture-tourism integration; Hainan Free Trade Port; Tourism English; Talent training; Industry-education integration

 

References

[1]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision.
[2]
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, Q. V. Pham, H. W. Lee, S. Sanyal, J. Maynez, A. Kolesnikov, et al., Scaling up visual and vision-language representation learning with noisy text supervision.
[3]
S. Yao, et al., Coca: Contrastive captioners are image-text foundation models.
[4]
X. Zhai, et al., Slip: Self-supervision meets language-image pre-training.
[5]
J. Li, et al., Blip: Bootstrapping language-image pre-training.
[6]
J. Li, et al., Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
[7]
L. Li, X. Liang, et al., Albef: Align before fuse.
[8]
F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, Vse++: Improving visual-semantic embeddings with hard negatives, in: Proceedings of the British Machine Vision Conference (BMVC), 2018.
[9]
H. Zhou, et al., Clip-adapter: Parameter-efficient transfer learning for vision-language models.
[10]
H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from transformers, in: EMNLP-IJCNLP, 2019.
[11]
Y.-C. Chen, L. Li, L. Yu, A. Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation learning.
[12]
J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic Visio linguistic representations for vision and language tasks, in: NeurIPS, 2019.
[13]
R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers.
[14]
M. Zaheer, et al., Big bird: Transformers for longer sequences, in: NeurIPS, 2020, pp. 17283-17297.
[15]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations (ICLR), 2021.
[16]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618-626.
Download PDF

Innovation Series

Innovation Series is an academic publisher publishing journals and books covering a wide range of academic disciplines.

Contact

Francesc Boix i Campo, 7
08038 Barcelona, Spain