A Sparse-Gated Cross-Modal Alignment Framework for Sewer Defect Image-Text Retrieval

Zhu*, Chuanxi; Wu, Kehe

Innovation Series: Advanced Science (ISSN 2938-9933, CNKI Indexed)

Volume 3 · Issue 3 (2026)

137
views

DOI number:

10.66521/2938-9933-2026031901

A Sparse-Gated Cross-Modal Alignment Framework for Sewer Defect Image-Text Retrieval

Chuanxi Zhu^*, Kehe Wu

School of Control & Computer Science North China Electric Power University Beijing, China

Corresponding Author: Chuanxi Zhu (zcx666@ncepu.edu.cn)

Abstract: We present SGC-Align, a Sparse-Gated Cross-Modal Alignment framework for pipeline defect image-text retrieval. SGC-Align replaces dense pairwise similarity with a learnable Sparse-Gated Hybrid Similarity Mask (HSM) that selectively filters and reweights cross-modal interaction channels—suppressing noise, amplifying semantically salient signals, and enabling interpretable mask visualizations. To capture fine-grained correspondences, we introduce Multidirectional Cyclic Cross-Modal Attention (MCCA), a multi-round, bidirectional attention mechanism that iteratively refines local-to-text alignments. We further propose a sparsity-oriented alignment loss (LHSM) jointly trained with the mask to encourage sparse and semantically coherent matches. Extensive experiments on pipeline inspection retrieval settings demonstrate substantial improvements in Recall@K and mAP against strong baselines, as well as reduced computation and memory overhead due to the sparse-first design.

Keywords: Culture-tourism integration; Hainan Free Trade Port; Tourism English; Talent training; Industry-education integration

References

[1]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision.

[2]

C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, Q. V. Pham, H. W. Lee, S. Sanyal, J. Maynez, A. Kolesnikov, et al., Scaling up visual and vision-language representation learning with noisy text supervision.

[3]

S. Yao, et al., Coca: Contrastive captioners are image-text foundation models.

[4]

X. Zhai, et al., Slip: Self-supervision meets language-image pre-training.

[5]

J. Li, et al., Blip: Bootstrapping language-image pre-training.

[6]

J. Li, et al., Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.

[7]

L. Li, X. Liang, et al., Albef: Align before fuse.

[8]

F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, Vse++: Improving visual-semantic embeddings with hard negatives, in: Proceedings of the British Machine Vision Conference (BMVC), 2018.

[9]

H. Zhou, et al., Clip-adapter: Parameter-efficient transfer learning for vision-language models.

[10]

H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder representations from transformers, in: EMNLP-IJCNLP, 2019.

[11]

Y.-C. Chen, L. Li, L. Yu, A. Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, Uniter: Universal image-text representation learning.

[12]

J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic Visio linguistic representations for vision and language tasks, in: NeurIPS, 2019.

[13]

R. Child, S. Gray, A. Radford, I. Sutskever, Generating long sequences with sparse transformers.

[14]

M. Zaheer, et al., Big bird: Transformers for longer sequences, in: NeurIPS, 2020, pp. 17283-17297.

[15]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations (ICLR), 2021.

[16]

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual explanations from deep networks via gradient-based localization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618-626.

Download PDF

Innovation Series

Innovation Series is an academic publisher publishing journals and books covering a wide range of academic disciplines.

About

About Us

Terms & Conditions

Contact

Francesc Boix i Campo, 7

08038 Barcelona, Spain