GSDP-ViT: A Lightweight Vision Transformer for Cervical Histopathological Image Classification

Peng, Liping

Innovation Series: Advanced Science (ISSN 2938-9933, CNKI Indexed)

Volume 3 · Issue 3 (2026)

109
views

DOI number:

10.66521/2938-9933-2026032405

GSDP-ViT: A Lightweight Vision Transformer for Cervical Histopathological Image Classification

Liping Peng

College of Computer and Information Science, Chongqing Normal University, Chongqing, China

Abstract: Cervical cancer is the fourth most common cancer among women worldwide, and histopathological image analysis remains the gold standard for diagnosing cervical precancerous lesions. However, the high morphological similarity among different lesion subtypes and the significant information sparsity in pathological microscopic images bring considerable challenges for automated classification. Current Vision Transformer (ViT) models applied to this task are limited by redundant feature generation in the tokenization stage, the lack of frequency-domain texture feature extraction, and wasteful computational allocation to uninformative image regions. To overcome these limitations, this paper proposes GSDP-ViT, a lightweight Vision Transformer that incorporates three novel components. First, a Ghost Convolutional Tokenizer (GCT) generates diverse feature maps from fewer parameters by augmenting intrinsic features through cheap depth wise linear operations. Second, a Spectral–Spatial Dual Attention (SSDA) mechanism simultaneously captures spatial morphological patterns and frequency-domain textural features via Discrete Cosine Transform (DCT) based parallel attention, with a learnable gating mechanism for adaptive fusion. Third, a Progressive Token Pruning (PTP) strategy dynamically evaluates token importance at each Transformer layer and removes the least informative tokens layer by layer, concentrating computation on pathologically relevant regions. Experiments on the LDCH cervical histopathological image dataset show that GSDP-ViT achieves 87.85% accuracy and 68.21% macro-F1 with only 0.47M parameters and 0.89 GFLOPs, surpassing several state-of-the-art models while maintaining superior computational efficiency. Ablation experiments validate the effectiveness and synergistic benefits of each component.

Keywords: Cervical histopathological images; Ghost convolution; Spectral–spatial dual attention; Dynamic token pruning; Lightweight Vision Transformer

References

[1]

Arbyn M, Weiderpass E, Bruni L, et al. Estimates of incidence and mortality of cervical cancer in 2018: A worldwide analysis. The Lancet Global Health, 2020, 8(2): e191–e203.

[2]

Schiffman M, Doorbar J, Wentzensen N, et al. Carcinogenic human papillomavirus infection. Nature Reviews Disease Primers, 2016, 2: 16086.

[3]

Arbyn M, Anttila A, Jordan J, et al. European guidelines for quality assurance in cervical cancer screening. Annals of Oncology, 2010, 21(3): 448–458.

[4]

Gurcan M N, Boucheron L E, Can A, et al. Histopathological image analysis: A review. IEEE Reviews in Biomedical Engineering, 2009, 2: 147–171.

[5]

Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks.

[6]

Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale.

[7]

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition.

[8]

Touvron H, Cord M, Douze M, et al. Training data-efficient image transformers & distillation through attention.

[9]

Lin J, Roy S, Li H, et al. Super vision transformer.

[10]

Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks.

[11]

Litjens G, Kooi T, Bejnordi B E, et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 2017, 42: 60–88.

[12]

Chen R J, Lu M Y, Wang J, et al. Pathomic fusion: An integrated framework for fusing histopathology and genomic features for cancer diagnosis. IEEE Transactions on Medical Imaging, 2021, 40(3): 757–770.

[13]

Liu Z, Lin Y, Cao Y, et al. Swin Transformer: Hierarchical vision transformer using shifted windows.

[14]

Chu X, Tian Z, Wang Y, et al. Conditional positional encodings for vision transformers.

[15]

Guo J, Han K, Wu H, et al. CMT: Convolutional neural networks meet vision transformers.

[16]

Dong X, Bao J, Chen D, et al. CSWin Transformer: A general vision transformer backbone with cross-shaped windows.

[17]

Xu K, Qin M, Sun F, et al. Learning in the frequency domain.

[18]

Han K, Xiao A, Wu E, et al. Transformer in transformer.

[19]

Yuan L, Chen Y, Wang T, et al. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet.

[20]

Liang Y, Ge C, Tong Z, et al. Not all tokens are equal: Human-centric visual analysis via token reorganization[C]//CVPR. 2022.

[21]

Dai Y, Gao Y, Liu F. TransMed: Transformers advance multi-modal medical image classification. Diagnostics, 2021, 11(8): 1384.

[22]

Yu S, Li J, Liu Z, et al. MIL-VT: Multiple instance learning enhanced vision transformer for fundus image classification.

[23]

Hassani A, Walton S, Li J, et al. Escaping the big data paradigm with compact transformers.

[24]

Yao H, Zhang X, Zhou X, et al. EVT: Enhanced Vision Transformer with wavelet position embedding for histopathological image classification. 2024.

[25]

Sandler M, Howard A, Zhu M, et al. MobileNetV2: Inverted residuals and linear bottlenecks.

[26]

Ma N, Zhang X, Zheng H, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design.

[27]

Han K, Wang Y, Tian Q, et al. GhostNet: More features from cheap operations.

[28]

Tan M, Le Q. EfficientNet: Rethinking model scaling for convolutional neural networks.

[29]

Rao Y, Zhao W, Liu B, et al. DynamicViT: Efficient vision transformers with dynamic token sparsification.

[30]

Yin H, Molchanov P, Alvarez J M, et al. A-ViT: Adaptive tokens for efficient vision transformer.

[31]

Qin Z, Zhang P, Wu F, et al. FcaNet: Frequency channel attention networks.

[32]

Guibas J, Mardani M, Karras T, et al. Adaptive Fourier neural operators: Efficient token mixers for transformers.

[33]

Komura D, Ishikawa S. Machine learning methods for histopathological image analysis. Computational and Structural Biotechnology Journal, 2018, 16: 34–42.

Download PDF

Innovation Series

Innovation Series is an academic publisher publishing journals and books covering a wide range of academic disciplines.

About

About Us

Terms & Conditions

Contact

Francesc Boix i Campo, 7

08038 Barcelona, Spain