MMTok Logo

MMTok

Multimodal Coverage Maximization for Efficient Inference of VLMs

🚀 Intelligent Vision Token Pruning for VLM Acceleration

⚡ 1.87× Speedup on H100
🎯 87.7% F1 with 4 tokens on POPE
🧠 Multimodal Coverage Maximization

Authors

Sixun Dong1

Arizona State University

sdong46@asu.edu Work done during internship at Zoom

Juhua Hu3

University of Washington

juhuah@uw.edu

Mian Zhang4

UT Dallas

mian.zhang@utdallas.edu Work done during internship at Zoom

Ming Yin5

Duke University

ming.yin@duke.edu Work done during internship at Zoom

Yanjie Fu1

Arizona State University

yanjie.fu@asu.edu

Qi Qian2†

Zoom Communications

qianq.mail@gmail.com †Corresponding Author

1Arizona State University    2Zoom Communications    3University of Washington

4University of Texas at Dallas    5Duke University

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms are proposed to reduce the number of vision tokens, most of them apply only uni-modal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. To address this limitation, we propose to leverage both vision and text tokens to select informative vision tokens. We first formulate the subset selection problem as a coverage maximum problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Under the maximal coverage criterion on the POPE dataset, our method achieves a 1.87× speedup while maintaining 98.7% of the original performance on LLaVA-Next-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B.

Method Overview

Multimodal coverage maximization framework for intelligent token selection

MMTok Architecture

MMTok Architecture: Efficient vision token pruning through multimodal coverage maximization

Key Results

Significant acceleration across multiple benchmark datasets

Inference Speedup

1.87×

On H100 GPU (higher on other GPUs)

Performance Retained

95%+

At highest prune ratio on LLaVA-1.5 & Next (7B & 13B)

Extreme Compression

87.7%

4 tokens on POPE, LLaVA-1.5-7B

Performance Comparison

Comprehensive comparison with existing methods

Performance Comparison

Performance comparison: MMTok results across multiple models and datasets

Resources

Citation

@misc{dong2025mmtokmultimodalcoveragemaximization,
      title={MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs}, 
      author={Sixun Dong and Juhua Hu and Mian Zhang and Ming Yin and Yanjie Fu and Qi Qian},
      year={2025},
      eprint={2508.18264},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.18264}, 
}

Acknowledgments

We thank Zoom Communications for providing internship opportunities and research support. We also appreciate the multimodal learning community for providing comprehensive benchmark datasets and baseline implementations.

Special thanks go to our collaborators for their constructive feedback and support. In particular, Yebowen Hu offered valuable discussions and feedback, while Kaiqiang Song contributed many insightful discussions, extensive assistance with computational resource scheduling, and helpful exchanges that enriched our learning. We also acknowledge the support from Zoomies.

This work was supported by Zoom Communications, including computational resources. We gratefully acknowledge the generous support provided.

Welcome Discussion

Questions & Feedback

Have questions about our method or want to discuss the results? We welcome all questions, discussions, and constructive feedback!

Issues & Improvements

Found issues with our implementation or have suggestions for improvements? Please open an issue on our GitHub repository.

Collaboration

Interested in collaborating or extending this work? We're always open to new research partnerships and joint projects.

Contact us: sdong46@asu.edu

GitHub Issues: MMTok Issues