Sixun Dong

Sixun Dong (Ironieser)

Multimodal Learning, VLM, LLM Agent

Independent Researcher, AZ, USA

Currently, I am an independent researcher. I completed my Master's at ShanghaiTech University under Professor Shenghua Gao.

My research focuses on multimodal AI systems that bridge computer vision, natural language processing, and machine learning, with different applications.

Recent News

Jan 2026 πŸŽ‰ Paper on MMTok accepted to ICLR 2026! Thanks to my mentor Qi Qian and all coauthors.
Jan 2026 πŸŽ‰ Robust Dysarthric Speech Recognition accepted to ICASSP 2026. Congratulations to Xiuwen Zheng!
Sep 2025 πŸŽ‰ Sculpting Features from Noise accepted to NeurIPS 2025. Congratulations to Nanxu Gong and Zijun Li!
Aug 2025 πŸ’Ό Completed GenAI Research Internship at Zoom Inc., focusing on efficient vision–language modeling. Grateful to my mentor Qi Qian and the Zoom team.
Aug 2025 πŸ“„ Paper on LiveMCP-101 β€” a new benchmark testing AI agents' real-world tool-use, released on arXiv
Aug 2025 πŸ“„ Paper on LogicIF β€” Complex Logical Instruction Generation released on arXiv
Aug 2025 πŸ“„ Paper on TimesCLIP β€” new multimodal approach to time series forecasting with CLIP, released on arXiv
May 2025 πŸ’Ό Started GenAI Research Internship at Zoom Inc. focusing on efficient vision-language modeling
Jan 2024 πŸ’Ό Completed Team Leader internship at DGene (Digital Human Algorithm Dept.), leading co-speech gesture generation and 3D human body reconstruction projects.
Feb 2024 πŸŽ‰ Paper on MLLM-Tool accepted to WACV 2024
Aug 2023 πŸ’Ό Completed Team Leader internship at Transsion Holdings (Audio-Video Generation Dept.), leading audio-driven talking-head video generation research with SoTA performance.
Mar 2023 πŸŽ‰ Paper on WeakSVR accepted to CVPR 2023
Mar 2022 πŸŽ‰ Paper on TransRAC accepted as πŸ† oral presentation to CVPR 2022

Selected Publications

Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

ICASSP'26 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

Xiuwen Zheng, Sixun Dong, Bornali Phukon, Mark Hasegawa-Johnson, Chang D. Yoo

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

ICLR'26 MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives

arXiv'2506 Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives

Sixun Dong, Wei Fan, Teresa Wu, Yanjie Fu

Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

NeurIPS'25 Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Nanxu Gong*, Zijun Li*, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

WACV'25 MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, Shenghua Gao

RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation

3DV'24 RoomDesigner: Encoding Anchor-latents for Style-consistent and Shape-compatible Indoor Scene Generation

Yiqun Zhao, Zibo Zhao, Jing Li, Sixun Dong, Shenghua Gao

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

CVPR'23 Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sixun Dong*, Huazhang Hu*, Dongze Lian, Weixin Luo, Yicheng Qian, Shenghua Gao

Experience

GenAI Research Intern

Zoom Inc., GenAI Research Group

May 2025 - Aug 2025

Worked on VLM and LLM Agent. Published one first-author paper on efficient VLM inference and two collaborative papers on LLM evaluation.

Research Intern (Team Leader)

DGene, Digital Human Algorithm Department

Aug 2023 - Jan 2024

Led digital human projects: co-speech gesture generation and 3D human body reconstruction with <7% measurement error.

Research Intern (Team Leader)

Transsion Holdings, Audio-Video Generation Department

Apr 2023 - Aug 2023

Led audio-driven talking head video generation research, achieving SoTA performance in commercial and academic benchmarks.

Academic Service

Reviewer

Conferences: CVPR (2023–2026), ICCV (2023, 2025), ECCV (2024, 2026), NeurIPS (2025), ICML (2025, 2026),ICLR (2026), ACM MM (2023–2025), ACCV (2024), KDD (2025)

Journals: IEEE Transactions on Multimedia, Neural Networks(Elsevier), ACM Transactions on Knowledge Discovery from Data

Education

2021 - 2024

M.S. in Computer Science

ShanghaiTech University, China

SVIP-Lab, Advisor: Prof. Shenghua Gao

2016 - 2020

B.E. in Computer Science (Dual Degree)

Dalian University of Technology, China

2016 - 2020

B.E. in Process Equipment and Control Engineering

Dalian University of Technology, China