TimesCLIP: Teaching Time Series to See and Speak

Authors

Sixun Dong¹

Arizona State University

sixun.dong@asu.edu

Wei Fan²

University of Oxford

wei.fan@ox.ac.uk

Teresa Wu¹

Arizona State University

teresa.wu@asu.edu

Yanjie Fu^1*

Arizona State University

yanjie.fu@asu.edu *Corresponding Author

¹Arizona State University ²University of Oxford

Abstract

We introduce TimesCLIP, a novel multimodal approach that bridges numerical and visual understanding in time series forecasting. By leveraging CLIP's pretrained text encoder as a backbone and incorporating visual patterns through contrastive learning, our method achieves state-of-the-art performance across 16 short-term forecasting datasets.

Our key insight is that CLIP-Text aligns multimodal space and captures both numerical and visual patterns in time series data, eliminating the need for complex architectural modifications while maintaining the scalability that makes Transformers powerful.

The framework follows a simple principle: CLIP is ALL you NEED - just replace those painfully-tuned transformer layers with CLIP-TEXT and achieve better performance with zero hyperparameter tuning.

Architecture Overview

Our multimodal framework combines numerical and visual understanding

TimesCLIP architecture: Bridging numerical time series data with visual patterns through contrastive learning

Key Results

State-of-the-art performance across multiple benchmarks

Datasets

22

16 short-term + 6 long-term forecasting benchmarks evaluated

Performance Gain

+15%

Average improvement across 16 benchmarks vs. existing methods

Zero Tuning

0

Hyperparameters needed - CLIP backbone handles everything

Visual Understanding

How humans and our model see time series patterns

Figure 1: Comparison of existing methods vs. our multimodal approach - bridging numerical and visual understanding

Figure 2: Comparison of existing methods vs. our multimodal approach - bridging numerical and visual understanding

Detailed Analysis

Method Visualization

Figure 4: Visualization of time series to image conversion with different colors for each variable

Performance Comparison

Figure 5: Comprehensive comparison with state-of-the-art methods

Ablation Studies

Figure 6: Ablation study results showing the impact of different components

Figure 7: Multimodal backbone ablation study

Resources

Paper

Read the full paper on arXiv

Code

Implementation and experiments

Blog Post

Detailed technical discussion

知乎

Chinese technical blog

Citation

@article{sixun2025teaching,
  title={Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives},
  author={Sixun, Dong and Wei, Fan and Wu, Teresa and Yanjie, Fu},
  journal={arXiv preprint arXiv:2506.24124},
  year={2025}
}

Acknowledgments

We thank the CLIP team at OpenAI for their foundational vision-language model that inspired this work. We also appreciate the time series forecasting community for providing comprehensive benchmarks and baseline implementations.

Special thanks to our collaborators and reviewers for their valuable feedback and suggestions that helped improve this research.

This work was supported by Arizona State University. We acknowledge the computational resources provided by our institutions.

Welcome Discussion

Questions & Feedback

Have questions about our method or want to discuss the results? We welcome all questions, discussions, and constructive feedback!

Issues & Improvements

Found issues with our implementation or have suggestions for improvements? Please open an issue on our GitHub repository.

Collaboration

Interested in collaborating or extending this work? We're always open to new research partnerships and joint projects.

Contact us: sdong46@asu.edu

GitHub Issues: TimesCLIP Issues

TimesCLIP

Authors

Sixun Dong1

Wei Fan2

Teresa Wu1

Yanjie Fu1*

Abstract

Architecture Overview

Key Results

Datasets

Performance Gain

Zero Tuning

Visual Understanding

Detailed Analysis

Method Visualization

Performance Comparison

Ablation Studies

Resources

Paper

Code

Blog Post

知乎

Citation

Acknowledgments

Welcome Discussion

Questions & Feedback

Issues & Improvements

Collaboration

Sixun Dong¹

Wei Fan²

Teresa Wu¹

Yanjie Fu^1*