TimesCLIP

Teaching Time Series to See and Speak

πŸš€ CLIP is ALL you NEED for Time Series Forecasting

πŸ”₯ First CLIP-based Time Series Forecasting
πŸ† 16/16 Short-term Benchmarks SoTA

Authors

Sixun Dong1

Arizona State University

sixun.dong@asu.edu

Wei Fan2

University of Oxford

wei.fan@ox.ac.uk

Teresa Wu1

Arizona State University

teresa.wu@asu.edu

Yanjie Fu1*

Arizona State University

yanjie.fu@asu.edu *Corresponding Author

1Arizona State University    2University of Oxford

Abstract

We introduce TimesCLIP, a novel multimodal approach that bridges numerical and visual understanding in time series forecasting. By leveraging CLIP's pretrained text encoder as a backbone and incorporating visual patterns through contrastive learning, our method achieves state-of-the-art performance across 16 short-term forecasting datasets.

Our key insight is that CLIP-Text aligns multimodal space and captures both numerical and visual patterns in time series data, eliminating the need for complex architectural modifications while maintaining the scalability that makes Transformers powerful.

The framework follows a simple principle: CLIP is ALL you NEED - just replace those painfully-tuned transformer layers with CLIP-TEXT and achieve better performance with zero hyperparameter tuning.

Architecture Overview

Our multimodal framework combines numerical and visual understanding

TimesCLIP Architecture

TimesCLIP architecture: Bridging numerical time series data with visual patterns through contrastive learning

Key Results

State-of-the-art performance across multiple benchmarks

Datasets

22

16 short-term + 6 long-term forecasting benchmarks evaluated

Performance Gain

+15%

Average improvement across 16 benchmarks vs. existing methods

Zero Tuning

0

Hyperparameters needed - CLIP backbone handles everything

Visual Understanding

How humans and our model see time series patterns

Detailed Analysis

Method Visualization

Time Series Visualization

Figure 4: Visualization of time series to image conversion with different colors for each variable

Performance Comparison

Comparison Results

Figure 5: Comprehensive comparison with state-of-the-art methods

Ablation Studies

Ablation Studies

Figure 6: Ablation study results showing the impact of different components

Multimodal Backbone

Figure 7: Multimodal backbone ablation study

Resources

Citation

@article{sixun2025teaching,
  title={Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives},
  author={Sixun, Dong and Wei, Fan and Wu, Teresa and Yanjie, Fu},
  journal={arXiv preprint arXiv:2506.24124},
  year={2025}
}

Acknowledgments

We thank the CLIP team at OpenAI for their foundational vision-language model that inspired this work. We also appreciate the time series forecasting community for providing comprehensive benchmarks and baseline implementations.

Special thanks to our collaborators and reviewers for their valuable feedback and suggestions that helped improve this research.

This work was supported by Arizona State University. We acknowledge the computational resources provided by our institutions.

Welcome Discussion

Questions & Feedback

Have questions about our method or want to discuss the results? We welcome all questions, discussions, and constructive feedback!

Issues & Improvements

Found issues with our implementation or have suggestions for improvements? Please open an issue on our GitHub repository.

Collaboration

Interested in collaborating or extending this work? We're always open to new research partnerships and joint projects.

Contact us: sdong46@asu.edu

GitHub Issues: TimesCLIP Issues