CatV2TON: A Novel Approach to Virtual Try-On Using Diffusion Transformers

Top post
Virtual Try-On with CatV2TON: A New Approach for Realistic Representation of Clothing in Images and Videos
Virtual try-on (VTON) is becoming increasingly important in online retail. It allows customers to realistically visualize clothing on themselves before purchasing. However, previous VTON methods often reach their limits, especially when generating videos and longer sequences. A new research approach, CatV2TON, promises a remedy.
CatV2TON: Diffusion Transformers for Seamless Image and Video Integration
CatV2TON utilizes a single diffusion transformer trained for both images and videos. This approach simplifies the process compared to previous methods, which often require separate models for still images and videos. The core of CatV2TON lies in the temporal concatenation of the input data. This means that both the information about the garment and the person are linked over time and presented to the model. By training with a combination of image and video data, the model learns to deliver robust results in both scenarios.
Efficient Video Generation through Overlapping Clips and Adaptive Normalization
Generating longer videos presents a particular challenge as it requires high computational resources. CatV2TON addresses this problem with an intelligent strategy: The videos are divided into overlapping clips that are processed sequentially. Previous frames serve as a guideline for subsequent ones to ensure temporal consistency. Additionally, Adaptive Clip Normalization (AdaCN) is used. This technique optimizes the normalization of the data within the individual clips and thus contributes to stable and resource-efficient video generation.
ViViD-S: An Improved Dataset for More Precise Results
The quality of an AI model depends significantly on the training data. The researchers have therefore developed ViViD-S, a refined video dataset for virtual try-on. This dataset has been optimized by filtering backward frames and applying 3D mask smoothing. These measures ensure improved temporal consistency and thus contribute to more realistic results.
Promising Results and Outlook
Initial tests show that CatV2TON delivers convincing results compared to existing methods in both image- and video-based virtual try-on. The combination of a unified model, efficient video generation, and an optimized dataset provides a promising foundation for realistic virtual try-ons in a variety of scenarios. This technology could revolutionize online retail by offering customers a more immersive and informative shopping experience.
The further development of VTON technologies like CatV2TON is an exciting field of research. Future developments could focus on improving realism, integrating additional factors such as body shape and movement, and expanding to different types of clothing. Virtual try-on has the potential to fundamentally change the way we buy clothes online.
Bibliography Chong, Zheng et al. “CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation.” arXiv preprint arXiv:2501.11325 (2025).