SwiftEdit: Fast Text-Guided Image Editing with One-Step Diffusion

SwiftEdit: Lightning-Fast Text-Guided Image Editing through One-Step Diffusion

The world of image editing is experiencing rapid development, driven by advancements in Artificial Intelligence. Text-based image editing, which allows users to modify images through simple text input, is at the forefront. This technology leverages the extensive capabilities of multi-step, diffusion-based text-to-image models. However, these methods often reach their limits when it comes to the speed requirements for real-time and on-device applications. The multi-step inversion and sampling process is frequently too time-consuming.

SwiftEdit, a new image editing tool, promises a solution. It enables near-instantaneous text-guided image editing in just 0.23 seconds. This speed advantage is based on two key innovations: a one-step inversion framework that allows image reconstruction in a single step, and a mask-guided editing technique with a novel attention scaling mechanism to perform local image edits.

One-Step Inversion: A New Approach

The inversion of one-step diffusion models presents a challenge, as existing techniques like DDIM inversion and null-text inversion are unsuitable for real-time editing. SwiftEdit therefore uses a novel one-step inversion framework inspired by encoder-based GAN inversion methods. Unlike GAN inversion, which requires domain-specific networks and retraining, the SwiftEdit framework can be generalized to arbitrary input images. It utilizes SwiftBrushv2, a fast and powerful one-step text-to-image model, both as a generator and as the basis for the inversion network. Through a two-stage training process with synthetic and real data, the network is optimized for processing any input image.

Mask-Guided Editing and Attention Scaling

After the one-step inversion, an efficient, mask-based editing technique is employed. SwiftEdit can either use a predefined editing mask or derive it directly from the trained inversion network and the text input. The mask is then used in a novel attention scaling process to control the editing strength while preserving background elements. This leads to high-quality editing results.

Speed and Performance Comparison

SwiftEdit is the first tool to combine diffusion-based one-step inversion with a one-step text-to-image model to enable instantaneous text-guided image editing. It is at least 50 times faster than previous multi-step methods while delivering comparable results. The combination of speed and performance makes SwiftEdit a promising solution for various applications, especially in the field of mobile and real-time image editing.

Outlook and Potential

The development of SwiftEdit marks an important step in text-based image editing. The ability to modify images in real-time and with high quality opens up new possibilities for creative applications, both for professional users and for everyday use. Integration into content creation tools like Mindverse could significantly simplify and accelerate the workflow for creating visual content. Future research could focus on further improving editing accuracy and expanding functionality to enable even more complex editing scenarios.