Research Blog

Review of recent work by the broader Computer Vision commnunity

Home   >   Research Blog   >   DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

Posted by Shivam on Thu, 17 Aug 2023 DeltaEdit Review

Text-driven image editing refers to manipulation of image content according to the user given text prompt. It is an important research area with real world applications like graphic design, advertisements, photo editing, virtual reality, and content generation for E-commerce. There are many existing frameworks that perform text-driven image editing but they are limited in performance due to heavy dependence on expensive annotated training data. These frameworks either perform on-the-go optimization given an image-text pair or perform hyperparameter optimization to find the best result. The authors propose a novel framework that is capable of performing text-driven image manipulation without requiring paired text inputs (source text and target text) during training. The motivation for this solution comes from a class of existing solutions that rely on CLIP & StyleGAN to perform image editing. These methods perform decently but are limited in terms of training or inference flexibility and have poor generalization to previously unseen text-prompts.

Need of a well-aligned feature space for Image & Text modality

To address the drawbacks of existing text-driven image editing models, the authors try to build a model to learn the relationship between the text feature space and StyleGAN’s latent visual space without any textual supervision (during training).

Fig. 1 Feature space analysis of CLIP image and text embeddings and the proposed CLIP delta image and text embeddings for MultiModal-CelebA-HQ dataset visualized using t-SNE.

CLIP is a powerful multi-modal vision-language model trained on 400 million (image, text) pairs. It learns a multi-modal semantic space for both image and text pairs. The authors note that more than the embeddings, the direction of the CLIP features is more semantically meaningful. The CLIP feature differences of paired visual-text data both mean similar semantic changes as can be seen in (Fig 1.) Precisely, the authors use CLIP delta image space and CLIP delta text space and build a model to map the CLIP delta image space to the StyleGAN’s editing direction in the training phase. In the inference phase, the same model can be utilized to predict the StyleGAN’s editing directions given a CLIP delta text input (i.e. source and target text CLIP embeddings).

Method: DeltaEdit

DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

Yueming Lyu and Tianwei Lin and Fu Li and Dongliang He and Jing Dong and Tieniu Tan

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023



During training, a latent mapper is trained to map the CLIP visual feature differences to the editing directions of StyleGAN. During inference, this mapper can be used to predict StyleGAN’s editing directions from the differences of the CLIP textual features.

Fig. 2 DeltaEdit framework.


For training, the model takes StyleGAN delta image features, \(\Delta{s} = s_2 - s_1\) as input and CLIP delta image features, \(\Delta{i} = i_2 - i_1\) as conditioning and trains the Delta Mapper network to predict the editing direction during the inference phase as:

\[\Delta{s'} = LatentMapper(s_1, i_1, \Delta{i})\]

where \(s_1\) and \(i_1\) are used as the input of Delta Mapper to provide information about the source image. The loss function to train the proposed DeltaMapper network is as follows:

\[\mathcal{L} = \mathcal{L}_{rec} + \mathcal{L}_{sim} = ||\Delta{s'}-\Delta{s}||_{2} + 1 - \cos{(\Delta{s'}, \Delta{s})}\]

where L-2 distance reconstruction loss is utilized to add supervision for learning the editing direction \(\Delta{s'}\) and cosine similarity loss is introduced to explicitly encourage the network to minimize the cosine distance between the predicted embedding direction \(\Delta{s'}\) and \(\Delta{s}\) in the S space.


During inference, the model takes CLIP image embeddings and StyleGAN S space embedding of the input image. The conditioning is given as \(\Delta{t'}\) (CLIP text feature difference of source and target text). The DeltaMapper network predicts \(\Delta{s'}\), which is the editing direction for the input image in the StyleGAN S space.This can be used to generate the output image as follows:

\(\Delta{s'} = LatentMapper(s, i, \Delta{t})\) \(s' = s + \Delta{s'}\)

By using the latent \(s’\) of the output image, we can get the final result \(I’\) by giving \(s’\) as input to the StyleGAN.


The authors performed extensive qualitative and quantitative evaluation on FFHQ, LSUN-Cat, LSUN-Church, and LSUN-Horse datasets.

Fig. 3 DeltaEdit results on StyleGAN2 FFHQ model.

The results in Fig 3. show that only the target attributes are manipulated, while other irrelevant attributes are well preserved. Meanwhile, the results are well adapted to the individual with diverse details, rather than overfitting to the same color or shape.

Fig. 4: DeltaEdit results on StyleGAN2 FFHQ model

In comparison, DeltaEdit yields the most impressive and disentangled results almost in each case (Fig 4.).

For quantitative evaluation, the authors utilized FID, PSNR, and IDS (identity similarity before and after manipulation by Arcface). Compared with the state-of-the art approaches, DeltaEdit achieve the best performance on all metrics (Fig 5.).

Fig. 5: Quantitative results on FID, PSNR, and IDS.

Fig. 6: Results of user study. DeltaEdit is preferred over StyleCLIP on manipulation accuracy and visual realism.

The authors also conducted a user study (Fig 6.) across 40 participants (consisting of 20 rounds for each participants) judging the method in terms of manipulation accuracy and visual realism, where DeltaEdit outperforms StyleCLIP.


In this article, we explored a text-free training framework for text-driven image editing. Majority of existing frameworks built upon pre-trained vision-language models either perform per-text optimization or inference-time hyperparameters tuning. The authors in this paper proposed a delta image and text space that has well-aligned distribution between CLIP visual feature differences of two images and CLIP textual embedding differences of source and target texts. The proposed model showed superior results qualitatively and quantitatively and generalizes well to unseen text prompts as zero-shot inference.

More blogs