Research Blog
Review of recent work by the broader Computer Vision commnunity
Review of recent work by the broader Computer Vision commnunity
Text-driven image editing refers to manipulation of image content according to the user given text prompt. It is an important research area with real world applications like graphic design, advertisements, photo editing, virtual reality, and content generation for E-commerce. There are many existing frameworks that perform text-driven image editing but they are limited in performance due to heavy dependence on expensive annotated training data. These frameworks either perform on-the-go optimization given an image-text pair or perform hyperparameter optimization to find the best result. The authors propose a novel framework that is capable of performing text-driven image manipulation without requiring paired text inputs (source text and target text) during training. The motivation for this solution comes from a class of existing solutions that rely on CLIP & StyleGAN to perform image editing. These methods perform decently but are limited in terms of training or inference flexibility and have poor generalization to previously unseen text-prompts.
To address the drawbacks of existing text-driven image editing models, the authors try to build a model to learn the relationship between the text feature space and StyleGAN’s latent visual space without any textual supervision (during training).
CLIP is a powerful multi-modal vision-language model trained on 400 million (image, text) pairs. It learns a multi-modal semantic space for both image and text pairs. The authors note that more than the embeddings, the direction of the CLIP features is more semantically meaningful. The CLIP feature differences of paired visual-text data both mean similar semantic changes as can be seen in (Fig 1.) Precisely, the authors use CLIP delta image space and CLIP delta text space and build a model to map the CLIP delta image space to the StyleGAN’s editing direction in the training phase. In the inference phase, the same model can be utilized to predict the StyleGAN’s editing directions given a CLIP delta text input (i.e. source and target text CLIP embeddings).
Overview
During training, a latent mapper is trained to map the CLIP visual feature differences to the editing directions of StyleGAN. During inference, this mapper can be used to predict StyleGAN’s editing directions from the differences of the CLIP textual features.
Training
For training, the model takes StyleGAN delta image features, \(\Delta{s} = s_2 - s_1\) as input and CLIP delta image features, \(\Delta{i} = i_2 - i_1\) as conditioning and trains the Delta Mapper network to predict the editing direction during the inference phase as:
\[\Delta{s'} = LatentMapper(s_1, i_1, \Delta{i})\]where \(s_1\) and \(i_1\) are used as the input of Delta Mapper to provide information about the source image. The loss function to train the proposed DeltaMapper network is as follows:
\[\mathcal{L} = \mathcal{L}_{rec} + \mathcal{L}_{sim} = ||\Delta{s'}-\Delta{s}||_{2} + 1 - \cos{(\Delta{s'}, \Delta{s})}\]where L-2 distance reconstruction loss is utilized to add supervision for learning the editing direction \(\Delta{s'}\) and cosine similarity loss is introduced to explicitly encourage the network to minimize the cosine distance between the predicted embedding direction \(\Delta{s'}\) and \(\Delta{s}\) in the S space.
Inference
During inference, the model takes CLIP image embeddings and StyleGAN S space embedding of the input image. The conditioning is given as \(\Delta{t'}\) (CLIP text feature difference of source and target text). The DeltaMapper network predicts \(\Delta{s'}\), which is the editing direction for the input image in the StyleGAN S space.This can be used to generate the output image as follows:
\(\Delta{s'} = LatentMapper(s, i, \Delta{t})\) \(s' = s + \Delta{s'}\)
By using the latent \(s’\) of the output image, we can get the final result \(I’\) by giving \(s’\) as input to the StyleGAN.
The authors performed extensive qualitative and quantitative evaluation on FFHQ, LSUN-Cat, LSUN-Church, and LSUN-Horse datasets.
The results in Fig 3. show that only the target attributes are manipulated, while other irrelevant attributes are well preserved. Meanwhile, the results are well adapted to the individual with diverse details, rather than overfitting to the same color or shape.
In comparison, DeltaEdit yields the most impressive and disentangled results almost in each case (Fig 4.).
For quantitative evaluation, the authors utilized FID, PSNR, and IDS (identity similarity before and after manipulation by Arcface). Compared with the state-of-the art approaches, DeltaEdit achieve the best performance on all metrics (Fig 5.).
The authors also conducted a user study (Fig 6.) across 40 participants (consisting of 20 rounds for each participants) judging the method in terms of manipulation accuracy and visual realism, where DeltaEdit outperforms StyleCLIP.
In this article, we explored a text-free training framework for text-driven image editing. Majority of existing frameworks built upon pre-trained vision-language models either perform per-text optimization or inference-time hyperparameters tuning. The authors in this paper proposed a delta image and text space that has well-aligned distribution between CLIP visual feature differences of two images and CLIP textual embedding differences of source and target texts. The proposed model showed superior results qualitatively and quantitatively and generalizes well to unseen text prompts as zero-shot inference.