Deepfake is one of the trending cults of this generation. It uses a very simple concept. The face of person A is transferred to a video of person B. Let’s look at how it’s done through different stages:
- Swapping Image: An autoencoder (encoder and decoder) is built, to reconstruct the image of A over B. Let us understand this through the idea of a criminal sketch. The features described by a witness (encoder), and the reconstructed picture of the suspect by the sketch artist (decoder). From a data set of over hundreds and thousands of pictures of both faces. An encoder is made using a deep learning Convolutional Neural Network(CNN). The encoder extracts the most important features from the chunk of images, i.e, encode pictures. Following this, a decoder is used to reconstruct the original image. The encoder and decoder are trained as if they are entwined twins, using a backpropagation method. Such that the input matches closely with the output. After the training, the frame-by-frame video is processed to swap faces. Using face detection the face of B is extracted. Now instead of feeding this to its original decoder, the decoder of person A is used. Hence we are able to have a face of A but with the context of B. This newly created face is merged into the original image of B.
- Make image realistic: A deep network discriminator, Generative Adversary Network(GAN) is used to differentiate the originality of the image. When a real image is fed into this discriminator, it gets trained to recognize real images better. When a created image is fed, it trains the autoencoder to create a more realistic image. This process is carried out over and over again until the created image is not distinguishable from the real one.
- Lip Sync from audio: Firstly it uses a Long Short Term Memory (LSTM) network, a type of recurring neural network. This network transforms the audio into 18 landmark points in the lip. The LSTM finally outputs a mouth shape for each output video frame. According to the respective mouth shape, the network synthesizes a mouth texture for the chin and mouth area. Now the application looks over the target videos to match the calculated mouth shape. The candidates are merged together using a median function. The parameters considered are realism and temporal smoothness. If there is a blurriness left in the video. It can be compensated by teeth sharpening and enhancement. The final trick to be pulled is to retime the frame. This gives a clear idea of where to insert the fake mouth texture. Hence giving a proper finesse and sync with the head movement.
The application of artificial intelligence has improved fake videos and they are here to stay. But with a word of caution. Social impact can be huge and may lead to legal problems. So better to use your energy on innovative ideas than just for fun. The application of GAN will help in the construction of better images. And researchers believe one day this may eventually help in detecting tumors.