- Introduction: The line between machine & artist becomes blurred
- While AI has proved superior at complex calculations & predictions, creativity seemed to be the domain which machines can’t take over.
- As Artificial Intelligence begins to generate stunning visuals, profound poetry & transcendent music, the nature of art & the role of human creativity in the future start to feel uncertain.
- It really is amazing that AI is now capable of producing art that is aesthetically pleasing.
Please Note: I reserve the rights of all the media used in this blog — photographs, animations, videos, etc. they are my work (except the 7 mentioned artworks by artists which were used as style images). GIFs might take a while to load, please be patient. In the medium app, it doesn’t load for me. If that is the case please open in browser instead.
Here are some image processing techniques that I applied to generate digital artwork from photographs-
2. Color Quantization
- Reduces the number of distinct colors used in an image, with the intention that the new image should be visually similar & compressed in size.
- Common case: transform a 24-bit color image into an 8-bit color image.
- Uses K Means Clustering to group pixels of similar color.
- K centroids of the clusters represent 3-D RGB color space & would replace the colors of all points in their cluster resulting in the image with K colors.
3. Superpixel Segmentation
- Partition image into superpixels. A superpixel is a group of connected pixels with similar colors or gray levels.
- Uses an unsupervised segmentation technique called Simple Linear Iterative Clustering (SLIC). Segments the image using K-Means clustering.
- It takes in all the pixel values of the image & tries to separate them into a predefined number of sub-regions.
- All the pixels in each superpixel then takes the average color value of all the pixels in that segment.
4. Neural Style Transfer
- Optimization technique which combines the contents of an image with the style of a different image effectively transferring the style.
- Image content: object structure, their specific layout & positioning.
- Image style: color, texture, patterns in strokes, style of painting technique.
- Capable of generating fascinating results that are difficult to produce manually.
- As shown below, the output matches the content statistics of the content image & the style statistics of the style image. These statistics are extracted from images using a convolutional neural network.
- Simply put, the generated image is the same content image but as though it were painted by Van Gogh in the style of his artwork ‘starry night’.
4.1 Project Summary
- I’ve been working on this project for over a month. I experimented a lot with model hyperparameters & the pair of content images & style images.
- Generated 2500+ digital artworks so far using a combination of 63 content images & 40 style images (8 artworks & 32 photographs).
- Each image (800 pixels wide) takes 7 mins to generate (2000 iterations). I was capable of generating up to 1200 pixels wide image using 6gb GPU.
- I also applied basic image enhancement techniques & color correction to produce visually aesthetic artwork.
4.2 Style Transfer: VGG-19 CNN Architecture
- Style transfer is a complex technique that requires a powerful model.
- Models large enough to achieve this task can take very long to train & require extremely large datasets to do so.
- For this reason, we import a pre-trained model that has already been trained on the very large ImageNet database.
- Following the original NST paper, we will use the VGG-19 network.
- Pre-trained VGG-19 model has learned to recognize a variety of features.
- Fig. shows dimension of different layers for an input image (1200×800).
- Feature portion – deals with extracting relevant features from the images.
- Classifier portion – deals with image classification (Not required here).
- In our project, instead of building our own CNN from scratch, we will be relying on the pre-trained features portion of the model only.
- Input layer takes a 3 channel colored RGB image which then follows through with a total of 16 layers as the remaining 3 layers in the VGG-19 are fully connected classifying layers. There are also a total of 5 max-pooling layers.
- We have frozen the relevant parameters such that they are not updated during the backpropagation process.
4.3 Cost Function
- Unlike regular neural network algorithms in deep learning we are not optimizing a cost function to get a set of parameter values.
- Instead, we optimize a cost function to get pixel values for target image.
- Content loss: mean square error at some convolution output produced by the content image & the generated image.
- Style loss: mean square error of gram matrices produced by the style image & the generated image.
- Total loss: weighted addition of content loss & style loss.
4.4 Optimizing Content of the image:
Match the content features of target image with the features of content image.
4.4.1 Feature Map — filter visualizations
- Here I have shown 2 of the 64 feature maps of Conv1_1 layer.
- Each feature map in a layer detects some features of the image.
- The feature map below is trying to recognize the vertical edges in the image (more specifically edges where left side is lighter than right side).
- The feature map below identifies horizontal edges (more specifically edges where the top region is lighter than the bottom region).
4.4.2 Shallower vs deeper layers
- Shallower layers detect low-level features like edges & simple textures.
- Deeper layers detect high-level features like complex textures & shapes.
- The dimension of feature maps shrinks as we move deeper.
- Following GIFs show some of the feature maps in the mentioned layers.
4.4.3 Choosing a layer for content extraction
- Each successive layer of CNN forgets about the exact details of the original image & focuses more on features (edges, shapes, textures).
- Take the most important features of the content. Any detail we didn’t fill in can be filled in with style. This allows room to balance out content & style.
- At the beginning of neural network, we will always get a sharper image.
- We will get the most visually pleasing results if you choose a layer in the middle of the network — neither too shallow nor too deep.
- Conv4_2 layer is chosen here to capture the most important features.
4.4.4 Content Loss
- Goal: Generated image to have similar content as the input image.
- Content loss takes a hidden layer activation of CNN (Conv4_2 here), & measures how different activations of content & generated image are.
- Minimizing content loss make sure both images have similar content.
- Take the square difference between activations from content image (AC) & generated image (AG) & then average all those square differences.
- Content_Loss = mean( Σᵢ(AG — AC)²) ∀ i = 1 to 512
4.5 Optimizing Style of the image:
- Content features are used as they are as the CNN does a good job of extracting content elements of an image that is fed into it.
- Although, the syle features require one additional pre-processing step, the use of a gram matrix for more effective style feature extraction.
- By applying a gram matrix to the extracted features, the content information is eliminated however the style information is preserved.
4.5.1 Style Weights
- We have chosen 5 layers to extract features from it. Layers close to the beginning are usually more effective in recreating style features while later layers offer additional variety towards the style elements.
- We can choose to prioritize certain layers over other layers by associating certain weight parameters with each layer.
- We will weight earlier layers more heavily.
4.5.2 Gram matrix — G(gram)
- G(gram) measures correlations between feature maps in the same layer.
- A feature map is simply the post-activation output of a convolutional layer.
- Conv2_1 has 128 filters, it will output 128 feature maps.
- Intuition: suppose we have two filters, one detects blue objects & one detects spirals. Applying these filters to an input image will produce 2 feature maps & we measure their correlation.
- If the feature maps are highly correlated, then any spiral present in the image is almost certain to be blue.
- Minimizing the difference between the gram matrix of style & generated image results in having a similar texture in the generated image.
- All the activation maps are then unrolled into a 2D matrix of pixel values.
- Each row in unrolled version represents activations of a filter (or channel).
- G(gram) is computed by multiplying the unrolled filter matrix with its transpose which results in a matrix of dimension channels x channels.
- G(gram) is independent of image resolution i.e. generated image & style image will have gram matrix dimension 128×128 for Conv2_1 layer.
- Hence resolution of generated image (= resolution of content image) & style image can be different.
- The diagonal elements measure how active a filter ii is e.g. suppose filter ii is detecting vertical textures then G(gram)ᵢᵢ measures how common vertical textures are in the image as a whole. If G(gram)ᵢᵢ is large, this means that the image has a lot of vertical texture.
- By capturing the prevalence of different types of features G(gram)ᵢᵢ, as well as how much different features occur together G(gram)ᵢⱼ, the Gram matrix G(gram) measures the style of an image.
4.5.3 Style Loss
- Goal: To minimize the distance between gram matrix of style image & gram matrix of generated image.
- Pass generated image & style image through same pre-trained VGG CNN.
- Take the output at some convolution of the CNN, calculate their gram matrix & then calculate the means square error for each chosen layer.
- In each iteration, we create an output image so that difference between gram matrix of output & gram matrix of style image is minimized.
4.6 Optimizing content & style together:
- The total loss is a linear combination of content loss & total style loss.
- Total Loss = α*Content_Loss + β*Total_Style_Loss
- α & β hyperparameters control relative weighting between content & style.
- Run content image through the VGG19 model & compute the content cost.
- Run the style image through the VGG19 model & compute the style cost.
- Used adam optimizer with learning rate = 0.003
- Utilized the GPU by transferring the model & tensors to CUDA.
- We have a content image, style image & generated (or target) image.
- While style transfer paper starts with target image being a random white noise image, I started with target image being a clone of content image.
- Optimization process is then going to try & maintain content of target image while applying more style from style image with each iteration.
- We’re going to optimize total loss with respect to the generated image.
- Content loss: we pass content & generated image through a pretrained CNN like VGG-19. We take mean squared error between these 2 outputs.
- Style loss: we pass style & generated image through the same CNN. We grab the output at 5 different layers & calculate the gram matrix.
- We then take the mean square error between the gram matrices of the style image & generated image for each layer.
- Take weighted sum of these mean squares. This gives us a total style loss.
- Total loss is the weighted sum of content loss & total style loss.
- With backpropagation compute all the gradients required to minimize the loss w.r.t. our target image parameters.
- This is how the optimizer learns of which pixels to adjust & how to adjust them in order to minimize the total loss.
- Update generated image with backpropagation that minimizes total loss.
- Update the weights with each iteration & repeat the process.
4.8.1 Using famous artworks as style images:
- Repaint the picture in the style of any artist from Van Gogh to Picasso.
- Here are some artworks that I chose for this blog.
Vincent Van Gogh: Starry Night
- One of the most recognized & magnificent pieces of art in the world.
- I added a motion effect here, the whole effect is ethereal & dreamlike.
- Color: Blue dominates the painting, blending hills into the sky. The Yellow & white of the stars & the moon stand out against the sky.
- Sky: Brushstrokes swirl, rolling with the clouds around the stars & moon.
- Hills & trees: Bend & match the soft swirls of the sky. Roll down into the little village below.
- Village: Straight lines & sharp angles divide it from rest of the painting.
I dream of painting, and then I paint my dream — Vincent Van Gogh
The world today doesn’t make sense, so why should I paint pictures that do? — Pablo Picasso
Here are the results, some combinations produced astounding artwork.
4.8.2 Using photographs as style images:
Here’s an image of a bride & graffiti, combining them results in an output similar to doodle painting.
Here, you can see the buildings being popped up in the background.
The effect kind of resembles the glass etching technique here.
Texture of the old wooden door created a unique look of an aged painting.
Pattern of the ceiling of India Habitat Centre is being transferred here creating an effect similar to mosaic.
Designs generated by spirograph are applied to the content image here.
Texture of an ice block worked really well here.
It seems like graffiti is painted on a brick wall.
Style of a mosaic ceiling is used to generate the output.
4.9 Iteration-wise Visualization:
- Visual of how style & content images combine to optimize target image.
- The image gets progressively more styled throughout the process with more iterations & it is very fascinating to visualize.
4.10 Stylize a sequence of photographs to create animation:
- One application of this that I can think of is in the animation industry, shooting in real-world & stylizing it as per the required style image.
- Here, I captured the images with a continuous burst mode of DSLR.
4.11 Stylize a video using Style Transfer
- Stylized a timelapse video that I shot at 30 frames/sec, 30sec duration.
- Each one of the 900 frames is then passed through the style transfer algorithm with different style images to create a unique effect.
- For each style, all frames took approx 18hrs to render in 720p resolution.
4.12 Variation in result with content weight (α) & style weight (β):
- Lower the ratio of α to β, more is the style being transferred.
- I got impressive results with α=1 & β=100, all the results in this blog are for this ratio.
- For 2000 iterations here’s how the ratio impacts the generated image-
- The variation is more pronounced in the brush strokes in trees.
4.13 Style transfer with 2 style images:
- I’ve extended the algorithm to combine the style from 2 style images.
- Shown below are 2 generated images produced with 2 style images.
- I like the texture in the first generated image. However, I want it to be more colorful like the 2nd generated image.
- So, Instead of using 1 style image, I used a combination of both style images & the result is pretty impressive.
- Modified total loss = 1*content_loss + 100*style1_loss + 45*style2_loss
It takes hours or days or even longer to finish a painting & yet with the help of deep learning we can generate a new digital painting inspired by some style in a matter of a few mins using photographs. It makes us wonder if computers rather than humans will be the artists of the future.
Ending the blog with a debatable question: If Artificial Intelligence is used to create images, can the final product really be thought of as art? Comment your view on this.
Thank you for reading!
I hope you enjoyed the blog. If you want to keep up to date with my articles please follow me. 🙂
My next blog will be on Deep Dream, an AI algorithm that produces dream-like hallucinogenic appearance in the intentionally over-processed images.