fbpx
Image Compression In 10 Lines of R Code Image Compression In 10 Lines of R Code
Principal Component Analysis (PCA) is a powerful Machine Learning tool. As an unsupervised learning technique, it excels in dimension reduction and... Image Compression In 10 Lines of R Code

Principal Component Analysis (PCA) is a powerful Machine Learning tool. As an unsupervised learning technique, it excels in dimension reduction and feature extraction

However, do you know we can use PCA to compress images?

[Related Article: Automating Image Annotation with MAX]

Photo by Pietro Jeng on Unsplash

# Install packages and load libraries

#install.packages(“tidyverse”)
#install.packages(“gbm”)
#install.packages(“e1071”)
#install.packages(“imager”)
library(tidyverse)
library(tree) 
library(randomForest) 
library(gbm) 
library(ROCR) 
library(e1071) 
library(imager)
# load the dataset. This is a 100*100*1000 array of data. An array is a generalization of a matrix to more than 2 dimensions. The first two dimensions index the pixels in a 100*100 black and white image of a face; the last dimension is the index for one of 1000 face images. The dataset can be accessed at: https://cyberextruder.com/face-matching-data-set-download/.
load(“faces_array.RData”)
#PAC requires a single matrix. so, we need to transform the 100*100 matrix into a single vector (10,000).
face_mat <- sapply(1:1000, function(i) as.numeric(faces_array[, , i])) %>% t

# To visualize the image, we need a matrix. so, let's convert 10000 dimensional vector to a matrix
plot_face <- function(image_vector) { 
 plot(as.cimg(t(matrix(image_vector, ncol=100))), axes=FALSE, asp=1)
 }
plot_face(face_mat[, sample(1000, 1)])

Here, we are trying to obtain the basic information of the dataset and constructing a new function for analysis.

#check the average face

face_average = colMeans(face_mat)
plot_face(face_average)

To a large extent, we can understand “average face” as the baseline for other images. By adding or subtracting values from the average face, we can obtain other faces.

#The above code doesn’t count to the 10 lines limit.#

# And, here it goes our 10 lines of code #

# generate PCA results;
# scale=TRUE and center=TRUE --> mean 0 and variance 1
pr.out = prcomp(face_mat,center=TRUE, scale=FALSE)

# pr.out$sdev: the standard deviations of the principal components; 
# (pr.out$sdev)²: variance of the principal components
pr.var=(pr.out$sdev)²

# pve: variance explained by the principal component
pve = pr.var/sum(pr.var)

# cumulative explained variance
cumulative_pve <-cumsum(pve)

#see the math explanation attached in the end
U = pr.out$rotation
Z = t(pr.out$x)

# Let's compress the 232nd face of the dataset and add the average face back and create four other images adopting the first 10,50,100, and 300 columns.
par(mfrow=c(1,5))
plot_face(face_mat[232,])
for (i in c(10,50,100,300))
 {
 plot_face((U[,1:i]%*%Z[1:i,])[,232]+face_average) 
 }

We did it! The results are not too bad. We have the original image on the far left, followed by the four compressed images.


Simple Math Explanations.

PCA is closely related to the singular value decomposition (SVD) of a matrix. So, x = UD(V^T) = z*(V^T),

where x is the 1000*10000 matrix,

  • V: the matrix of eigenvectors (rotation returned by prcomp)
  • D: standard deviation of the principal components (sdev returned by prcomp)
  • So, z = UD (the coordinates of the principal components in the rotated space (prcomp$x).

In other words, we can compress the images using the first k columns of V and the first k columns of z:

End of math.

[Related Article: Wonders in Image Processing with Machine Learning]

Originally Posted Here

Leihua Ye

Leihua is a Ph.D. Candidate in Political Science with a Master's degree in Statistics at the UC, Santa Barbara. As a Data Scientist, Leihua has six years of research and professional experience in Quantitative UX Research, Machine Learning, Experimentation, and Causal Inference. His research interests include: 1. Field Experiments, Research Design, Missing Data, Measurement Validity, Sampling, and Panel Data 2. Quasi-Experimental Methods: Instrumental Variables, Regression Discontinuity Design, Interrupted Time-Series, Pre-and-Post-Test Design, Difference-in-Differences, and Synthetic Control 3. Observational Methods: Matching, Propensity Score Stratification, and Regression Adjustment 4. Causal Graphical Model, User Engagement, Optimization, and Data Visualization 5. Python, R, and SQL Connect here: 1. http://www.linkedin.com/in/leihuaye 2. https://twitter.com/leihua_ye 3. https://medium.com/@leihua_ye

1