

UNHCR Refugee Data Visualized
Data VisualizationModelingposted by Eugene Joh December 24, 2017 Eugene Joh

Where’s the Data?
The data I’m using is taken from the United Nations High Commissioner for Refugees (UNHCR) website – the UN Refugee Agency. You can read more on what they do and why the exist in the link above. Currently you can only download the mid-year statistics for 2015. You get a statement saying: “Error: Statistics filters are not available at this time. Please try again later.” I just checked again and the data is accessible again!
In this post I am not going to share my views on the current political landscape in the United States (that can be a complete separate blog on its own). Similar to my previous post, I’m going to do a walkthrough with data cleaning, ask a couple questions and visualize the data in some meaningful way. If you want to skip the details on the code, you can just scroll down to the figures.
Working in R
As usual, all the code is in my GitHub repository for whoever wants to download and use it for themselves. For this post, I’m going to use the following packages ggplot, grid, scales, reshape2, scales and worldcloud. I was reading some R manuals and I discovered a “new” way to access your working directory environment. It involves the use of the list.files() function, which lists the files or folders in your current working directory. You can also further specify the path name (which saves the time of constantly changing the pathname in the setwd() function (which I have been foolishly doing for a while).
1
2
3
4
5
6
|
setwd ( "~/Documents/UNHCR Data/" ) # ~ acts as base for home directory list.files (full.names= TRUE ) # gives full path for files or folders files.all <- list.files (path= "All_Data/" ) #assigns name to file names in the All_Data folder length (files.all) #checks how many objects there are in the /All_Data folder files.all |
Since the file is a comm delimited file (.csv) we use the read.csv() function to read it in and assign it the name “ref.d”. I used what I know with the paste0() function We set the skip argument equal to 2 because by visual inspection of the file, the first two rows are committed to the file title (we don’t want to read that into R). I also used the na.string argument to specify that any blanks (“”), dashes (“-“) and asterisks (“*”) would be considered missing data, NA. The asterisks are specified to be redacted information, based on the UNHCR website.
1
2
3
4
5
6
|
ref.d <- read.csv ( paste0 ( "All_Data/" ,files.all), #insert filepath and name into 1st argument header=T, #select the headers in 3rd row skip=2, #skips the first rows (metadata in .csv file) na.string= c ( " , "-" , "*" ), #convert all blanks, "i","*" cells into missing type NA col.names=new.names #since we already made new names ) |
First thing I see, the names for the columns names are long and they would be annoying to reproduce. So first we’ll change these using the names() function.
1
2
|
new.names <- c ( "Year" , "Country" , "Country_Origin" , "Refugees" , "Asylum_Seekers" , "Returned_Refugees" , "IDPs" , "Returned_IDPs" , "Stateless_People" , "Others_of_Concern" , "Total" ) names (ref.d) <- new.names |
By using summary() and str() on the dataset we can see that the range of the data spans from 1951 to 2014, it contains information on the country where refugees are situated, their country of origin, counts for each population of concern and total counts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
> summary (ref.d) Year Country Country_Origin Refugees Min. :1951 Length:103746 Length:103746 Min. : 1 1st Qu.:2000 Class :character Class :character 1st Qu.: 3 Median :2006 Mode :character Mode :character Median : 16 Mean :2004 Mean : 5992 3rd Qu.:2010 3rd Qu.: 167 Max. :2014 Max. :3272290 NA 's :19353 Asylum_Seekers Returned_Refugees IDPs Returned_IDPs Min. : 0.0 Min. : 1 Min. : 470 Min. : 23 1st Qu.: 1.0 1st Qu.: 2 1st Qu.: 90746 1st Qu.: 5000 Median : 5.0 Median : 16 Median : 261704 Median : 27284 Mean : 255.9 Mean : 6737 Mean : 540579 Mean : 111224 3rd Qu.: 34.0 3rd Qu.: 247 3rd Qu.: 594443 3rd Qu.: 104230 Max. :358056.0 Max. :9799410 Max. :7632500 Max. :1186889 NA 's :46690 NA' s :97327 NA 's :103330 NA' s :103544 Stateless_People Others_of_Concern Total Min. : 1 Min. : 1.0 Min. : 1 1st Qu.: 205 1st Qu.: 14.8 1st Qu.: 3 Median : 1720 Median : 444.5 Median : 16 Mean : 66803 Mean : 24672.5 Mean : 8605 3rd Qu.: 11462 3rd Qu.: 6000.0 3rd Qu.: 166 Max. :3500000 Max. :957000.0 Max. :9799410 NA 's :103103 NA' s :103018 NA 's :2433 |
It’s always good practice to identify missing data as well (especially when we set the condition of the read.csv argument above). For non-numeric variables you can use a simple function using apply() and is.na() to identify missing values (NA’s) in your data.
1
|
apply (ref.d,2, function (x) sum ( is.na (x))) |
When I used str() on the data, I saw that the country names were as factors and the populations of concern categories were integers. I made a short for loop to change these to a character and numeric type respectively.
1
2
3
4
5
6
|
for (i in 2: length ( names (ref.d))){ # "2"-ignores the first column (we want to keep Year as an integer) if ( class (ref.d[,i])== "factor" ){ ref.d[,i] &lt;- as.character (ref.d[,i])} if ( class (ref.d[,i])== "integer" ){ ref.d[,i] &lt;- as.numeric (ref.d[,i])} } |
Also another nuance, I wanted to change some of the names of the countries (they were either very long or had extra information). I first identified the names I wanted to change and then replace them a new set of names. I did this using a for loop as well.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
old.countries <- c ( "Bolivia (Plurinational State of)" , "China, Hong Kong SAR" , "China, Macao SAR" , "Iran (Islamic Rep. of)" , "Micronesia (Federated States of)" , "Serbia and Kosovo (S/RES/1244 (1999))" , "Venezuela (Bolivarian Republic of)" , "Various/Unknown" ) # replacement names new.countries <- c ( "Bolivia" , "Hong Kong" , "Macao" , "Iran" , "Micronesia" , "Serbia &amp; Kosovo" , "Venezuela" , "Unknown" ) for (k in 1: length (old.countries)){ ref.d$Country_Origin[ref.d$Country_Origin==old.countries[k]]&lt;-new.countries[k] ref.d$Country[ref.d$Country==old.countries[k]]&lt;-new.countries[k] } |
If any has alternative ways to achieve the above (ie. using the apply family), comment below! Just a short disclaimer on for loops in R. There has been a lot of argument on the effectiveness of for loops in R compared to the apply function family. A quick Google Search shows many opinions on this issue, based on computing speed/power, simplicity, elegance, etc. Advanced R by Hadley Wickham talks about this and I’ve generally used this as a guideline on whether to use a for loop or an apply function.
Some Descriptives and North Korea
Just to get an idea of the data, we can create a list of the countries and countries of origin by using the code similar to identifying countries with certain MRSA strains in my previous post.
1
2
3
4
|
clist<- sort ( unique (ref.d$Country)) #alphabetical clist or.clist<- sort ( unique (ref.d$Country_Origin)) #alphabetical or.clist |
We can then compare them for any differences either using matching operators or the setdiff() function. First we’ll do this…
1
2
3
4
5
|
clist[!clist % in % or.clist] # or setdiff (clist,or.clist) [1] "Bonaire" "Montserrat" [3] "Sint Maarten (Dutch part)" "State of Palestine" |
… we can infer that these countries haven’t produced refugees or there is no data on these countries in the UNHCR database. If we reverse the comparison…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
or.clist[!or.clist % in % clist] # or setdiff (or.clist,clist) [1] "Andorra" "Anguilla" [3] "Bermuda" "Cook Islands" [5] "Dem. People's Rep. of Korea" "Dominica" [7] "French Polynesia" "Gibraltar" [9] "Guadeloupe" "Holy See (the)" [11] "Kiribati" "Maldives" [13] "Marshall Islands" "Martinique" [15] "New Caledonia" "Niue" [17] "Norfolk Island" "Palestinian" [19] "Puerto Rico" "Samoa" [21] "San Marino" "Sao Tome and Principe" [23] "Seychelles" "Stateless" [25] "Tibetan" "Tuvalu" [27] "Wallis and Futuna Islands " "Western Sahara" |
… we get a list of countries that have only produced refugees (not taken any refugees in) or there is missing data in the UNHCR. I myself being Korean-Canadian I noticed North Korea (the Democratic People’s Republic of Korea) on this list. I wanted to ask the question, which countries have the largest number of North Korean refugees based on the UNHCR Data? What are the top 10?
1
2
3
4
5
6
7
8
9
10
11
12
13
|
NK.tot<- aggregate ( cbind (Total)~Country,data=NK,FUN=sum) NK.tot[ order (-NK.tot[,2]),][1:10,] Country Total 31 United Kingdom 4808 5 Canada 2954 10 Germany 2845 18 Netherlands 487 3 Belgium 435 23 Russian Federation 357 32 United States of America 346 1 Australia 318 20 Norway 262 9 France 228 |
We find that the UK has the highest number of North Koreans refugees, followed by Canada and Germany with similar counts. Learned something new today.
Word Clouds in R
If you haven’t guessed based on the packages I loaded in the beginning, a word cloud was inevitable. Making word clouds in R is