fbpx
UNHCR Refugee Data Visualized UNHCR Refugee Data Visualized
Where’s the Data? The data I’m using is taken from the United Nations High Commissioner for Refugees (UNHCR) website – the UN Refugee Agency. You... UNHCR Refugee Data Visualized

Where’s the Data?

The data I’m using is taken from the United Nations High Commissioner for Refugees (UNHCR) website – the UN Refugee Agency. You can read more on what they do and why the exist in the link above.  Currently you can only download the mid-year statistics for 2015. You get a statement saying: “Error: Statistics filters are not available at this time. Please try again later.”  I just checked again and the data is accessible again!

In this post I am not going to share my views on the current political landscape in the United States (that can be a complete separate blog on its own). Similar to my previous post, I’m going to do a walkthrough with data cleaning, ask a couple questions and visualize the data in some meaningful way. If you want to skip the details on the code, you can just scroll down to the figures.

Working in R

As usual, all the code is in my GitHub repository for whoever wants to download and use it for themselves. For this post, I’m going to use the following packages ggplot, grid, scales, reshape2, scales and worldcloud. I was reading some R manuals and I discovered a “new” way to access your working directory environment. It involves the use of the list.files() function, which lists the files or folders in your current working directory. You can also further specify the path name (which saves the time of constantly changing the pathname in the setwd() function (which I have been foolishly doing for a while).

 

1
2
3
4
5
6
setwd("~/Documents/UNHCR Data/") # ~ acts as base for home directory
list.files(full.names=TRUE) # gives full path for files or folders
files.all <- list.files(path="All_Data/") #assigns name to file names in the All_Data folder
length(files.all) #checks how many objects there are in the /All_Data folder
files.all

 

Since the file is a comm delimited file (.csv) we use the read.csv() function to read it in and assign it the name “ref.d”. I used what I know with the paste0() function  We set the skip argument equal to 2 because by visual inspection of the file, the first two rows are committed to the file title (we don’t want to read that into R). I also used the na.string argument to specify that any blanks (“”), dashes (“-“) and asterisks (“*”) would be considered missing data, NA. The asterisks are specified to be redacted information, based on the UNHCR website.

1
2
3
4
5
6
ref.d <- read.csv(paste0("All_Data/",files.all), #insert filepath and name into 1st argument
    header=T, #select the headers in 3rd row
    skip=2, #skips the first rows (metadata in .csv file)
    na.string=c(","-","*"), #convert all blanks, "i","*" cells into missing type NA
    col.names=new.names #since we already made new names
    )

First thing I see, the names for the columns names are long and they would be annoying to reproduce. So first we’ll change these using the names() function.

1
2
new.names <- c("Year", "Country", "Country_Origin", "Refugees", "Asylum_Seekers", "Returned_Refugees", "IDPs", "Returned_IDPs", "Stateless_People", "Others_of_Concern","Total")
names(ref.d) <- new.names

By using summary() and str() on the dataset we can see that the range of the data spans from 1951 to 2014, it contains information on the country where refugees are situated, their country of origin, counts for each population of concern and total counts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> summary(ref.d)
     Year        Country          Country_Origin        Refugees
Min.   :1951   Length:103746      Length:103746      Min.   :      1
1st Qu.:2000   Class :character   Class :character   1st Qu.:      3
Median :2006   Mode  :character   Mode  :character   Median :     16
Mean   :2004                                         Mean   :   5992
3rd Qu.:2010                                         3rd Qu.:    167
Max.   :2014                                         Max.   :3272290
                                                     NA's   :19353
Asylum_Seekers     Returned_Refugees      IDPs         Returned_IDPs
Min.   :     0.0   Min.   :      1   Min.   :    470   Min.   :     23
1st Qu.:     1.0   1st Qu.:      2   1st Qu.:  90746   1st Qu.:   5000
Median :     5.0   Median :     16   Median : 261704   Median :  27284
Mean   :   255.9   Mean   :   6737   Mean   : 540579   Mean   : 111224
3rd Qu.:    34.0   3rd Qu.:    247   3rd Qu.: 594443   3rd Qu.: 104230
Max.   :358056.0   Max.   :9799410   Max.   :7632500   Max.   :1186889
NA's   :46690      NA's   :97327     NA's   :103330    NA's   :103544
Stateless_People  Others_of_Concern      Total
Min.   :      1   Min.   :     1.0   Min.   :      1
1st Qu.:    205   1st Qu.:    14.8   1st Qu.:      3
Median :   1720   Median :   444.5   Median :     16
Mean   :  66803   Mean   : 24672.5   Mean   :   8605
3rd Qu.:  11462   3rd Qu.:  6000.0   3rd Qu.:    166
Max.   :3500000   Max.   :957000.0   Max.   :9799410
NA's   :103103    NA's   :103018     NA's   :2433

It’s always good practice to identify missing data as well (especially when we set the condition of the read.csv argument above). For non-numeric variables you can use a simple function using apply() and is.na() to identify missing values (NA’s) in your data.

1
apply(ref.d,2, function(x) sum(is.na(x)))

When I used str() on the data, I saw that the country names were as factors and the populations of concern categories were integers. I made a short for loop to change these to a character and numeric type respectively.

1
2
3
4
5
6
for(i in 2:length(names(ref.d))){ # "2"-ignores the first column (we want to keep Year as an integer)
    if (class(ref.d[,i])=="factor"){
        ref.d[,i] &amp;lt;- as.character(ref.d[,i])}
    if (class(ref.d[,i])=="integer"){
        ref.d[,i] &amp;lt;- as.numeric(ref.d[,i])}
}

Also another nuance, I wanted to change some of the names of the countries (they were either very long or had extra information). I first identified the names I wanted to change and then replace them a new set of names. I did this using a for loop as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
old.countries <- c("Bolivia (Plurinational State of)",
    "China, Hong Kong SAR",
    "China, Macao SAR",
    "Iran (Islamic Rep. of)",
    "Micronesia (Federated States of)",
    "Serbia and Kosovo (S/RES/1244 (1999))",
    "Venezuela (Bolivarian Republic of)",
    "Various/Unknown")
# replacement names
new.countries <- c("Bolivia","Hong Kong","Macao","Iran","Micronesia","Serbia &amp;amp; Kosovo","Venezuela","Unknown")
for (k in 1:length(old.countries)){
    ref.d$Country_Origin[ref.d$Country_Origin==old.countries[k]]&amp;lt;-new.countries[k]
    ref.d$Country[ref.d$Country==old.countries[k]]&amp;lt;-new.countries[k]
}

If any has alternative ways to achieve the above (ie. using the apply family), comment below! Just a short disclaimer on for loops in R. There has been a lot of argument on the effectiveness of for loops in R compared to the apply function family. A quick Google Search shows many opinions on this issue, based on computing speed/power, simplicity, elegance, etc. Advanced R by Hadley Wickham talks about this and I’ve generally used this as a guideline on whether to use a for loop or an apply function.

Some Descriptives and North Korea

Just to get an idea of the data, we can create a list of the countries and countries of origin by using the code similar to identifying countries with certain MRSA strains in my previous post.

1
2
3
4
clist<-sort(unique(ref.d$Country)) #alphabetical
clist
or.clist<-sort(unique(ref.d$Country_Origin)) #alphabetical
or.clist

We can then compare them for any differences either using matching operators or the setdiff() function. First we’ll do this…

1
2
3
4
5
clist[!clist %in% or.clist] # or
setdiff(clist,or.clist)
[1] "Bonaire"                   "Montserrat"
[3] "Sint Maarten (Dutch part)" "State of Palestine"

… we can infer that these countries haven’t produced refugees or there is no data on these countries in the UNHCR database. If we reverse the comparison…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
or.clist[!or.clist %in% clist] # or
setdiff(or.clist,clist)
 [1] "Andorra"                     "Anguilla"
 [3] "Bermuda"                     "Cook Islands"
 [5] "Dem. People's Rep. of Korea" "Dominica"
 [7] "French Polynesia"            "Gibraltar"
 [9] "Guadeloupe"                  "Holy See (the)"
[11] "Kiribati"                    "Maldives"
[13] "Marshall Islands"            "Martinique"
[15] "New Caledonia"               "Niue"
[17] "Norfolk Island"              "Palestinian"
[19] "Puerto Rico"                 "Samoa"
[21] "San Marino"                  "Sao Tome and Principe"
[23] "Seychelles"                  "Stateless"
[25] "Tibetan"                     "Tuvalu"
[27] "Wallis and Futuna Islands "  "Western Sahara"

… we get a list of countries that have only produced refugees (not taken any refugees in) or there is missing data in the UNHCR. I myself being Korean-Canadian I noticed North Korea (the Democratic People’s Republic of Korea) on this list. I wanted to ask the question, which countries have the largest number of North Korean refugees based on the UNHCR Data? What are the top 10?

1
2
3
4
5
6
7
8
9
10
11
12
13
NK.tot<- aggregate(cbind(Total)~Country,data=NK,FUN=sum)
NK.tot[order(-NK.tot[,2]),][1:10,]
                    Country Total
31           United Kingdom  4808
5                    Canada  2954
10                  Germany  2845
18              Netherlands   487
3                   Belgium   435
23       Russian Federation   357
32 United States of America   346
1                 Australia   318
20                   Norway   262
9                    France   228

We find that the UK has the highest number of North Koreans refugees, followed by Canada and Germany with similar counts. Learned something new today.

Word Clouds in R

If you haven’t guessed based on the packages I loaded in the beginning, a word cloud was inevitable. Making word clouds in R is

Eugene Joh

Eugene Joh

A detail-oriented MS and MPH graduate with a strong quantitative background, motivated to use his current and developing skill set to improve the health of vulnerable populations in local and global settings.

1