# PI and Simulation Art in R

RTools & LanguagesUncategorizedposted by Steve Miller February 25, 2020 Steve Miller

I spent the better part of an afternoon last week perusing a set of old flash drives I’d made years ago for my monthly notebook backups. One that especially caught my attention had a folder of R scripts, probably at least 15 years old — harking back to my earliest days with R. I could only smile at some of the inefficient scripts I wrote then, reflecting an early, awkward attempt to switch gears from SAS to R.

The script I reviewed, in particular, had to do with Monte Carlo estimation of pi, as in pi*(r**2), for the area of a circle. Estimating pi via random sampling is quite straightforward and generally a first assignment in an intro numerical/statistical computation course.

The old code actually worked fine but was far from the ideal R vectorized/functional programming metaphor — gnarled with procedurally oriented nested loops and lists. And the limit of 2500,000 iterations reflected processor performance at that time. So, I decided to modernize the code a bit, adding a visualization that showed pi as derived from the ratio of an embedded circle to an enclosing square.

The point of departure for this exercise is a circle of diameter 2 centered at coordinates (1,1), embedded within a square of side length 2. A uniform random sample of x’s and y’s <= 2 is generated, their distance from the circle center calculated, and a determination made of whether each (x,y) point is within or outside the circle. The ratio of “in” to total points estimates the ratio of the area of the circle to the area of the square. And since the area of the square is 4, the area of the circle is estimated by 4*(in/total). More, the radius of the circle is 1, so 4*(in/out) estimates pi as well. Pretty nifty.

What follows is the MC script code, partitioned into Jupyter Notebook cells. The technology used is Wintel 10 with 128 GB RAM, along with JupyterLab 1.2.4 and R 3.6.2. The R data.table, ggplot, and knitr packages are featured.

Set options, import packages, and load a personal library. The private functions used are blanks, freqsdt, meta, mykab, and obj_sz. freqsdt is a general-purpose, multi-attribute frequencies function for data.tables. meta displays metadata and individual records from data.tables. mykab is a printing function that uses knitr kable, and obj_sz returns the size of R objects.

In [1]:

options(warn=-1) options(scipen = 10) options(datatable.print.topn=100) options(datatable.showProgress=FALSE) options(stringsAsFactors=TRUE) usualsuspects <- c( 'tidyverse', 'data.table', 'pryr', 'rvest', 'magrittr','lubridate', 'fst','feather', 'knitr', 'kableExtra', 'ggplot2','RColorBrewer' ) suppressMessages(invisible(lapply(usualsuspects, library, character.only = TRUE))) funcsdir <- "/steve/r/functions" funcsfile <- "rfunctions.r" setwd(funcsdir) source(funcsfile) lsf.str() blanks(2)

allfreqs : function (dtn, catlim = 100)

blanks : function (howmany)

freqsdt : function (DTstr, xstr)

freqsonly : function (DTstr, xstr)

meta : function (df, data = FALSE, dict = TRUE)

mykab : function (dt)

obj_sz : function (obj)

First up, an updated procedural code estimate of pi for 8 different simulations to assess how the estimates vary with sample size. A data.table with columns denoting the pi estimate and sample size is output.

In [2]:

set.seed(531) HOWMANY <- c(500,2500,12500,62500,312500,1562500,7812500,39062500) pisim <- data.table(howmany=HOWMANY,piest=NULL) for (i in 1:length(HOWMANY)) { h <- HOWMANY[i] x<-runif(h,max=2) y<-runif(h,max=2) d<-((x-1)**2+(y-1)**2)**.5 inout<-factor(ifelse(d<=1,"in","out")) pisim[i,piest:= 4*sum(inout=='in')/length(inout)] } mykab(pisim) blanks(2)

|howmany | piest |

|:——:|:——:|

| 500 |3.128000|

| 2500 |3.158400|

| 12500 |3.134400|

| 62500 |3.152512|

| 312500 |3.136166|

|1562500 |3.140460|

|7812500 |3.142583|

|39062500|3.141737|

The same calculation, this time using a more current functional approach. The results are, fortunately, identical.

In [3]:

set.seed(531) HOWMANY <- c(500,2500,12500,62500,312500,1562500,7812500,39062500) mksim <- function(h) { x<-runif(h,max=2) y<-runif(h,max=2) d<-((x-1)**2+(y-1)**2)**.5 inout<-factor(ifelse(d<=1,"in","out")) data.table(howmany=h,piest=4*sum(inout=='in')/length(inout)) } pisim <- rbindlist(lapply(HOWMANY,mksim)) mykab(pisim) blanks(2)

|howmany | piest |

|:——:|:——:|

| 500 |3.128000|

| 2500 |3.158400|

| 12500 |3.134400|

| 62500 |3.152512|

| 312500 |3.136166|

|1562500 |3.140460|

|7812500 |3.142583|

|39062500|3.141737|

Graph the pi estimates above as a function of sample size. Note the convergence to the R constant pi.

In [4]:

options(repr.plot.width=10, repr.plot.height=10) bpal <- brewer.pal(9,"Blues") gpal <- brewer.pal(9,"Greens") g <- ggplot(pisim, aes(x=howmany,y=piest)) + geom_point(size=5) + geom_line() + theme(plot.background = element_rect(fill = bpal[2]), panel.background = element_rect(fill = bpal[2])) + geom_hline(aes(yintercept=pi), na.rm = FALSE, show.legend = NA,col="black",size=.3,linetype=2) + theme(axis.text = element_text(size=15)) + theme(legend.position="none") + ylim(3.1,3.2) + scale_x_log10(breaks=pisim x : num 0.433 0.55 0.78 1.311 0.872 ... distance1_1: num 0.647 0.572 0.896 0.385 0.658 ... inout) tit <- "Area of Square and Circle" subtit <- paste("Simulation pi: ", round(pisim,6),"\nActual pi: ", round(pi,6),sep="") g <- ggplot(simpoints, aes(x=x,y=y,col=inout)) + geom_point(size=.5) + theme(plot.background = element_rect(fill = bpal[2]), panel.background = element_rect(fill = bpal[2])) + theme(legend.position="none") + ylim(-1,3) + xlim(-1,3) + labs(title=tit,subtitle=subtit, y="Height\n", x="\nLength") + theme(plot.title = element_text(size=22,face = "bold")) + theme(plot.subtitle = element_text(size=15,face = "bold")) + theme(axis.text = element_text(size=15)) + scale_color_manual(values = myColors) + theme(text = element_text(size=rel(4))) print(g) end <- proc.time() print(end-start) blanks(2)

user system elapsed

10.19 20.08 30.27

That’s it for now. More R/Python-Pandas next time.

In [ ]:

*Learn more R for data science skills in-person with our hands-on training sessions at ODSC East this April 13-17! See all the talks that use R, such as “Machine Learning in R Part I: Penalized Regression and Boosted Trees” with Jared Lander.*