Thursday, July 9, 2015

label outlier in ggplot2 boxplot

  • function to add labels to outliers in a ggplot2 boxplot
  • the function add.outlier() takes a ggplot boxplot object as input
  • the second optional input is a string containing the name of the variable containing the labels, the default is the value itself
  • the function expects a unique mapping to x and y, where x is a factor variable
  • the data frame given to the ggplot object must contain the x, y, and the labelling variable
require(ggplot2)

mtcars$cyl <- factor(mtcars$cyl)
mtcars$labels <- row.names(mtcars)

p <- ggplot(mtcars,aes(x=cyl,colour=cyl,y=qsec)) +
    geom_boxplot()



add.outlier <- function(p,labvar = as.character(p$mapping$y)){
      df <- data.frame(y = with(p$data,eval(p$mapping$y)),
                       x = with(p$data,eval(p$mapping$x)))
  
      df.l <- split(df,df$x)
      
      mm <- Reduce(rbind, lapply(df.l,FUN = function(df){
                                     data.frame(y = df$y[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))],
                                                x = df$x[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))]
                                                )})
                   )
  
      
      mm$x <- factor(mm$x,levels=sort(as.numeric(as.character(unique(p$data[,as.character(p$mapping$x)])))),
                     labels = levels(p$data[,as.character(p$mapping$x)])
                     )
      
      names(mm) <- c(as.character(p$mapping$y),as.character(p$mapping$x))
      mm <- merge(p$data[,c(names(mm),labvar)],mm)
      
      p + geom_text(data=mm,
                    aes_string(label=labvar),
                    vjust = -0.5)
}

add.outlier(p)





add.outlier(p,"labels")



Saturday, April 25, 2015

R - read excel files without java or perl dependencies

1 read Excel files without java or perl dependencies

With the new package from hadley wickham it is finally possible to read excel files without java or excel dependencies, yeah… For now there are only two functions:

  • excel_sheets() to get all sheets contained in an excel file
  • read_excel() to read one sheet out of an excel file
  • get the filenames of excel files contained in the current directory

require(readxl)
dir(pattern = "xls")

[1] "20140927excelfile.xlsx" "example.xlsx"

  • get the names of the excel sheets contained in the excel file (only one in this case)
  • the function has the file name resp. the whole path as argument


excel_sheets("example.xlsx")

[1] "conspvalues"

  • read in the sheet, the function takes the path and the sheet as argument
  • the sheet can be given as position (first sheet = 1, second = 2, etc) or as name (the first sheet is the default)


xx <- read_excel("example.xlsx",1)
head(xx)

mpg cyl    disp  hp drat      wt  qsec vs am gear carb
1 21.0   6     160 110 3.90    2.62 16.46  0  1    4    4
2 21.0   6 missing 110 3.90   2.875 17.02  0  1    4    4
3 22.8   4     108  93 3.85    2.32 18.61  1  1    4    1
4 21.4   6     258 110 3.08 missing 19.44  1  0    3    1
5 18.7   8     360 175 3.15    3.44 17.02  0  0    3    2
6 18.1   6     225 105 2.76    3.46 20.22  1  0    3    1


xx <- read_excel("example.xlsx","conspvalues")
head(xx)

mpg cyl    disp  hp drat      wt  qsec vs am gear carb
1 21.0   6     160 110 3.90    2.62 16.46  0  1    4    4
2 21.0   6 missing 110 3.90   2.875 17.02  0  1    4    4
3 22.8   4     108  93 3.85    2.32 18.61  1  1    4    1
4 21.4   6     258 110 3.08 missing 19.44  1  0    3    1
5 18.7   8     360 175 3.15    3.44 17.02  0  0    3    2
6 18.1   6     225 105 2.76    3.46 20.22  1  0    3    1

  • there is an additional argument to set na strings (na)

xx <- read_excel("example.xlsx","conspvalues",na="missing")
head(xx)

mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6   NA 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08    NA 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

  • with skip you can skip the first lines


xx <- read_excel("example.xlsx","conspvalues",skip=1)
head(xx)

21 6     160 110  3.9    2.62 16.46 0 1 4 4
1 21.0 6 missing 110 3.90   2.875 17.02 0 1 4 4
2 22.8 4     108  93 3.85    2.32 18.61 1 1 4 1
3 21.4 6     258 110 3.08 missing 19.44 1 0 3 1
4 18.7 8     360 175 3.15    3.44 17.02 0 0 3 2
5 18.1 6     225 105 2.76    3.46 20.22 1 0 3 1
6 14.3 8     360 245 3.21    3.57 15.84 0 0 3 4

  • and you can use col_names to tell the function whether or not the excel file containes the column names


xx <- read_excel("example.xlsx","conspvalues",skip=1, col_names=F)
head(xx)

X0 X1      X2  X3   X4      X5    X6 X7 X8 X9 X10
1 21.0  6     160 110 3.90    2.62 16.46  0  1  4   4
2 21.0  6 missing 110 3.90   2.875 17.02  0  1  4   4
3 22.8  4     108  93 3.85    2.32 18.61  1  1  4   1
4 21.4  6     258 110 3.08 missing 19.44  1  0  3   1
5 18.7  8     360 175 3.15    3.44 17.02  0  0  3   2
6 18.1  6     225 105 2.76    3.46 20.22  1  0  3   1


  • using col_types you can define the types of your columns (instead of letting R guess), attention: NA values may be produced
  • you need one name and type for each column


xx <- read_excel("example.xlsx","conspvalues",skip=1, col_types = rep("numeric",11))
head(xx)

Warnmeldungen:
1: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,  :
  [3, 3]: expecting numeric: got 'missing'
2: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,  :
  [5, 6]: expecting numeric: got 'missing'
    21 6 160 110  3.9  2.62 16.46 0 1 4 4
1 21.0 6  NA 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108  93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258 110 3.08    NA 19.44 1 0 3 1
4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

Sunday, March 15, 2015

function which runs two sample t-test on grouped data

1 function which runs two sample t-test on grouped data frame

  • the functions takes a data frame as its first argument
  • the second argument group is the column which should use for splitting the data frame
  • col indicates the numeric column to pass through to the t.test() function
  • incol should be a binary variable, which is also be passed through to t.test()


df.t.test <- function(df,group,col,indcol){
    t.test.helper <- function(x,col,indcol,group){
        tob <- t.test(x[,col] ~ x[,indcol])
        tmp <- data.frame(data = paste(col,"by",indcol),
                          group = x[1,group],
                          mean.group.1 = tob$estimate[1],
                          mean.group.2 = tob$estimate[2],
                          name.test.stat = tob$statistic,
                          conf.lower = tob$conf.int[1],
                          conf.upper = tob$conf.int[2],
                          pval = tob$p.value,
                          alternative = tob$alternative,
                          tob$method)
        names(tmp)[3:4] <- make.names(names(tob$estimate))
        row.names(tmp) <- x[1,group]
        tmp
    }
    df.l <- split(df[,c(col,indcol,group)],df[,group])
    Reduce(rbind,lapply(df.l,t.test.helper,col=col,indcol=indcol,group=group))}


## example data
examp.data <- data.frame(group=gl(10,100),
                         values=rnorm(1000),
                         t.group=sample(letters[1:2],1000,replace=T))
## example
df.t.test(examp.data,"group","values","t.group")

data group mean.in.group.a mean.in.group.b name.test.stat
1  values by t.group     1      0.06958824      0.02803721      0.2244456
2  values by t.group     2     -0.20944827     -0.06033410     -0.8368429
3  values by t.group     3     -0.20387479      0.07940850     -1.3245172
4  values by t.group     4      0.11406709      0.01975937      0.4220244
5  values by t.group     5      0.09060241     -0.10442099      1.0620544
6  values by t.group     6     -0.05623630     -0.07537593      0.1056388
7  values by t.group     7     -0.26081841     -0.02721652     -0.9533887
8  values by t.group     8     -0.04723535     -0.17205804      0.5662930
9  values by t.group     9      0.08185406      0.01676488      0.3033993
10 values by t.group    10     -0.41406196     -0.02193303     -2.1353113
   conf.lower  conf.upper       pval alternative              tob.method
1  -0.3263444  0.40944644 0.82293023   two.sided Welch Two Sample t-test
2  -0.5030591  0.20483073 0.40487276   two.sided Welch Two Sample t-test
3  -0.7083319  0.14176535 0.18876884   two.sided Welch Two Sample t-test
4  -0.3491642  0.53777963 0.67393371   two.sided Welch Two Sample t-test
5  -0.1693819  0.55942873 0.29082158   two.sided Welch Two Sample t-test
6  -0.3404432  0.37872241 0.91608663   two.sided Welch Two Sample t-test
7  -0.7200269  0.25282312 0.34281018   two.sided Welch Two Sample t-test
8  -0.3126875  0.56233286 0.57251104   two.sided Welch Two Sample t-test
9  -0.3606510  0.49082936 0.76222949   two.sided Welch Two Sample t-test
10 -0.7566487 -0.02760917 0.03527904   two.sided Welch Two Sample t-test

Monday, November 3, 2014

Graphics for Statistics - figures with ggplot2 - chapter 5 quantile-quantile plots

1 Keen Chapter 5

1.1 Figure 5.1 normal quantile-quantile plot

  • first set up the data vector
  • then we use a point layer with the quantile-quantile stat and change the size of the points: geom_point(stat="qq",shape=1,size=3.5)

require(ggplot2)
require(grid)
mass<-c(5.9,32.0,40.0,51.5,70.0,100.0,78.0,80.0,85.0,85.0,
              110.0,115.0,125.0,130.0,120.0,120.0,130.0,135.0,110.0,130.0,
              150.0,145.0,150.0,170.0,225.0,145.0,188.0,180.0,197.0,218.0,
              300.0,260.0,265.0,250.0,250.0,300.0,320.0,514.0,556.0,840.0,
              685.0,700.0,700.0,690.0,900.0,650.0,820.0,850.0,900.0,1015.0,
              820.0,1100.0,1000.0,1100.0,1000.0,1000.0)
df <- data.frame(y=mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-3,3),breaks=-3:3,expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.2 Figure 5.2 normal quantile-quantile plot

  • only use coord_flip() on the previous plot


ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-3,3),breaks=-2:2,expand=c(0,0.5)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.3 Figure 5.3 normal quantile-quantile plot

  • only add geom_abline() with intercept=mean(mass) and slope=sd(mass) (the line goes through the point defined by the mean of both, x and y values which is (mean(standard normal dist)=0,mean(mass)) therefore mean(mass) is the intercept; the sd of the norm is one so the slope is sd(mass)/sd(standard normal dist)=sd(mass))


ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    geom_abline(intercept=mean(mass),slope=sd(mass)) + 
    scale_x_continuous(limits=c(-3,3),breaks=-3:3,expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.4 Figure 5.4 gamma quantile-quantile plot

  • first we calculate the estimates of the scale and shape parameter of the gamma dist
  • then we only change the dist to qgamma (distribution="qgamma") and set the parameters as list (dparams=list(shape=sshape,scale=sscale))
  • last adjust the x-axis (limits and breaks)

sshape<-(mean(mass)/sd(mass))**2
sscale<-var(mass)/mean(mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",distribution="qgamma",dparams=list(shape=sshape,scale=sscale),shape=1,size=3.5) +
    scale_x_continuous(limits=c(0,1500),breaks=seq(0,1500,by=500),expand=c(0,20)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Gamma Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.5 Figure 5.5 normal quantile-quantile plot

  • we only have to change the x-axis of plot 5.1
  • so set the limits to c(-4,4) and the breaks to the required x-values by using the qnorm() function and change the labels appropriately
  • finally change to title of the x-axis

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-4,4),
                       breaks=qnorm(c(0.0001,.01,.05,.25,.50,.75,.95,.99,.9999)),
                       labels=c(0.01,1,5,25,50,75,95,99,99.99),
                       expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Normal Percentiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.6 Figure 5.6 normal quantile-quantile plot

  • it is the same like plot 5.5 - only add coord_flip()

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-4,4),
                       breaks=qnorm(c(0.0001,.01,.05,.25,.50,.75,.95,.99,.9999)),
                       labels=c(0.01,1,5,25,50,75,95,99,99.99),
                       expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Normal Percentiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


1.7 Figure 5.7 gamma quantile-quantile plot

  • from figure 5.4 we have to change the axis and flip the coordinates:
  • to calculate the breaks we use qgamma() with the appropriate values and the shape and scale parameter
  • then we change the labels accordingly
  • add coord_flip()

sshape<-(mean(mass)/sd(mass))**2
sscale<-var(mass)/mean(mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",distribution="qgamma",dparams=list(shape=sshape,scale=sscale),shape=1,size=3.5) +
    scale_x_continuous(limits=c(0,2400),
                       breaks=qgamma(c(0.01,.25,.50,.75,.95,.99,.999),shape=sshape,scale=sscale),
                       labels=c(1,25,50,75,95,99,99.9),expand=c(0,20)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Gamma Quantiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )


Sunday, November 2, 2014

Graphics for Statistics - figures with ggplot2 - chapter 4 ecdf plots

1 Keen Chapter 4

Graphics out of the book Graphics for Statistics and Data Analysis with R by Kevin Keen (book home page)

1.1 Figure 4.17 EDF plot

  • first set up the data frame:
    • we use the ecdf() function to get create a stepfunction ecdfmass()
    • from this function we can extract the knots (which will be mapped to the x-axis)
    • using this knots as arguments in ecdfmass() we'll get the belonging probabilities (which we will map to the y-axis)
    • the end column in df contains the end points of the horizontal lines in the step function - it's only the knots vector beginning with the second element and setting the last element to NA

mass<-c(5.9,32.0,40.0,51.5,70.0,100.0,78.0,80.0,85.0,85.0,
        110.0,115.0,125.0,130.0,120.0,120.0,130.0,135.0,110.0,130.0,
        150.0,145.0,150.0,170.0,225.0,145.0,188.0,180.0,197.0,218.0,
        300.0,260.0,265.0,250.0,250.0,300.0,320.0,514.0,556.0,840.0,
        685.0,700.0,700.0,690.0,900.0,650.0,820.0,850.0,900.0,1015.0,
        820.0,1100.0,1000.0,1100.0,1000.0,1000.0)


ecdfmass <- ecdf(mass)
kn <- knots(ecdfmass)
ed <- ecdfmass(kn)

df <- data.frame(knots=kn,ed=ed,end=c(kn[-1],NA))
head(df)

  knots         ed  end
1   5.9 0.01785714 32.0
2  32.0 0.03571429 40.0
3  40.0 0.05357143 51.5
4  51.5 0.07142857 70.0
5  70.0 0.08928571 78.0
6  78.0 0.10714286 80.0

  • now we first set the aesthetics for the points: x to knots and y to ed
  • add the point layer and setting the point size to 3: geom_point(size=3)
  • add the lines using a segment layer setting the aesthetics xend and yend: geom_segment(aes(xend=end,yend=ed))
  • in the next step we add two addition segment layers, one for each arrow; you can also use annotate() to do this; inside these segments we use the arrow() function from the grid package, so we can define the appearance of our arrows
  • the next two lines change the appearance of the axes: setting the limits, the breaks and the expansion
  • then set the appropriate axes titles and customize axis elements and panel.background



require(grid) ## for the arrow() function
ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )


1.2 Figure 4.18 EDF plot

  • we only have to change the axis breaks of the y-axis and add the horizontal lines
  • the first is done changing the by argument in the seq() to 0.25 in scale_y_continuous()
  • then we add a hline layer


ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_hline(yintercept=c(0.25,0.5,0.75),linetype=2) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )



  • we also can use the grid lines (but then we have also lines at 0 and 1


ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        panel.grid.major.x=element_blank(),
        panel.grid.major.y=element_line(linetype = 2,colour="grey50"),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )


1.3 Figure 4.19 EDF plot

  • replace geom_point() by geom_step()
  • get rid of the horizontal lines
  • add another little segment which connects the left arrow with the step function: geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4)
  • leave everything as it is


ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )



  • instead of using the data frame created above you can use the original data (mass) and set stat to ecdf
  • but for the arrows you need to calculate the vals anyway, that's why I use the data frame above


df3 <- data.frame(mass=mass)
ggplot(df,aes(x=knots,y=ed)) +
    geom_step(inherit.aes=F,stat="ecdf",data=df3,aes(x=mass)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )



1.4 Figure 4.20 EDF plot

  • the last plot only with the horizontal grid lines add the quartiles


ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    geom_hline(yintercept=c(0.25,0.5,0.75),linetype=2) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )



1.5 Figure 4.20 EDF plot with cumulative normal distribution function added

  • first we need to create a second data frame containing the values defining the curve, we choose to use 10000 points on the x-axis and use pnorm() to calculate to respective y values (using the empirical mean and the empirical sd of the vector mass)
  • then we add the layer (geom_line())
  • change the limits of the x-axis and the breaks inside scale_x_continuous()
  • change the length of the right arrow (setting xend to 1500)


mean_mass<-mean(mass)
sd_mass<-sd(mass)
min_mass<-min(mass)
max_mass<-1500

xx <- seq(0,10000,1)*(max_mass-min_mass)/10000.+min_mass
yy <- pnorm(xx,mean_mass,sd_mass)

df2 <- data.frame(xx=xx,yy=yy)

ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1500,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    geom_line(data=df2,aes(x=xx,y=yy)) +
    scale_x_continuous(limits=c(-45,1500),breaks=seq(0,1500,by=500),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )


Graphics for Statistics - figures with ggplot2 - Chapter 3 part 2 Bar charts with errorbars, dot-whisker charts

1 Keen Chapter 3 part 2

Graphics out of the book Graphics for Statistics and Data Analysis with R by Kevin Keen (book home page)

1.1 Figure 3.8 Bar-whisker chart

  • we will work out the graphic on flipped axes and flip them later
  • set the aesthetics:
    • x to the names of the allergenes
    • y to the percentage (prevalence)
    • ymin to y minus the standard error and
    • ymax to y plus the standard error
  • using again geom_bar() with the stat="identity" option because we have already aggregated data (so the height of the bars is set by the y aesthetic)
  • set the filling and the colour of the edges (fill and colour), finally adjust the width (width) of the bars
  • there is no axis title on the axis with the names, so set xlab("") (remember we will flip the axis later)
  • set the title of the continuous axis to Percent and
  • set the limits of the axis to 0 and 50
  • set expansion of it to c(0,0) - because we did not want to expand the axis, it should actually end at 0 and 50
  • now we flip the axes
  • and set the appearance of the text, axis and background elements

require(ggplot2)

names<-c("Epidermals","Dust Mites","Weeds","Grasses","Molds","Trees")
prevs<-c(38.2,37.8,31.1,31.1,29.3,26.7)
se<-c(3.2,3.2,3.1,3.1,3.0,2.9)

df <- data.frame(item=factor(1:6,labels=names),prevs=prevs,se=se)

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_bar(stat="identity",fill="transparent",colour="black",width=0.7) +
    geom_errorbar(width=0.3) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.line.y=element_blank(),
        axis.text=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


1.2 Figure 3.9 Bar and single whisker chart

  • the same as above, only change the filling of the bars to black (you do not need the colour argument any more) and the width of the error bars to 0

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_bar(stat="identity",fill="black",width=0.7) +
    geom_errorbar(width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.line.y=element_blank(),
        axis.text=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


1.3 Figure 3.10 Dot-whisker chart

  • for the dot-whisker chart we replace geom_bar() by geom_point()
  • in geom_point() we set the point size to four and the colour to black
  • in geom_errorbar() we set the the width to 0.3
  • add geom_vline() for the dotted lines and set the aesthetics xintercept to as.numeric(item) (because this aesthetic expects a numeric argument)
  • then change some elements in the theme section
    • set the colour in panel.border to black and do not forget to set fill to NA (you won't see anything if you don't)
    • remove the axis.line.y line


ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0.25) +
    geom_vline(aes(xintercept=as.numeric(item)),linetype=3,size=0.4) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


1.4 Figure 3.11 Dot-whisker chart

  • only minor changes to the previous plot
  • remove geom_vline()
  • adjust the widths of the error bars


ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0.1) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


1.5 Figure 3.12 Dot-whisker chart

  • only adjust the widths of the error bars


ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


1.6 Figure 3.13 two-tiered dot-whisker chart

  • there are several possibilities
  • I decided to use two error bar layers so first
  • I have to move the aesthetics for ymin and ymax to geom_errorbar(), I set the width to 0.2
  • then I add a second geom_errorbar() set there also aesthetics but now ymin to prevs-1.96*se and ymax to prevs+1.96*se


ggplot(df,aes(x=item,y=prevs)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(aes(ymin=prevs-se,ymax=prevs+se),width=0.2) +
    geom_errorbar(aes(ymin=prevs-1.96*se,ymax=prevs+1.96*se),width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )


Friday, October 31, 2014

R - plot with two different y-axes on top of each other

1 two plots on top of each other

  • first create to vectors of y-values
  • then set the margins appropriately
  • then we plot the first series of y-values and plot the axis title

y.plot1 <- sort(runif(10))
y.plot2 <- runif(10)*5
par(mar=c(5,6,2,4))
plot(y.plot1,lwd = 4,ann=F,las=2,type="l")
mtext("y.plot1",side = 2,line=3.5)



  • now we set the graphics parameter new to TRUE ; this causes R not to clean the plot before the next high level graphic command is executed
  • then we plot the second plot without annotations (because we want the axis on the right side, which would be plotted on the left side by default)
  • then we plot the y-axis on the right side of the plot (the four in axis(4, ...) indicates the side) and add the annotations

y.plot1 <- sort(runif(10))
par(new=T)
plot(y.plot2,ann=F,axes = F,col="red",type="l",lwd = 4)
axis(4,col="red",col.ticks = "red",lwd=3,at=0:4)
mtext("y.plot2",side = 4,line=3,col="red")


R - get weather data from openweather api

1 get weather data an from openweather via R

  • load the rjson package which is able to parse json into R formats
  • set the file argument to the uri
  • within the uri you can specify the city or a weather station (see http://openweathermap.org/api for details), here are a example by station

require(rjson)
tt <- fromJSON(file=
  "http://api.openweathermap.org/data/2.5/history/station?id=7447&type=day")

  • the fromJSON() command converts the JSON formatted data into a list
  • the list element named list contains the weather data for one month previous in the free account
  • the little function extract.temp() below, extracts the min and max temperature and the wind speed (together with the time)

extract.temp <- function(x){
    return(data.frame(min=x$temp$mi-273.15,
                      max=x$temp$ma-273.15,
                      wind=unlist(x$wind$speed[1]),
                      tstmp=x$dt))
}

temps <- Reduce(rbind,lapply(tt$list , extract.temp))
temps$tstmp <- as.Date(as.POSIXct(temps$tstmp,
                       origin = as.Date("1970-01-01")))
head(temps)

   min max wind      tstmp
v   16  26 1.42 2014-10-05
v1  12  23 2.31 2014-10-06
v2  14  24 2.12 2014-10-07
v3  15  25 3.14 2014-10-08
v4  15  25 3.33 2014-10-09
v5  16  25 3.14 2014-10-10

  • now we plot the data using the symbols() command, plotting wind speed on the y-axis and time on the x-axis (there is some mix-up because of the the beginning of winter time (from daylight saving time) on oct, 26th


nod <- length(tt$list)

symbols(x=temps$tstmp,
        y=temps$wind,
        thermometers = cbind(.4,1.6,temps$min/40,temps$max/40),
        fg = "red",
        ylim = c(-0.5,max(temps$wind)+1),
        xaxt="n",
        xlab = "day", ylab="wind speed",
        main="Temperature of the last 30 days")
axis(1,at=temps$tstmp,labels=
                substr(as.character(temps$tstmp),6,10),las=2)

 text(x=temps$tstmp,
      y=temps$wind-0.5,rep(0,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind+0.5,rep(40,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind,rep(20,nod),adj=c(0.5,0.5))




  • with the little function below you can find the station ids by lat and lon


get.station <- function(lat,lon){
    tmpdat <- fromJSON(file=
     paste0("http://api.openweathermap.org/data/2.5/station/find?lat=",lat,
            "&lon=",lon,"&cnt=1"))
    print(paste("found:",tmpdat[[1]]$station$name,", id:",tmpdat[[1]]$station$id))
    tmpdat[[1]]$station$id
}

station <- get.station(lat=34,lon=118.3)
station

[1] "found: ZSNJ , id: 7447"
[1] 7447



get.station(51.5,0.1)

[1] "found: EGLC , id: 5091"
[1] 5091

Sunday, October 12, 2014

R - ROC curves

Table of Contents

1 ROCR

  • load the package
  • the artificial data set ROCR.simple contains measurements and labels which you can consider as gold standard
  • the prediction() function takes the measurements and the gold standard as input
  • the value returned is a list containing inter alia:
    • measurements and labels
    • vectors of cutoffs and the number of
    • false positives
    • true positives
    • false negatives
    • true negatives corresponding to the resp. cutoffs

require(ROCR,quietly=T)
data(ROCR.simple)
pred <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
str(pred)

..$ : num [1:200] 0.613 0.364 0.432 0.14 0.385 ...
 ..@ labels     :List of 1
 .. ..$ : Ord.factor w/ 2 levels "0"<"1": 2 2 1 1 1 2 2 2 2 1 ...
 ..@ cutoffs    :List of 1
 .. ..$ : num [1:201] Inf 0.991 0.985 0.985 0.983 ...
 ..@ fp         :List of 1
 .. ..$ : num [1:201] 0 0 0 0 1 1 2 3 3 3 ...
 ..@ tp         :List of 1
 .. ..$ : num [1:201] 0 1 2 3 3 4 4 4 5 6 ...
 ..@ tn         :List of 1
 .. ..$ : num [1:201] 107 107 107 107 106 106 105 104 104 104 ...
 ..@ fn         :List of 1
 .. ..$ : num [1:201] 93 92 91 90 90 89 89 89 88 87 ...
 ..@ n.pos      :List of 1
 .. ..$ : int 93
 ..@ n.neg      :List of 1
 .. ..$ : int 107
 ..@ n.pos.pred :List of 1
 .. ..$ : num [1:201] 0 1 2 3 4 5 6 7 8 9 ...
 ..@ n.neg.pred :List of 1
 .. ..$ : num [1:201] 200 199 198 197 196 195 194 193 192 191 ...

  • now we can use performance() to calculate different performance measures
  • first we plot the classic ROC Curve based on sensitivity and specificity:
    • plotting sensitivity (y-axis) against 1-specificity (x-axis) which is equivalent to
    • true positive rate against false positive rate which is equivalent to

perf <- performance(pred,"tpr","fpr")
plot(perf)



  • calculate the area under the curve (AUC) (the output is a bit messy, but you find the value of interest in the slot y.values

print(performance(pred,"auc"))

An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.8341875


Slot "alpha.values":
list()

  • find the best cutoff (best always depends on your preferences); here we put equal weight on sensitivity and specificity and maximize the sum of them (Youden Index)
  • we write a function which takes a prediction object, the names of two performance measurements and gives back the cutoff, the maximum of the sum and the respective values of the two performance measurements

max.ss <- function(pred,meas.1,meas.2){
    meas.1 <- performance(pred,meas.1)
    meas.2 <- performance(pred,meas.2)
    x.vals <- slot(meas.1,'x.values')[[1]]
    y.vals <- slot(meas.1,'y.values')[[1]] + slot(meas.2,'y.values')[[1]]
    y1.vals <- slot(meas.1,'y.values')[[1]]
    y2.vals <- slot(meas.2,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 max.sum=max(y.vals),
                 max.meas1=y1.vals[which.max(y.vals)],
                 max.meas2=y2.vals[which.max(y.vals)])
}

max.ss(pred,"sens","spec")

$cut.off
[1] 0.5014893

$max.sum
[1] 1.69993

$max.meas1
[1] 0.8494624

$max.meas2
[1] 0.8504673

  • here we get a cutoff of 0.5
  • the maximized sum is 1.70
  • the resp. sensitivity is 0.85
  • the resp. specificity is also 0.85
  • sometimes we have a clear preference because the cost of a false negative is much higher than the cost of a false positive (or vice versa)
  • therefore it exists a modified version of the Youden-Index which maximizes \(sensitivity + r\cdot specificity\) where \(r=\frac{1-prevalence}{cost\cdot prevalence}\) and \(cost\) is the cost of a false negative and \(prevalence\) is the prevalence in the population under consideration


max.ss <- function(pred,cost=1,prev=0.5){
    r <- (1-prev)/(cost*prev)
    sens <- performance(pred,"sens")
    spec <- performance(pred,"spec")
    x.vals <- slot(sens,'x.values')[[1]]
    y.vals <- slot(sens,'y.values')[[1]] + r*slot(spec,'y.values')[[1]]
    y1.vals <- slot(sens,'y.values')[[1]]
    y2.vals <- slot(spec,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 sensitivity=y1.vals[which.max(y.vals)],
                 specificity=y2.vals[which.max(y.vals)])
}

max.ss(pred)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673

  • with the defaults cost=1 and prev=0.5 we get exactly the same result (because \(r=1\) in this case)
  • if we have a disease with prevalence of 0.1 where false negatives (i.e. not detecting a true case) are more expensive


max.ss(pred,cost=20,prev=0.1)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673