R, Ruby, Perl und ich

Monday, August 10, 2015

useful links

Data sets

Shiny

Thursday, July 9, 2015

label outlier in ggplot2 boxplot

function to add labels to outliers in a ggplot2 boxplot
the function add.outlier() takes a ggplot boxplot object as input
the second optional input is a string containing the name of the variable containing the labels, the default is the value itself
the function expects a unique mapping to x and y, where x is a factor variable
the data frame given to the ggplot object must contain the x, y, and the labelling variable

require(ggplot2)

mtcars$cyl <- factor(mtcars$cyl)
mtcars$labels <- row.names(mtcars)

p <- ggplot(mtcars,aes(x=cyl,colour=cyl,y=qsec)) +
    geom_boxplot()



add.outlier <- function(p,labvar = as.character(p$mapping$y)){
      df <- data.frame(y = with(p$data,eval(p$mapping$y)),
                       x = with(p$data,eval(p$mapping$x)))
  
      df.l <- split(df,df$x)
      
      mm <- Reduce(rbind, lapply(df.l,FUN = function(df){
                                     data.frame(y = df$y[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))],
                                                x = df$x[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))]
                                                )})
                   )
  
      
      mm$x <- factor(mm$x,levels=sort(as.numeric(as.character(unique(p$data[,as.character(p$mapping$x)])))),
                     labels = levels(p$data[,as.character(p$mapping$x)])
                     )
      
      names(mm) <- c(as.character(p$mapping$y),as.character(p$mapping$x))
      mm <- merge(p$data[,c(names(mm),labvar)],mm)
      
      p + geom_text(data=mm,
                    aes_string(label=labvar),
                    vjust = -0.5)
}

add.outlier(p)

add.outlier(p,"labels")

Saturday, April 25, 2015

R - read excel files without java or perl dependencies

1 read Excel files without java or perl dependencies

With the new package from hadley wickham it is finally possible to read excel files without java or excel dependencies, yeah… For now there are only two functions:

excel_sheets() to get all sheets contained in an excel file
read_excel() to read one sheet out of an excel file
get the filenames of excel files contained in the current directory

require(readxl)
dir(pattern = "xls")

[1] "20140927excelfile.xlsx" "example.xlsx"

get the names of the excel sheets contained in the excel file (only one in this case)
the function has the file name resp. the whole path as argument

excel_sheets("example.xlsx")

[1] "conspvalues"

read in the sheet, the function takes the path and the sheet as argument
the sheet can be given as position (first sheet = 1, second = 2, etc) or as name (the first sheet is the default)

xx <- read_excel("example.xlsx",1)
head(xx)

mpg cyl    disp  hp drat      wt  qsec vs am gear carb
1 21.0   6     160 110 3.90    2.62 16.46  0  1    4    4
2 21.0   6 missing 110 3.90   2.875 17.02  0  1    4    4
3 22.8   4     108  93 3.85    2.32 18.61  1  1    4    1
4 21.4   6     258 110 3.08 missing 19.44  1  0    3    1
5 18.7   8     360 175 3.15    3.44 17.02  0  0    3    2
6 18.1   6     225 105 2.76    3.46 20.22  1  0    3    1

xx <- read_excel("example.xlsx","conspvalues")
head(xx)

mpg cyl    disp  hp drat      wt  qsec vs am gear carb
1 21.0   6     160 110 3.90    2.62 16.46  0  1    4    4
2 21.0   6 missing 110 3.90   2.875 17.02  0  1    4    4
3 22.8   4     108  93 3.85    2.32 18.61  1  1    4    1
4 21.4   6     258 110 3.08 missing 19.44  1  0    3    1
5 18.7   8     360 175 3.15    3.44 17.02  0  0    3    2
6 18.1   6     225 105 2.76    3.46 20.22  1  0    3    1

there is an additional argument to set na strings (na)

xx <- read_excel("example.xlsx","conspvalues",na="missing")
head(xx)

mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6   NA 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08    NA 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

with skip you can skip the first lines

xx <- read_excel("example.xlsx","conspvalues",skip=1)
head(xx)

21 6     160 110  3.9    2.62 16.46 0 1 4 4
1 21.0 6 missing 110 3.90   2.875 17.02 0 1 4 4
2 22.8 4     108  93 3.85    2.32 18.61 1 1 4 1
3 21.4 6     258 110 3.08 missing 19.44 1 0 3 1
4 18.7 8     360 175 3.15    3.44 17.02 0 0 3 2
5 18.1 6     225 105 2.76    3.46 20.22 1 0 3 1
6 14.3 8     360 245 3.21    3.57 15.84 0 0 3 4

and you can use col_names to tell the function whether or not the excel file containes the column names

xx <- read_excel("example.xlsx","conspvalues",skip=1, col_names=F)
head(xx)

X0 X1      X2  X3   X4      X5    X6 X7 X8 X9 X10
1 21.0  6     160 110 3.90    2.62 16.46  0  1  4   4
2 21.0  6 missing 110 3.90   2.875 17.02  0  1  4   4
3 22.8  4     108  93 3.85    2.32 18.61  1  1  4   1
4 21.4  6     258 110 3.08 missing 19.44  1  0  3   1
5 18.7  8     360 175 3.15    3.44 17.02  0  0  3   2
6 18.1  6     225 105 2.76    3.46 20.22  1  0  3   1

using col_types you can define the types of your columns (instead of letting R guess), attention: NA values may be produced
you need one name and type for each column

xx <- read_excel("example.xlsx","conspvalues",skip=1, col_types = rep("numeric",11))
head(xx)

Warnmeldungen:
1: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,  :
  [3, 3]: expecting numeric: got 'missing'
2: In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,  :
  [5, 6]: expecting numeric: got 'missing'
    21 6 160 110  3.9  2.62 16.46 0 1 4 4
1 21.0 6  NA 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108  93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258 110 3.08    NA 19.44 1 0 3 1
4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4

Sunday, March 15, 2015

function which runs two sample t-test on grouped data

1 function which runs two sample t-test on grouped data frame

the functions takes a data frame as its first argument
the second argument group is the column which should use for splitting the data frame
col indicates the numeric column to pass through to the t.test() function
incol should be a binary variable, which is also be passed through to t.test()

df.t.test <- function(df,group,col,indcol){
    t.test.helper <- function(x,col,indcol,group){
        tob <- t.test(x[,col] ~ x[,indcol])
        tmp <- data.frame(data = paste(col,"by",indcol),
                          group = x[1,group],
                          mean.group.1 = tob$estimate[1],
                          mean.group.2 = tob$estimate[2],
                          name.test.stat = tob$statistic,
                          conf.lower = tob$conf.int[1],
                          conf.upper = tob$conf.int[2],
                          pval = tob$p.value,
                          alternative = tob$alternative,
                          tob$method)
        names(tmp)[3:4] <- make.names(names(tob$estimate))
        row.names(tmp) <- x[1,group]
        tmp
    }
    df.l <- split(df[,c(col,indcol,group)],df[,group])
    Reduce(rbind,lapply(df.l,t.test.helper,col=col,indcol=indcol,group=group))}


## example data
examp.data <- data.frame(group=gl(10,100),
                         values=rnorm(1000),
                         t.group=sample(letters[1:2],1000,replace=T))
## example
df.t.test(examp.data,"group","values","t.group")

data group mean.in.group.a mean.in.group.b name.test.stat
1  values by t.group     1      0.06958824      0.02803721      0.2244456
2  values by t.group     2     -0.20944827     -0.06033410     -0.8368429
3  values by t.group     3     -0.20387479      0.07940850     -1.3245172
4  values by t.group     4      0.11406709      0.01975937      0.4220244
5  values by t.group     5      0.09060241     -0.10442099      1.0620544
6  values by t.group     6     -0.05623630     -0.07537593      0.1056388
7  values by t.group     7     -0.26081841     -0.02721652     -0.9533887
8  values by t.group     8     -0.04723535     -0.17205804      0.5662930
9  values by t.group     9      0.08185406      0.01676488      0.3033993
10 values by t.group    10     -0.41406196     -0.02193303     -2.1353113
   conf.lower  conf.upper       pval alternative              tob.method
1  -0.3263444  0.40944644 0.82293023   two.sided Welch Two Sample t-test
2  -0.5030591  0.20483073 0.40487276   two.sided Welch Two Sample t-test
3  -0.7083319  0.14176535 0.18876884   two.sided Welch Two Sample t-test
4  -0.3491642  0.53777963 0.67393371   two.sided Welch Two Sample t-test
5  -0.1693819  0.55942873 0.29082158   two.sided Welch Two Sample t-test
6  -0.3404432  0.37872241 0.91608663   two.sided Welch Two Sample t-test
7  -0.7200269  0.25282312 0.34281018   two.sided Welch Two Sample t-test
8  -0.3126875  0.56233286 0.57251104   two.sided Welch Two Sample t-test
9  -0.3606510  0.49082936 0.76222949   two.sided Welch Two Sample t-test
10 -0.7566487 -0.02760917 0.03527904   two.sided Welch Two Sample t-test

Monday, November 3, 2014

Graphics for Statistics - figures with ggplot2 - chapter 5 quantile-quantile plots

1. Keen Chapter 5

1 Keen Chapter 5

1.1 Figure 5.1 normal quantile-quantile plot

first set up the data vector
then we use a point layer with the quantile-quantile stat and change the size of the points: geom_point(stat="qq",shape=1,size=3.5)

require(ggplot2)
require(grid)
mass<-c(5.9,32.0,40.0,51.5,70.0,100.0,78.0,80.0,85.0,85.0,
              110.0,115.0,125.0,130.0,120.0,120.0,130.0,135.0,110.0,130.0,
              150.0,145.0,150.0,170.0,225.0,145.0,188.0,180.0,197.0,218.0,
              300.0,260.0,265.0,250.0,250.0,300.0,320.0,514.0,556.0,840.0,
              685.0,700.0,700.0,690.0,900.0,650.0,820.0,850.0,900.0,1015.0,
              820.0,1100.0,1000.0,1100.0,1000.0,1000.0)
df <- data.frame(y=mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-3,3),breaks=-3:3,expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.2 Figure 5.2 normal quantile-quantile plot

only use coord_flip() on the previous plot

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-3,3),breaks=-2:2,expand=c(0,0.5)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.3 Figure 5.3 normal quantile-quantile plot

only add geom_abline() with intercept=mean(mass) and slope=sd(mass) (the line goes through the point defined by the mean of both, x and y values which is (mean(standard normal dist)=0,mean(mass)) therefore mean(mass) is the intercept; the sd of the norm is one so the slope is sd(mass)/sd(standard normal dist)=sd(mass))

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    geom_abline(intercept=mean(mass),slope=sd(mass)) + 
    scale_x_continuous(limits=c(-3,3),breaks=-3:3,expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Standard Normal Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.4 Figure 5.4 gamma quantile-quantile plot

first we calculate the estimates of the scale and shape parameter of the gamma dist
then we only change the dist to qgamma (distribution="qgamma") and set the parameters as list (dparams=list(shape=sshape,scale=sscale))
last adjust the x-axis (limits and breaks)

sshape<-(mean(mass)/sd(mass))**2
sscale<-var(mass)/mean(mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",distribution="qgamma",dparams=list(shape=sshape,scale=sscale),shape=1,size=3.5) +
    scale_x_continuous(limits=c(0,1500),breaks=seq(0,1500,by=500),expand=c(0,20)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Gamma Quantiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.5 Figure 5.5 normal quantile-quantile plot

we only have to change the x-axis of plot 5.1
so set the limits to c(-4,4) and the breaks to the required x-values by using the qnorm() function and change the labels appropriately
finally change to title of the x-axis

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-4,4),
                       breaks=qnorm(c(0.0001,.01,.05,.25,.50,.75,.95,.99,.9999)),
                       labels=c(0.01,1,5,25,50,75,95,99,99.99),
                       expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Normal Percentiles") +
    ylab("Mass (g)") +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.6 Figure 5.6 normal quantile-quantile plot

it is the same like plot 5.5 - only add coord_flip()

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",shape=1,size=3.5) +
    scale_x_continuous(limits=c(-4,4),
                       breaks=qnorm(c(0.0001,.01,.05,.25,.50,.75,.95,.99,.9999)),
                       labels=c(0.01,1,5,25,50,75,95,99,99.99),
                       expand=c(0,0.1)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Normal Percentiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

1.7 Figure 5.7 gamma quantile-quantile plot

from figure 5.4 we have to change the axis and flip the coordinates:
to calculate the breaks we use qgamma() with the appropriate values and the shape and scale parameter
then we change the labels accordingly
add coord_flip()

sshape<-(mean(mass)/sd(mass))**2
sscale<-var(mass)/mean(mass)

ggplot(df,aes(sample=mass)) +
    geom_point(stat="qq",distribution="qgamma",dparams=list(shape=sshape,scale=sscale),shape=1,size=3.5) +
    scale_x_continuous(limits=c(0,2400),
                       breaks=qgamma(c(0.01,.25,.50,.75,.95,.99,.999),shape=sshape,scale=sscale),
                       labels=c(1,25,50,75,95,99,99.9),expand=c(0,20)) +
    scale_y_continuous(limits=c(0,1200),breaks=seq(0,1200,by=200),expand=c(0,20)) +
    xlab("Gamma Quantiles") +
    ylab("Mass (g)") +
    coord_flip() +
          theme(
          panel.background=element_blank(),
          axis.line=element_line(colour="black"),
          axis.text=element_text(colour="black",size=14),
          axis.title=element_text(colour="black",size=14),
          axis.ticks=element_line(colour="black"),
          axis.ticks.length=unit(3,"mm")
          )

Sunday, November 2, 2014

Graphics for Statistics - figures with ggplot2 - chapter 4 ecdf plots

1. Keen Chapter 4

1 Keen Chapter 4

Graphics out of the book Graphics for Statistics and Data Analysis with R by Kevin Keen (book home page)

1.1 Figure 4.17 EDF plot

first set up the data frame:
- we use the ecdf() function to get create a stepfunction ecdfmass()
- from this function we can extract the knots (which will be mapped to the x-axis)
- using this knots as arguments in ecdfmass() we'll get the belonging probabilities (which we will map to the y-axis)
- the end column in df contains the end points of the horizontal lines in the step function - it's only the knots vector beginning with the second element and setting the last element to NA

mass<-c(5.9,32.0,40.0,51.5,70.0,100.0,78.0,80.0,85.0,85.0,
        110.0,115.0,125.0,130.0,120.0,120.0,130.0,135.0,110.0,130.0,
        150.0,145.0,150.0,170.0,225.0,145.0,188.0,180.0,197.0,218.0,
        300.0,260.0,265.0,250.0,250.0,300.0,320.0,514.0,556.0,840.0,
        685.0,700.0,700.0,690.0,900.0,650.0,820.0,850.0,900.0,1015.0,
        820.0,1100.0,1000.0,1100.0,1000.0,1000.0)


ecdfmass <- ecdf(mass)
kn <- knots(ecdfmass)
ed <- ecdfmass(kn)

df <- data.frame(knots=kn,ed=ed,end=c(kn[-1],NA))
head(df)

  knots         ed  end
1   5.9 0.01785714 32.0
2  32.0 0.03571429 40.0
3  40.0 0.05357143 51.5
4  51.5 0.07142857 70.0
5  70.0 0.08928571 78.0
6  78.0 0.10714286 80.0

now we first set the aesthetics for the points: x to knots and y to ed
add the point layer and setting the point size to 3: geom_point(size=3)
add the lines using a segment layer setting the aesthetics xend and yend: geom_segment(aes(xend=end,yend=ed))
in the next step we add two addition segment layers, one for each arrow; you can also use annotate() to do this; inside these segments we use the arrow() function from the grid package, so we can define the appearance of our arrows
the next two lines change the appearance of the axes: setting the limits, the breaks and the expansion
then set the appropriate axes titles and customize axis elements and panel.background

require(grid) ## for the arrow() function
ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

1.2 Figure 4.18 EDF plot

we only have to change the axis breaks of the y-axis and add the horizontal lines
the first is done changing the by argument in the seq() to 0.25 in scale_y_continuous()
then we add a hline layer

ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_hline(yintercept=c(0.25,0.5,0.75),linetype=2) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

we also can use the grid lines (but then we have also lines at 0 and 1

ggplot(df,aes(x=knots,y=ed)) +
    geom_point(size=3) +
    geom_segment(aes(xend=end,yend=ed)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        panel.grid.major.x=element_blank(),
        panel.grid.major.y=element_line(linetype = 2,colour="grey50"),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

1.3 Figure 4.19 EDF plot

replace geom_point() by geom_step()
get rid of the horizontal lines
add another little segment which connects the left arrow with the step function: geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4)
leave everything as it is

ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

instead of using the data frame created above you can use the original data (mass) and set stat to ecdf
but for the arrows you need to calculate the vals anyway, that's why I use the data frame above

df3 <- data.frame(mass=mass)
ggplot(df,aes(x=knots,y=ed)) +
    geom_step(inherit.aes=F,stat="ecdf",data=df3,aes(x=mass)) +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

1.4 Figure 4.20 EDF plot

the last plot only with the horizontal grid lines add the quartiles

ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1245,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    geom_hline(yintercept=c(0.25,0.5,0.75),linetype=2) +
    scale_x_continuous(limits=c(-45,1245),breaks=seq(0,1200,by=200),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.25),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

1.5 Figure 4.20 EDF plot with cumulative normal distribution function added

first we need to create a second data frame containing the values defining the curve, we choose to use 10000 points on the x-axis and use pnorm() to calculate to respective y values (using the empirical mean and the empirical sd of the vector mass)
then we add the layer (geom_line())
change the limits of the x-axis and the breaks inside scale_x_continuous()
change the length of the right arrow (setting xend to 1500)

mean_mass<-mean(mass)
sd_mass<-sd(mass)
min_mass<-min(mass)
max_mass<-1500

xx <- seq(0,10000,1)*(max_mass-min_mass)/10000.+min_mass
yy <- pnorm(xx,mean_mass,sd_mass)

df2 <- data.frame(xx=xx,yy=yy)

ggplot(df,aes(x=knots,y=ed)) +
    geom_step(direction = "hv") +
    geom_segment(x=min(df$knots),xend=-45,y=0,yend=0,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=max(df$knots),xend=1500,y=1,yend=1,arrow=arrow(length = unit(0.15,"cm")),size=c(0.4)) +
    geom_segment(x=min(df$knots),xend=min(df$knots),y=0,yend=min(df$ed),size=0.4) +
    geom_line(data=df2,aes(x=xx,y=yy)) +
    scale_x_continuous(limits=c(-45,1500),breaks=seq(0,1500,by=500),expand=c(0,0)) +
    scale_y_continuous(limits=c(-0.01,1.05),breaks=seq(0,1,by=0.2),expand=c(0,0)) +
    xlab("Mass (g)") +
    ylab("Empirical Distribution Function") +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks=element_line(colour="black"),
        axis.ticks.length=unit(3,"mm")
        )

Graphics for Statistics - figures with ggplot2 - Chapter 3 part 2 Bar charts with errorbars, dot-whisker charts

1. Keen Chapter 3 part 2

1 Keen Chapter 3 part 2

Graphics out of the book Graphics for Statistics and Data Analysis with R by Kevin Keen (book home page)

1.1 Figure 3.8 Bar-whisker chart

we will work out the graphic on flipped axes and flip them later
set the aesthetics:
- x to the names of the allergenes
- y to the percentage (prevalence)
- ymin to y minus the standard error and
- ymax to y plus the standard error
using again geom_bar() with the stat="identity" option because we have already aggregated data (so the height of the bars is set by the y aesthetic)
set the filling and the colour of the edges (fill and colour), finally adjust the width (width) of the bars
there is no axis title on the axis with the names, so set xlab("") (remember we will flip the axis later)
set the title of the continuous axis to Percent and
set the limits of the axis to 0 and 50
set expansion of it to c(0,0) - because we did not want to expand the axis, it should actually end at 0 and 50
now we flip the axes
and set the appearance of the text, axis and background elements

require(ggplot2)

names<-c("Epidermals","Dust Mites","Weeds","Grasses","Molds","Trees")
prevs<-c(38.2,37.8,31.1,31.1,29.3,26.7)
se<-c(3.2,3.2,3.1,3.1,3.0,2.9)

df <- data.frame(item=factor(1:6,labels=names),prevs=prevs,se=se)

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_bar(stat="identity",fill="transparent",colour="black",width=0.7) +
    geom_errorbar(width=0.3) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.line.y=element_blank(),
        axis.text=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

1.2 Figure 3.9 Bar and single whisker chart

the same as above, only change the filling of the bars to black (you do not need the colour argument any more) and the width of the error bars to 0

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_bar(stat="identity",fill="black",width=0.7) +
    geom_errorbar(width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        axis.line=element_line(colour="black"),
        axis.line.y=element_blank(),
        axis.text=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

1.3 Figure 3.10 Dot-whisker chart

for the dot-whisker chart we replace geom_bar() by geom_point()
in geom_point() we set the point size to four and the colour to black
in geom_errorbar() we set the the width to 0.3
add geom_vline() for the dotted lines and set the aesthetics xintercept to as.numeric(item) (because this aesthetic expects a numeric argument)
then change some elements in the theme section
- set the colour in panel.border to black and do not forget to set fill to NA (you won't see anything if you don't)
- remove the axis.line.y line

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0.25) +
    geom_vline(aes(xintercept=as.numeric(item)),linetype=3,size=0.4) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

1.4 Figure 3.11 Dot-whisker chart

only minor changes to the previous plot
remove geom_vline()
adjust the widths of the error bars

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0.1) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

1.5 Figure 3.12 Dot-whisker chart

only adjust the widths of the error bars

ggplot(df,aes(x=item,y=prevs,ymin=prevs-se,ymax=prevs+se)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

1.6 Figure 3.13 two-tiered dot-whisker chart

there are several possibilities
I decided to use two error bar layers so first
I have to move the aesthetics for ymin and ymax to geom_errorbar(), I set the width to 0.2
then I add a second geom_errorbar() set there also aesthetics but now ymin to prevs-1.96*se and ymax to prevs+1.96*se

ggplot(df,aes(x=item,y=prevs)) +
    geom_point(colour="black",size=4) +
    geom_errorbar(aes(ymin=prevs-se,ymax=prevs+se),width=0.2) +
    geom_errorbar(aes(ymin=prevs-1.96*se,ymax=prevs+1.96*se),width=0) +
    xlab("") +
    ylab("Percent") +
    scale_y_continuous(limits=c(0,50),expand=c(0,0)) +
    coord_flip() +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill=NA),
        axis.line=element_line(colour="black"),
        axis.text=element_text(colour="black",size=14),
        axis.title=element_text(colour="black",size=14),
        axis.ticks.x=element_line(colour="black"),
        axis.ticks.y=element_blank()
        )

Friday, October 31, 2014

R - plot with two different y-axes on top of each other

1 two plots on top of each other

first create to vectors of y-values
then set the margins appropriately
then we plot the first series of y-values and plot the axis title

y.plot1 <- sort(runif(10))
y.plot2 <- runif(10)*5
par(mar=c(5,6,2,4))
plot(y.plot1,lwd = 4,ann=F,las=2,type="l")
mtext("y.plot1",side = 2,line=3.5)

now we set the graphics parameter new to TRUE ; this causes R not to clean the plot before the next high level graphic command is executed
then we plot the second plot without annotations (because we want the axis on the right side, which would be plotted on the left side by default)
then we plot the y-axis on the right side of the plot (the four in axis(4, ...) indicates the side) and add the annotations

y.plot1 <- sort(runif(10))
par(new=T)
plot(y.plot2,ann=F,axes = F,col="red",type="l",lwd = 4)
axis(4,col="red",col.ticks = "red",lwd=3,at=0:4)
mtext("y.plot2",side = 4,line=3,col="red")

R - get weather data from openweather api

1. get weather data an from openweather via R

1 get weather data an from openweather via R

load the rjson package which is able to parse json into R formats
set the file argument to the uri
within the uri you can specify the city or a weather station (see http://openweathermap.org/api for details), here are a example by station

require(rjson)
tt <- fromJSON(file=
  "http://api.openweathermap.org/data/2.5/history/station?id=7447&type=day")

the fromJSON() command converts the JSON formatted data into a list
the list element named list contains the weather data for one month previous in the free account
the little function extract.temp() below, extracts the min and max temperature and the wind speed (together with the time)

extract.temp <- function(x){
    return(data.frame(min=x$temp$mi-273.15,
                      max=x$temp$ma-273.15,
                      wind=unlist(x$wind$speed[1]),
                      tstmp=x$dt))
}

temps <- Reduce(rbind,lapply(tt$list , extract.temp))
temps$tstmp <- as.Date(as.POSIXct(temps$tstmp,
                       origin = as.Date("1970-01-01")))
head(temps)

   min max wind      tstmp
v   16  26 1.42 2014-10-05
v1  12  23 2.31 2014-10-06
v2  14  24 2.12 2014-10-07
v3  15  25 3.14 2014-10-08
v4  15  25 3.33 2014-10-09
v5  16  25 3.14 2014-10-10

now we plot the data using the symbols() command, plotting wind speed on the y-axis and time on the x-axis (there is some mix-up because of the the beginning of winter time (from daylight saving time) on oct, 26th

nod <- length(tt$list)

symbols(x=temps$tstmp,
        y=temps$wind,
        thermometers = cbind(.4,1.6,temps$min/40,temps$max/40),
        fg = "red",
        ylim = c(-0.5,max(temps$wind)+1),
        xaxt="n",
        xlab = "day", ylab="wind speed",
        main="Temperature of the last 30 days")
axis(1,at=temps$tstmp,labels=
                substr(as.character(temps$tstmp),6,10),las=2)

 text(x=temps$tstmp,
      y=temps$wind-0.5,rep(0,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind+0.5,rep(40,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind,rep(20,nod),adj=c(0.5,0.5))

with the little function below you can find the station ids by lat and lon

get.station <- function(lat,lon){
    tmpdat <- fromJSON(file=
     paste0("http://api.openweathermap.org/data/2.5/station/find?lat=",lat,
            "&lon=",lon,"&cnt=1"))
    print(paste("found:",tmpdat[[1]]$station$name,", id:",tmpdat[[1]]$station$id))
    tmpdat[[1]]$station$id
}

station <- get.station(lat=34,lon=118.3)
station

[1] "found: ZSNJ , id: 7447"
[1] 7447

get.station(51.5,0.1)

[1] "found: EGLC , id: 5091"
[1] 5091

Sunday, October 12, 2014

R - ROC curves

1. ROCR

1 ROCR

load the package
the artificial data set ROCR.simple contains measurements and labels which you can consider as gold standard
the prediction() function takes the measurements and the gold standard as input
the value returned is a list containing inter alia:
- measurements and labels
- vectors of cutoffs and the number of
- false positives
- true positives
- false negatives
- true negatives corresponding to the resp. cutoffs

require(ROCR,quietly=T)
data(ROCR.simple)
pred <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
str(pred)

..$ : num [1:200] 0.613 0.364 0.432 0.14 0.385 ...
 ..@ labels     :List of 1
 .. ..$ : Ord.factor w/ 2 levels "0"<"1": 2 2 1 1 1 2 2 2 2 1 ...
 ..@ cutoffs    :List of 1
 .. ..$ : num [1:201] Inf 0.991 0.985 0.985 0.983 ...
 ..@ fp         :List of 1
 .. ..$ : num [1:201] 0 0 0 0 1 1 2 3 3 3 ...
 ..@ tp         :List of 1
 .. ..$ : num [1:201] 0 1 2 3 3 4 4 4 5 6 ...
 ..@ tn         :List of 1
 .. ..$ : num [1:201] 107 107 107 107 106 106 105 104 104 104 ...
 ..@ fn         :List of 1
 .. ..$ : num [1:201] 93 92 91 90 90 89 89 89 88 87 ...
 ..@ n.pos      :List of 1
 .. ..$ : int 93
 ..@ n.neg      :List of 1
 .. ..$ : int 107
 ..@ n.pos.pred :List of 1
 .. ..$ : num [1:201] 0 1 2 3 4 5 6 7 8 9 ...
 ..@ n.neg.pred :List of 1
 .. ..$ : num [1:201] 200 199 198 197 196 195 194 193 192 191 ...

now we can use performance() to calculate different performance measures
first we plot the classic ROC Curve based on sensitivity and specificity:
- plotting sensitivity (y-axis) against 1-specificity (x-axis) which is equivalent to
- true positive rate against false positive rate which is equivalent to

perf <- performance(pred,"tpr","fpr")
plot(perf)

calculate the area under the curve (AUC) (the output is a bit messy, but you find the value of interest in the slot y.values

print(performance(pred,"auc"))

An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.8341875


Slot "alpha.values":
list()

find the best cutoff (best always depends on your preferences); here we put equal weight on sensitivity and specificity and maximize the sum of them (Youden Index)
we write a function which takes a prediction object, the names of two performance measurements and gives back the cutoff, the maximum of the sum and the respective values of the two performance measurements

max.ss <- function(pred,meas.1,meas.2){
    meas.1 <- performance(pred,meas.1)
    meas.2 <- performance(pred,meas.2)
    x.vals <- slot(meas.1,'x.values')[[1]]
    y.vals <- slot(meas.1,'y.values')[[1]] + slot(meas.2,'y.values')[[1]]
    y1.vals <- slot(meas.1,'y.values')[[1]]
    y2.vals <- slot(meas.2,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 max.sum=max(y.vals),
                 max.meas1=y1.vals[which.max(y.vals)],
                 max.meas2=y2.vals[which.max(y.vals)])
}

max.ss(pred,"sens","spec")

$cut.off
[1] 0.5014893

$max.sum
[1] 1.69993

$max.meas1
[1] 0.8494624

$max.meas2
[1] 0.8504673

here we get a cutoff of 0.5
the maximized sum is 1.70
the resp. sensitivity is 0.85
the resp. specificity is also 0.85
sometimes we have a clear preference because the cost of a false negative is much higher than the cost of a false positive (or vice versa)
therefore it exists a modified version of the Youden-Index which maximizes $sensitivity + r\cdot specificity$ where $r=\frac{1-prevalence}{cost\cdot prevalence}$ and $cost$ is the cost of a false negative and $prevalence$ is the prevalence in the population under consideration

max.ss <- function(pred,cost=1,prev=0.5){
    r <- (1-prev)/(cost*prev)
    sens <- performance(pred,"sens")
    spec <- performance(pred,"spec")
    x.vals <- slot(sens,'x.values')[[1]]
    y.vals <- slot(sens,'y.values')[[1]] + r*slot(spec,'y.values')[[1]]
    y1.vals <- slot(sens,'y.values')[[1]]
    y2.vals <- slot(spec,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 sensitivity=y1.vals[which.max(y.vals)],
                 specificity=y2.vals[which.max(y.vals)])
}

max.ss(pred)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673

with the defaults cost=1 and prev=0.5 we get exactly the same result (because $r=1$ in this case)
if we have a disease with prevalence of 0.1 where false negatives (i.e. not detecting a true case) are more expensive

max.ss(pred,cost=20,prev=0.1)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673

Subscribe to: Posts ( Atom )

Monday, August 10, 2015

Data sets

Shiny

Thursday, July 9, 2015

Saturday, April 25, 2015

1 read Excel files without java or perl dependencies

Sunday, March 15, 2015

1 function which runs two sample t-test on grouped data frame

Monday, November 3, 2014

Table of Contents

1 Keen Chapter 5

1.1 Figure 5.1 normal quantile-quantile plot

1.2 Figure 5.2 normal quantile-quantile plot

1.3 Figure 5.3 normal quantile-quantile plot

1.4 Figure 5.4 gamma quantile-quantile plot

1.5 Figure 5.5 normal quantile-quantile plot

1.6 Figure 5.6 normal quantile-quantile plot

1.7 Figure 5.7 gamma quantile-quantile plot

Sunday, November 2, 2014

Table of Contents

1 Keen Chapter 4

1.1 Figure 4.17 EDF plot

1.2 Figure 4.18 EDF plot

1.3 Figure 4.19 EDF plot

1.4 Figure 4.20 EDF plot

1.5 Figure 4.20 EDF plot with cumulative normal distribution function added

Table of Contents

1 Keen Chapter 3 part 2

1.1 Figure 3.8 Bar-whisker chart

1.2 Figure 3.9 Bar and single whisker chart

1.3 Figure 3.10 Dot-whisker chart

1.4 Figure 3.11 Dot-whisker chart

1.5 Figure 3.12 Dot-whisker chart

1.6 Figure 3.13 two-tiered dot-whisker chart

Friday, October 31, 2014

1 two plots on top of each other

Table of Contents

1 get weather data an from openweather via R

Sunday, October 12, 2014

Table of Contents

1 ROCR