Friday, October 31, 2014

R - plot with two different y-axes on top of each other

1 two plots on top of each other

  • first create to vectors of y-values
  • then set the margins appropriately
  • then we plot the first series of y-values and plot the axis title

y.plot1 <- sort(runif(10))
y.plot2 <- runif(10)*5
par(mar=c(5,6,2,4))
plot(y.plot1,lwd = 4,ann=F,las=2,type="l")
mtext("y.plot1",side = 2,line=3.5)



  • now we set the graphics parameter new to TRUE ; this causes R not to clean the plot before the next high level graphic command is executed
  • then we plot the second plot without annotations (because we want the axis on the right side, which would be plotted on the left side by default)
  • then we plot the y-axis on the right side of the plot (the four in axis(4, ...) indicates the side) and add the annotations

y.plot1 <- sort(runif(10))
par(new=T)
plot(y.plot2,ann=F,axes = F,col="red",type="l",lwd = 4)
axis(4,col="red",col.ticks = "red",lwd=3,at=0:4)
mtext("y.plot2",side = 4,line=3,col="red")


R - get weather data from openweather api

1 get weather data an from openweather via R

  • load the rjson package which is able to parse json into R formats
  • set the file argument to the uri
  • within the uri you can specify the city or a weather station (see http://openweathermap.org/api for details), here are a example by station

require(rjson)
tt <- fromJSON(file=
  "http://api.openweathermap.org/data/2.5/history/station?id=7447&type=day")

  • the fromJSON() command converts the JSON formatted data into a list
  • the list element named list contains the weather data for one month previous in the free account
  • the little function extract.temp() below, extracts the min and max temperature and the wind speed (together with the time)

extract.temp <- function(x){
    return(data.frame(min=x$temp$mi-273.15,
                      max=x$temp$ma-273.15,
                      wind=unlist(x$wind$speed[1]),
                      tstmp=x$dt))
}

temps <- Reduce(rbind,lapply(tt$list , extract.temp))
temps$tstmp <- as.Date(as.POSIXct(temps$tstmp,
                       origin = as.Date("1970-01-01")))
head(temps)

   min max wind      tstmp
v   16  26 1.42 2014-10-05
v1  12  23 2.31 2014-10-06
v2  14  24 2.12 2014-10-07
v3  15  25 3.14 2014-10-08
v4  15  25 3.33 2014-10-09
v5  16  25 3.14 2014-10-10

  • now we plot the data using the symbols() command, plotting wind speed on the y-axis and time on the x-axis (there is some mix-up because of the the beginning of winter time (from daylight saving time) on oct, 26th


nod <- length(tt$list)

symbols(x=temps$tstmp,
        y=temps$wind,
        thermometers = cbind(.4,1.6,temps$min/40,temps$max/40),
        fg = "red",
        ylim = c(-0.5,max(temps$wind)+1),
        xaxt="n",
        xlab = "day", ylab="wind speed",
        main="Temperature of the last 30 days")
axis(1,at=temps$tstmp,labels=
                substr(as.character(temps$tstmp),6,10),las=2)

 text(x=temps$tstmp,
      y=temps$wind-0.5,rep(0,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind+0.5,rep(40,nod),adj=c(0.5,0))
 text(x=temps$tstmp,
      y=temps$wind,rep(20,nod),adj=c(0.5,0.5))




  • with the little function below you can find the station ids by lat and lon


get.station <- function(lat,lon){
    tmpdat <- fromJSON(file=
     paste0("http://api.openweathermap.org/data/2.5/station/find?lat=",lat,
            "&lon=",lon,"&cnt=1"))
    print(paste("found:",tmpdat[[1]]$station$name,", id:",tmpdat[[1]]$station$id))
    tmpdat[[1]]$station$id
}

station <- get.station(lat=34,lon=118.3)
station

[1] "found: ZSNJ , id: 7447"
[1] 7447



get.station(51.5,0.1)

[1] "found: EGLC , id: 5091"
[1] 5091

Sunday, October 12, 2014

R - ROC curves

Table of Contents

1 ROCR

  • load the package
  • the artificial data set ROCR.simple contains measurements and labels which you can consider as gold standard
  • the prediction() function takes the measurements and the gold standard as input
  • the value returned is a list containing inter alia:
    • measurements and labels
    • vectors of cutoffs and the number of
    • false positives
    • true positives
    • false negatives
    • true negatives corresponding to the resp. cutoffs

require(ROCR,quietly=T)
data(ROCR.simple)
pred <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
str(pred)

..$ : num [1:200] 0.613 0.364 0.432 0.14 0.385 ...
 ..@ labels     :List of 1
 .. ..$ : Ord.factor w/ 2 levels "0"<"1": 2 2 1 1 1 2 2 2 2 1 ...
 ..@ cutoffs    :List of 1
 .. ..$ : num [1:201] Inf 0.991 0.985 0.985 0.983 ...
 ..@ fp         :List of 1
 .. ..$ : num [1:201] 0 0 0 0 1 1 2 3 3 3 ...
 ..@ tp         :List of 1
 .. ..$ : num [1:201] 0 1 2 3 3 4 4 4 5 6 ...
 ..@ tn         :List of 1
 .. ..$ : num [1:201] 107 107 107 107 106 106 105 104 104 104 ...
 ..@ fn         :List of 1
 .. ..$ : num [1:201] 93 92 91 90 90 89 89 89 88 87 ...
 ..@ n.pos      :List of 1
 .. ..$ : int 93
 ..@ n.neg      :List of 1
 .. ..$ : int 107
 ..@ n.pos.pred :List of 1
 .. ..$ : num [1:201] 0 1 2 3 4 5 6 7 8 9 ...
 ..@ n.neg.pred :List of 1
 .. ..$ : num [1:201] 200 199 198 197 196 195 194 193 192 191 ...

  • now we can use performance() to calculate different performance measures
  • first we plot the classic ROC Curve based on sensitivity and specificity:
    • plotting sensitivity (y-axis) against 1-specificity (x-axis) which is equivalent to
    • true positive rate against false positive rate which is equivalent to

perf <- performance(pred,"tpr","fpr")
plot(perf)



  • calculate the area under the curve (AUC) (the output is a bit messy, but you find the value of interest in the slot y.values

print(performance(pred,"auc"))

An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.8341875


Slot "alpha.values":
list()

  • find the best cutoff (best always depends on your preferences); here we put equal weight on sensitivity and specificity and maximize the sum of them (Youden Index)
  • we write a function which takes a prediction object, the names of two performance measurements and gives back the cutoff, the maximum of the sum and the respective values of the two performance measurements

max.ss <- function(pred,meas.1,meas.2){
    meas.1 <- performance(pred,meas.1)
    meas.2 <- performance(pred,meas.2)
    x.vals <- slot(meas.1,'x.values')[[1]]
    y.vals <- slot(meas.1,'y.values')[[1]] + slot(meas.2,'y.values')[[1]]
    y1.vals <- slot(meas.1,'y.values')[[1]]
    y2.vals <- slot(meas.2,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 max.sum=max(y.vals),
                 max.meas1=y1.vals[which.max(y.vals)],
                 max.meas2=y2.vals[which.max(y.vals)])
}

max.ss(pred,"sens","spec")

$cut.off
[1] 0.5014893

$max.sum
[1] 1.69993

$max.meas1
[1] 0.8494624

$max.meas2
[1] 0.8504673

  • here we get a cutoff of 0.5
  • the maximized sum is 1.70
  • the resp. sensitivity is 0.85
  • the resp. specificity is also 0.85
  • sometimes we have a clear preference because the cost of a false negative is much higher than the cost of a false positive (or vice versa)
  • therefore it exists a modified version of the Youden-Index which maximizes \(sensitivity + r\cdot specificity\) where \(r=\frac{1-prevalence}{cost\cdot prevalence}\) and \(cost\) is the cost of a false negative and \(prevalence\) is the prevalence in the population under consideration


max.ss <- function(pred,cost=1,prev=0.5){
    r <- (1-prev)/(cost*prev)
    sens <- performance(pred,"sens")
    spec <- performance(pred,"spec")
    x.vals <- slot(sens,'x.values')[[1]]
    y.vals <- slot(sens,'y.values')[[1]] + r*slot(spec,'y.values')[[1]]
    y1.vals <- slot(sens,'y.values')[[1]]
    y2.vals <- slot(spec,'y.values')[[1]]
    list(cut.off=x.vals[which.max(y.vals)],
                 sensitivity=y1.vals[which.max(y.vals)],
                 specificity=y2.vals[which.max(y.vals)])
}

max.ss(pred)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673

  • with the defaults cost=1 and prev=0.5 we get exactly the same result (because \(r=1\) in this case)
  • if we have a disease with prevalence of 0.1 where false negatives (i.e. not detecting a true case) are more expensive


max.ss(pred,cost=20,prev=0.1)

$cut.off
[1] 0.5014893

$sensitivity
[1] 0.8494624

$specificity
[1] 0.8504673

Wednesday, October 1, 2014

R - confidence intervals for odds ratio in matched pair

  • the confidence interval for odds ratio in matched pairs is B/C * exp(+-1.96*sqrt(1/B + 1/C)) where B and C are the counts of discordant pairs
  • here is a example with anemia in mothers ~ low birth weight


bw <- as.table(matrix(c(13,27,8,22),dimnames=list(c("anemia","no anemia"),c("low bw","normal bw")),byrow=T,nrow=2))
bw

low bw normal bw
anemia        13        27
no anemia      8        22

  • do the calculations per hand:


print(paste("odds ratio:",27/8))
27/8 * exp(c(-1,1)*1.96*sqrt(1/27 + 1/8))

[1] "odds ratio: 3.375"
[1] 1.533297 7.428844

  • or use the function from the PropCIs package


require(PropCIs,quietly=T)
oddsratioci.mp(8,27,0.95)

data:  

95 percent confidence interval:
 1.562965 7.287833