Monday, July 22, 2013

R ggbio - visualize a single chromosome

  • plotIdeogram() is a wrapper function, synonyms to plotSingleChrom()
  • requires packages rtracklayer, GenomicRanges, BioGenerics, parallel
  • takes a name of a genome as argument and downloads the cytoband table from UCSC
require(ggbio)
p <- plotIdeogram()
1: hg19       2: hg18       3: hg17       4: hg16       5: vicPac1
  6: dasNov3    7: otoGar3    8: papHam1    9: papAnu2   10: felCat5
 11: felCat4   12: felCat3   13: panTro4   14: panTro3   15: panTro2
 16: panTro1   17: bosTau7   18: bosTau6   19: bosTau4   20: bosTau3
 21: bosTau2   22: canFam3   23: canFam2   24: canFam1   25: turTru2
 26: loxAfr3   27: musFur1   28: nomLeu3   29: nomLeu2   30: nomLeu1
 31: gorGor3   32: cavPor3   33: eriEur1   34: equCab2   35: equCab1
 36: dipOrd1   37: triMan1   38: calJac3   39: calJac1   40: pteVam1
 41: myoLuc2   42: mm10      43: mm9       44: mm8       45: mm7    
 46: micMur1   47: hetGla2   48: hetGla1   49: monDom5   50: monDom4
 51: monDom1   52: ponAbe2   53: ailMel1   54: susScr3   55: susScr2
 56: ochPri2   57: ornAna1   58: oryCun2   59: rn5       60: rn4    
 61: rn3       62: rheMac3   63: rheMac2   64: proCap1   65: oviAri1
 66: sorAra1   67: choHof1   68: speTri2   69: saiBol1   70: sorAra1
 71: sarHar1   72: echTel1   73: tupBel1   74: macEug2   75: cerSim1
 76: gadMor1   77: melUnd1   78: galGal4   79: galGal3   80: galGal2
 81: latCha1   82: fr3       83: fr2       84: fr1       85: petMar2
 86: petMar1   87: anoCar2   88: anoCar1   89: oryLat2   90: geoFor1
 91: oreNil2   92: chrPic1   93: gasAcu1   94: tetNig2   95: tetNig1
 96: melGal1   97: xenTro3   98: xenTro2   99: xenTro1  100: taeGut1
101: danRer7  102: danRer6  103: danRer5  104: danRer4  105: danRer3
106: ci2      107: ci1      108: braFlo1  109: strPur2  110: strPur1
111: apiMel2  112: apiMel1  113: anoGam1  114: droAna2  115: droAna1
116: droEre1  117: droGri1  118: dm3      119: dm2      120: dm1    
121: droMoj2  122: droMoj1  123: droPer1  124: dp3      125: dp2    
126: droSec1  127: droSim1  128: droVir2  129: droVir1  130: droYak2
131: droYak1  132: caePb2   133: caePb1   134: cb3      135: cb1    
136: ce10     137: ce6      138: ce4      139: ce2      140: caeJap1
141: caeRem3  142: caeRem2  143: priPac1  144: aplCal1  145: sacCer3
146: sacCer2  147: sacCer1  

Selection: 1
Loading…
Done
use chr1 automatically

  • plot chromosom 1 from the hg19 genome (chromosom 1 is the default)
p <- plotIdeogram(genome="hg19")
p

  • plot second chromosome
p <- plotIdeogram(genome="hg19",subchr="chr2" )
p

  • get rid of the cytoband
p <- plotIdeogram(genome="hg19",subchr="chr2",cytoband=F)
p

  • or change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )
p <- plotIdeogram(genome="hg19",subchr="chr2",aspect.ratio=1/10 )
p

  • have a closer look at the structure of p
attributes(p)
$.S3Class
[1] "gg"     "ggplot"

$class
[1] "ideogram"
attr(,"package")
[1] "ggbio"

$names
[1] "data"        "layers"      "scales"      "mapping"     "theme"      
[6] "coordinates" "facet"       "plot_env"    "labels"     

$xlab
[1] ""

$main
[1] ""

$ideogram

$xlabel
[1] FALSE

$ideogram.data
GRanges with 62 ranges and 2 metadata columns:
       seqnames                 ranges strand   |     name gieStain
          <Rle>              <IRanges>  <Rle>   | <factor> <factor>
   [1]     chr2   [       0,  4400000]      *   |    p25.3     gneg
   [2]     chr2   [ 4400000,  7100000]      *   |    p25.2   gpos50
   [3]     chr2   [ 7100000, 12200000]      *   |    p25.1     gneg
   [4]     chr2   [12200000, 16700000]      *   |    p24.3   gpos75
   [5]     chr2   [16700000, 19200000]      *   |    p24.2     gneg
   ...      ...                    ...    ... ...      ...      ...
  [58]     chr2 [225200000, 226100000]      *   |    q36.2     gneg
  [59]     chr2 [226100000, 231000000]      *   |    q36.3  gpos100
  [60]     chr2 [231000000, 235600000]      *   |    q37.1     gneg
  [61]     chr2 [235600000, 237300000]      *   |    q37.2   gpos50
  [62]     chr2 [237300000, 243199373]      *   |    q37.3     gneg
  ---
  seqlengths:
    chr1 chr10 chr11 chr12 chr13 chr14 ...  chr6  chr7  chr8  chr9  chrX  chrY
      NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA    NA

$subchr
[1] "chr2"

$aspect.ratio
[1] 0.1

$color
[1] "red"

$fill
[1] "red"

$alpha
[1] 0.7

$size
[1] 1

$zoom.offset
[1] 0.1


  • so we see: data are stored within p, so it is not necessary to download them again
  • change aspect ratio (you can change it also afterwards via + theme(aspect.ratio= )
md <- attr(p,"ideogram.data")
plotIdeogram(md, aspect.ratio = 0.3)

  • so you can always download data and load it later using the getIdeogram() function, which has three arguments:
    • genome must be one of the result from ucscGenomes()$db
    • subchr defines a subset (character vectors)
    • cytobands must be TRUE or FALSE

  • you can also use the underlying functions available in the rtracklayer package
  • ucscGenomes() returns a data frame with information about the available UCSC genomes (identifier, specie, date, official name)
head(ucscGenomes())
db   species      date                               name
1    hg19     Human Feb. 2009 Genome Reference Consortium GRCh37
2    hg18     Human Mar. 2006                    NCBI Build 36.1
3    hg17     Human  May 2004                      NCBI Build 35
4    hg16     Human Jul. 2003                      NCBI Build 34
5 vicPac1    Alpaca Jul. 2008          Broad Institute VicPac1.0
6 dasNov3 Armadillo Dec. 2011            Broad Institute DasNov3
  • plot ideogram of hg19 (data are contained in the biovizBase package)
data(hg19IdeogramCyto, package="biovizBase" )
plotIdeogram(hg19IdeogramCyto)

  • add zoom.region
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7) )

  • change color of the zoom region (border and filling)
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue" )

  • change zoom offset (zoom.offset) and size of the frame (size)
p <- plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2)
p

  • visualize chromosome 2 instead
p + xlim(GRanges("chr2", IRanges(1e7, 5e7)))

  • add labels on the x-axis
plotIdeogram(hg19IdeogramCyto,zoom.region=c(1e7,5e7),fill="blue",color="blue",zoom.offset=4,size=2,xlabel=T)

  • plot overview
autoplot(seqinfo(hg19Ideogram)[paste("chr",c(14:20,"X","Y"),sep="")])

Friday, July 19, 2013

R ggbio - tracks

  • container for a plot you want to align
  • constructor tracks()
  • first we produce two plots:
require(ggbio)
require(gridExtra)
df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10)
p1 <- ggplot(data = df1, aes(x = time, y = score)) +
      geom_line()
df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 + 1))
p2 <- ggplot(data = df2, aes(x = time, y = score)) +
    geom_line() +
    geom_point(size = 2, aes(color = value))
grid.arrange(p1,p2)

  • the two plots have different scale on the x-axis
  • we want to align them on exactly the same x-axis to make it more easy to compare each other
tracks(p1,p2)

  • now add a theme (included in the ggbio package)
tracks(time1=p1,time2=p2) +
    xlim(1,40) +
    theme_tracks_sunset()

  • add a title
tracks(time1=p1,time2=p2,title="My title")

  • set background colors (for each of the single plots)
tracks(time1=p1,time2=p2,track.plot.color=c("darkred","darkgreen"))

  • set background color (for the whole tracks)
tracks(time1=p1,time2=p2,track.bg.color="darkred")

  • set color of the label borders and the filling
tracks(time1=p1,time2=p2,label.bg.color="darkred",label.bg.fill="lightblue")

  • set color and size of the label text and the width of the labels
tracks(time1=p1,time2=p2,label.text.color="midnightblue",label.text.cex=2,label.width=unit(2.5,"cm"))

  • other zoom methods:
    • scale along with the limits of a GRange

require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(gr)

  • scale along with the limits of a IRange
require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))

  • change limits afterwards
require(GenomicRanges)
gr <- GRanges("chr", IRanges(1, 40))
trks <- tracks(time1 = p1, time2 = p2) + xlim(ranges(gr))
xlim(trks)
[1]  1 40
xlim(trks) <- c(1,30)
trks

Thursday, July 18, 2013

R - multiple comparisons

along with chapter 16 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
  • main function in R to obtain corrected p-vals p.adjust
  • to get available methods type:
p.adjust.methods
[1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
[6] "BY"         "fdr"        "none"
  • simulate 10 raw p-vals on gene level
raw.p.vals <- runif(10,0,0.5)
names(raw.p.vals) <- paste("gene",1:length(raw.p.vals),sep="")
raw.p.vals
     gene1      gene2      gene3      gene4      gene5      gene6      gene7 
0.30128864 0.45386933 0.11451989 0.17423164 0.36300090 0.32262681 0.24504471 
     gene8      gene9     gene10 
0.27808264 0.08773618 0.49969999
  • get the corrected vals for example
    • Holm's stepwise correction

p.adjust(raw.p.vals,method="holm")
    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 
    gene9    gene10 
0.8773618 1.0000000
  • False Discovery Rate
p.adjust(raw.p.vals,method="fdr")
    gene1     gene2     gene3     gene4     gene5     gene6     gene7     gene8 
0.4537511 0.4997000 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 0.4537511 
    gene9    gene10 
0.4537511 0.4997000
  • the multtest package (available from bioconductor) contains also a function for correction (mt.rawp2adjp())
require(multtest)
mymethods <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD","BH", "BY","ABH","TSBH")
mt.rawp2adjp(raw.p.vals,proc=mymethods)
$adjp
            rawp Bonferroni      Holm Hochberg   SidakSS   SidakSD        BH BY
 [1,] 0.08773618  0.8773618 0.8773618   0.4997 0.6007872 0.6007872 0.4537511  1
 [2,] 0.11451989  1.0000000 1.0000000   0.4997 0.7036615 0.6653358 0.4537511  1
 [3,] 0.17423164  1.0000000 1.0000000   0.4997 0.8525712 0.7837949 0.4537511  1
 [4,] 0.24504471  1.0000000 1.0000000   0.4997 0.9398532 0.8602188 0.4537511  1
 [5,] 0.27808264  1.0000000 1.0000000   0.4997 0.9615519 0.8602188 0.4537511  1
 [6,] 0.30128864  1.0000000 1.0000000   0.4997 0.9722682 0.8602188 0.4537511  1
 [7,] 0.32262681  1.0000000 1.0000000   0.4997 0.9796633 0.8602188 0.4537511  1
 [8,] 0.36300090  1.0000000 1.0000000   0.4997 0.9890001 0.8602188 0.4537511  1
 [9,] 0.45386933  1.0000000 1.0000000   0.4997 0.9976397 0.8602188 0.4997000  1
[10,] 0.49969999  1.0000000 1.0000000   0.4997 0.9990176 0.8602188 0.4997000  1
      ABH TSBH_0.05
 [1,]  NA 0.4537511
 [2,]  NA 0.4537511
 [3,]  NA 0.4537511
 [4,]  NA 0.4537511
 [5,]  NA 0.4537511
 [6,]  NA 0.4537511
 [7,]  NA 0.4537511
 [8,]  NA 0.4537511
 [9,]  NA 0.4997000
[10,]  NA 0.4997000

$index
 [1]  9  3  4  7  8  1  6  5  2 10

$h0.ABH
[1] NA

$h0.TSBH
h0.TSBH_0.05 
          10 

Warnmeldung:
In min(which(diff(h0.m, na.rm = TRUE) > 0), na.rm = TRUE) :
  kein nicht-fehlendes Argument für min; gebe Inf zurück

Wednesday, July 17, 2013

Ubuntu - merge several pdf

gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=app.pdf -dBATCH application.pdf 201302cv.pdf 

Sunday, July 7, 2013

R - Representation of microarray data

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor
  • using the data of T. Golub's paper from 1999 on leukemia classification
  • the data is contained in a package and can be installed
source("http://bioconductor.org/biocLite.R")
biocLite("golubEsets")
  • then load the package and the data
require(golubEsets)
data(Golub_Merge)
Golub_Merge  
ExpressionSet (storageMode: lockedEnvironment)
assayData: 7129 features, 72 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 39 40 ... 33 (72 total)
  varLabels: Samples ALL.AML ... Source (11 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 10521349 
Annotation: hu6800

  • Golub_Merge is of class ExpressionSet
  • the ExpressionSet class is part of the Biobase package
  • this class is designed to combine several different sources of information into one single structure
  • is input for many Bioconductor functions
  • it consists of:
    • expression data from microarray experiments (assayData)
    • meta data describing samples in experiments (phenoData)
    • annotations and meta-data about the features on the chip or technology used for the experiment (featureData,annotation)
    • information related to the protocol used for processing each sample (and usually extracted from manufacturer files, protocolData)
    • and a exible structure to describe the experiment (experimentData)

  • so we can get experiment-level metadata along with the pubmed ID
experimentData(Golub_Merge)
Experiment data
  Experimenter name: Golub TR et al. 
  Laboratory: Whitehead 
  Contact information: 
 
  Title: ALL/AML discrimination 
  URL: www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html 
  PMIDs: 10521349 

  Abstract: A 133 word abstract is available. Use 'abstract' method.

  • show the first part of the abstract
substr(abstract(Golub_Merge),1,102)
[1] "Although cancer classification has improved over the past 30 years, there has been no general approach"
  • one get get the dimension of the expression data
dim(exprs(Golub_Merge))
  • look at the five rows of the first 5 columns
exprs(Golub_Merge)[1:5,1:5]
                 39   40   42   47   48
AFFX-BioB-5_at -342  -87   22 -243 -130
AFFX-BioB-M_at -200 -248 -153 -218 -177
AFFX-BioB-3_at   41  262   17 -163  -28
AFFX-BioC-5_at  328  295  276  182  266
AFFX-BioC-3_at -224 -226 -211 -289 -170
  • retrieve information on experimental phenotypes (again we look only at the first five samples/rows)
pData(Golub_Merge)[1:5,]
Samples ALL.AML BM.PB T.B.cell  FAB      Date Gender pctBlasts Treatment
39      39     ALL    BM   B-cell <NA>                F        NA      <NA>
40      40     ALL    BM   B-cell <NA> 5/16/1980      F        NA      <NA>
42      42     ALL    BM   B-cell <NA>      <NA>      F        NA      <NA>
47      47     ALL    BM   B-cell <NA>  9/5/1986      M        NA      <NA>
48      48     ALL    BM   B-cell <NA> 2/28/1992      F        NA      <NA>
     PS Source
39 0.78   DFCI
40 0.68   DFCI
42 0.42   DFCI
47 0.81   DFCI
48 0.94   DFCI

  • how are the assay reporters named?
featureNames(Golub_Merge)[1:5]
[1] "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at"
[5] "AFFX-BioC-3_at"
  • how are the samples named?
sampleNames(Golub_Merge)
 [1] "39" "40" "42" "47" "48" "49" "41" "43" "44" "45" "46" "70" "71" "72" "68"
[16] "69" "67" "55" "56" "59" "52" "53" "51" "50" "54" "57" "58" "60" "61" "65"
[31] "66" "63" "64" "62" "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"
[46] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
[61] "27" "34" "35" "36" "37" "38" "28" "29" "30" "31" "32" "33"
  • show the distribution of the primary outcome
table(Golub_Merge$ALL.AML)
ALL AML 
 47  25

R - annotation of a microarray platform

along with chapter 7 of Draghici: Statistics and Data Analysis for Microarrays Using R and Bioconductor

  • use the annotation() accessor to get the type of microarray used

annotation(Golub_Merge)

  • so we use the hu6800.db package, which provides detailed information about the hu6800 platform. The package is updated biannually.
  • install the package via biocLite("hu6800.db")
  • if you want to see what objects this package supports type

ls("package:hu6800.db")

[1] "hu6800"              "hu6800ACCNUM"        "hu6800ALIAS2PROBE"  
 [4] "hu6800CHR"           "hu6800CHRLENGTHS"    "hu6800CHRLOC"       
 [7] "hu6800CHRLOCEND"     "hu6800.db"           "hu6800_dbconn"      
[10] "hu6800_dbfile"       "hu6800_dbInfo"       "hu6800_dbschema"    
[13] "hu6800ENSEMBL"       "hu6800ENSEMBL2PROBE" "hu6800ENTREZID"     
[16] "hu6800ENZYME"        "hu6800ENZYME2PROBE"  "hu6800GENENAME"     
[19] "hu6800GO"            "hu6800GO2ALLPROBES"  "hu6800GO2PROBE"     
[22] "hu6800MAP"           "hu6800MAPCOUNTS"     "hu6800OMIM"         
[25] "hu6800ORGANISM"      "hu6800ORGPKG"        "hu6800PATH"         
[28] "hu6800PATH2PROBE"    "hu6800PFAM"          "hu6800PMID"         
[31] "hu6800PMID2PROBE"    "hu6800PROSITE"       "hu6800REFSEQ"       
[34] "hu6800SYMBOL"        "hu6800UNIGENE"       "hu6800UNIPROT"

  • to get the mappings between manufacturer identifiers and gene abbreviations you can use hu6800SYMBOL and mget()
mget(ls(hu6800SYMBOL)[1:5],hu6800SYMBOL )
$A28102_at
[1] "GABRA3"

$AB000114_at
[1] "OMD"

$AB000115_at
[1] "IFI44L"

$AB000220_at
[1] "SEMA3C"

$AB000381_s_at
[1] "GML"
  • you can also use the DBI package infrastructure
    • list the tables available for NCBI Entrez Gene metadata

require(DBI)
hh <- org.Hs.eg_dbconn()
dbListTables(hh)
[1] "accessions"            "alias"                 "chrlengths"           
 [4] "chromosome_locations"  "chromosomes"           "cytogenetic_locations"
 [7] "ec"                    "ensembl"               "ensembl2ncbi"         
[10] "ensembl_prot"          "ensembl_trans"         "gene_info"            
[13] "genes"                 "go"                    "go_all"               
[16] "go_bp"                 "go_bp_all"             "go_cc"                
[19] "go_cc_all"             "go_mf"                 "go_mf_all"            
[22] "kegg"                  "map_counts"            "map_metadata"         
[25] "metadata"              "ncbi2ensembl"          "omim"                 
[28] "pfam"                  "prosite"               "pubmed"               
[31] "refseq"                "sqlite_stat1"          "ucsc"                 
[34] "unigene"               "uniprot"
  • query for genes annotated as sarcoma oncogenes
SQL <- "select * from gene_info gi, genes g where gi._id=g._id and gene_name like '%arcoma%oncogene%'"
sops <- dbGetQuery(hh,SQL)
head(sops)
_id                                                     gene_name symbol
1  314              v-raf murine sarcoma 3611 viral oncogene homolog   ARAF
2  559                v-raf murine sarcoma viral oncogene homolog B1   BRAF
3 1156             v-crk sarcoma virus CT10 oncogene homolog (avian)    CRK
4 1157        v-crk sarcoma virus CT10 oncogene homolog (avian)-like   CRKL
5 1834                                       feline sarcoma oncogene    FES
6 1859 Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog    FGR
   _id gene_id
1  314     369
2  559     673
3 1156    1398
4 1157    1399
5 1834    2242
6 1859    2268
  • find the LYN gene
sops[grep("LYN",sops$symbol),]
  • so we see it's in the 13th row and the gene id is 4067
sops[grep("LYN",sops$symbol),]
  • the we can use get() to get the probe set identifier
get("4067",revmap(hu6800ENTREZID))
[1] "M16038_at"
  • display the expression data of the gene by disease group
boxplot(exprs(Golub_Merge)["M16038_at",]~Golub_Merge$ALL.AML,col=c("darkgreen","gold"),notch=T)

  • and the ggplot2 version
require(ggplot2)
gmdata <- data.frame(expr=exprs(Golub_Merge)["M16038_at",],ALL.AML=Golub_Merge$ALL.AML)
ggplot(gmdata,aes(x=ALL.AML,y=expr,fill=ALL.AML)) +
    geom_boxplot(notch=T) +
    scale_fill_manual(values=c("darkgreen","gold")) +
    xlab("")
ggsave("ggboxpl.png")

Thursday, June 20, 2013

ls - show only folders

use du

du ~/ --max-depth=1
22576   /home/mandy/.mozilla
4       /home/mandy/Öffentlich
2956    /home/mandy/.config
28      /home/mandy/.icons
16      /home/mandy/.adobe
4714980 /home/mandy/klinik
4       /home/mandy/Vorlagen
4       /home/mandy/fserver
12      /home/mandy/.dbus
4       /home/mandy/Dokumente
19628   /home/mandy/Downloads
120     /home/mandy/.rstudio-desktop
36      /home/mandy/.compiz
5258420 /home/mandy/ggmbh
2940    /home/mandy/.local
44      /home/mandy/.emacs.d
8       /home/mandy/.ssh
8       /home/mandy/.gnome2
44      /home/mandy/.gnupg
4       /home/mandy/Videos
356     /home/mandy/Arbeitsfläche
52      /home/mandy/.gconf
15245532        /home/mandy/fileserver
4       /home/mandy/.hplip
95084   /home/mandy/.cpan
4       /home/mandy/Musik
64      /home/mandy/.macromedia
4       /home/mandy/Ubuntu One
8590040 /home/mandy/servercn
628     /home/mandy/Bilder
4       /home/mandy/.gnome2_private
4       /home/mandy/server170
110976  /home/mandy/R
107880  /home/mandy/.cache
36      /home/mandy/.cpanm
28      /home/mandy/.shutter
92      /home/mandy/.thumbnails
34173652        /home/mandy/

Wednesday, June 19, 2013

R - Grammar of Graphics Figure 2.1


par(mar=c(1,1,1,1))
openplotmat()
elpos <- coordinates(3)
fromto <- matrix(ncol=2,byrow = T,data=c(1,2,2,3))
nr <- nrow(fromto)
arrpos <- matrix(ncol=2,nrow=nr)

for(i in 1:nr){
    arrpos[i,] <- straightarrow(to=elpos[fromto[i,2],],
                                from=elpos[fromto[i,1],],
                                lwd = 2, arr.pos = 0.68,
                                arr.length = 0.5)
}

textrect(elpos[1,],0.1,0.09,lab="Source",shadow.col = NULL)
textellipse(elpos[2,],0.1,0.09,lab="Make a pie",shadow.col = NULL)
textrect(elpos[3,],0.1,0.09,lab="Render",shadow.col = NULL)

text(arrpos[1,1]-0.07,arrpos[1,2]-0.08,"Data")
text(arrpos[2,1]-0.06,arrpos[2,2]-0.08,"Graphics")


if installing perl modules via CPAN does not work


  • check if proxy configured correctly(o conf init /proxy/)
  • maybe you should remove files in ~/.cpan/sources/modules

Tuesday, June 18, 2013

R - Grammar of Graphics Figure 1.1 redone with ggplot


An Example (1.4)


  • first we have to get the data which are birth and death rates of the year 1990, therefore we use the world bank data (and of course, there is a package WDI providing direct access)
  • so load the package and download a list of indicators (WDIcache)
  • the resulting data frame contains indicator and name, and additional information (data description, source)

require(WDI)
indicators <- as.data.frame(WDIcache()$series)
names(indicators)
[1] "indicator"          "name"               "description"       
[4] "sourceDatabase"     "sourceOrganization"
  • then we have to find our two parameters of interest: the crude birth and the crude death rate:
    • we use grep on the column name to find them

grep("crude",indicators$name)
[1] 6553 6554
  • so we know that there are two lines containing the string crude in the name variable, so let's show them
indicators[grep("crude",indicators$name),]
indicator                                 name
6553 SP.DYN.CBRT.IN Birth rate, crude (per 1,000 people)
6554 SP.DYN.CDRT.IN Death rate, crude (per 1,000 people)
                                                                                                                                                                                                                                                                                                   description
6553 Crude birth rate indicates the number of live births occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration.
6554      Crude death rate indicates the number of deaths occurring during the year, per 1,000 population estimated at midyear. Subtracting the crude death rate from the crude birth rate provides the rate of natural increase, which is equal to the rate of population change in the absence of migration.
                   sourceDatabase
6553 World Development Indicators
6554 World Development Indicators
                                                                                                                                                                                                                                                                                                                                                                                                                         sourceOrganization
6553 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
6554 (1) United Nations Population Division. World Population Prospects, (2) United Nations Statistical Division. Population and Vital Statistics Reprot (various years), (3) Census reports and other statistical publications from national statistical offices, (4) Eurostat: Demographic Statistics, (5) Secretariat of the Pacific Community: Statistics and Demography Programme, and (6) U.S. Census Bureau: International Database.
  • now we now the indicators - and we can download the desired data (and just in case: we download them for the years 1980–2012)
rate.data <- WDI(country="all",indicator=indicators[grep("crude",indicators$name),1],start=1980,end=2012)
rate.data.1990 <- rate.data[rate.data$year==1990,]
head(rate.data.1990)
iso2c                                 country year SP.DYN.CBRT.IN
11     1A                              Arab World 1990       34.71824
44     1W                                   World 1990       25.85034
77     4E   East Asia & Pacific (developing only) 1990       22.84601
110    7E Europe & Central Asia (developing only) 1990       17.90782
143    8S                              South Asia 1990       32.91213
176    AD                                 Andorra 1990             NA
    SP.DYN.CDRT.IN
11        8.262026
44        9.266023
77        6.997506
110      10.065765
143      10.703840
176             NA


  • now we have to remove aggregations from these data (such as the European Union, Middle East, etc)
    • we use grepl on the iso2c column (which does the same as grep but returns logical values (of course you can also use grep again))
    • the iso2c is a two character, standardized country code
    • we eliminate all rows with iso2c starting with X, equals EU, containing a number, or equals ZQ, ZJ, ZG, or ZF

rate.data.1990 <- rate.data.1990[!grepl("^X|EU|\\d|Z[QJGF]",rate.data.1990$iso2c,perl=T),]
head(rate.data.1990)


    iso2c              country year SP.DYN.CBRT.IN SP.DYN.CDRT.IN
176    AD              Andorra 1990             NA             NA
209    AE United Arab Emirates 1990         25.916          2.794
242    AF          Afghanistan 1990         52.449         22.062
275    AG  Antigua and Barbuda 1990         20.100          6.800
308    AL              Albania 1990         24.610          5.909
341    AM              Armenia 1990         21.215          7.738
  • so that's the data frame
  • now the graphics
require(ggplot2)
require(gridExtra)
require(scales)

rate.data.1990$lab <- ifelse(rbinom(size=1,n=nrow(rate.data.1990),prob=0.2)==1,rate.data.1990$country,"")
ggplot(rate.data.1990,aes(x=SP.DYN.CBRT.IN,y=SP.DYN.CDRT.IN)) +
    stat_density2d(aes(colour= ..level..),bins=6,h=c(11,9),geom="density2d") +
    scale_x_continuous("Birth Rate", limits = c(0,60),breaks=seq(0,60,by=10),expand=c(0,1)) +
    scale_y_continuous("Death Rate", limits = c(0,30),breaks=seq(0,30,by=10),expand=c(0,1)) +
    scale_colour_gradientn(colours=c("blue","green","lightgreen","red"),guide="none") +
    geom_text(aes(label=lab),size=3,position = "jitter") +
    geom_abline(intersect=0,slope=1) +
    coord_fixed() +
    annotate("text",x=20,y=20,angle=45, label="Zero Population Growth",vjust=-0.5,size=4) +
    theme(
        panel.background=element_blank(),
        panel.border=element_rect(colour="black",fill="transparent"),
        axis.text = element_text(colour="black"),
        axis.ticks = element_line(colour="black"),
        axis.line = element_line(colour="black")
        ) 


  • note:
    • the plot in chapter 1.4 of the book uses an epanechnikov kernel, whereas ggplot uses a normal one
    • I choose randomly about 20% of the country names as labels (because I was to lazy to pick up exactly those used in the book)


Sunday, June 16, 2013

R - First steps working with microarray data

Working with Microarray Data

along with chapter 19 Lewis, R for Medicine and Biology

raw CEL files

  • the bioconductor packages affy and simpleaffy allow to work with raw microarray data
  • also carry out quality control and simple data analysis
#source("http://bioconductor.org/biocLite.R")
#biocLite("affy")
#biocLite("simpleaffy")
org_babel_R_eoe
  • get some raw data

download.file(url="ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar",destfile="arraydata.tar")
versuche URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE5nnn/GSE5090/suppl/GSE5090_RAW.tar'
ftp data connection made, file length 63569920 bytes
URL geöffnet
=================================================
downloaded 60.6 Mb
  • read in data
setwd("~/affyfiles")
 data.raw <- read.affy("covdesc")
setwd("~/")
  • the function read.affy() gives back a object of class AffyBatch (which extends the eSet class - Biobase package)
  • the cdfName slot: Affymetrix array plotform used
  • nrow slot: number of rows in the Affymetrix chip
data.raw@cdfName
[1] "HG-U133A"
data.raw@nrow
Rows 
 712
data.raw@ncol

Cols 
 712
  • so the HG-U133A array has 712 rows and 712 columns -> 506944 spots per chip
  • obtain probe-level data from each individual spot using the exprs() function
    • e.g. intesity data for the last four spots on the first array of the data.raw object

intensity(data.raw)[506940:506944,1]
506940  506941  506942  506943  506944 
  109.0 10380.5   175.3 10044.3   158.0

Normalizing/Quality Control

  • QC methods:
    • density plots of PM intensities
    • RNA degradation plots
    • MvA plots
  • MvA plots

    • assess how similar probe-level data is between samples in the same subgroup, detection of any bias in intensities between samples by assessment of the curve produced
    • M represents log fold change between two samples (y-axis)
    • A represents the mean absolute log expression between samples (x-axis)
    setwd("~/affyfiles")
    require(simpleaffy)
    require(affy)
    data.raw <- read.affy("covdesc")
    setwd("../")
    
    MAplot(data.raw[c(1,4)],pairs=T)
    

    • normalize data and plot again
    data.norm <- normalize(data.raw)
    MAplot(data.norm[c(1,4)],pairs=T)
    

From probe-level data to expression data

  • create expression levels from the raw data
  • create a PMA detection call and the associated p-values from the first array
#  data.mas5 <- call.exprs(data.raw,"mas5")
#  sample1 <- detection.p.val(data.raw[,1],calls=T)
  cbind(sample1$call[1:10],sample1$pval[1:10])
Warnmeldung:
In data.raw[c(1, 4)] :
  The use of abatch[i,] and abatch[i] is deprecated. Please use abatch[,i] instead.
X11cairo 
       2
      [,1] [,2]                 
 [1,] "P"  "0.00261701434183198"
 [2,] "P"  "0.00486279317241156"
 [3,] "P"  "0.00564280932389143"
 [4,] "P"  "0.00418036711173471"
 [5,] "A"  "0.378184514727312"  
 [6,] "P"  "0.00306673915877408"
 [7,] "P"  "0.00564280932389143"
 [8,] "A"  "0.0894050784997029" 
 [9,] "P"  "0.00116478895452843"
[10,] "A"  "0.0813367142140589"
  • qc() generates commonly used quality control metrics incl:
    • average background
    • scale factor
    • number of genes called present
    • 3' to 5' end ratios of beta-actin and GADPH genes

data.qc <- qc(data.raw,data.mas5)
plot(data.qc)



create a list of differentially expressed genes


  • data.mas5 is an instance of the AffyBatch class, so we could apply AffyBatch functions
  • exprs() returns expression-level data different probes
    • e.g. expression levels computed by the Mas 5.0 algorithm for the first 10 probes on the array for four control samples
exprs(data.mas5)[1:10,1:4]
GSM114834.CEL GSM114840.CEL GSM114841.CEL GSM114842.CEL
1007_s_at      7.623914      7.767622      7.799031      7.991894
1053_at        4.748147      5.127184      5.149059      4.975323
117_at         5.552385      6.005922      5.376999      5.947452
121_at         7.789410      7.900269      7.845280      8.188128
1255_g_at      1.254626      3.157348      3.842650      3.800311
1294_at        6.594926      6.445978      7.237356      6.657093
1316_at        5.118859      5.160522      4.894248      4.714638
1320_at        5.110600      5.188031      5.133438      5.185311
1405_i_at      6.766398      6.799344      5.578731      5.805039
1431_at        2.900998      3.487406      4.493393      3.641552
  • the function parwise.comparison() generates fold changes, t-tests, and means for pairs of experimental groups
  • pairwise.comparison() takes as input an exprSet object, value is an instance of the PairComp class
  • pairwise.filter() takes as input an PairComp object, filter regarding to:
    • minimum expression level
    • minimum number of arrays in which a gene is called present across all groups or by group
    • minimum fold change
    • maximum t-test p-val for a gene

  • example: filter results of pairwise comparison using:
    • genes must be called present in at least three samples in one of the groups
    • t-test p-val must be less then 0.01
    • fold change between means of the two groups must be at least 1.5
pair <- pairwise.comparison(data.mas5,group="disease",members=c("control", "polycystic_ovary_syndrome"),spots=data.raw)
pair.filt <- pairwise.filter(pair,min.present.no=3,present.by.group=T,fc=log2(1.5),tt=0.05)
[1] "Checking member control in group: ' disease '"
[1] "Checking member polycystic_ovary_syndrome in group: ' disease '"
  • the pair.filt object is also an instance of the PairComp class
  • how many genes have been returned?:
nrow(pair.filt@means)
[1] 54
pair.filt@means[1:10,1:2]
control polycystic_ovary_syndrome
200879_s_at  7.198598                  6.585865
200951_s_at  4.417044                  5.126597
200974_at   10.134294                  9.549187
201242_s_at  7.949515                  8.556699
201468_s_at  8.909398                  9.676864
201496_x_at  6.486854                  5.289261
201497_x_at  9.179405                  8.257583
202040_s_at  6.770394                  7.379680
202104_s_at  6.888542                  6.274534
202274_at    6.554181                  5.746637
  • view the PMA detection calls using the calls slot
pair.filt@calls[1:10,]
GSM114834.CEL.present GSM114840.CEL.present GSM114841.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "P"                   "P"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "P"                   "P"                  
202274_at   "P"                   "A"                   "P"                  
            GSM114842.CEL.present GSM114843.CEL.present GSM114844.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "P"                   "A"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "A"                   "P"                  
202274_at   "A"                   "P"                   "P"                  
            GSM114845.CEL.present GSM114846.CEL.present GSM114847.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "P"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "A"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "M"                   "P"                   "P"                  
202274_at   "P"                   "P"                   "A"                  
            GSM114848.CEL.present GSM114849.CEL.present GSM114850.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "A"                   "A"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "P"                  
201497_x_at "P"                   "P"                   "P"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "M"                   "P"                   "P"                  
202274_at   "A"                   "P"                   "A"                  
            GSM114851.CEL.present GSM114852.CEL.present GSM114853.CEL.present
200879_s_at "P"                   "P"                   "P"                  
200951_s_at "A"                   "A"                   "P"                  
200974_at   "P"                   "P"                   "P"                  
201242_s_at "P"                   "P"                   "P"                  
201468_s_at "P"                   "P"                   "P"                  
201496_x_at "P"                   "P"                   "A"                  
201497_x_at "P"                   "P"                   "A"                  
202040_s_at "P"                   "P"                   "P"                  
202104_s_at "P"                   "P"                   "P"                  
202274_at   "P"                   "P"                   "A"                  
            GSM114854.CEL.present GSM114855.CEL.present
200879_s_at "P"                   "P"                  
200951_s_at "A"                   "A"                  
200974_at   "P"                   "P"                  
201242_s_at "P"                   "P"                  
201468_s_at "P"                   "P"                  
201496_x_at "M"                   "P"                  
201497_x_at "M"                   "P"                  
202040_s_at "P"                   "P"                  
202104_s_at "P"                   "M"                  
202274_at   "A"                   "P"
  • fold change between the two groups for each gene fc slot
pair.filt@fc[1:10]
200879_s_at 200951_s_at   200974_at 201242_s_at 201468_s_at 201496_x_at 
  0.6127330  -0.7095537   0.5851071  -0.6071841  -0.7674657   1.1975929 
201497_x_at 202040_s_at 202104_s_at   202274_at 
  0.9218226  -0.6092852   0.6140086   0.8075434
  • t-test p-vals
pair.filt@tt[1:10]
200879_s_at 200951_s_at   200974_at 201242_s_at 201468_s_at 201496_x_at 
0.045650045 0.023617284 0.044967617 0.046132070 0.037101434 0.024593573 
201497_x_at 202040_s_at 202104_s_at   202274_at 
0.047164613 0.001084647 0.017975104 0.033348173

Wednesday, June 12, 2013

R - change wilcard pattern into regular expression

  • the command glob2rx takes a string containing wildcards (such as * or ?) into an equivalent regular expression
glob2rx("pa??ern")
^pa..ern$
  • the arguments trim.head and trim.tail could be set to determine whether or not the leading "^" or the trailing "$" should be trimmed from the result
glob2rx("*pa??ern",trim.head=TRUE)
pa..ern$
glob2rx("pa??e.n*",trim.tail=TRUE)

Monday, June 10, 2013

R - first steps with GEOquery

Along with chapter 18 of Lewis, R for Medicine and Biology

  • Installing the packages

    source("http://bioconductor.org/biocLite.R")
    biocLite("simpleaffy")
    biocLite("GEOquery")
    
    GEOquery
    
  • Gene Expression Omnibus Repository (GEO)

    • public repository
    • allows submitting of high-throughput experimental data for free access by others
    • includes single- and dual-channel microarrays, measuring mRNA, miRNA, genomic DNA (incl. arrayCGH and SNP) and protein abundance
    • also contains a collection of web-based tools for querying and downloading of datasets
    • for R there exists the GEOquery package
  • sample dataset: Polycystic ovary syndrome: adipose tissue

    • get data
    library(GEOquery)
    gds <- getGEO("GDS2084")
    Meta(gds)
    
    Using locally cached version of GDS2084 found here:
    /tmp/Rtmpa2mlCM/GDS2084.soft.gz
    $channel_count
    [1] "1"
    
    $dataset_id
    [1] "GDS2084" "GDS2084"
    
    $description
    [1] "Analysis of omental adipose tissues of morbidly obese patients with polycystic ovary syndrome (PCOS). PCOS is a common hormonal disorder among women of reproductive age, and is characterized by hyperandrogenism and chronic anovulation. PCOS is associated with obesity."
    [2] "control"                                                                                                                                                                                                                                                                     
    [3] "polycystic ovary syndrome"                                                                                                                                                                                                                                                   
    
    $email
    [1] "geo@ncbi.nlm.nih.gov"
    
    $feature_count
    [1] "22283"
    
    $institute
    [1] "NCBI NLM NIH"
    
    $name
    [1] "Gene Expression Omnibus (GEO)"
    
    $order
    [1] "none"
    
    $platform
    [1] "GPL96"
    
    $platform_organism
    [1] "Homo sapiens"
    
    $platform_technology_type
    [1] "in situ oligonucleotide"
    
    $pubmed_id
    [1] "17062763"
    
    $ref
    [1] "Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-6"
    
    $reference_series
    [1] "GSE5090"
    
    $sample_count
    [1] "15"
    
    $sample_id
    [1] "GSM114841,GSM114844,GSM114845,GSM114849,GSM114851,GSM114854,GSM114855"          
    [2] "GSM114834,GSM114842,GSM114843,GSM114847,GSM114848,GSM114850,GSM114852,GSM114853"
    
    $sample_organism
    [1] "Homo sapiens"
    
    $sample_type
    [1] "RNA"
    
    $title
    [1] "Polycystic ovary syndrome: adipose tissue"
    
    $type
    [1] "Expression profiling by array" "disease state"                
    [3] "disease state"                
    
    $update_date
    [1] "Mar 21 2007"
    
    $value_type
    [1] "count"
    
    $web_link
    [1] "http://www.ncbi.nlm.nih.gov/geo"
    


    • display individual expression values
      • first column (ID_REF): probe ID assigned by Affymetrix for each probe
      • second column (IDENTIFIER): ID for the corresponding transcript
      • remaining columns: expression values returned for each array

    Table(gds)[1:10,1:5]
    
    ID_REF IDENTIFIER GSM114841 GSM114844 GSM114845
    1  1007_s_at       DDR1     222.6     252.7     219.3
    2    1053_at       RFC2      35.5      24.5      23.4
    3     117_at      HSPA6      41.5      53.3      31.3
    4     121_at       PAX8     229.8     419.6     274.5
    5  1255_g_at     GUCA1A      14.3        13      29.6
    6    1294_at       UBA7     150.8       116      89.9
    7    1316_at       THRA      29.7      35.4        53
    8    1320_at     PTPN21      35.1      44.8      28.8
    9  1405_i_at       CCL5      47.8      53.2      41.7
    10   1431_at     CYP2E1      22.5      24.9      38.7
    
    • get disease status associated with each sample
    Columns(gds)[,1:2]
    
    sample             disease.state
    1  GSM114841                   control
    2  GSM114844                   control
    3  GSM114845                   control
    4  GSM114849                   control
    5  GSM114851                   control
    6  GSM114854                   control
    7  GSM114855                   control
    8  GSM114834 polycystic ovary syndrome
    9  GSM114842 polycystic ovary syndrome
    10 GSM114843 polycystic ovary syndrome
    11 GSM114847 polycystic ovary syndrome
    12 GSM114848 polycystic ovary syndrome
    13 GSM114850 polycystic ovary syndrome
    14 GSM114852 polycystic ovary syndrome
    15 GSM114853 polycystic ovary syndrome
    
    • information about source of samples
    Columns(gds)[,3]  
    
    [1] "Value for GSM114841: EP3_adipose_control; src: Omental adipose tissue"      
     [2] "Value for GSM114844: EP23_adipose_control; src: Omental adipose tissue"     
     [3] "Value for GSM114845: EP31_adipose_control_rep1; src: Omental adipose tissue"
     [4] "Value for GSM114849: EP37_adipose_control; src: Omental adipose tissue"     
     [5] "Value for GSM114851: EP49_adipose_control; src: Omental adipose tissue"     
     [6] "Value for GSM114854: EP69_adipose_control; src: Omental adipose tissue"     
     [7] "Value for GSM114855: EP71_adipose_control; src: Omental adipose tissue"     
     [8] "Value for GSM114834: EP1_adipose_pcos_rep1; src: Omental adipose tissue"    
     [9] "Value for GSM114842: EP10_adipose_pcos; src: Omental adipose tissue"        
    [10] "Value for GSM114843: EP18_adipose_pcos; src: Omental adipose tissue"        
    [11] "Value for GSM114847: EP32_adipose_pcos; src: Omental adipose tissue"        
    [12] "Value for GSM114848: EP34_adipose_pcos; src: Omental adipose tissue"        
    [13] "Value for GSM114850: EP47_adipose_pcos; src: Omental adipose tissue"        
    [14] "Value for GSM114852: EP55_adipose_pcos; src: Omental adipose tissue"        
    [15] "Value for GSM114853: EP66_adipose_pcos; src: Omental adipose tissue"
    
  • get information in more detail for the first sample (GSM114841)

    gsm <- getGEO("GSM114841")
    Meta(gsm)
    
    Using locally cached version of GSM114841 found here:
    /tmp/Rtmpa2mlCM/GSM114841.soft
    $biomaterial_provider_ch1
    [1] "Ramón y Cajal Hospital, Madrid, Spain"
    
    $channel_count
    [1] "1"
    
    $characteristics_ch1
    [1] "Morbidly obese control subject"
    
    $contact_address
    [1] "ARTURO DUPERIER"
    
    $contact_city
    [1] "MADRID"
    
    $contact_country
    [1] "Spain"
    
    $contact_email
    [1] "bperal@iib.uam.es"
    
    $contact_fax
    [1] "34 91 5854401"
    
    $contact_institute
    [1] "INSTITUTO DE INVESTIGACIONES BIOMEDICAS, CSIC-UAM"
    
    $contact_name
    [1] "BELEN,,PERAL"
    
    $contact_phone
    [1] "34 91 5854478"
    
    $contact_state
    [1] "MADRID"
    
    $`contact_zip/postal_code`
    [1] "28029"
    
    $data_processing
    [1] "MAS 5.0, scaled to 100 and RMA"
    
    $data_row_count
    [1] "22283"
    
    $description
    [1] "Total RNA was extracted from omental  adipose tissue from a control subject"
    
    $geo_accession
    [1] "GSM114841"
    
    $label_ch1
    [1] "Biotin"
    
    $last_update_date
    [1] "Jun 16 2006"
    
    $molecule_ch1
    [1] "total RNA"
    
    $organism_ch1
    [1] "Homo sapiens"
    
    $platform_id
    [1] "GPL96"
    
    $series_id
    [1] "GSE5090"
    
    $source_name_ch1
    [1] "Omental adipose tissue"
    
    $status
    [1] "Public on Jun 17 2006"
    
    $submission_date
    [1] "Jun 16 2006"
    
    $supplementary_file
    [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.CEL.gz"
    [2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM114nnn/GSM114841/suppl/GSM114841.EXP.gz"
    
    $taxid_ch1
    [1] "9606"
    
    $title
    [1] "EP3_adipose_control"
    
    $type
    [1] "RNA"
    
    • look at the additional data
    Columns(gsm)
    
    ID_REF
    VALUESignal intensity - MAS 5.0, scaled to 100 and RMA
    ABS_CALLPresence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC)
    Detection p-valuep-value that indicates the significance level of the detection call
    • so there are four columns
      • probe ID
      • expression value (output from a Mas 5.0 analysis software)
      • Detection Call
      • p-val (Detection p-val)
    Table(gsm)[500:510,]
    
    Column
    1            ID_REF
    2             VALUE
    3          ABS_CALL
    4 Detection p-value
                                                                                                                                                                      Description
    1                                                                                                                                                                            
    2                                                                                                                           Signal intensity - MAS 5.0, scaled to 100 and RMA
    3 Presence/absence of gene transcript in sample; the call in an absolute analysis that indicates if the transcript was present (P), absent (A), marginal (M), or no call (NC)
    4                                                                                                         p-value that indicates the significance level of the detection call
            ID_REF VALUE ABS_CALL Detection p-value
    500   33307_at  47.8        P          0.039365
    501   33304_at  35.3        A          0.339558
    502   33197_at  58.4        P          0.017001
    503   33148_at  21.4        P          0.019304
    504   33132_at  15.8        A          0.189687
    505   32837_at   274        P          0.000959
    506   32836_at 319.8        P          0.006532
    507   32811_at 213.3        P          0.024711
    508   32723_at    35        P          0.004863
    509 32699_s_at  43.9        A          0.162935
    510   32625_at  67.1        A           0.11716
    
  • get a series record

    • contains lists for GPL platform object and the GSM sample record object
    • retrieve GSE data structure for our example in one go
    gse <- getGEO("GSE5090")
     gse
    
    Found 1 file(s)
    GSE5090_series_matrix.txt.gz
    Using locally cached version: /tmp/Rtmpa2mlCM/GSE5090_series_matrix.txt.gz
    Using locally cached version of GPL96 found here:
    /tmp/Rtmpa2mlCM/GPL96.soft
    $GSE5090_series_matrix.txt.gz
    ExpressionSet (storageMode: lockedEnvironment)
    assayData: 22283 features, 17 samples 
      element names: exprs 
    protocolData: none
    phenoData
      sampleNames: GSM114834 GSM114840 ... GSM114855 (17 total)
      varLabels: title geo_accession ... data_row_count (30 total)
      varMetadata: labelDescription
    featureData
      featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22283 total)
      fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
      fvarMetadata: Column Description labelDescription
    experimentData: use 'experimentData(object)'
    Annotation: GPL96
    

Ubuntu - Samba file sharing



  • install samba
  • config file in: etc/samba (examples)
  • create the corresponding smb user: sudo smbpasswd -a user

Simple Example (German) here

Saturday, April 13, 2013

R - Hardin/Hilbe Generalized Linear Models and Extensions Chapter2 - R code

various choices of variance functions


  • function in 2.2 to illustrate the the effect of the assumed variance function (using the identity link for all functions)
x <- 1:3
y <- c(1,2,9)
glm(y~x,family=gaussian(link=identity))
Call:  glm(formula = y ~ x, family = gaussian(link = identity))

Coefficients:
(Intercept)            x  
         -4            4  

Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
Null Deviance:      38 
Residual Deviance: 6    AIC: 16.59
glm(y~x,family=poisson(link=identity))
Call:  glm(formula = y ~ x, family = poisson(link = identity))

Coefficients:
(Intercept)            x  
       -2.4          3.2  

Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
Null Deviance:      9.052 
Residual Deviance: 1.69         AIC: 14.36
glm(y~x,family=Gamma(link=identity))
Call:  glm(formula = y ~ x, family = Gamma(link = identity))

Coefficients:
(Intercept)            x  
     -1.799        2.744  

Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
Null Deviance:      2.537 
Residual Deviance: 0.4385       AIC: 14.6
glm(y~x,family=inverse.gaussian(link=identity))
Call:  glm(formula = y ~ x, family = inverse.gaussian(link = identity))

Coefficients:
(Intercept)            x  
     -1.370        2.353  

Degrees of Freedom: 2 Total (i.e. Null);  1 Residual
Null Deviance:      0.8611 
Residual Deviance: 0.1181       AIC: 13.48

model on page 16 to illustrate the use of an offset



require(foreign)
medpar <- read.dta("glmext3/medpar.dta")
m <- glm(los ~ hmo + white + type2 + type3, data=medpar, family=poisson(link="log"))
summary(m)
Warnmeldung:
In read.dta("glmext3/medpar.dta") :
  Faktorbeschriftungen von Stata-5-Dateien können nicht gelesen werden

Call:
glm(formula = los ~ hmo + white + type2 + type3, family = poisson(link = "log"), 
    data = medpar)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.3063  -1.8259  -0.6319   1.0081  15.3827  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.33293    0.02721  85.744  < 2e-16 ***
hmo         -0.07155    0.02394  -2.988  0.00281 ** 
white       -0.15387    0.02741  -5.613 1.99e-08 ***
type2        0.22165    0.02105  10.529  < 2e-16 ***
type3        0.70948    0.02614  27.146  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 8901.1  on 1494  degrees of freedom
Residual deviance: 8142.7  on 1490  degrees of freedom
AIC: 13868

Number of Fisher Scoring iterations: 5
  • Wald test whether the coefficient on white is equal to -0.2
require(survey)
regTermTest(m,test.terms = "white",null=-0.2,method="Wald")
Wald test for white
 in glm(formula = los ~ hmo + white + type2 + type3, family = poisson(link = "log"), 
    data = medpar)
F =  2.831664  on  1  and  1490  df: p= 0.092632
  • define model with offset variable white=-0.2
  • do the likelihood-ratio test
m1 <- glm(los ~ hmo + type2 + type3 + offset(-0.2*white), data=medpar, family=poisson(link="log"))
anova(m1,m,test="Chisq")
Analysis of Deviance Table

Model 1: los ~ hmo + type2 + type3 + offset(-0.2 * white)
Model 2: los ~ hmo + white + type2 + type3
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1      1491     8145.5                       
2      1490     8142.7  1   2.8657  0.09049 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Sunday, April 7, 2013

R - drop unused levels from factors

  • drops unused levels from factors
  • can be used on a factor or a data frame (which makes just sense when the data frame contains any factor)
  • syntax: droplevels(df)
  • create an example data frame
exdf <- data.frame(x=factor(1:10,labels=letters[1:10]),y=rbinom(10,1,0.5))
exdf
x y
1  a 0
2  b 1
3  c 1
4  d 0
5  e 1
6  f 1
7  g 0
8  h 0
9  i 1
10 j 1
  • subset exdf by y==1
exdf <- exdf[exdf$y==1,]
summary(exdf$x)
a b c d e f g h i j 
0 1 1 0 1 1 0 0 1 1
  • we see that all levels of y are present in the summary although several are not used anymore
exdf <- droplevels(exdf)
summary(exdf$x)
b c e f i j 
1 1 1 1 1 1

R - abbreviate strings automatically


  • abbreviate strings automatically
  • option minlength sets the number of characters
  • abbreviate in a way that they remain unique
data(armd.wide,package="nlmeU")
names(armd.wide)
 [1] "subject"  "lesion"   "line0"    "visual0"  "visual4"  "visual12"
 [7] "visual24" "visual52" "treat.f"  "miss.pat"
abbreviate(names(armd.wide))
 subject   lesion    line0  visual0  visual4 visual12 visual24 visual52 
  "sbjc"   "lesn"   "lin0"   "vsl0"   "vsl4"   "vs12"   "vs24"   "vs52" 
 treat.f miss.pat 
  "trt."   "mss."

Sunday, March 17, 2013

R Knoblauch/Maloney - MPDiR - Figure 2.1

  • rebuild some graphics from the book Knoblauch/Maloney: Modeling Psychophysical Data
  • from chapter 2 (Modeling), plotting age dependence of threshold for each level combination of Mtype and Sex using ggplot2 (page 27)
  • instead of the two panels we use two smooth() layers, loess is the default method for group sizes < 1000, so we only have to tell ggplot to use "lm" as method for one layer

  • we set se to F (se=F) in both of them which prevents the convidence intervals from being plotted

  • size, linetype, and colour are the corresponding parameters to lwd, lty, and col respectively

  • a nice list of line types and shapes of points

  • to plot a separate figure for every combination we use facet_wrap()

library(MPDiR)
data(Motion)
library(ggplot2)
ggplot(Motion, aes(x=LnAge,y=LnThresh)) + 
  geom_point() +
  geom_smooth(method="lm",se=F) +
  geom_smooth(se=F, size=2, linetype=2, colour="black") +
  facet_grid(Sex ~ Mtype,as.table=T)