docker4seq/supplementary_data.html at master · amparore/docker4seq · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

<title>Introduction</title>

<script type="text/javascript">
window.onload = function() {
  var imgs = document.getElementsByTagName('img'), i, img;
  for (i = 0; i < imgs.length; i++) {
    img = imgs[i];
    // center an image if it is the only element of its parent
    if (img.parentElement.childElementCount === 1)
      img.parentElement.style.textAlign = 'center';
  }
};
</script>


<style type="text/css">
body, td {
   font-family: sans-serif;
   background-color: white;
   font-size: 13px;
}

body {
  max-width: 800px;
  margin: auto;
  padding: 1em;
  line-height: 20px;
}

tt, code, pre {
   font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}

h1 {
   font-size:2.2em;
}

h2 {
   font-size:1.8em;
}

h3 {
   font-size:1.4em;
}

h4 {
   font-size:1.0em;
}

h5 {
   font-size:0.9em;
}

h6 {
   font-size:0.8em;
}

a:visited {
   color: rgb(50%, 0%, 50%);
}

pre, img {
  max-width: 100%;
}
pre {
  overflow-x: auto;
}
pre code {
   display: block; padding: 0.5em;
}

code {
  font-size: 92%;
  border: 1px solid #ccc;
}

code[class] {
  background-color: #F8F8F8;
}

table, td, th {
  border: none;
}

blockquote {
   color:#666666;
   margin:0;
   padding-left: 1em;
   border-left: 0.5em #EEE solid;
}

hr {
   height: 0px;
   border-bottom: none;
   border-top-width: thin;
   border-top-style: dotted;
   border-top-color: #999999;
}

@media print {
   * {
      background: transparent !important;
      color: black !important;
      filter:none !important;
      -ms-filter: none !important;
   }

   body {
      font-size:12pt;
      max-width:100%;
   }

   a, a:visited {
      text-decoration: underline;
   }

   hr {
      visibility: hidden;
      page-break-before: always;
   }

   pre, blockquote {
      padding-right: 1em;
      page-break-inside: avoid;
   }

   tr, img {
      page-break-inside: avoid;
   }

   img {
      max-width: 100% !important;
   }

   @page :left {
      margin: 15mm 20mm 15mm 10mm;
   }

   @page :right {
      margin: 15mm 10mm 15mm 20mm;
   }

   p, h2, h3 {
      orphans: 3; widows: 3;
   }

   h2, h3 {
      page-break-after: avoid;
   }
}
</style>


</head>

<body>
<h2>Introduction</h2>

<p>The docker4seq package was developed to facilitate the use of computing demanding applications in the field of NGS data analysis.</p>

<p>The docker4seq package uses <a href="https://www.docker.com/what-docker">docker</a> <a href="https://www.docker.com/what-container">containers</a> that embed demanding computing tasks, e.g. short reads mapping. </p>

<p>This approach provides multiple advantages: </p>

<ul>
<li><p>user does not need to install all the software on its local server</p></li>
<li><p>results generated by different containers can be organized in pipelines</p></li>
<li><p>reproducible research is guarantee by the possibility of sharing the docker containers used for the analysis </p></li>
</ul>

<h2>Requirements</h2>

<p>The minimal hardware requirements are a 4 core 64 bits linux computer, 32 Gb RAM, one SSD 250GB, with a folder with read/write permission for any users (chmod 777), and <a href="https://www.docker.com/">docker</a> installed.</p>

<p><a href="https://github.com/kendomaniac/docker4seq"><strong>docker4seq</strong></a> and its graphical interface <a href="https://github.com/mbeccuti/4SeqGUI"><strong>4SeqGUI</strong></a> can fit ideally in the <a href="https://www.intel.com/content/www/us/en/products/boards-kits/nuc/kits/nuc6i7kyk.html">NUC6I7KYK, Intel mini-computer</a> equipped with Kingston Technology HyperX Impact 32GB Kit (2x16GB), 2133MHz DDR4 CL13 260-Pin SODIMM and Samsung 850 EVO - 250GB - M.2 SATA III Internal SSD.</p>

<p><strong>MANDATORY:</strong> The first time <em>docker4seq</em> is installed the <strong>downloadContainers</strong> function needs to be executed to download in the local repository the docker containers that are needed by <em>docker4seq</em>.</p>

<pre><code class="r">library(docker4seq)
downloadContainers(group=&quot;docker&quot;)
</code></pre>

<h2>Dockers containers</h2>

<p>At the present time all functions requiring some sort of calculation are embedded in the following docker containers:</p>

<ul>
<li><p>docker.io/rcaloger/annotate.2017.01 used by rnaseqCounts, rsemanno </p></li>
<li><p>docker.io/rcaloger/bwa.2017.01 used by bwaIndexUcsc, bwa </p></li>
<li><p>docker.io/rcaloger/chipseq.2017.01 used by chipseqCounts, chipseq</p></li>
<li><p>docker.io/rcaloger/r332.2017.01 used by experimentPower, sampleSize, wrapperDeseq2</p></li>
<li><p>docker.io/rcaloger/mirnaseq.2017.01 used by mirnaCounts</p></li>
<li><p>docker.io/rcaloger/rsemstar.2017.01 used by rnaseqCounts, rsemstarIndex, rsemstarUscsIndex</p></li>
<li><p>docker.io/rcaloger/skewer.2017.01 used by skewer</p></li>
</ul>

<h3>docker container nomenclature</h3>

<p>In case of updates required to solve bugs, which do not affect the calculation docker.io/rcaloger/XXXXX.YYYY.ZZ the fiels ZZ will be updated.</p>

<p>In case of updates which affect the calculation, e.g. new release of Bioconductor libraries, the field YYYY will be updated. Previous versions will be maintained to allow reproducibility.</p>

<h3>Reproducibility</h3>

<p>Within any folder generated with docker4seq functions it is saved the file <strong>containers.txt</strong>, which indicates the containers available in the local release of docker4seq.</p>

<p>In case, user would like to download a set of dockers containers different from those provided as part of the package those needs to be described in a file with the following format, <strong>docker.repository/user/docker.name</strong> which is passed to downloadContainers:</p>

<pre><code class="r">downloadContainers(group=&quot;docker&quot;, containers.file=&quot;my_containers.txt&quot;)
#an example of the my_containers.txt file content
docker.io/rcaloger/bwa.2017.01
docker.io/rcaloger/chipseq.2017.01
docker.io/rcaloger/r340.2017.01
</code></pre>

<h2>Available workflows</h2>

<p>At the present time are available the following workflows:</p>

<ul>
<li><strong>mRNAseq</strong>, which allows:

<ul>
<li>adapter trimming with <a href="https://github.com/relipmoc/skewer">skewer</a></li>
<li>mapping with <a href="https://github.com/alexdobin/STAR">STAR</a></li>
<li>counting genes and isoforms with <a href="http://deweylab.github.io/RSEM/">RSEM</a></li>
<li>ENSEMBL gene annotation.</li>
<li>organizing the output of RSEM in tables to be used for differential expression analysis</li>
<li>visualizing experiment data with PCA</li>
<li>evaluating experiment power and sample size</li>
<li>detecting differentially expressed genes/isoforms</li>
</ul></li>
<li><strong>miRNAseq</strong>, which executes the workflow described in Cordero et al. PLoS One. 2012;7(2):e31630, embedding the following steps:

<ul>
<li>trimming adapters with <a href="http://cutadapt.readthedocs.io/en/stable/guide.html">cutadapt</a></li>
<li>miRNAs mapping on <a href="http://www.mirbase.org/">mirbase</a> hairpins using <a href="http://compbio.cs.toronto.edu/shrimp/">SHRiMP</a></li>
<li>quantification of mature miRNAs.</li>
<li>visualizing experiment data with PCA</li>
<li>evaluating experiment power and sample size</li>
<li>detecting differentially expressed miRNAs</li>
</ul></li>
<li><strong>ChIPseq</strong>, which allows:

<ul>
<li>adapter trimming with <a href="https://github.com/relipmoc/skewer">skewer</a></li>
<li>mapping with <a href="http://bio-bwa.sourceforge.net/bwa.shtml">BWA</a></li>
<li>peak calling using either MACS v 1.4 or SICER v 1.1</li>
<li>associating peaks to the nearest gene, UCSC annotation</li>
<li>full annotation of the nearest gene</li>
</ul></li>
</ul>

<p>The most computing expensive steps of the analyses are embedded in the following docker4seq functions: <strong>rnaseqCounts</strong>, <strong>mirnaCounts</strong>, <strong>chipseqCounts</strong>. These functions are also the only one that have RAM and computing power requirements not usually available in consumer computers. Below its shown the time required to run the above three functions increasing the number of sequenced reads.</p>

<h3>rnaseqCounts performances on different hardware configurations</h3>

<p>The conversion between fastq files to counts is the most time consuming step in RNAseq data analysis and it normally requires high-end server. We tested <strong>rnaseqCounts</strong> on different hardware:</p>

<pre><code>+ SeqBox: NUC6I7KYK CPU i7-6770HQ 3.5 GHz (1 core, 8 threads), 32 Gb RAM, HD 250 Gb SSD
+ SGI UV200 server: CPU E5-4650 v2 2.40GHz (8 cores, 160 threads), 1 Tb RAM, RAID 6, 100 Tb SATA
</code></pre>

<p>We run respctively 26, 52, 78, and 105 million reads increasing the number of threads till reaching the limit of the hardware or up to 64 threads, i.e. vaules shown in parenthesis in next figure. It is notable that SeqBox, mapping in 5 hours more than 100 milion reads, is able to handle can handle, in 20 hours, the throughput of the Illumina benchtop sequencer NextSeq 500, which produces up to 400 milion reads in a 30 hours run.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/mrna_performace.jpg" alt="rnaseqCounts overall performance"></p>

<h3>mirnaCounts performances on different hardware configurations</h3>

<p>We run resepctively 3, 6, 12, and 24 miRNA samples in parallel using <strong>mirnaCounts</strong>, increasing the number of threads till reaching the limit of the hardware or 24 threads, i.e. vaules shown in parenthesis in next figure.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/mirna_performace.jpg" alt="mirnaCounts overall performance"> </p>

<h3>chipseqCounts performances on different hardware configurations</h3>

<p>We run resepctively 37, 70, 111, and 149 million reads increasing the number of threads till reaching the limit of the hardware or up to 64 threads, i.e. vaules shown in parenthesis in next figure.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/chipseq_performance.jpg" alt="chipseqCounts overall performance"> </p>

<p>From the point of view of parallelization the <strong>rnaseqCounts</strong> is the one that embeds the most parallelized tools: i) mapping with STAR and ii) quantifying transcripts with RSEM. Both these tools have a massive I/O requirement. On the basis of the results shown above parallelization does not improve very much the overall performaces, but compensate the poor I/O of the RAID based on SATA disk array. On the other side the presence of an SSD with very high I/O compensate in the limited amount of cores of SeqBox.</p>

<p>In the case of <strong>mirnaCounts</strong> and <strong>chipseqCounts</strong> the parallelization is very little and it is only available for the reads mapping procedure. On the otherside both functions have a massive I/O. The reduced parallelization of these two analyses combined with the higher I/O of the SSD with respect to the SATA array makes SeqBox extremely effective even with very high number of reads to be processed.</p>

<h2>RNAseq workflow: Howto</h2>

<p>The mRNAseq workflow can be run using <strong>4SeqGUI</strong> graphical interface (linux/MAC):</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/rnaseq1.jpeg" alt="mRNAseq workflow"></p>

<p>Sample quantification is made of these steps:</p>

<ul>
<li><p>Creating a genome index for STAR (see end of this paragraph)</p></li>
<li><p>Running removing sequencing adapters</p></li>
<li><p>Mapping reads to the reference genome</p></li>
<li><p>Quantify gene and transcript expression level</p></li>
<li><p>Annotating genes.</p></li>
</ul>

<p>All the parameters can be setup using 4SeqGUI</p>

<h3>Creating a STAR index file for mRNAseq:</h3>

<p>The index can be easily created using the graphical interface:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/rnaseq2.jpeg" alt="Creating a STAR genome index"></p>

<p>A detailed description of the parameters is given below.</p>

<h4>Creating a STAR index file by line command</h4>

<p>\fontsize{8}{8}\selectfont</p>

<pre><code class="r">rsemstarIndex(group=&quot;docker&quot;,genome.folder=&quot;/data/scratch/hg38star&quot;,
ensembl.urlgenome=&quot;ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz&quot;,
ensembl.urlgtf=&quot;ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz&quot;)
</code></pre>

<p>In brief, <strong>rsemstarIndex</strong> uses ENSEMBL genomic data. User has to provide the URL (<strong>ensembl.urlgenome</strong>) for the file XXXXX_dna.toplevel.fa.gz related to the organism of interest, the URL (<strong>ensembl.urlgtf</strong>) for the annotation GTF XXX.gtf.gz and the path to the folder where the index will be generated (<strong>genome.folder</strong>). The parameter <strong>threads</strong> indicate the number of cores dedicated to this task.</p>

<h3>Quantifying genes/isoforms:</h3>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/rnaseq3.jpeg" alt="Gene, Isoform counting"></p>

<p>A detailed description of the parameters is given below.</p>

<h4>Sample quantification by line command</h4>

<p>The sample quantification can be also executed using R and it is completely embedded in a single function:</p>

<pre><code class="r">#test example
system(&quot;wget http://130.192.119.59/public/test.mrnaCounts.zip&quot;)
unzip(&quot;test.mrnaCounts.zip&quot;)
setwd(&quot;./test.mrnaCounts&quot;)
library(docker4seq)
rnaseqCounts(group=&quot;docker&quot;,fastq.folder=getwd(), scratch.folder=getwd(),
adapter5=&quot;AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT&quot;,
adapter3=&quot;AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT&quot;,
seq.type=&quot;se&quot;, threads=8,  min.length=40,
genome.folder=&quot;/data/scratch/mm10star&quot;, strandness=&quot;none&quot;, save.bam=FALSE,
org=&quot;mm10&quot;, annotation.type=&quot;gtfENSEMBL&quot;)
</code></pre>

<p>User needs to create the <strong>fastq.folder</strong>, where the fastq.gz file(s) for the sample under analysis are located.
The <strong>scratch.folder</strong> is the location where temporary data are created. The results will be then saved in the <strong>fastq.folder</strong>.</p>

<p>User needs to provide also the sequence of the sequencing adapters, <strong>adapter5</strong> and <strong>adapter3</strong> parameters. In case Illumina platform is used the adapters sequences can be easily recovered <a href="https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences_1000000002694-01.pdf"><strong>here</strong></a>. </p>

<p><strong>seq.type</strong> indicates if single end (se) or pair-end (pe) data are provided, <strong>threads</strong> indicates the max number of cores used by <em>skewer</em> and <em>STAR</em>, all the other steps are done on a single core.</p>

<p>The <strong>min.length</strong> refers to the minimal length that a reads should have after adapters trimming. Since today the average read length for a RNAseq experiment is 50 or 75 nts would be better to bring to 40 nts the min.length parameter to increase the precision in assigning the correct position on the genome.</p>

<p>The <strong>genome.folder</strong> parameter refers to the location of the genomic index generated by STAR using the <em>docker4seq</em> function <strong>rsemstarIndex</strong>.
The generation of the genome index is very simple and it is highlited at the end of this paragraph.</p>

<p><strong>strandness</strong>, is a parameter referring to the kit used for the library prep. If the kit does not provide strand information it is set to &quot;none&quot;, if provides strand information is set to &quot;forward&quot; for  Illumina stranded kit and it set to &quot;reverse&quot; for Illumina ACCESS kit. <strong>save.bam</strong> set to TRUE indicates that genomic bam file and transcriotomic bam files are also saved at the end of the analysis. <strong>annotation.type</strong> refers to the type of available gene-level annotation. At the present time is only available ENSEMBL annotation defined by the gtf downloaded during the creation of the indexed genome files, see paragraph <em>at the end*Creating a STAR index file for mRNAseq</em>.</p>

<h3>Sample quantification output files</h3>

<p>The mRNAseq workflow produces the following output files: </p>

<pre><code>+ XXXXX-trimmed.log, containing the information related to the adapters trimming
+ gtf_annotated_genes.results, the output of RSEM gene quantification with gene-level annotation
+ Log.final.out, the statistics of the genome mapping generated by STAR
+ rsem.info, summary of the parameters used in the run
+ genes.results, the output of RSEM gene quantification
+ isoforms.results, the output of RSEM isoform quantification
+ run.info, some statistics on the run
+ skewerd_xxxxxxxxxxxx.log, log of the skewer docker container
+ stard.yyyyyyyyyyyy.log, log of the star docker container
</code></pre>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/rnaseq7.jpeg" alt="gtf_annotated_genes.results"></p>

<p>The first column in <strong>gtf_annotated_genes.results</strong> is the ensembl gene id, the second is the <a href="http://www.ensembl.org/Help/Faq?id=468">biotype</a>,
the 3rd is the annotation source, the 4th contains the set of transcripts included in the ensembl gene id. Then there is the length of the gene, the lenght of the gene to which is subtracted the average length of the sequenced fragments, the expected counts are the couts to be used for differentiual expression analysis. <a href="https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/">TPM</a> and <a href="https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/">FPM</a> are normalized gene quantities to be used only for visualization purposes.</p>

<h3>From samples to experiment</h3>

<p>The RSEM output is sample specific, thus it is necessary to assemble the single sample in an experiment table including in the header of the column both the covariates and the batch, if any.
The header sample name is separated by the covariate with an underscore, e.g. mysample1_Cov1, mysample2_Cov2. </p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/counts1.jpeg" alt="counts table with covariates"></p>

<p>In case also a batch is present also this is separated by a further underscore, e.g. mysample1_Cov1_batch1, mysample2_Cov_batch2.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/counts2.jpeg" alt="counts table with covariates and batch"></p>

<p>The addition of the covariates to the various samples can be done using the <strong>4seqGUI</strong> using the button: <em>From samples to experiment</em>.
Covariates are added to the column name, i.e. sample name, using _, e.g. mysample_Cov.1. Batches are added after covariates with _, e.g. mysample_Cov.1_batch.1.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/counts3.jpeg" alt="generating a table with covariates"></p>

<h4>From samples to experiments by line command</h4>

<pre><code class="r">#test example
system(&quot;wget http://130.192.119.59/public/test.samples2experiment.zip&quot;)
unzip(&quot;test.samples2experiment.zip&quot;)
setwd(&quot;test.samples2experiment&quot;)
library(docker4seq)
sample2experiment(sample.folders=c(&quot;./e1g&quot;,&quot;./e2g&quot;,&quot;./e3g&quot;,
&quot;./p1g&quot;, &quot;./p2g&quot;, &quot;./p3g&quot;),
covariates=c(&quot;Cov.1&quot;,&quot;Cov.1&quot;,&quot;Cov.1&quot;,&quot;Cov.2&quot;,&quot;Cov.2&quot;,&quot;Cov.2&quot;),
bio.type=&quot;protein_coding&quot;, output.prefix=&quot;.&quot;)
</code></pre>

<p>User needs to provide the paths of the samples, <strong>sample.folder</strong> parameter, a vector of the covariates, <strong>covariates</strong>, and the biotype(s) of interest, <strong>bio.type</strong> parameter. The parameter <strong>output.prefix</strong> refers to the path where the output will be created, as default this is the actual R working folder.</p>

<h4>From samples to experiments output files</h4>

<p>The samples to experiments produces the following output files: </p>

<pre><code>+ _counts.txt: gene-level raw counts table for differential expression analysis
+ _isoforms_counts.txt: isoform-level raw counts table for differential expression analysis
+ _isoforms_log2TPM.txt: isoform-level log2TPM for visualization purposes
+ _log2TPM.txt: gene-level log2TPM for visualization purposes
+ _isoforms_log2FPKM.txt: isoform-level log2FPKM for visualization purposes
+ _log2FPKM.txt: gene-level log2FPKM for visualization purposes
+ XXXXX.Rout: logs of the execution
</code></pre>

<h3>Visualizing experiment data with PCA</h3>

<p><a href="https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/">PCA</a> finds the principal components of data. Principal components are the underlying structure in the data. They are the directions where there is the most variance, the directions where the data is most spread out.
<strong>4SeqGUI</strong> provides an interface to the generation experiment samples PCA</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/pca1.jpeg" alt="PCA"></p>

<p>The plot is saved in <strong>pca.pdf</strong> in the selected folder.</p>

<h4>PCA by line command</h4>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.analysis.zip&quot;)
unzip(&quot;test.analysis.zip&quot;)
setwd(&quot;test.analysis&quot;)
library(docker4seq)
pca(experiment.table=&quot;_log2FPKM.txt&quot;, type=&quot;FPKM&quot;, legend.position=&quot;topleft&quot;, covariatesInNames=FALSE, principal.components=c(1,2), pdf = TRUE, output.folder=getwd())
</code></pre>

<p>User needs to provide the paths of experiment table, <strong>experiment.table</strong> parameter, i.e. the file generated using the samples2experiment function. The <strong>type</strong> parameter indicates if FPKM, TPM or counts are used for the PCA geenration. The parameter <strong>legend.position</strong> defines where to locate the covariates legend. The parameter <strong>covariatesInNames</strong> indicates if the header of the experiment table contains or not covariate information. The parameter <strong>principal.components</strong> indicates which principal components should be plotted. <strong>output.folder</strong> indicates where to save the pca.pdf file.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/pca2.jpeg" alt="pca.pdf"></p>

<p>The values in parentesis on x and y axes are the amount of variance explained by each principal component.</p>

<p>IMPORTANT: The above analysis is suitable also for miRNAseq data</p>

<h3>Evaluating experiment power and sample size</h3>

<p><a href="https://www.ncbi.nlm.nih.gov/pubmed/25246651">RnaSeqSampleSize</a> Bioconductor package provides the possibility to calculate, from a pilot experiment, the statistical power and to define the optimal sample size.
<strong>4SeqGUI</strong> provides an interface to sample size estimation: </p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/ss.jpeg" alt="sample size estimation"></p>

<p>and to statistical power estimation:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/es.jpeg" alt="stat power estimation"></p>

<h4>Sample size estimation by line command</h4>

<p>Sample size estimation is an important issue in the design of RNA sequencing experiments. We have implemented a wrapper function for sample size estimation using the bioconductor package RnaSeqSampleSize.</p>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.analysis.zip&quot;)
unzip(&quot;test.analysis.zip&quot;)
setwd(&quot;test.analysis&quot;)
library(docker4seq)
sampleSize(group=&quot;docker&quot;, filename=&quot;_counts.txt&quot;, power=0.80, FDR=0.1, genes4dispersion=200, log2fold.change=1)
</code></pre>

<p>The requested parameters are the path to the counts experiment table generated by <strong>samples2experiment</strong> function. The param <strong>power</strong> indicates the expecte fraction of differentially expressed gene, e.g 0.80. <strong>FDR</strong> and <strong>log2fold.change</strong>  are the two thresholds used to define the set of differentially expressed genes of interest. </p>

<p>The output file is <strong>sample_size_evaluation.txt</strong> is saved in the R working folder, below an example of the file content:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/samplesize1.jpeg" alt="sample_size_evaluation.txt"></p>

<p>IMPORTANT: The above analysis is suitable also for miRNAseq data</p>

<h4>Experiment statistical power estimation by line command</h4>

<p>Experiment power provides an indication of which is the fraction of differentially expressed genes that can be detected given a specific number of samples and differential expression detection thresholds. We have implemented a wrapper function for experiment power estimation using the bioconductor package RnaSeqSampleSize.</p>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.analysis.zip&quot;)
unzip(&quot;test.analysis.zip&quot;)
setwd(&quot;test.analysis&quot;)
library(docker4seq)
experimentPower(group=&quot;docker&quot;, filename=&quot;_counts.txt&quot;,replicatesXgroup=7, FDR=0.1, genes4dispersion=200, log2fold.change=1)
</code></pre>

<p>The requested parameters are the path to the counts experiment table generated by <strong>samples2experiment</strong> function. The param <strong>replicatesXgroup</strong> indicates the number of sample associated to each of the two covariates. <strong>FDR</strong> and <strong>log2fold.change</strong>  are the two thresholds used to define the set of differentially expressed genes of interest. <strong>genes4dispersion</strong> indicates the number of genes used in estimation of read counts and dispersion distribution.</p>

<p>The output file is <strong>power_estimation.txt</strong> is saved in the R working folder, below an example of the file content: </p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/expp1.jpeg" alt="power_estimation.txt"></p>

<p>IMPORTANT: The above analysis is suitable also for miRNAseq data</p>

<h3>Differential expression analysis with DESeq2</h3>

<p>A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes.
<strong>4SeqGUI</strong> provides an interface to DESeq2 to simplify differential expression analysis.</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/de1.jpeg" alt="DESeq2"></p>

<p>The output files are:</p>

<p><strong>DEfull.txt</strong> containing the full set of results generated by DESeq2</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/de2.jpeg" alt="DEfull.txt"></p>

<p><strong>DEfiltered_log2fc_X_fdr_Y.Y.txt</strong> containing the set of differentially expressed genes passing the indicated thresholds</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/de3.jpeg" alt="DEfiltered_log2fc_1_fdr_0.1.txt"></p>

<p><strong>genes4david.txt</strong> a file containing only the gene symbols to be used as input for <a href="https://david.ncifcrf.gov/">DAVID</a> or <a href="http://amp.pharm.mssm.edu/Enrichr/">ENRICHR</a></p>

<p>log2normalized_counts.txt, DESeq2 log2 library size normalized counts to be used for visualizaiton only instead of log2 FPKM or log2 TPM.</p>

<h4>DESeq2 by line command</h4>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.analysis.zip&quot;)
unzip(&quot;test.analysis.zip&quot;)
setwd(&quot;test.analysis&quot;)
library(docker4seq)
wrapperDeseq2(output.folder=getwd(), group=&quot;docker&quot;, experiment.table=&quot;_counts.txt&quot;, log2fc=1, fdr=0.1, ref.covar=&quot;Cov.1&quot;, type=&quot;gene&quot;, batch=FALSE)
</code></pre>

<p>User needs to provide experiment table, <strong>experiment.table</strong> param, i.e. the counts table generated with <strong>samples2experiment</strong> function, the thresholds for the differential expression analysis, <strong>log2fc</strong> and <strong>fdr</strong> params, the reference covariate, <strong>ref.covar</strong> param, i.e. the covariate that is used as reference for differential expression detection, the <strong>type</strong> param, whihc refers to the type of experiment table in use: <em>gene</em>, <em>isoform</em>, <em>mirna</em>, <strong>batch</strong> parameter that indicates, it it is set to <strong>TRUE</strong> that the header of the experiment table also contains the extra information for the batch effect (see above). </p>

<p>IMPORTANT: the above analysis can be also applied to miRNAseq data.</p>

<h2>miRNAseq workflow</h2>

<p>The miRNAseq workflow can be run using <strong>4SeqGUI</strong> graphical interface:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/mirna1.jpeg" alt="miRNAseq workflow"></p>

<p>The miRNAseq docker container executes the following steps:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/mirna3.jpeg" alt="miRNAseq workflow"></p>

<p>The full workflow is described in <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0031630">Cordero et al. Plos ONE 2012</a>. In brief, fastq files are trimmed using <a href="https://github.com/marcelm/cutadapt">cutadapt</a> and the trimmed reads are mapped on miRNA precursors, i.e. harpin.fa file, from <a href="http://www.mirbase.org/ftp.shtml">miRBase</a> using <a href="http://compbio.cs.toronto.edu/shrimp/">SHRIMP</a>. Using the location of the mature miRNAs in the precursor, countOverlaps function, from the Bioconductor package GenomicRanges is used to quantify the reads mapping on mature miRNAs. </p>

<p>All the parameters needed to run the miRNAseq workflow can be setup using 4SeqGUI</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/mirna2.jpeg" alt="miRNAseq parameters"></p>

<p>A detailed description of the parameters is given below.</p>

<h3>miRNAseq workflow by line command</h3>

<p>The miRNAseq workflow can be also executed using R and it is completely embedded in a unique function:</p>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.mirnaCounts.zip&quot;)
unzip(&quot;test.mirnaCounts.zip&quot;)
setwd(&quot;test.mirnaCounts&quot;)
library(docker4seq)
mirnaCounts(group=&quot;docker&quot;,fastq.folder=getwd(), scratch.folder=&quot;/data/scratch&quot;,
            mirbase.id=&quot;hsa&quot;,download.status=FALSE, adapter.type=&quot;NEB&quot;, trimmed.fastq=FALSE)
</code></pre>

<p>User needs to create the <strong>fastq.folder</strong>, where the fastq.gz files for all miRNAs under analysis are located.
The <strong>scratch.folder</strong> is the location where temporary data are created. The results will be then saved in the <strong>fastq.folder</strong>.
User needs to provide also the identifier of the miRBase organism, e.g. <strong>hsa</strong> for Homo sapiens, <strong>mmu</strong> for Mus musculus.  If the <strong>download.status</strong> is set to FALSE, mirnaCounts uses miRBase release 21, if it is set to TRUE the lastest version of precursor and mature miRNAs will be downloaded from miRBase. Users need to provide the name of the producer of the miRNA library prep kit to identify which adapters need to be provided to cutadapt, <strong>adapter.type</strong> parameter. The available adapters are NEB and Illumina, but, upon request, we can add other adapters. Finally, if the <strong>trimmed.fastq</strong> is set to FALSE the trimmed fastq are not saved at the end of the analysis.</p>

<h3>miRNAseq workflow output files</h3>

<p>The miRNAseq workflow produces the following output files: </p>

<pre><code>+ README: A file describing the content of the data folder
+ all.counts.txt: miRNAs raw counts, to be used for differential expression analysis
+ trimmimg.log: adapters trimming statistics
+ shrimp.log: mapping statistics
+ all.counts.Rda: miRNAs raw counts ready to be loaded in R.
+ analysis.log: logs of the full analysis pipeline
</code></pre>

<h2>Adding covariates and batches to mirnaCounts output: all.counts.txt</h2>

<p>The funnction <strong>mirnaCovar</strong> add to the header of all.counts.txt covariates and batches or covariates only.</p>

<pre><code class="r">#test example
system(&quot;wget 130.192.119.59/public/test.mirna.analysis.zip&quot;)
unzip(&quot;test.mirna.analysis.zip&quot;)
setwd(&quot;test.mirna.analysis&quot;)
library(docker4seq)
mirnaCovar(experiment.folder=getwd(),
     covariates=c(&quot;Cov.1&quot;, &quot;Cov.1&quot;, &quot;Cov.1&quot;, &quot;Cov.1&quot;, &quot;Cov.1&quot;, &quot;Cov.1&quot;,
                  &quot;Cov.2&quot;, &quot;Cov.2&quot;, &quot;Cov.2&quot;, &quot;Cov.2&quot;, &quot;Cov.2&quot;, &quot;Cov.2&quot;),
     batches=c(&quot;bath.1&quot;, &quot;bath.1&quot;, &quot;bath.2&quot;, &quot;bath.2&quot;, &quot;batch.1&quot;, &quot;batch.1&quot;,
               &quot;batch.2&quot;, &quot;batch.2&quot;,&quot;batch.1&quot;, &quot;batch.1&quot;,&quot;bath.2&quot;, &quot;bath.2&quot;))
</code></pre>

<h2>chipseq workflow</h2>

<p>The chipseq workflow can be run using <strong>4SeqGUI</strong> graphical interface:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/chipseq0.jpeg" alt="ChIPseq workflow"></p>

<p>The ChIPseq is made of two main steps:</p>

<ul>
<li><p>Creating a genome index for BWA (see end of this paragraph)</p></li>
<li><p>Running MACS or SICER analysis</p></li>
</ul>

<h3>Creating a BWA index file for Chipseq:</h3>

<p>The index can be easily created using the graphical interface:</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/chipseq1.jpeg" alt="Creating a BWA index with Genome indexing BWA"></p>

<pre><code class="r">bwaIndexUcsc(group=&quot;sudo&quot;,genome.folder=&quot;/sto2/data/scratch/mm10bwa&quot;, uscs.urlgenome=
&quot;http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz&quot;,
gatk=FALSE)
</code></pre>

<p>In brief, <strong>bwaIndexUcsc</strong> uses UCSC genomic data. User has to provide the URL (<strong>uscs.urlgenome</strong>) for the file chromFa.tar.gz related to the organism of interest and the path to the folder where the index will be generated (<strong>genome.folder</strong>). The parameter <strong>gatk</strong> has to be set to FALSE because is not used for a genomic index used for ChIPseq.</p>

<h3>Calling peaks and annotating:</h3>

<p>All the parameters needed to run MACS or SICER can be setup using 4SeqGUI</p>

<p><img src="/Users/raffaelecalogero/Dropbox/data/docker/stable/docker4seq/inst/img/chipseq3.jpeg" alt="MACS and SICER analysis"></p>

<p>A detailed description of the parameters is given below.</p>

<h3>Chipseq workflow by line command</h3>

<p>The chipseq workflow can be also executed using R and it is completely embedded in a unique function:</p>

<pre><code class="r">system(&quot;wget 130.192.119.59/public/test.chipseqCounts.zip&quot;)
unzip(&quot;test.chipseqCounts.zip&quot;)
setwd(&quot;test.chipseqCounts&quot;)
library(docker4seq)
chipseqCounts(group = &quot;docker&quot;, output.folder = &quot;./prdm51.igg&quot;,
  mock.folder=&quot;./igg&quot;, test.folder=&quot;./prdm51&quot;, scratch.folder=getwd(),
  adapter5 = &quot;AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT&quot;,
  adapter3 = &quot;AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT&quot;,
  threads = 8, min.length = 30, genome.folder,
  mock.id = &quot;igg&quot;, test.id = &quot;tf&quot;, genome, read.size = 50,
  tool = &quot;macs&quot;, macs.min.mfold = 10, macs.max.mfold = 30,
  macs.pval = &quot;1e-5&quot;, sicer.wsize = 200, sicer.gsize = 200,
  sicer.fdr = 0.1, tss.distance = 0, max.upstream.distance = 10000,
  remove.duplicates = &quot;N&quot;)
</code></pre>

<p>Specifically user needs to create three folders:</p>

<pre><code>+ mock.folder, where the fastq.gz file for the control sample is located. For control sample we refer to ChIP with IgG only or input DNA.
+ test.folder, where the fastq.gz file for the ChIP of the sample to be analysed.
+ output.folder, where the R script embedding the above script is located.
</code></pre>

<p>The <strong>scratch.folder</strong> can be the same as the <strong>output.folder</strong>. However, if the system in use has a high speed disk for temporary calculation, e.g. a SSD disk, the location of the scratch.folder on the SSD will reduce significantly the computing time.</p>

<p>User needs to provide also the sequence of the sequencing adapters, <strong>adapter5</strong> and <strong>adapter3</strong> parameters. In case Illumina platform is used the adapters sequences can be easily recovered <a href="https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences_1000000002694-01.pdf">here</a>. </p>

<p><strong>Threads</strong> indicates the max number of cores used by <em>skewer</em> and <em>bwa</em>, all the other steps are done on a single core.
The <strong>min.length</strong> refers to the minimal length that a reads should have after adapters trimming. Since today the average read length for a ChIP experiment is 50 or 75 nts would be better to bring to 40 nts the min.length parameter to increase the precision in assigning the correct position on the genome.</p>

<p>The <strong>genome.folder</strong> parameter refers to the location of the genomic index generated by bwa using the <em>docker4seq</em> function <strong>bwaIndexUcsc</strong>.
The generation of the genome index for ChIP experiment is very simple and it is highlited at the end of this paragraph.</p>

<p><strong>mock.id</strong> and <strong>test.id</strong> identify the type of sample and are assigned to the ID parameter in the RG field of the bam file.</p>

<p><strong>genome</strong> is the parameter referring to the annotation used to associate ChIP peaks to genes. In the present implemetation are available hg38, hg19 for human and mm10 and mm9 for mouse annotations.</p>

<p><strong>read.size</strong> is a parameter requested by MACS and SICER for their analysis.
<strong>macs.min.mfold</strong>, <strong>macs.max.mfold</strong>, <strong>macs.pval</strong>  are the deafult parameters requested for  peaks definition for more info please refer to the documetation of MACS 1.4.
<strong>sicer.wsize</strong>, <strong>sicer.gsize</strong>, <strong>sicer.fdr</strong> are the deafult parameters requested for  peaks definition for more info please refer to the documetation ofSICER 1.1.
<strong>Important</strong>: The optimal value for <strong>sicer.gsize</strong> in case of H3K4Me3 ChIP is 200 and in case of ChIP H3K27Me3 is 600.   </p>

<p><strong>tss.distance</strong> and <strong>max.upstream.distance</strong> are parameters required by ChIPseqAnno, which is the bioconducto package used to assign the peaks to specific genes. Specifically max.upstream.distance refers to the max distance in nts that allow to associate a peak to a gene.</p>

<p><strong>remove.duplicates</strong> is the parameter that indicates if duplicates need to be removed or not. It has two options: <strong>N</strong> duplicates are not removed, <strong>Y</strong> duplicates are removed.</p>

<h3>Chipseq workflow output files</h3>

<p>The chipseq workflow produces the following output files: </p>

<pre><code>+ README: A file describing the content of the data folder
+ mypeaks.xls: All detected peaks alongside the nearest gene and its annotation
+ mytreat.counts: The total reads count for the provided treatment file
+ mycontrol.counts: The total reads count for the provided control/background file
+ peak_report.xls: Aggregate information regarding the peak and their position relative to the nearest gene
+ chromosome_distribution.pdf: Barplot of the distribution of the peaks on the chromosomes
+ relative_position_distribution.pdf: Barplot of the distribution of the peaks positions relative to their nearest gene
+ peak_width_distribution.pdf: Histogram of the distribution of the width of the peaks
+ distance_from_nearest_gene_distribution.pdf: Histogram of the distribution of the distance of each peak from its nearest gene
+ cumulative_coverage_total.pdf: Cumulative normalized gene coverage
+ cumulative_coverage_chrN.pdf: Cumulative normalized gene coverage for the specific chromosome
+ mycontrol_sorted.bw: bigWig file for UCSC Genome Browser visualization
+ mytreat_sorted.bw: bigWig file for UCSC Genome Browser visualization
</code></pre>

</body>

</html>