-
Notifications
You must be signed in to change notification settings - Fork 2
count
This command counts the number of sequences and prints the number to STDOUT. Advanced grouping of sequences is possible by supplying or more key strings containing variables (-k).
Usage:
st count [options] [-l <list>...] [-k <key>...] [<input>...]
st count (-h | --help)
Options:
-k, --key <key> Summarize over a variable key or a string containing variables.
For numeric key insert 'n:' before. Values are counted
in intervals of 1. To change, specify 'n:<interval>:<key>'.
Example: 'n:10:{s:seqlen}'
-n, --no-int Don't print intervals when using the 'n:<interval>:<key> syntax',
instead only upper limits (e.g. '5' instead of '(1,5]')
See this page for the options common to all commands.
By default, the count command will return the global count for all files in the input:
st count *.fastq10648515
If the count for each file is needed, use the filename variable:
st count -k filename *.fastqfile1.fastq 6474547
file2.fastq 2402290
file3.fastq 1771678
It is possible to use multiple keys. Consider the example for the find command where the primer names and number of mismatches are annotated as attributes. Now, the mismatch distribution for each primer can be analysed:
st count -k {a:f_primer} -k n:{a:f_dist} seqs.faprimer1 0 249640
primer1 1 23831
primer1 2 2940
primer1 3 123
primer1 4 36
primer1 5 2
primer2 0 448703
primer2 1 60373
primer2 2 8996
primer2 3 691
primer2 4 34
primer2 5 7
primer2 6 1
N/A 5029
If primers on both ends were searched, it might make sense to use a math expression to get the sum of distances for both primers.
st count -k {a:f_primer} -k {a:r_primer} -k "n:{{a:f_dist + a:r_dist}}" primer_trimmed.fq.gzf_primer1 r_primer1 0 3457490
f_primer1 r_primer1 1 491811
f_primer1 r_primer1 2 6374
f_primer1 r_primer1 3 420
f_primer1 r_primer1 4 10
(...)
The curly braces are actually only needed if a string of multiple
variables and/or text is composed. The n: prefix tells the tool that
the distance is numeric, which is useful for correct sorting.
With numeric keys, it is possible to summarize over intervals, add
a n:<interval> prefix. This example shows the GC content
summarized over 10% windows:
st count -k n:10:{s:gc} seqs.fa(20,30] 2
(30,40] 15
(40,50] 193
(50,60] 984
(60,70] 7
The intervals (start,end] are open at the start and closed at the end, meaning that start <= value < end.