Skip to content
markschl edited this page Aug 14, 2018 · 6 revisions

This command counts the number of sequences and prints the number to STDOUT. Advanced grouping of sequences is possible by supplying or more key strings containing variables (-k).

Usage:
    st count [options] [-l <list>...] [-k <key>...] [<input>...]
    st count (-h | --help)

Options:
    -k, --key <key>     Summarize over a variable key or a string containing variables.
                        For numeric key insert 'n:' before. Values are counted
                        in intervals of 1. To change, specify 'n:<interval>:<key>'.
                        Example: 'n:10:{s:seqlen}'
    -n, --no-int        Don't print intervals when using the 'n:<interval>:<key> syntax',
                        instead only upper limits (e.g. '5' instead of '(1,5]')

See this page for the options common to all commands.

Description

By default, the count command will return the global count for all files in the input:

st count *.fastq
10648515

If the count for each file is needed, use the filename variable:

st count -k filename *.fastq
file1.fastq    6474547
file2.fastq    2402290
file3.fastq    1771678

It is possible to use multiple keys. Consider the example for the find command where the primer names and number of mismatches are annotated as attributes. Now, the mismatch distribution for each primer can be analysed:

st count -k {a:f_primer} -k n:{a:f_dist} seqs.fa
primer1	0	249640
primer1	1	23831
primer1	2	2940
primer1	3	123
primer1	4	36
primer1	5	2
primer2	0	448703
primer2	1	60373
primer2	2	8996
primer2	3	691
primer2	4	34
primer2	5	7
primer2	6	1
N/A	5029

If primers on both ends were searched, it might make sense to use a math expression to get the sum of distances for both primers.

st count -k {a:f_primer} -k {a:r_primer} -k "n:{{a:f_dist + a:r_dist}}" primer_trimmed.fq.gz
f_primer1	r_primer1	0	3457490
f_primer1	r_primer1	1	491811
f_primer1	r_primer1	2	6374
f_primer1	r_primer1	3	420
f_primer1	r_primer1	4	10
(...)

The curly braces are actually only needed if a string of multiple variables and/or text is composed. The n: prefix tells the tool that the distance is numeric, which is useful for correct sorting.

With numeric keys, it is possible to summarize over intervals, add a n:<interval> prefix. This example shows the GC content summarized over 10% windows:

st count -k n:10:{s:gc} seqs.fa
(20,30]	2
(30,40]	15
(40,50]	193
(50,60]	984
(60,70]	7

The intervals (start,end] are open at the start and closed at the end, meaning that start <= value < end.

Clone this wiki locally