-
Notifications
You must be signed in to change notification settings - Fork 2
variables
Many commands can use variables, and some of them will
also provide part of their output as variables. See below
for a list of global variables. They are normally written in
curly braces: {variable}. The following example recodes
sequence IDs to seq_1/2/3...:
st set -i seq_{num} seqs.fa > renamed.faThe variables can be categorized into different categories. Aside from
'basic' variables, each category has its own prefix divided from the
variable name with a colon (<prefix>:<varible>). Categories:
- 'Basic' variables (id, desc, num, filename, ...): no prefix
- Sequence attributes in the form 'key=value':
a:<key> - Metadata from associated lists:
l:<fieldname>orl:<column_index> - Sequence statistics:
s:<name>(also available in dedicated stat command) - Variables provided by commands, currently: find (
f:) and split (split:)
The prefix makes it possible to e.g. have list fields and attributes with the same name.
See below for a full list of all available variables.
Note that the variable is written inbetween curly brackets: {<a:otu>}.
This is also required when using them in attributes.
Variables provided by commands (and all others) can be written to the output in two ways: attributes and CSV/TXT output. This example uses regex matching:
st find -ir "([^\.]+).*" seqs.fa -p id={f:match::1}
# returns `>seqname.1234 id=seqname`
st find -ir "([^\.]+).*" seqs.fa --to-txt id,f:match::1,seq
# returns `seqname.1234 seqname SEQ`
# Note: curly brackets are not necessary here.Mathematical expressions are written with double curly brackets. This example calculates the length of a match found by the find command.
st find -d3 GCATATCAATAAGCGGAGGA seqs.fa \
-p match_len="{{f:end - f:start + 1}}"If compiled with ExprTk support (which is the default for the provided binaries), filtering expressions are also possible using the filter command:
st filter "s:seqlen >= 100" input.fa > filtered.faExprTk expressions can also handle strings. String variables have to be
explicitly marked as such using a preceding dot (.variable).
st filter ".id == 'id1' or .id == 'id2'" input.fa > filtered.fa| variable | description |
|---|---|
| id | Record ID (in FASTA/FASTQ: everything before first space) |
| desc | Record description (everything after first space) |
| seq | Record sequence |
| num | Sequence number starting with 1 |
| path | Path to the current input file (or '-' if reading from STDIN) |
| filename | Name of the current input file with extension (or '-') |
| filestem | Name of the current input file without extension (or '-') |
| extension | Extension of the current input file. |
| dirname | Name of the base directory of the current file (or '') |
Adding the sequence number to the ID :
st set -i {id}_{num}Counting the number of records per file in the input:
st count -k filename *.fasta| variable | description |
|---|---|
| s:seqlen | Sequence length |
| s:ungapped_len | Sequence length without gaps (-) |
| s:gc | GC content as percentage of total bases. Lowercase (=masked) letters / characters other than ACGTU are not counted. |
| s:count | Count occurrence one or more characters. Usage: s:count:<characters>. Note that some characters (like '-') cannot be specified in math expressions. |
| s:exp_err | Total number of errors expected in the sequence, calculated from the quality scores as the sum of all error probabilities. For FASTQ, make sure to specify the correct format (--fmt) in case the scores are not in the Sanger/Illumina 1.8+ format. |
Get absolute GC content (not relative to sequence length):
st stat count:GC input.faAttributes stored in FASTA/FASTQ headers in the form key=value
Summarizing over an attribute in the FASTA header >id size=3:
st count -k a:size seqs.faAdding the sequence length to the header as attribute:
st . -a seqlen={s:seqlen} seqs.faFields from associated lists. (-l argument). Specify either a column number
e.g. {l:4}, or a column name ({l:<fieldname>}) if there is a header. With
multiple -l arguments, the lists can be selected in the same order
using
l:<field>, l2:<field>, l3:<field>, and so on.
Extracting sequences with coordinates stored in a BED file:
st trim -l coordinates.bed -0 {l:2}..{l:3} input.fa > output.faMath expressions with variables. Common operators and functions can be used (+, -, *, /, %, ^, min, max, sqrt, abs, exp, trignometric functions, ...). Boolean expressions are possible with common operators and keywords (and/or/not/...).See http://www.partow.net/programming/exprtk/ and https://github.com/ArashPartow/exprtk/blob/master/readme.txt for more information. Math expressions are also used by the 'filter' command.
Setting a GC content attribute as fraction instead of percentage:
st . -p gc={{s:gc / 100}} seqs.faRemoving DNA sequences with more than 10% of ambiguous bases:
st filter 's:count:ATGC / s:seqlen >= 0.1' input.faSelecting IDs with a certain pattern::
st filter ".id like 'AB*'" input.faSelecting IDs from a list::
st filter -uml id_list.txt 'def(l:1)' seqs.fa