Skip to content

variables

markschl edited this page Aug 19, 2018 · 9 revisions

Variables

Many commands can use variables, and some of them will also provide part of their output as variables. See below for a list of global variables. They are normally written in curly braces: {variable}. The following example recodes sequence IDs to seq_1/2/3...:

st set -i seq_{num} seqs.fa > renamed.fa

The variables can be categorized into different categories. Aside from 'basic' variables, each category has its own prefix divided from the variable name with a colon (<prefix>:<varible>). Categories:

  • 'Basic' variables (id, desc, num, filename, ...): no prefix
  • Sequence attributes in the form 'key=value': a:<key>
  • Metadata from associated lists: l:<fieldname> or l:<column_index>
  • Sequence statistics: s:<name> (also available in dedicated stat command)
  • Variables provided by commands, currently: find (f:) and split (split:)

The prefix makes it possible to e.g. have list fields and attributes with the same name.

See below for a full list of all available variables.

Note that the variable is written inbetween curly brackets: {<a:otu>}. This is also required when using them in attributes.

Writing to output

Variables provided by commands (and all others) can be written to the output in two ways: attributes and CSV/TXT output. This example uses regex matching:

st find -ir "([^\.]+).*" seqs.fa -p id={f:match::1}
# returns `>seqname.1234 id=seqname`

st find -ir "([^\.]+).*" seqs.fa --to-txt id,f:match::1,seq
# returns `seqname.1234 seqname SEQ`
# Note: curly brackets are not necessary here.

Math expressions

Mathematical expressions are written with double curly brackets. This example calculates the length of a match found by the find command.

st find -d3 GCATATCAATAAGCGGAGGA seqs.fa \
  -p match_len="{{f:end - f:start + 1}}"

If compiled with ExprTk support (which is the default for the provided binaries), filtering expressions are also possible using the filter command:

st filter "s:seqlen >= 100" input.fa > filtered.fa

String variables

ExprTk expressions can also handle strings. String variables have to be explicitly marked as such using a preceding dot (.variable).

st filter ".id == 'id1' or .id == 'id2'" input.fa > filtered.fa

List of variables

Standard variables without prefix. Usage: <variable>

variable description
id Record ID (in FASTA/FASTQ: everything before first space)
desc Record description (everything after first space)
seq Record sequence
num Sequence number starting with 1
path Path to the current input file (or '-' if reading from STDIN)
filename Name of the current input file with extension (or '-')
filestem Name of the current input file without extension (or '-')
extension Extension of the current input file.
dirname Name of the base directory of the current file (or '')

Examples:

Adding the sequence number to the ID :

st set -i {id}_{num}

Counting the number of records per file in the input:

st count -k filename *.fasta

Sequence statistics. Usage: s:<variable>[:opts]

variable description
s:seqlen Sequence length
s:ungapped_len Sequence length without gaps (-)
s:gc GC content as percentage of total bases. Lowercase (=masked) letters / characters other than ACGTU are not counted.
s:count Count occurrence one or more characters. Usage: s:count:<characters>. Note that some characters (like '-') cannot be specified in math expressions.
s:exp_err Total number of errors expected in the sequence, calculated from the quality scores as the sum of all error probabilities. For FASTQ, make sure to specify the correct format (--fmt) in case the scores are not in the Sanger/Illumina 1.8+ format.

Example:

Get absolute GC content (not relative to sequence length):

st stat count:GC input.fa

Attributes. Usage: a:<name>

Attributes stored in FASTA/FASTQ headers in the form key=value

Examples:

Summarizing over an attribute in the FASTA header >id size=3:

st count -k a:size seqs.fa

Adding the sequence length to the header as attribute:

st . -a seqlen={s:seqlen} seqs.fa

Entries of associated lists. Usage: l:<field>

Fields from associated lists. (-l argument). Specify either a column number e.g. {l:4}, or a column name ({l:<fieldname>}) if there is a header. With multiple -l arguments, the lists can be selected in the same order using l:<field>, l2:<field>, l3:<field>, and so on.

Example:

Extracting sequences with coordinates stored in a BED file:

st trim -l coordinates.bed -0 {l:2}..{l:3} input.fa > output.fa

Math expressions. Usage: {{<expression>}}

Math expressions with variables. Common operators and functions can be used (+, -, *, /, %, ^, min, max, sqrt, abs, exp, trignometric functions, ...). Boolean expressions are possible with common operators and keywords (and/or/not/...).See http://www.partow.net/programming/exprtk/ and https://github.com/ArashPartow/exprtk/blob/master/readme.txt for more information. Math expressions are also used by the 'filter' command.

Examples:

Setting a GC content attribute as fraction instead of percentage:

st . -p gc={{s:gc / 100}} seqs.fa

Removing DNA sequences with more than 10% of ambiguous bases:

st filter 's:count:ATGC / s:seqlen >= 0.1' input.fa

Selecting IDs with a certain pattern::

st filter ".id like 'AB*'" input.fa

Selecting IDs from a list::

st filter -uml id_list.txt 'def(l:1)' seqs.fa

Clone this wiki locally