-
Notifications
You must be signed in to change notification settings - Fork 2
filter
Filters sequences by a mathematical expression which may contain any variable.
Usage:
st filter [options][-a <attr>...][-l <list>...] <expression> [<input>...]
st filter (-h | --help)
st filter --help-vars
Options:
--dropped <file> Output file for sequences that were removed by filtering.
The extension is autorecognized if possible, fallback
is the input format.
See this page for the options common to all commands.
Removing sequences shorter than 100 bp:
st filter "s:seqlen >= 100" input.fa > filtered.faRemoving DNA sequences with more than 10% of ambiguous bases:
st filter "s:count:ATGC / s:seqlen >= 0.1" input.fa > filtered.faQuick and easy way to select certain sequences (for more advanced filtering using lists, see below):
st filter ".id == 'id1' or .id like 'AB*'" input.fa > filtered.faNote the . before the variable name. This indicates that this is a
string comparison.
Keeping only sequences with less than five primer mismatches (stored in the
f_dist attribute, see example for the find command),
and which are also long enough:
st filter "a:f_dist < 5 and s:seqlen > 100" primer_search.fa > filtered.faFairly advanced expressions, even small scripts are possible. An overview of the syntax can be found here.
Note: This command is only available if compiling with the exprtk feature.
(cargo build --release --features=exprtk) to activate usage of the
ExprTk C++ library.
However, the provided binaries include this feature by default.
The exp_err statistics variable represents the total expected number of errors
in a sequence, as provided by the quality scores. See here
for more information on reading them.
This example removes sequences with less than one expected error. The
output is the same as from fastq_filter USEARCH.
st filter 's:exp_err >= 1' input.fq > filtered.fqNormalization according to sequence length is easily possible with
a math formula (corresponding to -fastq_maxee_rate of USEARCH).
st filter 's:exp_err / s:seqlen >= 0.002' input.fq > filtered.fqUndefined variables can occur if a record could not
be found in an associated list, an attribute was not present
in the header, or the requested match/match group was not found by
the find command. Whether a variable is defined can be checked
using the def() function. It returns true if the given variable is defined for
a given sequence.
Example for retrieving sequences stored in a list of IDs (see also here):
st filter -uml id_list.txt "def(l:1)" seqs.fa > in_list.faNote that empty strings are also treated as undefined. Consider this FASTA file:
>id1
SEQUENCE
>id2 value=20
SEQUENCE
>id3 value=
SEQUENCE
The following command does an additional check if the attribute value
is defined or not:
st filter "def(a:value) and a:value > 5" seqs.faOutput:
>id2 value=20
SEQ
Since seq3 has an empty value attribute, it is also removed by filtering.
Also note: Undefined variables are represented by NaN. In ExprTk,
comparisons to NaN always result in false. Therefore, the check using
def() is not actually necessary in this case because a:value > 5 would
anyway return false for seq1 and seq3.