FDSTools Tools
This page gives a brief description of each tool in FDSTools. Descriptions of the command line arguments are included. The same information can be obtained from the FDSTools command line by running the command fdstools --help (for the list of tools) or fdstools --help toolname (for a description of that particular tool).
fdstools
Data analysis tools for Massively Parallel Sequencing of forensic DNA markers, including tools for characterisation and filtering of PCR stutter artefacts and other systemic noise, and for automatic detection of the alleles in a sample.
usage: fdstools [-h] [-v] [-d] TOOL ...
optional arguments: -h, --help show this help message, or help for the specified TOOL, and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
available tools: TOOL specify which tool to run allelefinder Find true alleles in reference samples and detect possible contaminations. bganalyse Analyse the amount of noise in reference samples. bgcorrect Match background noise profiles (obtained from e.g., bgestimate) to samples. bgestimate Estimate allele-centric background noise profiles (means) from reference samples. bghomraw Compute noise ratios for all noise detected in homozygous reference samples. bghomstats Compute allele-centric statistics for background noise in homozygous reference samples (min, max, mean, sample variance). bgmerge Merge multiple files containing background noise profiles. bgpredict Predict background profiles of new alleles based on a model of stutter occurrence obtained from stuttermodel. findnewalleles Mark all sequences that are not in another list of sequences. libconvert Convert a legacy TSSV library file (tab-separated) to the FDSTools library format (ini-style). library Create an empty FDSTools library file. mps2ce Convert sample data file to GeneMapper/GeneMarker format, so that MPS data can be used with tools developed for CE data. pipeline Automatically run complete, predefined analysis pipelines. Recommended starting point for new users. samplestats Compute various statistics for each sequence in the given sample data file and perform threshold-based allele calling. seqconvert Convert between raw sequences, TSSV-style sequences, and allele names. stuttermark Mark potential stutter products by assuming a fixed maximum percentage of stutter product vs the parent sequence. stuttermodel Train a stutter prediction model using homozygous reference samples. tssv Link raw reads in a FastA or FastQ file to markers and count the number of reads for each unique sequence. vis Create a data visualisation web page or Vega graph specification.
allelefinder
Find true alleles in reference samples and detect possible contaminations.
In each sample, the sequences with the highest read counts of each marker are called alleles, with a user-defined maximum number of alleles per marker. The allele balance is kept within given bounds. If the highest non-allelic sequence exceeds a given limit, no alleles are called for this marker. If this happens for multiple markers in one sample, no alleles are called for this sample at all.
If the input file contains a 'flags' column, any sequences with a flag starting with 'STUTTER' will be ignored. Therefore, it is highly recommended to run Allelefinder on the output of Stuttermark.
The allele list obtained from allelefinder should always be checked carefully before using it as the input of various other tools operating on reference samples. These tools rely heavily on the correctness of this file to do their job. One may use the allelefinder report (-R/--report output argument) and the bganalyse tool to get a quick overview of what might be wrong.
usage: fdstools allelefinder [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX] [-f EXPR] [-m PCT] [-M PCT] [-n N] [-N N] [-a N] [-x X] [-F FORMAT] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
output file options: -o, --output FILE file to write output to (default: write to stdout) -R, --report FILE file to write a report to (default: write to stderr)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
filtering options: -m, --min-allele-pct PCT call heterozygous if the second allele is at least this percentage of the highest allele of a marker (default: 30.0) -M, --max-noise-pct PCT a sample is considered contaminated/unsuitable for a marker if the highest non-allelic sequence is at least this percentage of the highest allele of that marker (default: 10.0) -n, --min-reads N require at least this number of reads for the highest allele of each marker (default: 50) -N, --min-reads-lowest N require at least this number of reads for the lowest allele of each marker (default: 15) -a, --max-alleles N allow no more than this number of alleles per marker; if unspecified, the amounts given in the library file are used, which have a default value of 1 for markers on the mitochondrial genome and Y chromosome, or 2 otherwise -x, --max-noisy X entirely reject a sample if more than this fraction of markers (if less than 1) or absolute number of markers (if 1 or more) have a high non-allelic sequence (default: 0.1)
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: no conversion) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bganalyse
Analyse the amount of noise in reference samples.
Use this tool after correcting the reference samples with BGCorrect to analyse the amount of remaining noise after correction. This way, potentially contaminated or otherwise 'dirty' reference samples can be detected. The highest amount of remaining noise can be interpreted as a lower bound to the reliable detection of a minor contributor's alleles in mixed DNA samples.
In the default mode ('full'), the lowest, highest, and total number of backgroud/noise reads as well as the respective percentages w.r.t. the number of allelic reads of each marker in each sample is printed. This data can be visualised using fdstools vis bganalyse.
In the alternative 'percentiles' mode, the highest and the total number of background reads as a percentage of the number of allelic reads for each marker is given at selected percentiles of the samples. I.e., it gives the highest and total remaining noise considering only the cleanest x% of samples, for different values of x.
usage: fdstools bganalyse [-h] [-v] [-d] [-m MODE] [-p PCT] [-o FILE] [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-F FORMAT] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -m, --mode MODE controls what kind of information is printed; 'full' (the default) prints the lowest, highest, and total number of backgroud reads as well as the respective percentages w.r.t. the number of allelic reads of each marker in each sample; 'percentiles' prints the highest and the total number of background reads as a percentage of the number of allelic reads for each marker at given percentiles -p, --percentiles PCT comma-separated list of percentiles to report when -m/--mode is set to 'percentiles' (default: 100,99,95,90)
output file options: -o, --output FILE file to write output to (default: write to stdout)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
allele detection options: -a, --allelelist ALLELEFILE file containing a list of the true alleles of each sample (e.g., obtained from allelefinder) -c, --annotation-column COLNAME name of a column in the sample files, which contains a value beginning with 'ALLELE' for the true alleles of the sample
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: raw) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bgcorrect
Match background noise profiles (obtained from e.g., bgestimate) to samples.
Eleven new columns are added to the output giving, for each sequence, the number of reads attributable to noise from other sequences (_noise columns) and the number of noise reads caused by the prescense of this sequence (_add columns), as well as the resulting number of reads after correction (_corrected columns: original minus _noise plus _add).
The correction_flags column contains one of the following values: 'not_corrected', no background noise profile was available for this marker; 'not_in_ref_db', the sequence was not present in the noise profiles given; 'corrected_as_background_only', the sequence was present in the noise profiles given, but only as noise and not as genuine allele; 'corrected_bgpredict', the sequence was present in the noise profiles as a genuine allele, but its noise profile consists entirely of predictions as opposed to direct observations; 'corrected_bgestimate'/'corrected_bghomstats', the sequence was present in the noise profiles as a genuine allele and at least part of its noise profile was based on direct observations.
Finally, the weight column gives the number of times that the noise profile of that allele fitted in the sample.
usage: fdstools bgcorrect [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-C] [-M MARKER] [-F FORMAT] [-l LIBRARY] PROFILES [IN] [OUT]
positional arguments: PROFILES file containing background noise profiles to match
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, stutter noise correction will be done on the total number of reads, instead of separately for either strand
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-bgcorrect.out', which expands to 'sampletag-bgcorrect.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
filtering options: -M, --marker MARKER work only on MARKER
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: no conversion) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bgestimate
Estimate allele-centric background noise profiles (means) from reference samples.
Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples.
usage: fdstools bgestimate [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-C] [-m PCT] [-n N] [-s N] [-S PCT] [-g N] [-p FILE] [-M MARKER] [-H] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, noise profiles will be calculated for the total number of reads, instead of separately for either strand
output file options: -o, --output FILE file to write output to (default: write to stdout) -R, --report FILE file to write a report to (default: write to stderr)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
allele detection options: -a, --allelelist ALLELEFILE file containing a list of the true alleles of each sample (e.g., obtained from allelefinder) -c, --annotation-column COLNAME name of a column in the sample files, which contains a value beginning with 'ALLELE' for the true alleles of the sample
filtering options: -m, --min-pct PCT minimum amount of background to consider, as a percentage of the highest allele (default: 0.50) -n, --min-abs N minimum amount of background to consider, as an absolute number of reads for at least one orientation (default: 5) -s, --min-samples N require this minimum number of samples for each true allele (default: 2) -S, --min-sample-pct PCT require this minimum number of samples for each background product, as a percentage of the number of samples with a particular true allele (default: 80.0) -g, --min-genotypes N require this minimum number of unique heterozygous genotypes for each allele for which no homozygous samples are available (default: 3) -p, --profiles FILE use the given noise profiles file as a starting point -M, --marker MARKER work only on MARKER -H, --homozygotes if specified, only homozygous samples will be considered
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bghomraw
Compute noise ratios for all noise detected in homozygous reference samples.
With this tool, separate data points are produced for each sample, which can be visualised using fdstools vis bgraw. Use bghomstats or bgestimate to compute aggregate statistics on noise instead.
usage: fdstools bghomraw [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-C] [-m PCT] [-n N] [-s N] [-S PCT] [-M MARKER] [-F FORMAT] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, noise ratios will be calculated for the total number of reads, instead of separately for either strand
output file options: -o, --output FILE file to write output to (default: write to stdout)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
allele detection options: -a, --allelelist ALLELEFILE file containing a list of the true alleles of each sample (e.g., obtained from allelefinder) -c, --annotation-column COLNAME name of a column in the sample files, which contains a value beginning with 'ALLELE' for the true alleles of the sample
filtering options: -m, --min-pct PCT minimum amount of background to consider, as a percentage of the highest allele (default: 0.50) -n, --min-abs N minimum amount of background to consider, as an absolute number of reads (default: 5) -s, --min-samples N require this minimum number of samples for each true allele (default: 2) -S, --min-sample-pct PCT require this minimum number of samples for each background product, as a percentage of the number of samples with a particular true allele (default: 80.0) -M, --marker MARKER work only on MARKER
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: raw) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bghomstats
Compute allele-centric statistics for background noise in homozygous reference samples (min, max, mean, sample variance).
Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples. If many reference samples are heterozygous (as is usually the case with forensic STR markers), it is preferable to use bgestimate instead, since it can handle heterozygous samples as well.
usage: fdstools bghomstats [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-C] [-m PCT] [-n N] [-s N] [-S PCT] [-M MARKER] [-F FORMAT] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, noise statistics will be calculated for the total number of reads, instead of separately for either strand
output file options: -o, --output FILE file to write output to (default: write to stdout)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
allele detection options: -a, --allelelist ALLELEFILE file containing a list of the true alleles of each sample (e.g., obtained from allelefinder) -c, --annotation-column COLNAME name of a column in the sample files, which contains a value beginning with 'ALLELE' for the true alleles of the sample
filtering options: -m, --min-pct PCT minimum amount of background to consider, as a percentage of the highest allele (default: 0.50) -n, --min-abs N minimum amount of background to consider, as an absolute number of reads (default: 5) -s, --min-samples N require this minimum number of samples for each true allele (default: 2) -S, --min-sample-pct PCT require this minimum number of samples for each background product, as a percentage of the number of samples with a particular true allele (default: 80.0) -M, --marker MARKER work only on MARKER
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: no conversion) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bgmerge
Merge multiple files containing background noise profiles.
Background noise profiles are merged in the order in which they are specified. If multple files specify a different value for the same allele and sequence, the value of the first file is used.
It is convenient to pipe the output of bgpredict and/or bgestimate into bgmerge to merge that with an existing file containing background profiles. Specify '-' as one of the input files to read from stdin (i.e., read input from a pipe). If only one input file is specified, '-' is implicitly used as the second input file. Note that as a result, in case of conflicting values, the value in the specified input file will take precedence over the value in the data that was piped in.
Example: fdstools bgpredict ... | fdstools bgmerge old.txt > out.txt
usage: fdstools bgmerge [-h] [-v] [-d] [-o FILE] [-l LIBRARY] FILE [FILE ...]
positional arguments: FILE files containing the background noise profiles to combine; if a single file is given, it is merged with input from stdin; use '-' to use stdin as an explicit input source
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
output file options: -o, --output FILE file to write output to (default: write to stdout)
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
bgpredict
Predict background profiles of new alleles based on a model of stutter occurrence obtained from stuttermodel.
This tool can be used to compute background noise profiles for alleles for which no reference samples are available. The profiles are predicted using a model of stutter occurrence that must have been created previously using stuttermodel. A list of sequences should be given; bgpredict will predict a background noise profile for each of the provided sequences separately. The prediction is based completely on the provided stutter model.
The predicted background noise profiles obtained from bgpredict can be combined with the output of bgestimate and/or bghomstats using bgmerge.
It is possible to use an entire forensic case sample as the SEQS input argument of bgpredict to obtain a predicted background noise profile for each sequence detected in the sample. When the background noise profiles thus obtained are combined with those obtained from bgestimate, bgcorrect may subsequently produce 'cleaner' results if the sample contained alleles for which no reference samples were available.
usage: fdstools bgpredict [-h] [-v] [-d] [-C] [-M MARKER] [-A] [-n PCT] [-t N] [-l LIBRARY] STUT SEQS [OUT]
positional arguments: STUT file containing a trained stutter model SEQS file containing the sequences for which a profile should be predicted OUT the file to write the output to (default: write to stdout)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, stutter will be modeled for the total number of reads, instead of separately for either strand -M, --marker MARKER assume the specified marker for all sequences -A, --use-all-data if specified, the 'All data' model is used to predict stutter whenever no marker-specific model is available for a certain repeat unit
filtering options: -n, --min-pct PCT minimum amount of background to consider, as a percentage of the highest allele (default: 0.50) -t, --min-r2 N minimum required r-squared score (default: 0.0)
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
findnewalleles
Mark all sequences that are not in another list of sequences.
If not present, a new column 'flags' is added to the output. Any sequence that does not occur in the provided list of known sequences is flagged 'novel'.
usage: fdstools findnewalleles [-h] [-v] [-d] [-r] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-l LIBRARY] KNOWN [IN] [OUT]
positional arguments: KNOWN file containing a list of known allelic sequences
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -r, --remove-allele-flags remove the 'allele' flag from the alleles that are marked 'novel'
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-findnewalleles.out', which expands to 'sampletag-findnewalleles.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
libconvert
Convert a legacy TSSV library file (tab-separated) to the FDSTools library format (ini-style).
This is a convenience tool for users migrating from the standalone 'TSSV' programme. Use the 'library' tool if you wish to create a new, empty FDSTools library file to start with.
Both FDSTools and the standalone 'TSSV' programme use a library file to store the names, flanking (primer) sequences, and STR repeat structure of forensic STR markers. However, the TSSV library file format is not well suited for non- STR markers and automatic generation of allele names. FDSTools therefore employs a different (ini-style) library file format that can store more details about the markers used. The libconvert tool can be used to convert old library files to the new format.
Please refer to the help of the 'library' tool for more information about FDSTools library files.
usage: fdstools libconvert [-h] [-v] [-d] [IN] [OUT]
positional arguments: IN input library in the legacy TSSV format (default: read from stdin) OUT the file to write the FDSTools library to (default: write to stdout)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
library
Create an empty FDSTools library file.
An FDSTools library file contains various details about the forensic markers used in the analysis, such as the genomic location, expected number of alleles, expected length range of alleles, etc. FDSTools primarily uses library files for configuring STRNaming, which is responsible for converting sequences to allele names and vice versa. This is true even for non-STR markers and fragments on the mitochondrial genome.
In its simplest form, the library file only contains the positions (on the human genome reference sequence, GRCh38) of the reported range of each marker. This is referred to as a 'smart' library file. Alternatively, markers can be explicitly configured, which was the default prior to FDSTools version 2.0. Explicit configuration is currently required when the analysed markers are non-human.
Users migrating from the standalone 'TSSV' programme may use the libconvert tool to convert their TSSV library file to FDSTools format.
usage: fdstools library [-h] [-v] [-d] [-t TYPE] [-m] [-b NAME] [OUT]
positional arguments: OUT the file to write the output to (default: write to stdout)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -t, --type TYPE the type of markers that this library file will be used for; with 'smart' (the default), only the genomic positions of the analysed ranges (i.e., the amplicon excluding the primers) need to be specified and FDSTools will automatically detect and configure allele naming using STRNaming (currently only supported for markers in the human genome); 'full' will create a library file with all possible sections; 'str' or 'non-str' will only output sections used to explicitly define STR and non-STR markers, respectively -m, --microhaplotypes if specified, the [microhaplotype_positions] section is included, which can be used to configure allele calling for microhaplotype targets -b, --builtin NAME start with a built-in library file, choose from 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB- UAS', 'ID-OmniSTR', 'PowerSeq46GY'
mps2ce
Convert sample data file to GeneMapper/GeneMarker format, so that MPS data can be used with tools developed for CE data.
usage: fdstools mps2ce [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-s SEPARATOR] [-p] [-a] [-r] [-F FORMAT] [-l LIBRARY] [IN] [OUT]
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-mps2ce.out', which expands to 'sampletag-mps2ce.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
output file format options: -s, --separator SEPARATOR delimiter used to separate the columns in the output file (default: ) -p, --pair-columns-by-allele by default, all Height columns come after all Allele columns; specify this option to place each Height column directly after the corresponding Allele column
filtering options: -a, --alleles-only if specified, only sequences flagged as 'allele' are included in the output -r, --remove-non-str-markers if specified, non-STR markers are excluded from the output
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename, ce (default: ce) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
pipeline
Automatically run complete, predefined analysis pipelines. Recommended starting point for new users.
This tool runs one of three default analysis pipelines automatically, given a configuration file with tool options and input/output file names. The three available analysis options are 'reference-sample', analysing a single reference sample with TSSV and Stuttermark; 'reference-database', analysing a collection of reference samples with BGEstimate and Stuttermodel; and 'case- sample', analysing a single case sample with TSSV, BGPredict, BGMerge, BGCorrect, and Samplestats. All results are visualised in interactive graphical reports for presentation and further interpretation.
This tool takes a single mandatory argument: the name of an INI configuration file that contains the analysis settings to use. An easy way to obtain such an INI file with default values for all settings, is to run fdstools pipeline your-filename.ini --analysis case-sample. This will create the file 'your- filename.ini' and fill it with default values for the given analysis type (in this example: case-sample analysis).
All settings in the configuration file correspond to options of various tools in FDSTools. Please refer to the tool-specific help for a full description of each tool. Type fdstools -h TOOL to get help with the given TOOL.
usage: fdstools pipeline [-h] [-v] [-d] [-a ANALYSIS] [-e REGEX] [-f EXPR] [-l LIBRARY] [-s FASTA] [-m STUT] [-p PROFILES] [-r] [-S SAMPLE [SAMPLE ...]] [-A ALLELEFILE] [-P PREFIX] [-C] INI
positional arguments: INI pipeline configuration file; if it does not exist, a new file with default settings will be created
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -a, --analysis ANALYSIS controls which predefined analysis pipeline will be run; 'reference-sample' runs a single sample's FastA/FastQ file through TSSV and Stuttermark to prepare it for the reference-database analysis; 'reference-database' runs a collection of reference samples through Allelefinder, BGEstimate, and Stuttermodel to create a reference database of systemic noise; 'case-sample' runs a single sample's FastA/FastQ file through TSSV, BGPredict, BGCorrect, and Samplestats
sample tag parsing options: these options are used to extract sample tags (names) from their file names; for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
input/output file options: words in [brackets] indicate applicable analysis types; all of these values can also be specified in the [pipeline] section of the INI file -l, --in-library LIBRARY library file containing marker definitions -s, --in-sample-raw FASTA [ref-sample, case-sample] FastA or FastQ file containing raw sequence data of the sample -m, --in-stuttermodel STUT [case-sample] file containing a trained stutter model -p, --in-bgprofiles PROFILES [case-sample] file containing noise profiles from BGEstimate -r, --store-predictions [case-sample] if this option is specified, output files named 'sampletag-bgpredict.txt' and 'sampletag- bgmerge.txt' will be created if applicable; these files contain predicted stutter amounts for the sequences in the sample based on the given stutter model -S, --in-samples SAMPLE [SAMPLE ...] [ref-database] file names of reference sample data files ('.csv' output files of the 'reference-sample' analysis) -A, --in-allelelist ALLELEFILE [ref-database] file containing a list of the true alleles of each sample; if not given, Allelefinder will be run as part of the pipeline to create this file; it is ESSENTIAL that you check the correctness and completeness of the allele list -P, --prefix PREFIX [ref-database] if specified, all output file names are prefixed with this value -C, --combine-strands [ref-database, case-sample] if specified, noise analysis will be performed on the total number of reads, instead of separately for either strand
samplestats
Compute various statistics for each sequence in the given sample data file and perform threshold-based allele calling.
Updates the 'flags' column (or adds it, if it was not present in the input data) to include 'allele' for all sequences that meet various allele calling thresholds.
Adds the following columns to the input data. Some columns may be omitted from the output if the input does not contain the required columns. In the column names below, 'X' is a placeholder for 'forward', 'reverse', and 'total', which refers to the strand of DNA for which the statistics are calculated. 'Y' is a placeholder for 'corrected' (statistics calculated on data after noise correction by e.g., BGCorrect), 'noise' (statistics calculated on the number of reads attributed to noise), and 'add' (statistics calculated on the number of reads recovered through noise correction). Wherever the 'Y' part of the column name is omitted, the values in the column are computed on data prior to noise correction.
X_Y: The number of Y reads of this sequence on the X strand (this column is not added by Samplestats, but should be present in the input). X_Y_mp_sum: The value of X_Y, as a percentage of the sum of the X_Y of the marker. X_Y_mp_max: The value of X_Y, as a percentage of the maximum X_Y of the marker. forward_Y_pct: The number of Y reads on the forward strand, as a percentage of the total number of Y reads of this sequence. X_correction_pct: The difference between the values of X_corrected and X, as a percentage of the value of X. X_removed_pct: The value of X_noise, as a percentage of the value of X. X_added_pct: The value of X_add, as a percentage of the value of X. X_recovery: The value of X_add, as a percentage of the value of X_corrected.
usage: fdstools samplestats [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-U [only]] [-n N] [-b N] [-m PCT] [-p PCT] [-c PCT] [-y PCT] [-E N] [-D PCT] [-G [N]] [-a ACTION] [-A] [-N N] [-B N] [-M PCT] [-P PCT] [-C PCT] [-Y PCT] [-F FORMAT] [-l LIBRARY] [IN] [OUT]
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-samplestats.out', which expands to 'sampletag-samplestats.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
interpretation options: sequences that match the -c or -y option (or both) and all of the other settings are marked as 'allele' -U, --uncall-alleles [only] if specified and the input contains sequences with the 'allele' flag, the flag will be removed for sequences not meeting the requirements; with the optional keyword 'only', no 'allele' flags will be added to any sequences that do meet the criteria -n, --min-reads N the minimum number of reads (default: 30) -b, --min-per-strand N the minimum number of reads in both orientations (default: 0) -m, --min-pct-of-max PCT the minimum percentage of reads w.r.t. the highest allele of the marker (default: 2.0) -p, --min-pct-of-sum PCT the minimum percentage of reads w.r.t. the marker's total number of reads (default: 1.5) -c, --min-correction PCT the minimum percentage change in read count due to correction by e.g., bgcorrect (total_correction column; default: 0) -y, --min-recovery PCT the minimum number of reads that was recovered thanks to noise correction (by e.g., bgcorrect), as a percentage of the total number of reads after correction (total_recovery column; default: 0) -E, --min-allele-reads N force a minimum total number of reads for all alleles on a marker; don't call any alleles otherwise (default: 0) -D, --max-nonallele-pct PCT drop all allele markings if the highest non-allelic sequence is at least this percentage of the total number of reads for all alleles on that marker (default: 100.0) -G, --max-alleles [N] if specified, do not mark any alleles on a marker if more than N alleles meet the criteria; without N, the amounts given in the library file are used, which have a default value of 1 for markers on the mitochondrial genome and Y chromosome, or 2 otherwise (Note: don't forget to provide -l/--library!)
filtering options: sequences that match the -C or -Y option (or both) and all of the other settings are retained, all others are filtered -a, --filter-action ACTION filtering mode: 'off', disable filtering; 'combine', replace filtered sequences by a single line with aggregate values per marker; 'delete', remove filtered sequences without leaving a trace (default: off) -A, --filter-absolute if specified, apply filters to absolute read counts (i.e., with the sign removed), which may keep over- corrected sequences that would otherwise be filtered out -N, --min-reads-filt N the minimum number of reads (default: 1) -B, --min-per-strand-filt N the minimum number of reads in both orientations (default: 0) -M, --min-pct-of-max-filt PCT the minimum percentage of reads w.r.t. the highest allele of the marker (default: 0.0) -P, --min-pct-of-sum-filt PCT the minimum percentage of reads w.r.t. the marker's total number of reads (default: 0.0) -C, --min-correction-filt PCT the minimum percentage change in read count due to correction by e.g., bgcorrect (total_correction column; default: 0) -Y, --min-recovery-filt PCT the minimum number of reads that was recovered thanks to noise correction (by e.g., bgcorrect), as a percentage of the total number of reads after correction (total_recovery column; default: 0)
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: no conversion) -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
seqconvert
Convert between raw sequences, TSSV-style sequences, and allele names.
FDSTools was built to be compatible with TSSV, which writes sequences of known STR alleles in a shortened form referred to as 'TSSV-style sequences'. At the same time, FDSTools supports the creation of human-readable allele names which are more suitable for display.
For example, the raw sequence 'AGCGTAAGATAGATAGATAGATAGATAGATACCTACCTACCTCTAGCT' might be rewritten as the TSSV-style sequence 'AGCGTA(1)AGAT(6)ACCT(3)CTAGCT(1)', or as the allele name 'CE9_AGAT[6]ACCT[3]'.
Seqconvert can be used to explicitly convert all sequences in a file to the same output format. Conversions are done using a library file, see the help text of the library tool for details.
You can specify multiple input files using the -i/--input option. This is especially useful when generating allele names for many samples that have many sequences in common. To call the variants in the allele names, FDSTools needs to do sequence alignments which can be rather slow. When generating allele names for many input files at once, the results of the alignments are cached which may give a significant speed-up compared to generating allele names for each sample separately.
Seqconvert can also be used with two different library files to rewrite the allele names or TSSV-style sequences after a library update. Currently, the only limitation to this is that the ending position of the left flank and the starting position of the right flank must be the same.
Note that FDSTools makes no assumptions about the sequence format in its input files; instead it automatically performs any required conversions while running any tool. Explicitly running seqconvert is never a necessity; use this tool for your own convenience.
usage: fdstools seqconvert [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-m COLNAME] [-a COLNAME] [-c COLNAME] [-M MARKER] [-l LIBRARY] [-L LIBRARY] [-r MARKER [MARKER ...]] FORMAT [IN] [OUT]
positional arguments: FORMAT the format to convert to: one of raw, tssv, allelename
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -m, --marker-column COLNAME name of the column that contains the marker name (default: 'marker') -a, --allele-column COLNAME name of the column that contains the sequence (default: 'sequence') -c, --output-column COLNAME name of the column to write the output to (default: same as -a/--allele-column) -M, --marker MARKER assume the specified marker for all sequences -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY' -L, --library2 LIBRARY second library file to use for output; if specified, allele names can be conveniently updated to fit this new library file -r, --reverse-complement MARKER [MARKER ...] to be used together with -L/--library2; specify the markers for which the sequences are reverse- complemented in the new library
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-seqconvert.out', which expands to 'sampletag-seqconvert.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
stuttermark
Mark potential stutter products by assuming a fixed maximum percentage of stutter product vs the parent sequence.
If not present, Stuttermark adds a new column named 'flags' to the output. The flags column will contain 'STUTTER' for possible stutter products. A sequence is considered a possible stutter product if its total read count is less than or equal to the maximum number of expected stutter reads. The maximum number of stutter reads is computed by assuming a fixed percentage of stutter product compared to the originating sequence.
Stuttermark requires TSSV-style sequences (automatically converting sequences to this format if necessary) and detects possible stutter products by comparing sequences that have the same repeat blocks but different numbers of repeats for one or more of their blocks.
The STUTTER annotation contains additional information. For example: 'STUTTER:146.6x1(2-1):10.4x2(2-1x9-1)'. This is a stutter product for which at most 146.6 reads have come from the first sequence in the output file ('146.6x1') and at most 10.4 reads have come from the second sequence in the output file ('10.4x2'). This sequence differs from the first sequence in the output file by a loss of one repeat of the second repeat block ('2-1') and it differs from the second sequence by the loss of one repeat in the second block and one repeat in the ninth block ('2-1x9-1'). (If this sequence would have more than 157 reads, it would not have been marked.)
usage: fdstools stuttermark [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]] [-e REGEX] [-f EXPR] [-s DEF] [-m N] [-n N] [-r N] [-l LIBRARY] [IN] [OUT]
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -s, --stutter DEF Define maximum expected stutter percentages. The default value of '-1:15,+1:4' sets -1 stutter (loss of one repeat) to 15%, +1 stutter (gain of one repeat) to 4%. Any unspecified stutter amount is assumed not to occur directly but e.g., a -2 stutter may still be recognised as two -1 stutters stacked together. NOTE: It may be necessary to specify this option as '-s=-1:15,+1:4' (note the equals sign instead of a space).
input file options: IN single sample data file to process (default: read from stdin) -i, --input IN [IN ...] multiple sample data files to process (use with -o/--output)
output file options: OUT the file to write the output to (default: write to stdout) -o, --output OUT [OUT ...] list of names of output files to match with input files specified with -i/--input, or a format string to construct file names from sample tags; e.g., the default value is '\1-stuttermark.out', which expands to 'sampletag-stuttermark.out'
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
filtering options: -m, --min-reads N set minimum number of reads to evaluate (default: 2) -n, --min-repeats N set minimum number of repeats of a block that can possibly stutter (default: 3) -r, --min-report N a sequence is only annotated as a stutter of some other sequence if the expected number of stutter occurances of this other sequence is above this value (default: 0.1)
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
stuttermodel
Train a stutter prediction model using homozygous reference samples.
The model obtained from this tool can be used by bgpredict to predict background noise profiles of alleles for which no reference samples are available.
usage: fdstools stuttermodel [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-C] [-T THREADS] [-m PCT] [-n N] [-L N] [-s N] [-M MARKER] [-t N] [-O] [-D N] [-S] [-z] [-u N] [-r RAWFILE] [-l LIBRARY] [FILE ...]
positional arguments: FILE the sample data file(s) to process (default: read from stdin)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -C, --combine-strands if specified, stutter will be modeled for the total number of reads, instead of separately for either strand -T, --num-threads THREADS number of worker threads to use (default: 1) -D, --degree N degree of polynomials to fit (default: 2) -S, --same-shape if specified, the polynomials of all markers will have equal coefficients, except for a vertical shift -z, --ignore-zeros if specified, samples exhibiting no stutter are ignored -u, --max-unit-length N investigate stutter of repeats of units of up to this number of nucleotides in length (default: 6) -r, --raw-outfile RAWFILE write raw data points to this file, for use in stuttermodel visualisations (specify '-' to write to stdout; normal output on stdout is then suppressed)
output file options: -o, --output FILE file to write output to (default: write to stdout)
sample tag parsing options: for details about REGEX syntax and capturing groups, check https://docs.python.org/howto/regex -e, --tag-expr REGEX regular expression that captures (using one or more capturing groups) the sample tags from the file names; by default, the entire file name except for its extension (if any) is captured -f, --tag-format EXPR format of the sample tags produced; a capturing group reference like '\n' refers to the n-th capturing group in the regular expression specified with -e/--tag-expr (the default of '\1' simply uses the first capturing group); with a single sample, you can enter the sample tag here explicitly
allele detection options: -a, --allelelist ALLELEFILE file containing a list of the true alleles of each sample (e.g., obtained from allelefinder) -c, --annotation-column COLNAME name of a column in the sample files, which contains a value beginning with 'ALLELE' for the true alleles of the sample
filtering options: -m, --min-pct PCT minimum amount of background to consider, as a percentage of the highest allele (default: 0.00) -n, --min-abs N minimum amount of background to consider, as an absolute number of reads (default: 1) -L, --min-lengths N require this minimum number of unique repeat lengths (default: 5) -s, --min-samples N require this minimum number of samples for each true allele (default: 1) -M, --marker MARKER work only on MARKER -t, --min-r2 N minimum required r-squared score (default: 0.5) -O, --orphans if specified, a fit on one strand is reported even if no fit was obtained on the other strand for the same marker, unit, and stutter depth
sequence format options: -l, --library LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY'
tssv
Link raw reads in a FastA or FastQ file to markers and count the number of reads for each unique sequence.
Scans a FastA or FastQ file, finding the sequences from the 'flanks' section in the provided library file. Each time a pair of flanks is found, that sequence read is linked to the corresponding marker and the portion of the sequence between the flanks is extracted. The number of times each such extracted sequence was encountered is counted, along with the orientation (strand) in which it was found in the input file. The output is a list of unique sequences found for each marker including the corresponding counts.
By default, a small number of mismatches is allowed when aligning the flanks to the reads. This can be controlled with the -m/--mismatches option. Furthermore, when the portion of the sequence to which a flank aligns is completely written in lowercase letters in the input file, that match is discarded. This way, FDSTools works well together with the paired-end read merging tool FLASH, version 1.2.11/lo, which (optionally) writes the non- overlapping portion of the reads in lowercase [1]. Together, this ensures repetitive sequences (such as STRs) are not truncated when the paired-end reads are merged.
The sequences thus obtained are subsequently filtered in three ways. First, the 'expected_allele_length' section in the library file may be used to specify hard limits on the acceptable sequence length for each marker. Any unexpectedly short or long sequence is removed. Second, any sequence with an ambiguous base (i.e., not A, C, G, or T) is removed. Finally, the -a/--minimum option can be used to filter out sequences that have been seen only rarely. When the -A/--aggregate-filtered option is given, all filtered sequences of each marker are aggregated and reported as 'Other sequences'.
This tool is an evolution of the original TSSV program [2].
References: [1] https://github.com/Jerrythafast/FLASH-lowercase-overhang [2] https://github.com/jfjlaros/tssv
usage: fdstools tssv [-h] [-v] [-d] [-F FORMAT] [-R FILE] [-L N] [-D DIR] [-T THREADS] [-X] [-m MISMATCHES] [-n N] [-a N] [-B] [-M ACTION] LIBRARY [IN] [OUT]
positional arguments: LIBRARY library file with marker definitions; custom file or built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-UAS', 'ID-OmniSTR', 'PowerSeq46GY' IN the sample data file to process (default: read from stdin) OUT the file to write the output to (default: write to stdout)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -L, --flank-length N length of anchor (flanking) sequences to use (default: 16) -D, --dir DIR output directory for verbose output; when given, a subdirectory will be created for each marker, each with a separate sequences.csv file and a number of FASTA/FASTQ files containing unrecognised reads (unknown.fa), recognised reads (Marker/paired.fa), and reads that lack one of the flanks of a marker (Marker/noend.fa and Marker/nostart.fa) -T, --num-threads THREADS number of worker threads to use (default: 1) -X, --no-deduplicate disable deduplication of reads; by setting this option, memory usage will be reduced in expense of longer running time
sequence format options: -F, --sequence-format FORMAT convert sequences to the specified format: one of raw, tssv, allelename (default: raw)
output file options: -R, --report FILE file to write a report to (default: write to stderr)
filtering options: -m, --mismatches MISMATCHES number of mismatches (per nucleotide of flanking sequence if less than 1, else absolute) to allow in flanking sequences, rounded upward (default: 0.1) -n, --indel-score N insertions and deletions in the flanking sequences are penalised this number of times more heavily than mismatches (default: 2) -a, --minimum N report only sequences with this minimum number of reads (default: 2) -B, --no-aggregate-filtered by default, sequences that have been filtered (as per the -a/--minimum option, the expected_allele_length section in the library file, as well as all sequences with ambiguous bases) will be aggregated per marker and reported as 'Other sequences'; specify this option to remove such sequences entirely -M, --missing-marker-action ACTION action to take when no sequences are linked to a marker: one of include, exclude, halt (default: include)
vis
Create a data visualisation web page or Vega graph specification.
With no optional arguments specified, a self-contained web page (HTML file) is produced. You can open this file in a web browser to view interactive visualisations of your data. The web page contains a file selection element which can be used to select the data to be visualised.
Visualisations make use of the Vega JavaScript library (https://vega.github.io). The required JavaScript libraries (Vega and D3) are embedded in the generated HTML file. With the -O/--online option specified, the HTML file will instead link to the latest version of these libraries on the Internet.
Vega supports generating visualisations on the command line. By default, FDSTools produces a full-featured HTML file. Specify the -V/--vega option if you wish to obtain a bare Vega graph specification (a JSON file) instead. You can pass this file through Vega to generate a PNG or SVG image file.
If an input file is specified, the visualisation will be set up specifically to visualise the contents of this file. To this end, the entire file contents are embedded in the generated visualisation.
usage: fdstools vis [-h] [-v] [-d] [-V] [-t] [-T TITLE] [-n N] [-m PCT] [-S PCT] [-s N] [-B N] [-c] [-M MARKER] [-U UNIT] [-A] [-a] [-I FILE] [-L] [-b N] [-p N] [-w N] [-H N] [-x N] [-j N] [-N N] [-X PCT] [-Q PCT] [-C N] [-Y N] [-Z N] TYPE [IN] [OUT]
positional arguments: TYPE the type of data to visualise; use 'sample' to visualise sample data files and bgcorrect output; use 'profile' to visualise background noise profiles obtained with bgestimate, bghomstats, and bgpredict; use 'bgraw' to visualise raw background noise data obtained with bghomraw; use 'stuttermodel' to visualise models of stutter obtained from stuttermodel; 'bganalyse' to visualise data obtained from bganalyse; use 'allele' to visualise the allele list obtained from allelefinder IN file containing the data to embed in the visualisation file; if not specified, HTML visualisation files will contain a file selection control, and Vega visualisation files will load data from a file called 'data.csv' OUT file to write output to (default: write to stdout)
optional arguments: -h, --help show this help message and exit -v, --version show version number and exit -d, --debug if specified, additional debug output is given -V, --vega by default, a full-featured HTML file offering an interactive visualisation is created; if this option is specified, only a bare Vega graph specification (JSON file) is produced instead -t, --tidy tidily indent the generated JSON -T, --title TITLE prepend the given value to the title of HTML visualisations (default: prepend name of data file if given)
visualisation options: words in [brackets] indicate applicable visualisation types -n, --min-abs N [sample, profile, bgraw] only show sequences with this minimum number of reads (default: 5) -m, --min-pct-of-max PCT [sample, profile, bgraw] for sample: only show sequences with at least this percentage of the number of reads of the highest allele of a marker; for profile and bgraw: at least this percentage of the true allele (default: 0.5) -S, --min-pct-of-sum PCT [sample] only show sequences with at least this percentage of the total number of reads of a marker (default: 0.0) -s, --min-per-strand N [sample] only show sequences with this minimum number of reads for both orientations (forward/reverse) (default: 0) -B, --bias-threshold N [sample] mark sequences that have less than this percentage of reads on one strand (default: 0.0) -c, --no-ce-length-sort [sample] if specified, do not sort STR alleles by length -M, --marker MARKER [sample, profile, bgraw, stuttermodel, bganalyse] only show graphs for the markers that contain the given value in their name; separate multiple values with spaces; prepend any value with '=' for an exact match (default: show all markers) -U, --repeat-unit UNIT [stuttermodel] only show graphs for the repeat units that contain the given value; separate multiple values with spaces; prepend any value with '=' for an exact match (default: show all repeat units) -A, --no-alldata [stuttermodel] if specified, show only marker-specific fits -a, --no-aggregate [sample] if specified, do not replace filtered sequences with a per-marker aggregate 'Other sequences' entry -I, --input2 FILE [profile, stuttermodel] raw data points file to overlay on the background noise profiles or stutter model graphs (as obtained from bghomraw or the -r/--raw-outfile option of stuttermodel); if not specified, HTML visualisation files will contain a file selection control
display options: -L, --log-scale [sample, profile, bgraw, bganalyse] use logarithmic scale (for sample and bganalyse: square root scale) instead of linear scale -b, --bar-width N [sample, profile, bgraw, bganalyse] width of the bars in pixels (default: 15) -p, --padding N [sample, profile, bgraw, stuttermodel] amount of padding (in pixels) between graphs of different markers/alleles (default: 70) -w, --width N [sample, profile, bgraw, stuttermodel, bganalyse, allele] width of the graph area in pixels (default: 600) -H, --height N [stuttermodel, allele] height of the graph area in pixels (default: 400) -x, --max-seq-len N [sample] truncate long sequences to this number of characters (default: 70) -j, --jitter N [stuttermodel] apply this amount of jitter to raw data points (between 0 and 1, default: 0.25)
allele calling options: for sample visualisations only; sequences that match the -C or -Y option (or both) and all of the other settings are marked as 'allele' -N, --allele-min-abs N the minimum number of reads (default: 30) -X, --allele-min-pct-of-max PCT the minimum percentage of reads w.r.t. the highest allele of the marker (default: 2.0) -Q, --allele-min-pct-of-sum PCT the minimum percentage of reads w.r.t. the marker's total number of reads (default: 1.5) -C, --allele-min-correction N the minimum change in read count due to correction by e.g., bgcorrect (default: 0) -Y, --allele-min-recovery N the minimum number of reads that was recovered thanks to noise correction (by e.g., bgcorrect), as a percentage of the total number of reads after correction (default: 0) -Z, --allele-min-per-strand N the minimum number of reads in both orientations (default: 0)