FDSTools Tools

This page gives a brief description of each tool in FDSTools. Descriptions of the command line arguments are included. The same information can be obtained from the FDSTools command line by running the command fdstools --help (for the list of tools) or fdstools --help toolname (for a description of that particular tool).

fdstools

Data analysis tools for Massively Parallel Sequencing of forensic DNA markers, including tools for characterisation and filtering of PCR stutter artefacts and other systemic noise, and for automatic detection of the alleles in a sample.

usage: fdstools [-h] [-v] [-d] TOOL ...
options:
  -h, --help      show this help message, or help for the specified TOOL, and
                  exit
  -v, --version   show version number and exit
  -d, --debug     if specified, additional debug output is given
available tools:
  TOOL            specify which tool to run
    allelefinder  Find true alleles in reference samples and detect possible
                  contaminations.
    bganalyse     Analyse the amount of noise in reference samples.
    bgcorrect     Match background noise profiles (obtained from e.g.,
                  bgestimate) to samples.
    bgestimate    Estimate allele-centric background noise profiles (means)
                  from reference samples.
    bghomraw      Compute noise ratios for all noise detected in homozygous
                  reference samples.
    bghomstats    Compute allele-centric statistics for background noise in
                  homozygous reference samples (min, max, mean, sample
                  variance).
    bgmerge       Merge multiple files containing background noise profiles.
    bgpredict     Predict background profiles of new alleles based on a model
                  of stutter occurrence obtained from stuttermodel.
    findnewalleles
                  Mark all sequences that are not in another list of
                  sequences.
    libconvert    Convert a legacy TSSV library file (tab-separated) to the
                  FDSTools library format (ini-style).
    library       Create an empty FDSTools library file.
    pipeline      Automatically run complete, predefined analysis pipelines.
                  Recommended starting point for new users.
    samplestats   Compute various statistics for each sequence in the given
                  sample data file and perform threshold-based allele calling.
    seqconvert    Convert between raw sequences, TSSV-style sequences, and
                  allele names.
    stuttermark   Mark potential stutter products by assuming a fixed maximum
                  percentage of stutter product vs the parent sequence.
    stuttermodel  Train a stutter prediction model using homozygous reference
                  samples.
    tssv          Link raw reads in a FastA or FastQ file to markers and count
                  the number of reads for each unique sequence.
    vis           Create a data visualisation web page or Vega graph
                  specification.

allelefinder

Find true alleles in reference samples and detect possible contaminations.

In each sample, the sequences with the highest read counts of each marker are called alleles, with a user-defined maximum number of alleles per marker. The allele balance is kept within given bounds. If the highest non-allelic sequence exceeds a given limit, no alleles are called for this marker. If this happens for multiple markers in one sample, no alleles are called for this sample at all.

If the input file contains a 'flags' column, any sequences with a flag starting with 'STUTTER' will be ignored. Therefore, it is highly recommended to run Allelefinder on the output of Stuttermark.

The allele list obtained from allelefinder should always be checked carefully before using it as the input of various other tools operating on reference samples. These tools rely heavily on the correctness of this file to do their job. One may use the allelefinder report (-R/--report output argument) and the bganalyse tool to get a quick overview of what might be wrong.

usage: fdstools allelefinder [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX]
                             [-f EXPR] [-m PCT] [-M PCT] [-n N] [-N N] [-a N]
                             [-x X] [-F FORMAT] [-l LIBRARY]
                             [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
  -R, --report FILE     file to write a report to (default: write to stderr)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -m, --min-allele-pct PCT
                        call heterozygous if the second allele is at least
                        this percentage of the highest allele of a marker
                        (default: 30.0)
  -M, --max-noise-pct PCT
                        a sample is considered contaminated/unsuitable for a
                        marker if the highest non-allelic sequence is at least
                        this percentage of the highest allele of that marker
                        (default: 10.0)
  -n, --min-reads N     require at least this number of reads for the highest
                        allele of each marker (default: 50)
  -N, --min-reads-lowest N
                        require at least this number of reads for the lowest
                        allele of each marker (default: 15)
  -a, --max-alleles N   allow no more than this number of alleles per marker;
                        if unspecified, the amounts given in the library file
                        are used, which have a default value of 1 for markers
                        on the mitochondrial genome and Y chromosome, or 2
                        otherwise
  -x, --max-noisy X     entirely reject a sample if more than this fraction of
                        markers (if less than 1) or absolute number of markers
                        (if 1 or more) have a high non-allelic sequence
                        (default: 0.1)
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bganalyse

Analyse the amount of noise in reference samples.

Use this tool after correcting the reference samples with BGCorrect to analyse the amount of remaining noise after correction. This way, potentially contaminated or otherwise 'dirty' reference samples can be detected. The highest amount of remaining noise can be interpreted as a lower bound to the reliable detection of a minor contributor's alleles in mixed DNA samples.

In the default mode ('full'), the lowest, highest, and total number of backgroud/noise reads as well as the respective percentages w.r.t. the number of allelic reads of each marker in each sample is printed. This data can be visualised using fdstools vis bganalyse.

In the alternative 'percentiles' mode, the highest and the total number of background reads as a percentage of the number of allelic reads for each marker is given at selected percentiles of the samples. I.e., it gives the highest and total remaining noise considering only the cleanest x% of samples, for different values of x.

usage: fdstools bganalyse [-h] [-v] [-d] [-m MODE] [-p PCT] [-o FILE]
                          [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME]
                          [-F FORMAT] [-l LIBRARY]
                          [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -m, --mode MODE       controls what kind of information is printed; 'full'
                        (the default) prints the lowest, highest, and total
                        number of backgroud reads as well as the respective
                        percentages w.r.t. the number of allelic reads of each
                        marker in each sample; 'percentiles' prints the
                        highest and the total number of background reads as a
                        percentage of the number of allelic reads for each
                        marker at given percentiles
  -p, --percentiles PCT
                        comma-separated list of percentiles to report when
                        -m/--mode is set to 'percentiles' (default:
                        100,99,95,90)
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bgcorrect

Match background noise profiles (obtained from e.g., bgestimate) to samples.

Eleven new columns are added to the output giving, for each sequence, the number of reads attributable to noise from other sequences (_noise columns) and the number of noise reads caused by the prescense of this sequence (_add columns), as well as the resulting number of reads after correction (_corrected columns: original minus _noise plus _add).

The correction_flags column contains one of the following values: 'not_corrected', no background noise profile was available for this marker; 'not_in_ref_db', the sequence was not present in the noise profiles given; 'corrected_as_background_only', the sequence was present in the noise profiles given, but only as noise and not as genuine allele; 'corrected_bgpredict', the sequence was present in the noise profiles as a genuine allele, but its noise profile consists entirely of predictions as opposed to direct observations; 'corrected_bgestimate'/'corrected_bghomstats', the sequence was present in the noise profiles as a genuine allele and at least part of its noise profile was based on direct observations.

Finally, the weight column gives the number of times that the noise profile of that allele fitted in the sample.

usage: fdstools bgcorrect [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                          [-e REGEX] [-f EXPR] [-C] [-M MARKER] [-F FORMAT]
                          [-l LIBRARY]
                          PROFILES [IN] [OUT]
positional arguments:
  PROFILES              file containing background noise profiles to match
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, stutter noise correction will be done on
                        the total number of reads, instead of separately for
                        either strand
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i, --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o, --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-bgcorrect.out', which expands to
                        'sampletag-bgcorrect.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -M, --marker MARKER   work only on MARKER
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bgestimate

Estimate allele-centric background noise profiles (means) from reference samples.

Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples.

usage: fdstools bgestimate [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX]
                           [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-C]
                           [-m PCT] [-n N] [-s N] [-S PCT] [-g N] [-p FILE]
                           [-M MARKER] [-H] [-l LIBRARY]
                           [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, noise profiles will be calculated for
                        the total number of reads, instead of separately for
                        either strand
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
  -R, --report FILE     file to write a report to (default: write to stderr)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m, --min-pct PCT     minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n, --min-abs N       minimum amount of background to consider, as an
                        absolute number of reads for at least one orientation
                        (default: 5)
  -s, --min-samples N   require this minimum number of samples for each true
                        allele (default: 2)
  -S, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -g, --min-genotypes N
                        require this minimum number of unique heterozygous
                        genotypes for each allele for which no homozygous
                        samples are available (default: 3)
  -p, --profiles FILE   use the given noise profiles file as a starting point
  -M, --marker MARKER   work only on MARKER
  -H, --homozygotes     if specified, only homozygous samples will be
                        considered
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bghomraw

Compute noise ratios for all noise detected in homozygous reference samples.

With this tool, separate data points are produced for each sample, which can be visualised using fdstools vis bgraw. Use bghomstats or bgestimate to compute aggregate statistics on noise instead.

usage: fdstools bghomraw [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                         [-a ALLELEFILE] [-c COLNAME] [-C] [-m PCT] [-n N]
                         [-s N] [-S PCT] [-M MARKER] [-F FORMAT] [-l LIBRARY]
                         [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, noise ratios will be calculated for the
                        total number of reads, instead of separately for
                        either strand
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m, --min-pct PCT     minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n, --min-abs N       minimum amount of background to consider, as an
                        absolute number of reads (default: 5)
  -s, --min-samples N   require this minimum number of samples for each true
                        allele (default: 2)
  -S, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -M, --marker MARKER   work only on MARKER
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bghomstats

Compute allele-centric statistics for background noise in homozygous reference samples (min, max, mean, sample variance).

Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples. If many reference samples are heterozygous (as is usually the case with forensic STR markers), it is preferable to use bgestimate instead, since it can handle heterozygous samples as well.

usage: fdstools bghomstats [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                           [-a ALLELEFILE] [-c COLNAME] [-C] [-m PCT] [-n N]
                           [-s N] [-S PCT] [-M MARKER] [-F FORMAT]
                           [-l LIBRARY]
                           [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, noise statistics will be calculated for
                        the total number of reads, instead of separately for
                        either strand
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m, --min-pct PCT     minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n, --min-abs N       minimum amount of background to consider, as an
                        absolute number of reads (default: 5)
  -s, --min-samples N   require this minimum number of samples for each true
                        allele (default: 2)
  -S, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -M, --marker MARKER   work only on MARKER
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bgmerge

Merge multiple files containing background noise profiles.

Background noise profiles are merged in the order in which they are specified. If multple files specify a different value for the same allele and sequence, the value of the first file is used.

It is convenient to pipe the output of bgpredict and/or bgestimate into bgmerge to merge that with an existing file containing background profiles. Specify '-' as one of the input files to read from stdin (i.e., read input from a pipe). If only one input file is specified, '-' is implicitly used as the second input file. Note that as a result, in case of conflicting values, the value in the specified input file will take precedence over the value in the data that was piped in.

Example: fdstools bgpredict ... | fdstools bgmerge old.txt > out.txt

usage: fdstools bgmerge [-h] [-v] [-d] [-o FILE] [-l LIBRARY] FILE [FILE ...]
positional arguments:
  FILE                  files containing the background noise profiles to
                        combine; if a single file is given, it is merged with
                        input from stdin; use '-' to use stdin as an explicit
                        input source
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

bgpredict

Predict background profiles of new alleles based on a model of stutter occurrence obtained from stuttermodel.

This tool can be used to compute background noise profiles for alleles for which no reference samples are available. The profiles are predicted using a model of stutter occurrence that must have been created previously using stuttermodel. A list of sequences should be given; bgpredict will predict a background noise profile for each of the provided sequences separately. The prediction is based completely on the provided stutter model.

The predicted background noise profiles obtained from bgpredict can be combined with the output of bgestimate and/or bghomstats using bgmerge.

It is possible to use an entire forensic case sample as the SEQS input argument of bgpredict to obtain a predicted background noise profile for each sequence detected in the sample. When the background noise profiles thus obtained are combined with those obtained from bgestimate, bgcorrect may subsequently produce 'cleaner' results if the sample contained alleles for which no reference samples were available.

usage: fdstools bgpredict [-h] [-v] [-d] [-C] [-M MARKER] [-A] [-n PCT] [-t N]
                          [-l LIBRARY]
                          STUT SEQS [OUT]
positional arguments:
  STUT                  file containing a trained stutter model
  SEQS                  file containing the sequences for which a profile
                        should be predicted
  OUT                   the file to write the output to (default: write to
                        stdout)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, stutter will be modeled for the total
                        number of reads, instead of separately for either
                        strand
  -M, --marker MARKER   assume the specified marker for all sequences
  -A, --use-all-data    if specified, the 'All data' model is used to predict
                        stutter whenever no marker-specific model is available
                        for a certain repeat unit
filtering options:
  -n, --min-pct PCT     minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -t, --min-r2 N        minimum required r-squared score (default: 0.0)
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

findnewalleles

Mark all sequences that are not in another list of sequences.

If not present, a new column 'flags' is added to the output. Any sequence that does not occur in the provided list of known sequences is flagged 'novel'.

usage: fdstools findnewalleles [-h] [-v] [-d] [-r] [-i IN [IN ...]]
                               [-o OUT [OUT ...]] [-e REGEX] [-f EXPR]
                               [-l LIBRARY]
                               KNOWN [IN] [OUT]
positional arguments:
  KNOWN                 file containing a list of known allelic sequences
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -r, --remove-allele-flags
                        remove the 'allele' flag from the alleles that are
                        marked 'novel'
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i, --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o, --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-findnewalleles.out', which
                        expands to 'sampletag-findnewalleles.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

libconvert

Convert a legacy TSSV library file (tab-separated) to the FDSTools library format (ini-style).

This is a convenience tool for users migrating from the standalone 'TSSV' programme. Use the 'library' tool if you wish to create a new, empty FDSTools library file to start with.

Both FDSTools and the standalone 'TSSV' programme use a library file to store the names, flanking (primer) sequences, and STR repeat structure of forensic STR markers. However, the TSSV library file format is not well suited for non- STR markers and automatic generation of allele names. FDSTools therefore employs a different (ini-style) library file format that can store more details about the markers used. The libconvert tool can be used to convert old library files to the new format.

Please refer to the help of the 'library' tool for more information about FDSTools library files.

usage: fdstools libconvert [-h] [-v] [-d] [IN] [OUT]
positional arguments:
  IN             input library in the legacy TSSV format (default: read from
                 stdin)
  OUT            the file to write the FDSTools library to (default: write to
                 stdout)
options:
  -h, --help     show this help message and exit
  -v, --version  show version number and exit
  -d, --debug    if specified, additional debug output is given

library

Create an empty FDSTools library file.

An FDSTools library file contains various details about the forensic markers used in the analysis, such as the genomic location, expected number of alleles, expected length range of alleles, etc. FDSTools primarily uses library files for configuring STRNaming, which is responsible for converting sequences to allele names and vice versa. This is true even for non-STR markers and fragments on the mitochondrial genome.

In its simplest form, the library file only contains the positions (on the human genome reference sequence, GRCh38) of the reported range of each marker. This is referred to as a 'smart' library file. Alternatively, markers can be explicitly configured, which was the default prior to FDSTools version 2.0. Explicit configuration is currently required when the analysed markers are non-human.

Users migrating from the standalone 'TSSV' programme may use the libconvert tool to convert their TSSV library file to FDSTools format.

usage: fdstools library [-h] [-v] [-d] [-t TYPE] [-m] [-b NAME] [OUT]
positional arguments:
  OUT                   the file to write the output to (default: write to
                        stdout)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -t, --type TYPE       the type of markers that this library file will be
                        used for; with 'smart' (the default), only the genomic
                        positions of the analysed ranges (i.e., the amplicon
                        excluding the primers) need to be specified and
                        FDSTools will automatically detect and configure
                        allele naming using STRNaming (currently only
                        supported for markers in the human genome); 'full'
                        will create a library file with all possible sections;
                        'str' or 'non-str' will only output sections used to
                        explicitly define STR and non-STR markers,
                        respectively
  -m, --microhaplotypes
                        if specified, the [microhaplotype_positions] section
                        is included, which can be used to configure allele
                        calling for microhaplotype targets
  -b, --builtin NAME    start with a built-in library file, choose from
                        'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB', 'ForenSeqB-
                        UAS', 'PowerSeq46GY'

pipeline

Automatically run complete, predefined analysis pipelines. Recommended starting point for new users.

This tool runs one of three default analysis pipelines automatically, given a configuration file with tool options and input/output file names. The three available analysis options are 'reference-sample', analysing a single reference sample with TSSV and Stuttermark; 'reference-database', analysing a collection of reference samples with BGEstimate and Stuttermodel; and 'case- sample', analysing a single case sample with TSSV, BGPredict, BGMerge, BGCorrect, and Samplestats. All results are visualised in interactive graphical reports for presentation and further interpretation.

This tool takes a single mandatory argument: the name of an INI configuration file that contains the analysis settings to use. An easy way to obtain such an INI file with default values for all settings, is to run fdstools pipeline your-filename.ini --analysis case-sample. This will create the file 'your- filename.ini' and fill it with default values for the given analysis type (in this example: case-sample analysis).

All settings in the configuration file correspond to options of various tools in FDSTools. Please refer to the tool-specific help for a full description of each tool. Type fdstools -h TOOL to get help with the given TOOL.

usage: fdstools pipeline [-h] [-v] [-d] [-a ANALYSIS] [-e REGEX] [-f EXPR]
                         [-l LIBRARY] [-s FASTA] [-m STUT] [-p PROFILES] [-r]
                         [-S SAMPLE [SAMPLE ...]] [-A ALLELEFILE] [-P PREFIX]
                         [-C]
                         INI
positional arguments:
  INI                   pipeline configuration file; if it does not exist, a
                        new file with default settings will be created
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -a, --analysis ANALYSIS
                        controls which predefined analysis pipeline will be
                        run; 'reference-sample' runs a single sample's
                        FastA/FastQ file through TSSV and Stuttermark to
                        prepare it for the reference-database analysis;
                        'reference-database' runs a collection of reference
                        samples through Allelefinder, BGEstimate, and
                        Stuttermodel to create a reference database of
                        systemic noise; 'case-sample' runs a single sample's
                        FastA/FastQ file through TSSV, BGPredict, BGCorrect,
                        and Samplestats
sample tag parsing options:
  these options are used to extract sample tags (names) from their file
  names; for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
input/output file options:
  words in [brackets] indicate applicable analysis types; all of these
  values can also be specified in the [pipeline] section of the INI file

  -l, --in-library LIBRARY
                        library file containing marker definitions
  -s, --in-sample-raw FASTA
                        [ref-sample, case-sample] FastA or FastQ file
                        containing raw sequence data of the sample
  -m, --in-stuttermodel STUT
                        [case-sample] file containing a trained stutter model
  -p, --in-bgprofiles PROFILES
                        [case-sample] file containing noise profiles from
                        BGEstimate
  -r, --store-predictions
                        [case-sample] if this option is specified, output
                        files named 'sampletag-bgpredict.txt' and 'sampletag-
                        bgmerge.txt' will be created if applicable; these
                        files contain predicted stutter amounts for the
                        sequences in the sample based on the given stutter
                        model
  -S, --in-samples SAMPLE [SAMPLE ...]
                        [ref-database] file names of reference sample data
                        files ('.csv' output files of the 'reference-sample'
                        analysis)
  -A, --in-allelelist ALLELEFILE
                        [ref-database] file containing a list of the true
                        alleles of each sample; if not given, Allelefinder
                        will be run as part of the pipeline to create this
                        file; it is ESSENTIAL that you check the correctness
                        and completeness of the allele list
  -P, --prefix PREFIX   [ref-database] if specified, all output file names are
                        prefixed with this value
  -C, --combine-strands
                        [ref-database, case-sample] if specified, noise
                        analysis will be performed on the total number of
                        reads, instead of separately for either strand

samplestats

Compute various statistics for each sequence in the given sample data file and perform threshold-based allele calling.

Updates the 'flags' column (or adds it, if it was not present in the input data) to include 'allele' for all sequences that meet various allele calling thresholds.

Adds the following columns to the input data. Some columns may be omitted from the output if the input does not contain the required columns. In the column names below, 'X' is a placeholder for 'forward', 'reverse', and 'total', which refers to the strand of DNA for which the statistics are calculated. 'Y' is a placeholder for 'corrected' (statistics calculated on data after noise correction by e.g., BGCorrect), 'noise' (statistics calculated on the number of reads attributed to noise), and 'add' (statistics calculated on the number of reads recovered through noise correction). Wherever the 'Y' part of the column name is omitted, the values in the column are computed on data prior to noise correction.

X_Y: The number of Y reads of this sequence on the X strand (this column is not added by Samplestats, but should be present in the input). X_Y_mp_sum: The value of X_Y, as a percentage of the sum of the X_Y of the marker. X_Y_mp_max: The value of X_Y, as a percentage of the maximum X_Y of the marker. forward_Y_pct: The number of Y reads on the forward strand, as a percentage of the total number of Y reads of this sequence. X_correction_pct: The difference between the values of X_corrected and X, as a percentage of the value of X. X_removed_pct: The value of X_noise, as a percentage of the value of X. X_added_pct: The value of X_add, as a percentage of the value of X. X_recovery: The value of X_add, as a percentage of the value of X_corrected.

usage: fdstools samplestats [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                            [-e REGEX] [-f EXPR] [-U [only]] [-n N] [-b N]
                            [-m PCT] [-p PCT] [-c PCT] [-y PCT] [-E N]
                            [-D PCT] [-G [N]] [-a ACTION] [-A] [-N N] [-B N]
                            [-M PCT] [-P PCT] [-C PCT] [-Y PCT] [-F FORMAT]
                            [-l LIBRARY]
                            [IN] [OUT]
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i, --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o, --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-samplestats.out', which expands
                        to 'sampletag-samplestats.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
interpretation options:
  sequences that match the -c or -y option (or both) and all of the other
  settings are marked as 'allele'

  -U, --uncall-alleles [only]
                        if specified and the input contains sequences with the
                        'allele' flag, the flag will be removed for sequences
                        not meeting the requirements; with the optional
                        keyword 'only', no 'allele' flags will be added to any
                        sequences that do meet the criteria
  -n, --min-reads N     the minimum number of reads (default: 30)
  -b, --min-per-strand N
                        the minimum number of reads in both orientations
                        (default: 0)
  -m, --min-pct-of-max PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 2.0)
  -p, --min-pct-of-sum PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 1.5)
  -c, --min-correction PCT
                        the minimum percentage change in read count due to
                        correction by e.g., bgcorrect (total_correction
                        column; default: 0)
  -y, --min-recovery PCT
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (total_recovery column; default: 0)
  -E, --min-allele-reads N
                        force a minimum total number of reads for all alleles
                        on a marker; don't call any alleles otherwise
                        (default: 0)
  -D, --max-nonallele-pct PCT
                        drop all allele markings if the highest non-allelic
                        sequence is at least this percentage of the total
                        number of reads for all alleles on that marker
                        (default: 100.0)
  -G, --max-alleles [N]
                        if specified, do not mark any alleles on a marker if
                        more than N alleles meet the criteria; without N, the
                        amounts given in the library file are used, which have
                        a default value of 1 for markers on the mitochondrial
                        genome and Y chromosome, or 2 otherwise (Note: don't
                        forget to provide -l/--library!)
filtering options:
  sequences that match the -C or -Y option (or both) and all of the other
  settings are retained, all others are filtered

  -a, --filter-action ACTION
                        filtering mode: 'off', disable filtering; 'combine',
                        replace filtered sequences by a single line with
                        aggregate values per marker; 'delete', remove filtered
                        sequences without leaving a trace (default: off)
  -A, --filter-absolute
                        if specified, apply filters to absolute read counts
                        (i.e., with the sign removed), which may keep over-
                        corrected sequences that would otherwise be filtered
                        out
  -N, --min-reads-filt N
                        the minimum number of reads (default: 1)
  -B, --min-per-strand-filt N
                        the minimum number of reads in both orientations
                        (default: 0)
  -M, --min-pct-of-max-filt PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 0.0)
  -P, --min-pct-of-sum-filt PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 0.0)
  -C, --min-correction-filt PCT
                        the minimum percentage change in read count due to
                        correction by e.g., bgcorrect (total_correction
                        column; default: 0)
  -Y, --min-recovery-filt PCT
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (total_recovery column; default: 0)
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

seqconvert

Convert between raw sequences, TSSV-style sequences, and allele names.

FDSTools was built to be compatible with TSSV, which writes sequences of known STR alleles in a shortened form referred to as 'TSSV-style sequences'. At the same time, FDSTools supports the creation of human-readable allele names which are more suitable for display.

For example, the raw sequence 'AGCGTAAGATAGATAGATAGATAGATAGATACCTACCTACCTCTAGCT' might be rewritten as the TSSV-style sequence 'AGCGTA(1)AGAT(6)ACCT(3)CTAGCT(1)', or as the allele name 'CE9_AGAT[6]ACCT[3]'.

Seqconvert can be used to explicitly convert all sequences in a file to the same output format. Conversions are done using a library file, see the help text of the library tool for details.

You can specify multiple input files using the -i/--input option. This is especially useful when generating allele names for many samples that have many sequences in common. To call the variants in the allele names, FDSTools needs to do sequence alignments which can be rather slow. When generating allele names for many input files at once, the results of the alignments are cached which may give a significant speed-up compared to generating allele names for each sample separately.

Seqconvert can also be used with two different library files to rewrite the allele names or TSSV-style sequences after a library update. Currently, the only limitation to this is that the ending position of the left flank and the starting position of the right flank must be the same.

Note that FDSTools makes no assumptions about the sequence format in its input files; instead it automatically performs any required conversions while running any tool. Explicitly running seqconvert is never a necessity; use this tool for your own convenience.

usage: fdstools seqconvert [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                           [-e REGEX] [-f EXPR] [-m COLNAME] [-a COLNAME]
                           [-c COLNAME] [-M MARKER] [-l LIBRARY] [-L LIBRARY]
                           [-r MARKER [MARKER ...]]
                           FORMAT [IN] [OUT]
positional arguments:
  FORMAT                the format to convert to: one of raw, tssv, allelename
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -m, --marker-column COLNAME
                        name of the column that contains the marker name
                        (default: 'marker')
  -a, --allele-column COLNAME
                        name of the column that contains the sequence
                        (default: 'sequence')
  -c, --output-column COLNAME
                        name of the column to write the output to (default:
                        same as -a/--allele-column)
  -M, --marker MARKER   assume the specified marker for all sequences
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'
  -L, --library2 LIBRARY
                        second library file to use for output; if specified,
                        allele names can be conveniently updated to fit this
                        new library file
  -r, --reverse-complement MARKER [MARKER ...]
                        to be used together with -L/--library2; specify the
                        markers for which the sequences are reverse-
                        complemented in the new library
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i, --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o, --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-seqconvert.out', which expands to
                        'sampletag-seqconvert.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly

stuttermark

Mark potential stutter products by assuming a fixed maximum percentage of stutter product vs the parent sequence.

If not present, Stuttermark adds a new column named 'flags' to the output. The flags column will contain 'STUTTER' for possible stutter products. A sequence is considered a possible stutter product if its total read count is less than or equal to the maximum number of expected stutter reads. The maximum number of stutter reads is computed by assuming a fixed percentage of stutter product compared to the originating sequence.

Stuttermark requires TSSV-style sequences (automatically converting sequences to this format if necessary) and detects possible stutter products by comparing sequences that have the same repeat blocks but different numbers of repeats for one or more of their blocks.

The STUTTER annotation contains additional information. For example: 'STUTTER:146.6x1(2-1):10.4x2(2-1x9-1)'. This is a stutter product for which at most 146.6 reads have come from the first sequence in the output file ('146.6x1') and at most 10.4 reads have come from the second sequence in the output file ('10.4x2'). This sequence differs from the first sequence in the output file by a loss of one repeat of the second repeat block ('2-1') and it differs from the second sequence by the loss of one repeat in the second block and one repeat in the ninth block ('2-1x9-1'). (If this sequence would have more than 157 reads, it would not have been marked.)

usage: fdstools stuttermark [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                            [-e REGEX] [-f EXPR] [-s DEF] [-m N] [-n N] [-r N]
                            [-l LIBRARY]
                            [IN] [OUT]
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -s, --stutter DEF     Define maximum expected stutter percentages. The
                        default value of '-1:15,+1:4' sets -1 stutter (loss of
                        one repeat) to 15%, +1 stutter (gain of one repeat) to
                        4%. Any unspecified stutter amount is assumed not to
                        occur directly but e.g., a -2 stutter may still be
                        recognised as two -1 stutters stacked together. NOTE:
                        It may be necessary to specify this option as
                        '-s=-1:15,+1:4' (note the equals sign instead of a
                        space).
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i, --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o, --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-stuttermark.out', which expands
                        to 'sampletag-stuttermark.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -m, --min-reads N     set minimum number of reads to evaluate (default: 2)
  -n, --min-repeats N   set minimum number of repeats of a block that can
                        possibly stutter (default: 3)
  -r, --min-report N    a sequence is only annotated as a stutter of some
                        other sequence if the expected number of stutter
                        occurances of this other sequence is above this value
                        (default: 0.1)
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

stuttermodel

Train a stutter prediction model using homozygous reference samples.

The model obtained from this tool can be used by bgpredict to predict background noise profiles of alleles for which no reference samples are available.

usage: fdstools stuttermodel [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                             [-a ALLELEFILE] [-c COLNAME] [-C] [-T THREADS]
                             [-m PCT] [-n N] [-L N] [-s N] [-M MARKER] [-t N]
                             [-O] [-D N] [-S] [-z] [-u N] [-r RAWFILE]
                             [-l LIBRARY]
                             [FILE ...]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -C, --combine-strands
                        if specified, stutter will be modeled for the total
                        number of reads, instead of separately for either
                        strand
  -T, --num-threads THREADS
                        number of worker threads to use (default: 1)
  -D, --degree N        degree of polynomials to fit (default: 2)
  -S, --same-shape      if specified, the polynomials of all markers will have
                        equal coefficients, except for a vertical shift
  -z, --ignore-zeros    if specified, samples exhibiting no stutter are
                        ignored
  -u, --max-unit-length N
                        investigate stutter of repeats of units of up to this
                        number of nucleotides in length (default: 6)
  -r, --raw-outfile RAWFILE
                        write raw data points to this file, for use in
                        stuttermodel visualisations (specify '-' to write to
                        stdout; normal output on stdout is then suppressed)
output file options:
  -o, --output FILE     file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e, --tag-expr REGEX  regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m, --min-pct PCT     minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.00)
  -n, --min-abs N       minimum amount of background to consider, as an
                        absolute number of reads (default: 1)
  -L, --min-lengths N   require this minimum number of unique repeat lengths
                        (default: 5)
  -s, --min-samples N   require this minimum number of samples for each true
                        allele (default: 1)
  -M, --marker MARKER   work only on MARKER
  -t, --min-r2 N        minimum required r-squared score (default: 0.5)
  -O, --orphans         if specified, a fit on one strand is reported even if
                        no fit was obtained on the other strand for the same
                        marker, unit, and stutter depth
sequence format options:
  -l, --library LIBRARY
                        library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'

tssv

Link raw reads in a FastA or FastQ file to markers and count the number of reads for each unique sequence.

Scans a FastA or FastQ file, finding the sequences from the 'flanks' section in the provided library file. Each time a pair of flanks is found, that sequence read is linked to the corresponding marker and the portion of the sequence between the flanks is extracted. The number of times each such extracted sequence was encountered is counted, along with the orientation (strand) in which it was found in the input file. The output is a list of unique sequences found for each marker including the corresponding counts.

By default, a small number of mismatches is allowed when aligning the flanks to the reads. This can be controlled with the -m/--mismatches option. Furthermore, when the portion of the sequence to which a flank aligns is completely written in lowercase letters in the input file, that match is discarded. This way, FDSTools works well together with the paired-end read merging tool FLASH, version 1.2.11/lo, which (optionally) writes the non- overlapping portion of the reads in lowercase [1]. Together, this ensures repetitive sequences (such as STRs) are not truncated when the paired-end reads are merged.

The sequences thus obtained are subsequently filtered in three ways. First, the 'expected_allele_length' section in the library file may be used to specify hard limits on the acceptable sequence length for each marker. Any unexpectedly short or long sequence is removed. Second, any sequence with an ambiguous base (i.e., not A, C, G, or T) is removed. Finally, the -a/--minimum option can be used to filter out sequences that have been seen only rarely. When the -A/--aggregate-filtered option is given, all filtered sequences of each marker are aggregated and reported as 'Other sequences'.

This tool is an evolution of the original TSSV program [2].

References: [1] https://github.com/Jerrythafast/FLASH-lowercase-overhang [2] https://github.com/jfjlaros/tssv

usage: fdstools tssv [-h] [-v] [-d] [-F FORMAT] [-R FILE] [-L N] [-D DIR]
                     [-T THREADS] [-X] [-m MISMATCHES] [-n N] [-a N] [-B]
                     [-M ACTION]
                     LIBRARY [IN] [OUT]
positional arguments:
  LIBRARY               library file with marker definitions; custom file or
                        built-in: 'ForenSeqA', 'ForenSeqA-UAS', 'ForenSeqB',
                        'ForenSeqB-UAS', 'PowerSeq46GY'
  IN                    the sample data file to process (default: read from
                        stdin)
  OUT                   the file to write the output to (default: write to
                        stdout)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -L, --flank-length N  length of anchor (flanking) sequences to use, if not
                        specified in the library file (default: 16)
  -D, --dir DIR         output directory for verbose output; when given, a
                        subdirectory will be created for each marker, each
                        with a separate sequences.csv file and a number of
                        FASTA/FASTQ files containing unrecognised reads
                        (unknown.fa), recognised reads (Marker/paired.fa), and
                        reads that lack one of the flanks of a marker
                        (Marker/noend.fa and Marker/nostart.fa)
  -T, --num-threads THREADS
                        number of worker threads to use (default: 1)
  -X, --no-deduplicate  disable deduplication of reads; by setting this
                        option, memory usage will be reduced in expense of
                        longer running time
sequence format options:
  -F, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
output file options:
  -R, --report FILE     file to write a report to (default: write to stderr)
filtering options:
  -m, --mismatches MISMATCHES
                        number of mismatches (per nucleotide of flanking
                        sequence if less than 1, else absolute) to allow in
                        flanking sequences, rounded upward (default: 0.1)
  -n, --indel-score N   insertions and deletions in the flanking sequences are
                        penalised this number of times more heavily than
                        mismatches (default: 2)
  -a, --minimum N       report only sequences with this minimum number of
                        reads (default: 2)
  -B, --no-aggregate-filtered
                        by default, sequences that have been filtered (as per
                        the -a/--minimum option, the expected_allele_length
                        section in the library file, as well as all sequences
                        with ambiguous bases) will be aggregated per marker
                        and reported as 'Other sequences'; specify this option
                        to remove such sequences entirely
  -M, --missing-marker-action ACTION
                        action to take when no sequences are linked to a
                        marker: one of include, exclude, halt (default:
                        include)

vis

Create a data visualisation web page or Vega graph specification.

With no optional arguments specified, a self-contained web page (HTML file) is produced. You can open this file in a web browser to view interactive visualisations of your data. The web page contains a file selection element which can be used to select the data to be visualised.

Visualisations make use of the Vega JavaScript library (https://vega.github.io). The required JavaScript libraries (Vega and D3) are embedded in the generated HTML file. With the -O/--online option specified, the HTML file will instead link to the latest version of these libraries on the Internet.

Vega supports generating visualisations on the command line. By default, FDSTools produces a full-featured HTML file. Specify the -V/--vega option if you wish to obtain a bare Vega graph specification (a JSON file) instead. You can pass this file through Vega to generate a PNG or SVG image file.

If an input file is specified, the visualisation will be set up specifically to visualise the contents of this file. To this end, the entire file contents are embedded in the generated visualisation.

usage: fdstools vis [-h] [-v] [-d] [-V] [-t] [-T TITLE] [-n N] [-m PCT]
                    [-S PCT] [-s N] [-B N] [-c] [-M MARKER] [-U UNIT] [-A]
                    [-a] [-I FILE] [-L] [-b N] [-p N] [-w N] [-H N] [-x N]
                    [-j N] [-N N] [-X PCT] [-Q PCT] [-C N] [-Y N] [-Z N]
                    TYPE [IN] [OUT]
positional arguments:
  TYPE                  the type of data to visualise; use 'sample' to
                        visualise sample data files and bgcorrect output; use
                        'profile' to visualise background noise profiles
                        obtained with bgestimate, bghomstats, and bgpredict;
                        use 'bgraw' to visualise raw background noise data
                        obtained with bghomraw; use 'stuttermodel' to
                        visualise models of stutter obtained from
                        stuttermodel; 'bganalyse' to visualise data obtained
                        from bganalyse; use 'allele' to visualise the allele
                        list obtained from allelefinder
  IN                    file containing the data to embed in the visualisation
                        file; if not specified, HTML visualisation files will
                        contain a file selection control, and Vega
                        visualisation files will load data from a file called
                        'data.csv'
  OUT                   file to write output to (default: write to stdout)
options:
  -h, --help            show this help message and exit
  -v, --version         show version number and exit
  -d, --debug           if specified, additional debug output is given
  -V, --vega            by default, a full-featured HTML file offering an
                        interactive visualisation is created; if this option
                        is specified, only a bare Vega graph specification
                        (JSON file) is produced instead
  -t, --tidy            tidily indent the generated JSON
  -T, --title TITLE     prepend the given value to the title of HTML
                        visualisations (default: prepend name of data file if
                        given)
visualisation options:
  words in [brackets] indicate applicable visualisation types

  -n, --min-abs N       [sample, profile, bgraw] only show sequences with this
                        minimum number of reads (default: 5)
  -m, --min-pct-of-max PCT
                        [sample, profile, bgraw] for sample: only show
                        sequences with at least this percentage of the number
                        of reads of the highest allele of a marker; for
                        profile and bgraw: at least this percentage of the
                        true allele (default: 0.5)
  -S, --min-pct-of-sum PCT
                        [sample] only show sequences with at least this
                        percentage of the total number of reads of a marker
                        (default: 0.0)
  -s, --min-per-strand N
                        [sample] only show sequences with this minimum number
                        of reads for both orientations (forward/reverse)
                        (default: 0)
  -B, --bias-threshold N
                        [sample] mark sequences that have less than this
                        percentage of reads on one strand (default: 0.0)
  -c, --no-ce-length-sort
                        [sample] if specified, do not sort STR alleles by
                        length
  -M, --marker MARKER   [sample, profile, bgraw, stuttermodel, bganalyse] only
                        show graphs for the markers that contain the given
                        value in their name; separate multiple values with
                        spaces; prepend any value with '=' for an exact match
                        (default: show all markers)
  -U, --repeat-unit UNIT
                        [stuttermodel] only show graphs for the repeat units
                        that contain the given value; separate multiple values
                        with spaces; prepend any value with '=' for an exact
                        match (default: show all repeat units)
  -A, --no-alldata      [stuttermodel] if specified, show only marker-specific
                        fits
  -a, --no-aggregate    [sample] if specified, do not replace filtered
                        sequences with a per-marker aggregate 'Other
                        sequences' entry
  -I, --input2 FILE     [profile, stuttermodel] raw data points file to
                        overlay on the background noise profiles or stutter
                        model graphs (as obtained from bghomraw or the
                        -r/--raw-outfile option of stuttermodel); if not
                        specified, HTML visualisation files will contain a
                        file selection control
display options:
  -L, --log-scale       [sample, profile, bgraw, bganalyse] use logarithmic
                        scale (for sample and bganalyse: square root scale)
                        instead of linear scale
  -b, --bar-width N     [sample, profile, bgraw, bganalyse] width of the bars
                        in pixels (default: 15)
  -p, --padding N       [sample, profile, bgraw, stuttermodel] amount of
                        padding (in pixels) between graphs of different
                        markers/alleles (default: 70)
  -w, --width N         [sample, profile, bgraw, stuttermodel, bganalyse,
                        allele] width of the graph area in pixels (default:
                        600)
  -H, --height N        [stuttermodel, allele] height of the graph area in
                        pixels (default: 400)
  -x, --max-seq-len N   [sample] truncate long sequences to this number of
                        characters (default: 70)
  -j, --jitter N        [stuttermodel] apply this amount of jitter to raw data
                        points (between 0 and 1, default: 0.25)
allele calling options:
  for sample visualisations only; sequences that match the -C or -Y option
  (or both) and all of the other settings are marked as 'allele'

  -N, --allele-min-abs N
                        the minimum number of reads (default: 30)
  -X, --allele-min-pct-of-max PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 2.0)
  -Q, --allele-min-pct-of-sum PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 1.5)
  -C, --allele-min-correction N
                        the minimum change in read count due to correction by
                        e.g., bgcorrect (default: 0)
  -Y, --allele-min-recovery N
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (default: 0)
  -Z, --allele-min-per-strand N
                        the minimum number of reads in both orientations
                        (default: 0)