FDSTools Tools

This page gives a brief description of each tool in FDSTools. Descriptions of the command line arguments are included. The same information can be obtained from the FDSTools command line by running the command fdstools --help (for the list of tools) or fdstools --help toolname (for a description of that particular tool).

We will extend this page with practical examples of each tool in the future.

FDSTools

Data analysis tools for Next Generation Sequencing of forensic DNA markers, including tools for characterisation and filtering of PCR stutter artefacts and other systemic noise, and for automatic detection of the alleles in a sample.

usage: fdstools [-h ...] [-v ...] [-d] TOOL ...
optional arguments:
  -h ..., --help ...    show this help message, or help for the specified
                        TOOL, and exit
  -v ..., --version ...
                        show version number and exit
  -d, --debug           if specified, additional debug output is given
available tools:
  TOOL                  specify which tool to run
    allelefinder        Find true alleles in reference samples and detect
                        possible contaminations.
    bganalyse           Analyse the amount of noise in reference samples.
    bgcorrect           Match background noise profiles (obtained from e.g.,
                        bgestimate) to samples.
    bgestimate          Estimate allele-centric background noise profiles
                        (means) from reference samples.
    bghomraw            Compute noise ratios for all noise detected in
                        homozygous reference samples.
    bghomstats          Compute allele-centric statistics for background noise
                        in homozygous reference samples (min, max, mean,
                        sample variance).
    bgmerge             Merge multiple files containing background noise
                        profiles.
    bgpredict           Predict background profiles of new alleles based on a
                        model of stutter occurrence obtained from
                        stuttermodel.
    findnewalleles      Mark all sequences that are not in another list of
                        sequences.
    libconvert          Convert between TSSV (tab-separated) and FDSTools
                        (ini-style) library formats.
    library             Create an empty FDSTools library file.
    pipeline            Automatically run complete, predefined analysis
                        pipelines. Recommended starting point for new users.
    samplestats         Compute various statistics for each sequence in the
                        given sample data file and perform threshold-based
                        allele calling.
    seqconvert          Convert between raw sequences, TSSV-style sequences,
                        and allele names.
    stuttermark         Mark potential stutter products by assuming a fixed
                        maximum percentage of stutter product vs the parent
                        sequence.
    stuttermodel        Train a stutter prediction model using homozygous
                        reference samples.
    tssv                Link raw reads in a FastA or FastQ file to markers and
                        count the number of reads for each unique sequence.
    vis                 Create a data visualisation web page or Vega graph
                        specification.

allelefinder

Find true alleles in reference samples and detect possible contaminations.

In each sample, the sequences with the highest read counts of each marker are called alleles, with a user-defined maximum number of alleles per marker. The allele balance is kept within given bounds. If the highest non-allelic sequence exceeds a given limit, no alleles are called for this marker. If this happens for multiple markers in one sample, no alleles are called for this sample at all.

The allele list obtained from allelefinder should always be checked carefully before using it as the input of various other tools operating on reference samples. These tools rely heavily on the correctness of this file to do their job. One may use the allelefinder report (-R/--report output argument) and the bganalyse tool to get a quick overview of what might be wrong.

usage: fdstools allelefinder [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX]
                             [-f EXPR] [-m PCT] [-M PCT] [-n N] [-a N] [-x N]
                             [-c COLNAME] [-F FORMAT] [-l LIBRARY]
                             [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
  -R FILE, --report FILE
                        file to write a report to (default: write to stderr)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -m PCT, --min-allele-pct PCT
                        call heterozygous if the second allele is at least
                        this percentage of the highest allele of a marker
                        (default: 30.0)
  -M PCT, --max-noise-pct PCT
                        a sample is considered contaminated/unsuitable for a
                        marker if the highest non-allelic sequence is at least
                        this percentage of the highest allele of that marker
                        (default: 10.0)
  -n N, --min-reads N   require at least this number of reads for the highest
                        allele of each marker (default: 50)
  -a N, --max-alleles N
                        allow no more than this number of alleles per marker;
                        if unspecified, the amounts given in the library file
                        are used, which have a default value of 2
  -x N, --max-noisy N   entirely reject a sample if more than this number of
                        markers have a high non-allelic sequence (default: 2)
  -c COLNAME, --stuttermark-column COLNAME
                        name of column with Stuttermark output; if specified,
                        sequences for which the value in this column does not
                        start with ALLELE are ignored
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

bganalyse

Analyse the amount of noise in reference samples.

Use this tool after correcting the reference samples with BGCorrect to analyse the amount of remaining noise after correction. This way, potentially contaminated or otherwise 'dirty' reference samples can be detected. The highest amount of remaining noise can be interpreted as a lower bound to the reliable detection of a minor contributor's alleles in mixed DNA samples.

In the default mode ('full'), the lowest, highest, and total number of backgroud/noise reads as well as the respective percentages w.r.t. the number of allelic reads of each marker in each sample is printed. This data can be visualised using fdstools vis bganalyse.

In the alternative 'percentiles' mode, the highest and the total number of background reads as a percentage of the number of allelic reads for each marker is given at selected percentiles of the samples. I.e., it gives the highest and total remaining noise considering only the cleanest x% of samples, for different values of x.

usage: fdstools bganalyse [-h] [-v] [-d] [-m MODE] [-p PCT] [-o FILE]
                          [-e REGEX] [-f EXPR] [-a ALLELEFILE] [-c COLNAME]
                          [-F FORMAT] [-l LIBRARY]
                          [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -m MODE, --mode MODE  controls what kind of information is printed; 'full'
                        (the default) prints the lowest, highest, and total
                        number of backgroud reads as well as the respective
                        percentages w.r.t. the number of allelic reads of each
                        marker in each sample; 'percentiles' prints the
                        highest and the total number of background reads as a
                        percentage of the number of allelic reads for each
                        marker at given percentiles
  -p PCT, --percentiles PCT
                        comma-separated list of percentiles to report when
                        -m/--mode is set to 'percentiles' (default:
                        100,99,95,90)
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a ALLELEFILE, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c COLNAME, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

bgcorrect

Match background noise profiles (obtained from e.g., bgestimate) to samples.

Eleven new columns are added to the output giving, for each sequence, the number of reads attributable to noise from other sequences (_noise columns) and the number of noise reads caused by the prescense of this sequence (_add columns), as well as the resulting number of reads after correction (_corrected columns: original minus _noise plus _add).

The correction_flags column contains one of the following values: 'not_corrected', no background noise profile was available for this marker; 'not_in_ref_db', the sequence was not present in the noise profiles given; 'corrected_as_background_only', the sequence was present in the noise profiles given, but only as noise and not as genuine allele; 'corrected_bgpredict', the sequence was present in the noise profiles as a genuine allele, but its noise profile consists entirely of predictions as opposed to direct observations; 'corrected_bgestimate'/'corrected_bghomstats', the sequence was present in the noise profiles as a genuine allele and at least part of its noise profile was based on direct observations.

Finally, the weight column gives the number of times that the noise profile of that allele fitted in the sample.

usage: fdstools bgcorrect [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                          [-e REGEX] [-f EXPR] [-M MARKER] [-F FORMAT]
                          [-l LIBRARY]
                          PROFILES [IN] [OUT]
positional arguments:
  PROFILES              file containing background noise profiles to match
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i IN [IN ...], --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o OUT [OUT ...], --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-bgcorrect.out', which expands to
                        'sampletag-bgcorrect.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -M MARKER, --marker MARKER
                        work only on MARKER
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

bgestimate

Estimate allele-centric background noise profiles (means) from reference samples.

Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples.

usage: fdstools bgestimate [-h] [-v] [-d] [-o FILE] [-R FILE] [-e REGEX]
                           [-f EXPR] [-a ALLELEFILE] [-c COLNAME] [-m PCT]
                           [-n N] [-s N] [-S PCT] [-g N] [-p FILE] [-M MARKER]
                           [-H] [-l LIBRARY] [-Q N] [-x N]
                           [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
  -R FILE, --report FILE
                        file to write a report to (default: write to stderr)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a ALLELEFILE, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c COLNAME, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m PCT, --min-pct PCT
                        minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n N, --min-abs N     minimum amount of background to consider, as an
                        absolute number of reads for at least one orientation
                        (default: 5)
  -s N, --min-samples N
                        require this minimum number of samples for each true
                        allele (default: 2)
  -S PCT, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -g N, --min-genotypes N
                        require this minimum number of unique heterozygous
                        genotypes for each allele for which no homozygous
                        samples are available (default: 3)
  -p FILE, --profiles FILE
                        use the given noise profiles file as a starting point
  -M MARKER, --marker MARKER
                        work only on MARKER
  -H, --homozygotes     if specified, only homozygous samples will be
                        considered
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion
random subsampling options (advanced):
  -Q N, --limit-reads N
                        simulate lower sequencing depth by randomly dropping
                        reads down to this maximum total number of reads for
                        each sample
  -x N, --drop-samples N
                        randomly drop this fraction of input samples

bghomraw

Compute noise ratios for all noise detected in homozygous reference samples.

With this tool, separate data points are produced for each sample, which can be visualised using fdstools vis bgraw. Use bghomstats or bgestimate to compute aggregate statistics on noise instead.

usage: fdstools bghomraw [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                         [-a ALLELEFILE] [-c COLNAME] [-m PCT] [-n N] [-s N]
                         [-S PCT] [-M MARKER] [-F FORMAT] [-l LIBRARY]
                         [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a ALLELEFILE, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c COLNAME, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m PCT, --min-pct PCT
                        minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n N, --min-abs N     minimum amount of background to consider, as an
                        absolute number of reads (default: 5)
  -s N, --min-samples N
                        require this minimum number of samples for each true
                        allele (default: 2)
  -S PCT, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -M MARKER, --marker MARKER
                        work only on MARKER
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

bghomstats

Compute allele-centric statistics for background noise in homozygous reference samples (min, max, mean, sample variance).

Compute a profile of recurring background noise for each unique allele in the database of reference samples. The profiles obtained can be used by bgcorrect to filter background noise from samples. If many reference samples are heterozygous (as is usually the case with forensic STR markers), it is preferable to use bgestimate instead, since it can handle heterozygous samples as well.

usage: fdstools bghomstats [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                           [-a ALLELEFILE] [-c COLNAME] [-m PCT] [-n N] [-s N]
                           [-S PCT] [-M MARKER] [-F FORMAT] [-l LIBRARY]
                           [-Q N] [-x N]
                           [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a ALLELEFILE, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c COLNAME, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m PCT, --min-pct PCT
                        minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -n N, --min-abs N     minimum amount of background to consider, as an
                        absolute number of reads (default: 5)
  -s N, --min-samples N
                        require this minimum number of samples for each true
                        allele (default: 2)
  -S PCT, --min-sample-pct PCT
                        require this minimum number of samples for each
                        background product, as a percentage of the number of
                        samples with a particular true allele (default: 80.0)
  -M MARKER, --marker MARKER
                        work only on MARKER
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: no conversion)
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion
random subsampling options (advanced):
  -Q N, --limit-reads N
                        simulate lower sequencing depth by randomly dropping
                        reads down to this maximum total number of reads for
                        each sample
  -x N, --drop-samples N
                        randomly drop this fraction of input samples

bgmerge

Merge multiple files containing background noise profiles.

Background noise profiles are merged in the order in which they are specified. If multple files specify a different value for the same allele and sequence, the value of the first file is used.

It is convenient to pipe the output of bgpredict and/or bgestimate into bgmerge to merge that with an existing file containing background profiles. Specify '-' as one of the input files to read from stdin (i.e., read input from a pipe). If only one input file is specified, '-' is implicitly used as the second input file. Note that as a result, in case of conflicting values, the value in the specified input file will take precedence over the value in the data that was piped in.

Example: fdstools bgpredict ... | fdstools bgmerge old.txt > out.txt

usage: fdstools bgmerge [-h] [-v] [-d] [-o FILE] [-l LIBRARY] FILE [FILE ...]
positional arguments:
  FILE                  files containing the background noise profiles to
                        combine; if a single file is given, it is merged with
                        input from stdin; use '-' to use stdin as an explicit
                        input source
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

bgpredict

Predict background profiles of new alleles based on a model of stutter occurrence obtained from stuttermodel.

This tool can be used to compute background noise profiles for alleles for which no reference samples are available. The profiles are predicted using a model of stutter occurrence that must have been created previously using stuttermodel. A list of sequences should be given; bgpredict will predict a background noise profile for each of the provided sequences separately. The prediction is based completely on the provided stutter model.

The predicted background noise profiles obtained from bgpredict can be combined with the output of bgestimate and/or bghomstats using bgmerge.

It is possible to use an entire forensic case sample as the SEQS input argument of bgpredict to obtain a predicted background noise profile for each sequence detected in the sample. When the background noise profiles thus obtained are combined with those obtained from bgestimate, bgcorrect may subsequently produce 'cleaner' results if the sample contained alleles for which no reference samples were available.

usage: fdstools bgpredict [-h] [-v] [-d] [-M MARKER] [-A] [-n PCT] [-t N]
                          [-l LIBRARY]
                          STUT SEQS [OUT]
positional arguments:
  STUT                  file containing a trained stutter model
  SEQS                  file containing the sequences for which a profile
                        should be predicted
  OUT                   the file to write the output to (default: write to
                        stdout)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -M MARKER, --marker MARKER
                        assume the specified marker for all sequences
  -A, --use-all-data    if specified, the 'All data' model is used to predict
                        stutter whenever no marker-specific model is available
                        for a certain repeat unit
filtering options:
  -n PCT, --min-pct PCT
                        minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.50)
  -t N, --min-r2 N      minimum required r-squared score (default: 0.8)
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

findnewalleles

Mark all sequences that are not in another list of sequences.

Adds a new column 'new_allele' to the input data. An asterisk (*) will be placed in this column for any sequence that does not occur in the provided list of known sequences.

usage: fdstools findnewalleles [-h] [-v] [-d] [-i IN [IN ...]]
                               [-o OUT [OUT ...]] [-e REGEX] [-f EXPR]
                               [-M MARKER] [-l LIBRARY]
                               KNOWN [IN] [OUT]
positional arguments:
  KNOWN                 file containing a list of known allelic sequences
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i IN [IN ...], --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o OUT [OUT ...], --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-findnewalleles.out', which
                        expands to 'sampletag-findnewalleles.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -M MARKER, --marker MARKER
                        work only on MARKER
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

libconvert

Convert between TSSV (tab-separated) and FDSTools (ini-style) library formats.

This is a convenience tool for users migrating from the standalone 'TSSV' programme. Use the 'library' tool if you wish to create a new, empty FDSTools library file to start with.

Both FDSTools and the standalone 'TSSV' programme use a library file to store the names, flanking (primer) sequences, and STR repeat structure of forensic STR markers. However, the TSSV library file format is not suitable for non-STR markers and automatic generation of allele names. FDSTools therefore employs a different (ini-style) library file format that can store more details about the markers used. The libconvert tool can be used to convert between the two formats.

Please refer to the help of the 'library' tool for more information about FDSTools library files.

usage: fdstools libconvert [-h] [-v] [-d] [-a] [IN] [OUT]
positional arguments:
  IN             input library file, the format is automatically detected
                 (default: read from stdin)
  OUT            the file to write the output to (default: write to stdout)
optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
  -d, --debug    if specified, additional debug output is given
  -a, --aliases  when converting to TSSV format, aliases in FDSTools libraries
                 are converted to separate markers in the output library when
                 this option is specified; otherwise, they are merged into
                 their respective markers; when converting to FDSTools format,
                 the [aliases] section is included in the output if this
                 option is specified

library

Create an empty FDSTools library file.

An FDSTools library file contains various details about the forensic markers used in the analysis, such as the flanking (primer) sequences, general STR structure or reference sequence of non-STR markers, genomic location, expected number of alleles, expected length range of alleles, etc. FDSTools uses library files for tasks such as linking raw sequence reads to markers and converting sequences to allele names or vice versa.

In FDSTools, sequences of STR alleles are split up into three parts: a prefix, the STR, and a suffix. The prefix and suffix are optional and are meant to fill the gap between the STR and the primer binding sites. The primer binding sites are called 'flanks' in the library file. For non-STR markers, FDSTools library files simply contain the reference sequence of the region between the flanks.

Allele names typically consist of an allele number compatible with those obtained from Capillary Electrophoresis (CE), followed by the STR sequence in a shortened form and any substitutions or other variants that occur in the prefix and suffix. The first prefix/suffix in the library file is used as the reference sequence for calling variants.

Special alleles, such as the 'X' and 'Y' allele from the Amelogenin gender test, may be given an explicit allele name by specifying an Alias in the FDSTools library file.

Users migrating from the standalone 'TSSV' programme may use the libconvert tool to convert their TSSV library file to FDSTools format.

usage: fdstools library [-h] [-v] [-d] [-t TYPE] [-a] [OUT]
positional arguments:
  OUT                   the file to write the output to (default: write to
                        stdout)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -t TYPE, --type TYPE  the type of markers that this library file will be
                        used for; 'full' (the default) will create a library
                        file with all possible sections, whereas 'str' or
                        'non-str' will only output the sections applicable to
                        STR and non-STR markers, respectively
  -a, --aliases         if specified, the [aliases] section is included, which
                        can be used to explicitly assign allele names to
                        specific sequences of specific markers

pipeline

Automatically run complete, predefined analysis pipelines. Recommended starting point for new users.

This tool runs one of three default analysis pipelines automatically, given a configuration file with tool options and input/output file names. The three available analysis options are 'reference-sample', analysing a single reference sample with TSSV and Stuttermark; 'reference-database', analysing a collection of reference samples with BGEstimate and Stuttermodel; and 'case-sample', analysing a single case sample with TSSV, BGPredict, BGMerge, BGCorrect, and Samplestats. All results are visualised in interactive graphical reports for presentation and further interpretation.

This tool takes a single mandatory argument: the name of an INI configuration file that contains the analysis settings to use. An easy way to obtain such an INI file with default values for all settings, is to run fdstools pipeline your-filename.ini --analysis case-sample. This will create the file 'your-filename.ini' and fill it with default values for the given analysis type (in this example: case-sample analysis).

All settings in the configuration file correspond to options of various tools in FDSTools. Please refer to the tool-specific help for a full description of each tool. Type fdstools -h TOOL to get help with the given TOOL.

usage: fdstools pipeline [-h] [-v] [-d] [-a ANALYSIS] [-e REGEX] [-f EXPR]
                         [-l LIBRARY] [-s FASTA] [-m STUT] [-p PROFILES] [-r]
                         [-S SAMPLE [SAMPLE ...]] [-A ALLELEFILE] [-P PREFIX]
                         INI
positional arguments:
  INI                   pipeline configuration file; if it does not exist, a
                        new file with default settings will be created
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -a ANALYSIS, --analysis ANALYSIS
                        controls which predefined analysis pipeline will be
                        run; 'reference-sample' runs a single sample's
                        FastA/FastQ file through TSSV and Stuttermark to
                        prepare it for the reference-database analysis;
                        'reference-database' runs a collection of reference
                        samples through Allelefinder, BGEstimate, and
                        Stuttermodel to create a reference database of
                        systemic noise; 'case-sample' runs a single sample's
                        FastA/FastQ file through TSSV, BGPredict, BGCorrect,
                        and Samplestats
sample tag parsing options:
  these options are used to extract sample tags (names) from their file
  names; for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
input/output file options:
  words in [brackets] indicate applicable analysis types; all of these
  values can also be specified in the [pipeline] section of the INI file

  -l LIBRARY, --in-library LIBRARY
                        library file containing marker definitions
  -s FASTA, --in-sample-raw FASTA
                        [ref-sample, case-sample] FastA or FastQ file
                        containing raw sequence data of the sample
  -m STUT, --in-stuttermodel STUT
                        [case-sample] file containing a trained stutter model
  -p PROFILES, --in-bgprofiles PROFILES
                        [case-sample] file containing noise profiles from
                        BGEstimate
  -r, --store-predictions
                        [case-sample] if this option is specified, output
                        files named 'sampletag-bgpredict.txt' and 'sampletag-
                        bgmerge.txt' will be created if applicable; these
                        files contain predicted stutter amounts for the
                        sequences in the sample based on the given stutter
                        model
  -S SAMPLE [SAMPLE ...], --in-samples SAMPLE [SAMPLE ...]
                        [ref-database] file names of reference sample data
                        files ('.csv' output files of the 'reference-sample'
                        analysis)
  -A ALLELEFILE, --in-allelelist ALLELEFILE
                        [ref-database] file containing a list of the true
                        alleles of each sample; if not given, Allelefinder
                        will be run as part of the pipeline to create this
                        file; it is ESSENTIAL that you check the correctness
                        and completeness of the allele list
  -P PREFIX, --prefix PREFIX
                        [ref-database] if specified, all output file names are
                        prefixed with this value

samplestats

Compute various statistics for each sequence in the given sample data file and perform threshold-based allele calling.

Updates the 'flags' column (or adds it, if it was not present in the input data) to include 'allele' for all sequences that meet various allele calling thresholds.

Adds the following columns to the input data. Some columns may be omitted from the output if the input does not contain the required columns. In the column names below, 'X' is a placeholder for 'forward', 'reverse', and 'total', which refers to the strand of DNA for which the statistics are calculated. 'Y' is a placeholder for 'corrected' (statistics calculated on data after noise correction by e.g., BGCorrect), 'noise' (statistics calculated on the number of reads attributed to noise), and 'add' (statistics calculated on the number of reads recovered through noise correction). Wherever the 'Y' part of the column name is omitted, the values in the column are computed on data prior to noise correction.

X_Y: The number of Y reads of this sequence on the X strand (this column is not added by Samplestats, but should be present in the input). X_Y_mp_sum: The value of X_Y, as a percentage of the sum of the X_Y of the marker. X_Y_mp_max: The value of X_Y, as a percentage of the maximum X_Y of the marker. forward_Y_pct: The number of Y reads on the forward strand, as a percentage of the total number of Y reads of this sequence. X_correction_pct: The difference between the values of X_corrected and X, as a percentage of the value of X. X_removed_pct: The value of X_noise, as a percentage of the value of X. X_added_pct: The value of X_add, as a percentage of the value of X. X_recovery: The value of X_add, as a percentage of the value of X_corrected.

usage: fdstools samplestats [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                            [-e REGEX] [-f EXPR] [-n N] [-b N] [-m PCT]
                            [-p PCT] [-c PCT] [-y PCT] [-a ACTION] [-A] [-N N]
                            [-B N] [-M PCT] [-P PCT] [-C PCT] [-Y PCT]
                            [IN] [OUT]
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i IN [IN ...], --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o OUT [OUT ...], --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-samplestats.out', which expands
                        to 'sampletag-samplestats.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
interpretation options:
  sequences that match the -c or -y option (or both) and all of the other
  settings are marked as 'allele'

  -n N, --min-reads N   the minimum number of reads (default: 30)
  -b N, --min-per-strand N
                        the minimum number of reads in both orientations
                        (default: 1)
  -m PCT, --min-pct-of-max PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 2.0)
  -p PCT, --min-pct-of-sum PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 1.5)
  -c PCT, --min-correction PCT
                        the minimum change in read count due to correction by
                        e.g., bgcorrect (default: 0)
  -y PCT, --min-recovery PCT
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (default: 0)
filtering options:
  sequences that match the -C or -Y option (or both) and all of the other
  settings are retained, all others are filtered

  -a ACTION, --filter-action ACTION
                        filtering mode: 'off', disable filtering; 'combine',
                        replace filtered sequences by a single line with
                        aggregate values per marker; 'delete', remove filtered
                        sequences without leaving a trace (default: off)
  -A, --filter-absolute
                        if specified, apply filters to absolute read counts
                        (i.e., with the sign removed), which may keep over-
                        corrected sequences that would otherwise be filtered
                        out
  -N N, --min-reads-filt N
                        the minimum number of reads (default: 1)
  -B N, --min-per-strand-filt N
                        the minimum number of reads in both orientations
                        (default: 1)
  -M PCT, --min-pct-of-max-filt PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 0.0)
  -P PCT, --min-pct-of-sum-filt PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 0.0)
  -C PCT, --min-correction-filt PCT
                        the minimum change in read count due to correction by
                        e.g., bgcorrect (default: 0)
  -Y PCT, --min-recovery-filt PCT
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (default: 0)

seqconvert

Convert between raw sequences, TSSV-style sequences, and allele names.

FDSTools was built to be compatible with TSSV, which writes sequences of known STR alleles in a shortened form referred to as 'TSSV-style sequences'. At the same time, FDSTools supports the creation of human-readable allele names which are more suitable for display.

For example, the raw sequence 'AGCGTAAGATAGATAGATAGATAGATAGATACCTACCTACCTCTAGCT' might be rewritten as the TSSV-style sequence 'AGCGTA(1)AGAT(6)ACCT(3)CTAGCT(1)', or as the allele name 'CE9_AGAT[6]ACCT[3]'.

Seqconvert can be used to explicitly convert all sequences in a file to the same output format. Conversions are done using a library file, see the help text of the libconvert tool for details.

You can specify multiple input files using the -i/--input option. This is especially useful when generating allele names for many samples that have many sequences in common. To call the variants in the allele names, FDSTools needs to do sequence alignments which can be rather slow. When generating allele names for many input files at once, the results of the alignments are cached which may give a significant speed-up compared to generating allele names for each sample separately.

Seqconvert can also be used with two different library files to rewrite the allele names or TSSV-style sequences after a library update. Currently, the only limitation to this is that the ending position of the left flank and the starting position of the right flank must be the same.

Note that FDSTools makes no assumptions about the sequence format in its input files; instead it automatically performs any required conversions while running any tool. Explicitly running seqconvert is never a necessity; use this tool for your own convenience.

usage: fdstools seqconvert [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                           [-e REGEX] [-f EXPR] [-m COLNAME] [-a COLNAME]
                           [-c COLNAME] [-M MARKER] [-l LIBRARY] [-L LIBRARY]
                           [-r MARKER [MARKER ...]]
                           FORMAT [IN] [OUT]
positional arguments:
  FORMAT                the format to convert to: one of raw, tssv, allelename
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -m COLNAME, --marker-column COLNAME
                        name of the column that contains the marker name
                        (default: 'marker')
  -a COLNAME, --allele-column COLNAME
                        name of the column that contains the sequence
                        (default: 'sequence')
  -c COLNAME, --output-column COLNAME
                        name of the column to write the output to (default:
                        same as -a/--allele-column')
  -M MARKER, --marker MARKER
                        assume the specified marker for all sequences
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion
  -L LIBRARY, --library2 LIBRARY
                        second library file to use for output; if specified,
                        allele names can be conveniently updated to fit this
                        new library file
  -r MARKER [MARKER ...], --reverse-complement MARKER [MARKER ...]
                        to be used together with -L/--library2; specify the
                        markers for which the sequences are reverse-
                        complemented in the new library
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i IN [IN ...], --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o OUT [OUT ...], --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-seqconvert.out', which expands to
                        'sampletag-seqconvert.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly

stuttermark

Mark potential stutter products by assuming a fixed maximum percentage of stutter product vs the parent sequence.

Stuttermark adds a new column (named 'annotation' by default) to the output. The new column contains 'STUTTER' for possible stutter products, or 'ALLELE' otherwise. Lines that were not evaluated are annotated as 'UNKNOWN'. A sequence is considered a possible stutter product if its total read count is less than or equal to the maximum number of expected stutter reads. The maximum number of stutter reads is computed by assuming a fixed percentage of stutter product compared to the originating sequence.

Stuttermark requires TSSV-style sequences as input (automatically converting sequences to this format if necessary) and detects possible stutter products by comparing sequences that have the same repeat blocks but different numbers of repeats for one or more of their blocks.

The STUTTER annotation contains additional information. For example: 'STUTTER:146.6x1(2-1):10.4x2(2-1x9-1)'. This is a stutter product for which at most 146.6 reads have come from the first sequence in the output file ('146.6x1') and at most 10.4 reads have come from the second sequence in the output file ('10.4x2'). This sequence differs from the first sequence in the output file by a loss of one repeat of the second repeat block ('2-1') and it differs from the second sequence by the loss of one repeat in the second block and one repeat in the ninth block ('2-1x9-1'). (If this sequence would have more than 157 reads, it would be annotated as 'ALLELE' instead.)

usage: fdstools stuttermark [-h] [-v] [-d] [-i IN [IN ...]] [-o OUT [OUT ...]]
                            [-e REGEX] [-f EXPR] [-s DEF] [-c COLNAME] [-m N]
                            [-n N] [-r N] [-l LIBRARY]
                            [IN] [OUT]
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -s DEF, --stutter DEF
                        Define maximum expected stutter percentages. The
                        default value of '-1:15,+1:4' sets -1 stutter (loss of
                        one repeat) to 15%, +1 stutter (gain of one repeat) to
                        4%. Any unspecified stutter amount is assumed not to
                        occur directly but e.g., a -2 stutter may still be
                        recognised as two -1 stutters stacked together. NOTE:
                        It may be necessary to specify this option as
                        '-s=-1:15,+1:4' (note the equals sign instead of a
                        space).
  -c COLNAME, --column-name COLNAME
                        name of the newly added column (default: 'annotation')
input file options:
  IN                    single sample data file to process (default: read from
                        stdin)
  -i IN [IN ...], --input IN [IN ...]
                        multiple sample data files to process (use with
                        -o/--output)
output file options:
  OUT                   the file to write the output to (default: write to
                        stdout)
  -o OUT [OUT ...], --output OUT [OUT ...]
                        list of names of output files to match with input
                        files specified with -i/--input, or a format string to
                        construct file names from sample tags; e.g., the
                        default value is '\1-stuttermark.out', which expands
                        to 'sampletag-stuttermark.out'
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
filtering options:
  -m N, --min-reads N   set minimum number of reads to evaluate (default: 2)
  -n N, --min-repeats N
                        set minimum number of repeats of a block that can
                        possibly stutter (default: 3)
  -r N, --min-report N  a sequence is only annotated as a stutter of some
                        other sequence if the expected number of stutter
                        occurances of this other sequence is above this value
                        (default: 0.1)
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion

stuttermodel

Train a stutter prediction model using homozygous reference samples.

The model obtained from this tool can be used by bgpredict to predict background noise profiles of alleles for which no reference samples are available.

usage: fdstools stuttermodel [-h] [-v] [-d] [-o FILE] [-e REGEX] [-f EXPR]
                             [-a ALLELEFILE] [-c COLNAME] [-m PCT] [-n N]
                             [-L N] [-s N] [-M MARKER] [-t N] [-O] [-D N] [-S]
                             [-z] [-u N] [-r RAWFILE] [-l LIBRARY] [-Q N]
                             [-x N]
                             [FILE [FILE ...]]
positional arguments:
  FILE                  the sample data file(s) to process (default: read from
                        stdin)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -D N, --degree N      degree of polynomials to fit (default: 2)
  -S, --same-shape      if specified, the polynomials of all markers will have
                        equal coefficients, except for a vertical shift
  -z, --ignore-zeros    if specified, samples exhibiting no stutter are
                        ignored
  -u N, --max-unit-length N
                        investigate stutter of repeats of units of up to this
                        number of nucleotides in length (default: 6)
  -r RAWFILE, --raw-outfile RAWFILE
                        write raw data points to this file (specify '-' to
                        write to stdout; normal output on stdout is then
                        supressed)
output file options:
  -o FILE, --output FILE
                        file to write output to (default: write to stdout)
sample tag parsing options:
  for details about REGEX syntax and capturing groups, check
  https://docs.python.org/howto/regex

  -e REGEX, --tag-expr REGEX
                        regular expression that captures (using one or more
                        capturing groups) the sample tags from the file names;
                        by default, the entire file name except for its
                        extension (if any) is captured
  -f EXPR, --tag-format EXPR
                        format of the sample tags produced; a capturing group
                        reference like '\n' refers to the n-th capturing group
                        in the regular expression specified with -e/--tag-expr
                        (the default of '\1' simply uses the first capturing
                        group); with a single sample, you can enter the sample
                        tag here explicitly
allele detection options:
  -a ALLELEFILE, --allelelist ALLELEFILE
                        file containing a list of the true alleles of each
                        sample (e.g., obtained from allelefinder)
  -c COLNAME, --annotation-column COLNAME
                        name of a column in the sample files, which contains a
                        value beginning with 'ALLELE' for the true alleles of
                        the sample
filtering options:
  -m PCT, --min-pct PCT
                        minimum amount of background to consider, as a
                        percentage of the highest allele (default: 0.00)
  -n N, --min-abs N     minimum amount of background to consider, as an
                        absolute number of reads (default: 1)
  -L N, --min-lengths N
                        require this minimum number of unique repeat lengths
                        (default: 5)
  -s N, --min-samples N
                        require this minimum number of samples for each true
                        allele (default: 1)
  -M MARKER, --marker MARKER
                        work only on MARKER
  -t N, --min-r2 N      minimum required r-squared score (default: 0.8)
  -O, --orphans         if specified, a fit on one strand is reported even if
                        no fit was obtained on the other strand for the same
                        marker, unit, and stutter depth
sequence format options:
  -l LIBRARY, --library LIBRARY
                        library file for sequence format conversion
random subsampling options (advanced):
  -Q N, --limit-reads N
                        simulate lower sequencing depth by randomly dropping
                        reads down to this maximum total number of reads for
                        each sample
  -x N, --drop-samples N
                        randomly drop this fraction of input samples

tssv

Link raw reads in a FastA or FastQ file to markers and count the number of reads for each unique sequence.

This tool is basically a wrapper around the 'tssvl' program, offering direct support for using FDSTools library files and allele name generation.

usage: fdstools tssv [-h] [-v] [-d] [-F FORMAT] [-R FILE] [-q] [-D DIR]
                     [-T THREADS] [-m MISMATCHES] [-n N] [-a N] [-A]
                     [-M ACTION]
                     LIBRARY [IN] [OUT]
positional arguments:
  LIBRARY               library file with marker definitions
  IN                    the sample data file to process (default: read from
                        stdin)
  OUT                   the file to write the output to (default: write to
                        stdout)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -D DIR, --dir DIR     output directory for verbose output; when given, a
                        subdirectory will be created for each marker, each
                        with a separate sequences.csv file and a number of
                        FASTA/FASTQ files containing unrecognised reads
                        (unknown.fa), recognised reads (Marker/paired.fa), and
                        reads that lack one of the flanks of a marker
                        (Marker/noend.fa and Marker/nostart.fa)
  -T THREADS, --num-threads THREADS
                        number of worker threads to use (default: 1)
  -X, --no-deduplicate  disable deduplication of reads; by setting this
                        option, memory usage will be reduced in expense of
                        longer running time
sequence format options:
  -F FORMAT, --sequence-format FORMAT
                        convert sequences to the specified format: one of raw,
                        tssv, allelename (default: raw)
output file options:
  -R FILE, --report FILE
                        file to write a report to (default: write to stderr)
filtering options:
  -m MISMATCHES, --mismatches MISMATCHES
                        number of mismatches per nucleotide to allow in
                        flanking sequences (default: 0.1)
  -n N, --indel-score N
                        insertions and deletions in the flanking sequences are
                        penalised this number of times more heavily than
                        mismatches (default: 2)
  -a N, --minimum N     report only sequences with this minimum number of
                        reads (default: 1)
  -A, --aggregate-filtered
                        if specified, sequences that have been filtered (as
                        per the -a/--minimum option, the
                        expected_allele_length section in the library file, as
                        well as all sequences with ambiguous bases) will be
                        aggregated per marker and reported as 'Other
                        sequences'
  -M ACTION, --missing-marker-action ACTION
                        action to take when no sequences are linked to a
                        marker: one of include, exclude, halt (default:
                        include)

vis

Create a data visualisation web page or Vega graph specification.

With no optional arguments specified, a self-contained web page (HTML file) is produced. You can open this file in a web browser to view interactive visualisations of your data. The web page contains a file selection element which can be used to select the data to be visualised.

Visualisations make use of the Vega JavaScript library (https://vega.github.io). The required JavaScript libraries (Vega and D3) are embedded in the generated HTML file. With the -O/--online option specified, the HTML file will instead link to the latest version of these libraries on the Internet.

Vega supports generating visualisations on the command line. By default, FDSTools produces a full-featured HTML file. Specify the -V/--vega option if you wish to obtain a bare Vega graph specification (a JSON file) instead. You can pass this file through Vega to generate a PNG or SVG image file.

If an input file is specified, the visualisation will be set up specifically to visualise the contents of this file. To this end, the entire file contents are embedded in the generated visualisation.

usage: fdstools vis [-h] [-v] [-d] [-V] [-O] [-t] [-T TITLE] [-n N] [-m PCT]
                    [-S PCT] [-s N] [-B N] [-c] [-M MARKER] [-U UNIT] [-A]
                    [-a] [-I FILE] [-L] [-b N] [-p N] [-w N] [-H N] [-x N]
                    [-j N] [-N N] [-X PCT] [-Q PCT] [-C N] [-Y N] [-Z N]
                    TYPE [IN] [OUT]
positional arguments:
  TYPE                  the type of data to visualise; use 'sample' to
                        visualise sample data files and bgcorrect output; use
                        'profile' to visualise background noise profiles
                        obtained with bgestimate, bghomstats, and bgpredict;
                        use 'bgraw' to visualise raw background noise data
                        obtained with bghomraw; use 'stuttermodel' to
                        visualise models of stutter obtained from
                        stuttermodel; 'bganalyse' to visualise data obtained
                        from bganalyse; use 'allele' to visualise the allele
                        list obtained from allelefinder
  IN                    file containing the data to embed in the visualisation
                        file; if not specified, HTML visualisation files will
                        contain a file selection control, and Vega
                        visualisation files will load data from a file called
                        'data.csv'
  OUT                   file to write output to (default: write to stdout)
optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -d, --debug           if specified, additional debug output is given
  -V, --vega            by default, a full-featured HTML file offering an
                        interactive visualisation is created; if this option
                        is specified, only a bare Vega graph specification
                        (JSON file) is produced instead
  -O, --online          when generating an HTML visualisation file, required
                        JavaScript libraries (D3 and Vega) are embedded in the
                        file; if this option is specified, the HTML file will
                        instead link to these libraries on the Internet,
                        thereby always using the latest versions of D3 and
                        Vega
  -t, --tidy            tidily indent the generated JSON
  -T TITLE, --title TITLE
                        prepend the given value to the title of HTML
                        visualisations (default: prepend name of data file if
                        given)
visualisation options:
  words in [brackets] indicate applicable visualisation types

  -n N, --min-abs N     [sample, profile, bgraw] only show sequences with this
                        minimum number of reads (default: 5)
  -m PCT, --min-pct-of-max PCT
                        [sample, profile, bgraw] for sample: only show
                        sequences with at least this percentage of the number
                        of reads of the highest allele of a marker; for
                        profile and bgraw: at least this percentage of the
                        true allele (default: 0.5)
  -S PCT, --min-pct-of-sum PCT
                        [sample] only show sequences with at least this
                        percentage of the total number of reads of a marker
                        (default: 0.0)
  -s N, --min-per-strand N
                        [sample] only show sequences with this minimum number
                        of reads for both orientations (forward/reverse)
                        (default: 0)
  -B N, --bias-threshold N
                        [sample] mark sequences that have less than this
                        percentage of reads on one strand (default: 25.0)
  -c, --no-ce-length-sort
                        [sample] if specified, do not sort STR alleles by
                        length
  -M MARKER, --marker MARKER
                        [sample, profile, bgraw, stuttermodel, bganalyse] only
                        show graphs for the markers that contain the given
                        value in their name; separate multiple values with
                        spaces; prepend any value with '=' for an exact match
                        (default: show all markers)
  -U UNIT, --repeat-unit UNIT
                        [stuttermodel] only show graphs for the repeat units
                        that contain the given value; separate multiple values
                        with spaces; prepend any value with '=' for an exact
                        match (default: show all repeat units)
  -A, --no-alldata      [stuttermodel] if specified, show only marker-specific
                        fits
  -a, --no-aggregate    [sample] if specified, do not replace filtered
                        sequences with a per-marker aggregate 'Other
                        sequences' entry
  -I FILE, --input2 FILE
                        [profile, stuttermodel] raw data points file to
                        overlay on the background noise profiles or stutter
                        model graphs; if not specified, HTML visualisation
                        files will contain a file selection control
display options:
  -L, --log-scale       [sample, profile, bgraw, bganalyse] use logarithmic
                        scale (for sample and bganalyse: square root scale)
                        instead of linear scale
  -b N, --bar-width N   [sample, profile, bgraw, bganalyse] width of the bars
                        in pixels (default: 15)
  -p N, --padding N     [sample, profile, bgraw, stuttermodel] amount of
                        padding (in pixels) between graphs of different
                        markers/alleles (default: 70)
  -w N, --width N       [sample, profile, bgraw, stuttermodel, bganalyse,
                        allele] width of the graph area in pixels (default:
                        600)
  -H N, --height N      [stuttermodel, allele] height of the graph area in
                        pixels (default: 400)
  -x N, --max-seq-len N
                        [sample] truncate long sequences to this number of
                        characters (default: 70)
  -j N, --jitter N      [stuttermodel] apply this amount of jitter to raw data
                        points (between 0 and 1, default: 0.25)
allele calling options:
  for sample visualisations only; sequences that match the -C or -Y option
  (or both) and all of the other settings are marked as 'allele'

  -N N, --allele-min-abs N
                        the minimum number of reads (default: 30)
  -X PCT, --allele-min-pct-of-max PCT
                        the minimum percentage of reads w.r.t. the highest
                        allele of the marker (default: 2.0)
  -Q PCT, --allele-min-pct-of-sum PCT
                        the minimum percentage of reads w.r.t. the marker's
                        total number of reads (default: 1.5)
  -C N, --allele-min-correction N
                        the minimum change in read count due to correction by
                        e.g., bgcorrect (default: 0)
  -Y N, --allele-min-recovery N
                        the minimum number of reads that was recovered thanks
                        to noise correction (by e.g., bgcorrect), as a
                        percentage of the total number of reads after
                        correction (default: 0)
  -Z N, --allele-min-per-strand N
                        the minimum number of reads in both orientations
                        (default: 1)