Understanding the analysis
DAFF Taxonomic assignment workflow

While the workflow report aims to be self-documenting, there are many analytical details which the user may wish to understand in further detail. The purpose of this document is to provide this enhanced understanding of what exactly happens during the analysis, and how outcomes are derived from the resulting data.

To set up and run the workflow, visit the Nextflow workflow repository: qcif/nf-daff-biosecurity-wf2.
For analysis code (used in the above workflow), see the Python modules repository: qcif/daff-biosecurity-wf2
For reference, an example workflow report is available here

Interpretting the workflow report
Reference data
1. BLAST
2. BOLD
Input files
1. FASTA file
2. Metadata CSV file
BLAST - parsing the XML output
1. Extracted values
2. Calculated values
BLAST - Extracting taxonomic metadata
BOLD - submitting sequences to ID Engine
Assigning taxonomic identity
1. Checking preliminary morphology ID
2. Checking taxa of interest
Phylogenetic analysis
Assessment of supporting publications
Assessment of database coverage

Interpretting the workflow report

The workflow report aims to deliver clear outcomes with supporting evidence. The taxonomic identity (or lack thereof) is the first thing to be reported. The remainder of the report serves to corroborate supporting evidence for that claim.

It is crucial that the reader pays attention to supporting evidence when evaluating the report. Even when a positive identity is reported, there are other aspects of the analysis that can subtract from the credibility of such an outcome. Most notably, it is possible that a positive species identification could be based on a single reference sequence with an incorrect taxonomic annotation. In this case the result would obviously be erroneous. The potential for this error would be raised in the section "Publications supporting taxonomic association", which would flag that there is only one independent publication source supporting the outcome.

Results overview

Check the taxonomic outcome (the first item under "Result overview")
If positive, the Preliminary ID will be either confirmed or rejected
1. If rejected, you may wish to check the database coverage (square button) to ensure that the taxon is represented in the reference database
If Taxa of Interest (TOIs) were provided, check whether they were detected.
1. If not detected, you may wish to check the database coverage to ensure that the taxon is represented in the reference database. A brief overview of TOI coverage is provided beneath the TOI item; this shows the highest observed warning level from each of the provided TOIs. For the full database coverage reports, go to the "Taxa of interest" report section.

Candidate species

First take a look at the hit classification table (top-right). This shows how many hits were returned, and how they have been classified for the purpose of the analysis. You will typically see a few candidate species and many that are classified as NO MATCH. The first row of STRONG MATCH and MODERATE MATCH with 1+ species determines the identity threshold for identifying candidates.
Check the "Candidate species" table. For each species, you should take note of:
1. Number of hits - >5 hits gives more confidence in the taxonomic annotation of reference sequences
2. Top identity - with multiple candidates, a stark drop in identity between species provides higher confidence
3. Median identity - if >0.5% lower than the top identity, this indicates high intraspecific diversity or possibly misidentified reference sequences.
4. Database coverage - if the identity is not 100% and there are species in the genus with no database representation, it's possible that the true identity of the sample could be one of the unrepresented species.
If there are multiple candidates, the report will prompt the analyst to make a genus-level assignment for the sample. To aid in this process the analyst should refer to:
1. The identities (max and median) of each candidate
2. The phylogenetic tree

Taxa of interest

See Checking taxa of interest

Reference data

BLAST

The reference data used by the workflow depends entirely on the deployment - ask your platform administrator if you are unsure. For the BLAST version of the workflow, the reference data will be a BLAST database of sequence records that is held on the analysis server - by default this is NCBI Core Nt, but it is possible to run with a different reference database. The workflow report specifies the database name in the database coverage report (admins can set this manually with BLAST_DATABASE_NAME).

BOLD

By default this is set to COX1_SPECIES_PUBLIC (admins can override this by setting BOLD_DATABASE).

Input files

FASTA file

A FASTA file containing sample sequences to be analysed. Multiple sequences per sample can be used, but the FASTA header for each sequence must be unique and match an entry in the metadata.csv input. The following constraints apply to this input:

Seq IDs must be unique
Seq IDs must match metadata.csv input
Maximum query sequences: 150
Minimum seq length: 20nt
Max length of any sequence: 3000nt
All residues must be valid nucleotide (ambiguous IUPAC DNA: ATGCRYSWKMBDHVN)

Metadata CSV file

This file provides metadata for each query sequence, with the following fields:

Field	Required	Description
sample_id	yes	Must match the header of one FASTA sequence
locus	yes	Must be in the list of allowed Loci or `NA` for virus or BOLD queries
preliminary_id	yes	A suggested taxonomic identity based on sample morphology
taxa_of_interest	no	A pipe-delimited list of taxa to be evaluated against the sample. Can be at rank species, genus, family, order, class, phylum, kingdom or domain.
country	no	The sample country of origin
host	no	The host or commodity that the sample was extracted from

From v1.1 any additional fields in this file will be displayed in the workflow report

BLAST - parsing the XML output

BLAST search is performed using a local (meaning run on the same machine as the workflow) BLASTN from the NCBI BLAST+ toolkit; the version is specified in the workflow report. This command-line BLASTN process produces a series of alignments for each query sequence, with each alignment relating to a BLAST "hit" against a sequence in the reference database.

This step forks a single BLAST result into a series of query directories. From here each query's results are analysed in parallel.

Extracted values

The following values are extracted verbatim from BLAST XML fields:

Hit identifier (GenBank ID)
Hit definition (GenBank record title)
Hit NCBI accession
Hit subject length (nt)
High-scoring pairs (HSPs; each represents a segment of the alignment):
- bitscore
- evalue
- identity
- query_start
- query_end
- subject_start
- subject_end
- alignment_length

Calculated values

These values are not present in the BLAST XML and are calculated from the extracted values:

Value	Description	Equation
Alignment length	The total non-overlapping length of all HSPs.
Hit bitscore	A score which takes into account both alignment strength and length. Calculated as the sum of bitscores across all HSPs.	\( \sum_{HSP \in \text{HSPs}} \text{bits}(HSP) \)
Hit E-value	An expression of probability that the alignment occurred due to random chance, often expressed as an exponent to distinguish between very low numbers. If there is only one HSP, the `hsp.evalue` will be used. Otherwise, a formula is used.	\( \text{ess} \cdot 2^{-\sum_{HSP \in \text{hit.HSPs}} \text{bits}(HSP)} \) Where `ess` is the effective search space specified in the BLAST XML output.
Hit identity	The proportion of nucleotides which match between query and subject in the alignment. This is calculated as the weighted identity of HSPs (high-scoring pairs), clipped to a maximum of 1.	\( \frac{\sum_{HSP \in \text{HSPs}} \text{identities}(HSP)}{\sum_{HSP \in \text{HSPs}} \text{alignment length}(HSP)} \)
Query coverage	The proportion of the query sequence that is covered by the alignment with the reference sequence.	\( \frac{\text{alignment length}}{\text{query length}} \)

BLAST - Extracting taxonomic metadata

BLAST results do not include structured taxonomic information. This data is extracted for each BLAST hit subject using taxonkit, a command-line tool which can retrieve taxonomic records from NCBI's taxdump archive. Taxids for each hit are extracted from the local BLAST database using blastdbcmd, another tool in the BLAST+ suite. This results in the following fields being collected for each hit:

Taxid
Domain
Superkingdom
Kingdom
Phylum
Class
Order
Family
Genus
Species

BOLD - submitting sequences to ID Engine

When the workflow is run in --bold mode, the search method changes to use the BOLD ID Engine through the BOLD API (https://v4.boldsystems.org/resources/api). Since BOLD requires query DNA sequences that are correctly orientated (i.e. not antisense), we attempt to orientate the query sequences before submission. Query sequences are then submitted to the ID Engine API on-by-one. BOLD then returns a set of match statistics similar to BLAST for each query.

This step forks each BOLD into a series of query directories. From here each query's results are analysed in parallel.

Sequence orientation

Each DNA sequence is translated all three translation frames in both the forward and reverse directions. This results in six translated amino acid sequences for each query in frames 1, 2, 3, -1, -2, -3.

To orientate each query sequence, we then use the hmmsearch tool (part of the HMMER suite) locally to determine whether any the translation frames contain any of the following HMM profiles:

pf00115.hmm - Cytochrome C and Quinol oxidase polypeptide I
pf00116.hmm - Cytochrome C oxidase subunit II, periplasmic domain
pf02790.hmm - Cytochrome C oxidase subunit II, transmembrane domain

A match is accepted when the E-value is below 1e-05. The first frame which is predicted to encode one of these domains dictates the orientation that will then be submitted to BOLD. For query sequences with no matches, both the forward and reverse orientations are submitted to BOLD and the one which returns hit(s) is assumed to be in the correct orientation (the other orientation's result is discarded).

Submitting to ID Engine

Orientated query sequences are submitted to the ID Engine API sequentially, and the requests run in parallel to increase throughput. The following data are parsed directly from the API response:

Query title
Query length
Query frame
Query sequence
Hits:
- Hit identifier (BOLD ID)
- Hit sequence description
- Hit taxonomic identification (species)
- Hit similarity (used in place of identity)
- Hit URL (a link to the record on https://boldsystems.org)
- Hit nucleotide sequence
- Hit collectors (BOLD database submitter(s))

Requesting additional metadata

For each hit subject, additional metadata are then requested from the "Full data retrieval" BOLD API endpoint:

Accession (GenBank accession)
Phylum
Class
Order
Family
Genus
Species

The above fields are then used to fetch a kingdom classification (not included in BOLD response data) from the GBIF API. The phylum given above is used to fetch a the associated phylum record from the GBIF species search API, and the kingdom field is extracted from the response data.

Assigning taxonomic identity

This is a critical stage in the analysis. Hits returned from BLAST/BOLD are filtered and a list of candidate species is extracted from those hits. Filtering is applied as follows. For BOLD search, the process is identical, with similarity being used in place of identity.

All hits which are below 300nt AND 85.0% query coverage are excluded from the entire analysis. These are referred to as "filtered hits".
The identity threshold for candidate hits is either:
- 98.5% (if any filtered hits meet this threshold) - defined as a STRONG MATCH
- OR 93.5% - defined as a MODERATE MATCH
Resulting candidate hits are then aggregated by species and the top hit per-species is identified. These species are what you see reported in the "Candidates" section of the workflow report.
The "No. hits" shown for each candidate includes all filtered hits, not just the candidate hits.
The "Median identity" shown in the "Candidate species" table is derived from the identity (%) of all filtered hits. If there is a wide distribution of hit identities, the median will be reduced. If the median drops below the candidate threshold, the badge will turn from green to yellow. If it drops to less than 95.0% of the threshold (i.e. <93.6% for a threshold of 98.5%), the badge will turn red.

The candidate species identified above are then cross-checked against the Preliminary Morphology ID and Taxa of Interest (if provided).

Checking preliminary morphology ID

The PMI provided by the user is checked against each taxonomic rank from each candidate species. If the provided name matches any of the fields, this is regarded as a match. The user should be aware that in some edge-cases this can result in a mis-match due to the existence of taxonomic homonyms (Morus, for example, is both a plant and bird genus).

Checking taxa of interest

This process is identical to that described above, except that a little more information is collected for display in the "Taxa of interest" section of the workflow report. For each taxon of interest, the best-scoring species that matches the taxonomy is reported. It is possible for a TOI to match multiple candidates, but only the top candidate will be shown.

Boxplot of identity distribution

When the analysis identifies more than candidate species, the analyst is prompted to make a subjective genus-level taxonomic assignment. To assist in this, a boxplot which shows the distribution of hit identities per-species is included in the report.

Phylogenetic analysis

Subject sequences are selected from filtered hits extracted previously. The selection process is a little complex, as it aims to strike a balance between reasonable coverage of the genetic diversity present in BLAST/BOLD hit subjects, while also trying to minimize the number of sequences that need to go through alignment and analysis. Building trees with 100+ sequences is SLOW and the resulting tree is often ugly, so we do our best to avoid that.

Hits are collected in order of descending identity until both:
- The identity of the hit drops to 95.0%
- AND at least 10 hits have been collected

This means that candidate hits are always sampled, and some filtered hits are usually included too.

Next, if there are more than 1000 hits for the species, a representative sample of hits for each species is taken:

Filtered hits for the species are ordered by identity
A systematic sample of 1000 hits is taken, including the first and last hit

The aim is that a species with 200 hits of alignment identities between 90-95% would result in a sample of hits with identities similar to the following (assuming a bimodal distribution representing two taxonomic clades, and sample size of n=6):

90.0%
90.3%
90.5%
94.8%
94.8%
95.0%

The workflow has a default value of PHYLOGENY_MAX_HITS_PER_SPECIES=1000 so that the sampling described above is not implemented unless the workflow administrator decides to limit this number to constrain tree size. Setting this sample size too low would result in poor quality trees that may give a false impression of genetic diversity to the user.

A FASTA sequence is then written by extracting the nucleotide sequence from each of the selected hits, and adding the query sequence. This FASTA file is then alignment with MAFFT, and then a tree is computed from the alignment with FastME.

Assessment of supporting publications

An important drawback of searching against the large Non-redundant (Nr) BLAST database is that this database contains many sequences which are not very reputable. Since anyone can submit sequences to GenBank there are many sequences with incorrect taxonomic annotation. This analysis presents a measure of confidence in the integrity of the reference sequences supporting the conclusions. Are the candidate reference sequences supported by numerous publications? Great, that means that the taxonomic annotation has been corroborated by multiple studies. Do we have 5 reference sequences that were all submitted to GenBank by the same author(s)? That casts some doubt over the integrity of the taxonomic annotation.

Even when the workflow report is "green" in all other sections of the report, caution is advised if there is only one independent publication corroborating the reference sequences. Further investigation may be required to confirm the credibility of the reference sequence source.

The analysis involves clusting of GenBank publication records based on the provided metadata (author, title, journal). An independent analysis is carried out per-candidate species:

For each candidate hit, a list of publications is extracted from the corresponding GenBank record using the Entrez efetch API.
Each publication is annotated with:
1. The NCBI accession number
2. A source tag, which is a token derived (by stripping non-alphanumeric characters and converting to lowercase) from one of the following fields (the field with a value):
  1. Author list
  2. Publication title ("Direct submission" titles are ignored here as they are computed-generated and not a real publication)
  3. Journal
3. Whether the record appears to have been automatically generated - determined by the presence of the string ##Genome-Annotation-Data-START## in the GenBank record text.
Publications are then clustered into groups of "independent publications":
1. If a publication source tag matches one in an existing group, it is added to that group
2. Otherwise, it is added to a new group
3. Records with no publications are allocated to a single group
The groups are shown as "independent sources" in the "Publications supporting reference sequences" section of the workflow report.

Assessment of database coverage

An analysis of taxonomic coverage of the reference database is carried out for each of the following taxa:

Candidate species (only when n ≤ 3)
Preliminary ID taxon (provided in metadata.csv)
Taxa of interest (provided in metadata.csv; only when n ≤ 10)

Each of these is referred to as a "target taxon" or, more concisely, "target". For each target, three analyses may be performed:

5.1: number of GenBank records for target taxon at the given locus (if provided)
5.2: number of species in target genus which have 1+ GenBank records at the given locus (if provided)
5.3: as for 5.2, but with species limited to those which occur in the country of origin (according to GBIF occurrence records)

Analyses 5.2 and 5.3 are only run when the target is of rank genus or species. This is because there would be far too much data to analyse at higher ranks.

To obtain "species in genus" in analyses 5.2 and 5.3, we use the GBIF API:

A GBIF record is retrieved from the /species/suggest API using the target taxon as the search query. If the target matches our list of canonical taxa, the taxonomic rank specified in that list is set as an API request parameter. This prevents the rather annoying issue of the query "Bacteria" matching the genus "Bacteria" of the Diapheromeridae (a family of Arthropoda).
The genus key from the target record is then used to fetch all matching records at rank "species" from the /species/match API endpoint.
Returned speces records are filtered to exclude extinct species, and include only those where status is one of ['ACCEPTED', 'DOUBTFUL'].

Enumerating GenBank records

For each species identified, the Entrez API is used to query GenBank records that match that species at the given locus.The query is dynamically generated to include all synonyms for the locus specified in the workflow's loci.json file. the taxid is extracted from NCBI taxonomies using taxonkit.

For example, the following locus and taxon "Homo sapiens" (taxid 9606):

# loci.json entry for "COI"
{
    "ambiguous_synonyms": [
        "coi",
        "co1",
        "cox",
        "cox1"
    ],
    "non_ambiguous_synonyms": [
        "cytochrome oxidase subunit 1"
    ]
}

...would result in this GenBank query being rendered:

txid9606[Organism] AND (coi[Title]) OR (coi[GENE]) OR (co1[Title]) OR (co1[GENE]) OR (cox[Title]) OR (cox[GENE]) OR (cox1[Title]) OR (cox1[GENE]) OR (cytochrome oxidase subunit 1)

The response returned from Entrez includes a count field - this is the data that we use to enumerate records for each species.

Occurrence maps

In addition to the analyses above, a geographic distribution map is also generated based on occurrence records fetched from the GBIF occurrence API. The number of occurrence records fetched is limited to 5000, because some taxa (e.g. "arthropoda") have far too many occurrence records to fetch in a reasonable length of time.

Understanding the analysis DAFF Taxonomic assignment workflow

Table of contents