Environment variables

Many aspects of the analysis can be controlled through environment variables, which are read in the config.py module. Most of these are optional, but some scripts require some variables to be present (this is documented in the main README.md.

Variable Default Description
Credentials
USER_EMAIL Used to authenticate with NCBI Entrez API if NCBI_API_KEY not provided. Also used to rate limit requests from different users on the same system.
NCBI_API_KEY Used to authenticate with NCBI Entrez API for an increased rate limit on requests.
Reference data
TAXONKIT_DATA $HOME/.taxonkit The directory where NCBI's taxdump files can be found.
ALLOWED_LOCI_FILE ./loci.json A JSON file which describes permitted loci (barcoding regions) and their synonyms. The default can be overridden with a file in the same format. The ambiguous_synonyms listed in this file are locus synonyms which may be ambiguous in a GenBank query.
BOLD_DATABASE COX1_SPECIES_PUBLIC
User inputs
INPUT_FASTA_FILEPATH The query FASTA file containing the user's sample DNA sequences (required throughout the modules).
INPUT_METADATA_CSV_FILEPATH The metadata CSV file containing the user's sample metadata (required throughout the modules).
Working directories
OUTPUT_DIR ./output/ Where output data should be written to for the Nextflow run.
QUERY_DIR The directory containing output data for the querying currently being analysed.
Parameters
BLAST_MAX_TARGET_SEQS 2000 The maximum number of hits collected for each query sequence in the BLAST search. Not used for analysis but rendered in the report.
DB_COVERAGE_TOI_LIMIT 10 The maximum number of TOIs that will be analysed by P5 (database coverage).
GBIF_LIMIT_RECORDS 500 The maximum number of records per-request to the GBIF API. More records than this will be fetched in batches.
GBIF_MAX_OCCURRENCE_RECORDS 5000 The maximum number of GBIF records that will be fetched for plotting the occurrence distribution map.
GBIF_ACCEPTED_STATUS accepted,doubtful Only GBIF records with these statuses will be retained when fetching related species (comma-separated).
PHYLOGENY_MIN_HIT_IDENTITY 0.95 Minimum hit identity to be included into phylogenetic tree.
PHYLOGENY_MIN_HIT_SEQUENCES 10 The minimum number of sequences to be included (non-candidate sequences will be included until this limit is reached).
PHYLOGENY_MAX_HITS_PER_SPECIES 1000 Maximum number of hits to be included for each species. This is useful where candidate species have 30+ sequences that are more than is required for building the tree. Reducing this parameter to 10 would increase tree clarity while also dramatically reducing run time. In this case, a stratified sample of sequences (based on identity) is taken to represent the range of diversity present in each species' sequences.
BLAST_DATABASE_NAME NCBI Core Nt For showing in the report.
FACILITY_NAME This will be shown in the report.
ANALYST_NAME This will be shown in the report.
REPORT_DEBUG 0 If REPORT_DEBUG=1 this replaces the timestamp in the report file name with DEBUG so that it can be easily reloaded in the browser after re-rendering.
SKIP_ORIENTATION 0 If SKIP_ORIENTATION=1 then BOLD runs will skip orientation of query sequences and submit both forward and reverse sequences to the ID Engine API (for developers, this removes the need for local installation of Hmmsearch).
LOGGING_DEBUG 0 If LOGGING_DEBUG=1 then verbose log statements will be emitted to help diagnose issues.
GENBANK_CONCURRENCY_TEST 0 For unit tests only. If GENBANK_CONCURRENCY_TEST=1 this will enable the GenBank concurrency test, which sends a lot of API requests and takes a while to complete.
KEEP_OUTPUTS 0 For integration tests only. If KEEP_OUTPUTS=1 this will retain the temp working directory for inspection of output files after test completion.
SKIP_PASSED_TESTS 0 For integration tests only. If SKIP_PASSED_TESTS=1 this will re-run only the tests which have not yet passed.
RUN_TEST_CASE For integration tests only. Specify a single test case to run. Typically used when debugging a specific test case.
Analysis/filtering criteria
MIN_NT 300 Minimum alignment length for a BLAST hit to be considered for candidate screening (nucleotides).
MIN_Q_COVERAGE 0.85 Minimum query coverage for a BLAST hit to be considered for candidate screening (decimal proportion).
MIN_IDENTITY 0.935 Minimum identity for a BLAST hit to be considered for candidate screening (decimal proportion).
MIN_IDENTITY_STRICT 0.985 Minimum hit identity to be considered a STRONG candidate (decimal proportion).
MEDIAN_IDENTITY_WARNING_FACTOR 0.95 Minimum proportion of candidate identity threshold for a median identity to receive WARNING level instead of DANGER level. e.g. if the median identity is >95% of the candidate identity threshold, then it will be marked as WARNING instead of DANGER (decimal proportion).
MAX_CANDIDATES_FOR_ANALYSIS 3 The maximum number of candidate species that will proceed to further analysis (P4/5). When this threshold is reached, a boxplot showing identity distributions is shown in the report "Candidates" section.
MIN_SOURCE_COUNT 5 Minimum number of independent publications required for a candidate species to receive Flag 4A.
DB_COV_MIN_A 5 Minimum number of GenBank records to receive Flag 5.1A.
DB_COV_MIN_B 1 Minimum number of GenBank records to receive Flag 5.1B.
DB_COV_RELATED_MIN_A 90 Minimum percent species coverage of GenBank records to receive Flag 5.2A.
DB_COV_RELATED_MIN_B 10 Minimum percent species coverage of GenBank records to receive Flag 5.2B.
DB_COV_COUNTRY_MISSING_A 1 Minimum number of species WITHOUT GenBank records to receive Flag 5.3B.
Output file names
TIMESTAMP_FILENAME timestamp.txt
ACCESSIONS_FILENAME accessions.txt
TAXONOMY_FILENAME taxonomy.csv
QUERY_TITLE_FILENAME query_title.txt
HITS_JSON_FILENAME hits.json
HITS_FASTA_FILENAME hits.fasta
TAXONOMY_ID_CSV_FILENAME assigned_taxonomy.csv
CANDIDATES_FASTA_FILENAME candidates.fasta
CANDIDATES_CSV_FILENAME candidates.csv
CANDIDATES_JSON_FILENAME candidates.json
CANDIDATES_COUNT_FILENAME candidates_count.txt
CANDIDATES_SOURCES_JSON_FILENAME candidates_sources.json
INDEPENDENT_SOURCES_JSON_FILENAME aggregated_sources.json
TOI_DETECTED_CSV_FILENAME taxa_of_concern_detected.csv
PMI_MATCH_CSV_FILENAME preliminary_id_match.csv
BOXPLOT_IMG_FILENAME boxplot.png
TREE_NWK_FILENAME candidates.nwk
DB_COVERAGE_JSON_FILENAME db_coverage.json
BOLD_TAXON_COUNT_JSON bold_taxon_counts.json
BOLD_TAXON_COLLECTORS_JSON bold_taxon_collectors.json
BOLD_TAXONOMY_JSON bold_taxonomy.json