Environment variables

Many aspects of the analysis can be controlled through environment variables, which are read in the config.py module. Most of these are optional, but some scripts require some variables to be present (this is documented in the main README.md.

Variable	Default	Description
Credentials
`USER_EMAIL`		Used to authenticate with NCBI Entrez API if `NCBI_API_KEY` not provided. Also used to rate limit requests from different users on the same system.
`NCBI_API_KEY`		Used to authenticate with NCBI Entrez API for an increased rate limit on requests.
Reference data
`TAXONKIT_DATA`	`$HOME/.taxonkit`	The directory where NCBI's taxdump files can be found.
`ALLOWED_LOCI_FILE`	`./loci.json`	A JSON file which describes permitted loci (barcoding regions) and their synonyms. The default can be overridden with a file in the same format. The `ambiguous_synonyms` listed in this file are locus synonyms which may be ambiguous in a GenBank query.
`BOLD_DATABASE`	`COX1_SPECIES_PUBLIC`
User inputs
`INPUT_FASTA_FILEPATH`		The query FASTA file containing the user's sample DNA sequences (required throughout the modules).
`INPUT_METADATA_CSV_FILEPATH`		The metadata CSV file containing the user's sample metadata (required throughout the modules).
Working directories
`OUTPUT_DIR`	`./output/`	Where output data should be written to for the Nextflow run.
`QUERY_DIR`		The directory containing output data for the querying currently being analysed.
Parameters
`BLAST_MAX_TARGET_SEQS`	`2000`	The maximum number of hits collected for each query sequence in the BLAST search. Not used for analysis but rendered in the report.
`DB_COVERAGE_TOI_LIMIT`	`10`	The maximum number of TOIs that will be analysed by P5 (database coverage).
`GBIF_LIMIT_RECORDS`	`500`	The maximum number of records per-request to the GBIF API. More records than this will be fetched in batches.
`GBIF_MAX_OCCURRENCE_RECORDS`	`5000`	The maximum number of GBIF records that will be fetched for plotting the occurrence distribution map.
`GBIF_ACCEPTED_STATUS`	`accepted,doubtful`	Only GBIF records with these statuses will be retained when fetching related species (comma-separated).
`PHYLOGENY_MIN_HIT_IDENTITY`	`0.95`	Minimum hit identity to be included into phylogenetic tree.
`PHYLOGENY_MIN_HIT_SEQUENCES`	`10`	The minimum number of sequences to be included (non-candidate sequences will be included until this limit is reached).
`PHYLOGENY_MAX_HITS_PER_SPECIES`	`1000`	Maximum number of hits to be included for each species. This is useful where candidate species have 30+ sequences that are more than is required for building the tree. Reducing this parameter to 10 would increase tree clarity while also dramatically reducing run time. In this case, a stratified sample of sequences (based on identity) is taken to represent the range of diversity present in each species' sequences.
`BLAST_DATABASE_NAME`	`NCBI Core Nt`	For showing in the report.
`FACILITY_NAME`		This will be shown in the report.
`ANALYST_NAME`		This will be shown in the report.
`REPORT_DEBUG`	`0`	If `REPORT_DEBUG=1` this replaces the timestamp in the report file name with `DEBUG` so that it can be easily reloaded in the browser after re-rendering.
`SKIP_ORIENTATION`	`0`	If `SKIP_ORIENTATION=1` then BOLD runs will skip orientation of query sequences and submit both forward and reverse sequences to the ID Engine API (for developers, this removes the need for local installation of Hmmsearch).
`LOGGING_DEBUG`	`0`	If `LOGGING_DEBUG=1` then verbose log statements will be emitted to help diagnose issues.
`GENBANK_CONCURRENCY_TEST`	`0`	For unit tests only. If `GENBANK_CONCURRENCY_TEST=1` this will enable the GenBank concurrency test, which sends a lot of API requests and takes a while to complete.
`KEEP_OUTPUTS`	`0`	For integration tests only. If `KEEP_OUTPUTS=1` this will retain the temp working directory for inspection of output files after test completion.
`SKIP_PASSED_TESTS`	`0`	For integration tests only. If `SKIP_PASSED_TESTS=1` this will re-run only the tests which have not yet passed.
`RUN_TEST_CASE`		For integration tests only. Specify a single test case to run. Typically used when debugging a specific test case.
Analysis/filtering criteria
`MIN_NT`	`300`	Minimum alignment length for a BLAST hit to be considered for candidate screening (nucleotides).
`MIN_Q_COVERAGE`	`0.85`	Minimum query coverage for a BLAST hit to be considered for candidate screening (decimal proportion).
`MIN_IDENTITY`	`0.935`	Minimum identity for a BLAST hit to be considered for candidate screening (decimal proportion).
`MIN_IDENTITY_STRICT`	`0.985`	Minimum hit identity to be considered a STRONG candidate (decimal proportion).
`MEDIAN_IDENTITY_WARNING_FACTOR`	`0.95`	Minimum proportion of candidate identity threshold for a median identity to receive WARNING level instead of DANGER level. e.g. if the median identity is >95% of the candidate identity threshold, then it will be marked as WARNING instead of DANGER (decimal proportion).
`MAX_CANDIDATES_FOR_ANALYSIS`	`3`	The maximum number of candidate species that will proceed to further analysis (P4/5). When this threshold is reached, a boxplot showing identity distributions is shown in the report "Candidates" section.
`MIN_SOURCE_COUNT`	`5`	Minimum number of independent publications required for a candidate species to receive Flag 4A.
`DB_COV_MIN_A`	`5`	Minimum number of GenBank records to receive Flag 5.1A.
`DB_COV_MIN_B`	`1`	Minimum number of GenBank records to receive Flag 5.1B.
`DB_COV_RELATED_MIN_A`	`90`	Minimum percent species coverage of GenBank records to receive Flag 5.2A.
`DB_COV_RELATED_MIN_B`	`10`	Minimum percent species coverage of GenBank records to receive Flag 5.2B.
`DB_COV_COUNTRY_MISSING_A`	`1`	Minimum number of species WITHOUT GenBank records to receive Flag 5.3B.
Output file names
`TIMESTAMP_FILENAME`	`timestamp.txt`
`ACCESSIONS_FILENAME`	`accessions.txt`
`TAXONOMY_FILENAME`	`taxonomy.csv`
`QUERY_TITLE_FILENAME`	`query_title.txt`
`HITS_JSON_FILENAME`	`hits.json`
`HITS_FASTA_FILENAME`	`hits.fasta`
`TAXONOMY_ID_CSV_FILENAME`	`assigned_taxonomy.csv`
`CANDIDATES_FASTA_FILENAME`	`candidates.fasta`
`CANDIDATES_CSV_FILENAME`	`candidates.csv`
`CANDIDATES_JSON_FILENAME`	`candidates.json`
`CANDIDATES_COUNT_FILENAME`	`candidates_count.txt`
`CANDIDATES_SOURCES_JSON_FILENAME`	`candidates_sources.json`
`INDEPENDENT_SOURCES_JSON_FILENAME`	`aggregated_sources.json`
`TOI_DETECTED_CSV_FILENAME`	`taxa_of_concern_detected.csv`
`PMI_MATCH_CSV_FILENAME`	`preliminary_id_match.csv`
`BOXPLOT_IMG_FILENAME`	`boxplot.png`
`TREE_NWK_FILENAME`	`candidates.nwk`
`DB_COVERAGE_JSON_FILENAME`	`db_coverage.json`
`BOLD_TAXON_COUNT_JSON`	`bold_taxon_counts.json`
`BOLD_TAXON_COLLECTORS_JSON`	`bold_taxon_collectors.json`
`BOLD_TAXONOMY_JSON`	`bold_taxonomy.json`