Preprocessing
Input Format
Section titled “Input Format”SILO ingests data in NDJSON format (Newline-Delimited JSON). One JSON object per line describes a single sequence record. There is no separate TSV/FASTA input mode.
.zst and .xz compressed NDJSON files are detected and decompressed transparently.
Preprocessing Configuration
Section titled “Preprocessing Configuration”The preprocessing configuration is a YAML file that controls where SILO reads its input and writes its
output. All keys are optional unless noted otherwise. Filenames are resolved relative to inputDirectory.
| Key | Type | Default | Default in Docker image | Description |
|---|---|---|---|---|
inputDirectory | path | ./ | /preprocessing/input/ | Directory containing the input files. |
outputDirectory | path | ./output/ | /preprocessing/output/ | Directory where SILO writes the preprocessed database state. |
ndjsonInputFilename | path | (none — required) | NDJSON file with the input records, relative to inputDirectory. SILO will refuse to start preprocessing if this is unset. | |
databaseConfig | path | database_config.yaml | The database configuration file, relative to inputDirectory. | |
referenceGenomeFilename | path | reference_genomes.json | The reference genomes file, relative to inputDirectory. | |
lineageDefinitionFilenames | list | (absent) | A list of lineage-definition file names (see Lineage Definition Files), relative to inputDirectory. | |
phyloTreeFilename | path | (absent) | A phylogenetic-tree file (see Phylogenetic Tree File), relative to inputDirectory. | |
withoutUnalignedSequences | boolean | false | If true, SILO omits the unaligned-sequence column for each aligned nucleotide sequence. |
NDJSON Record Schema
Section titled “NDJSON Record Schema”Each line in the NDJSON file is a flat JSON object. The top-level keys must include:
- One entry for every metadata field declared in the
database_config.yaml, using the same name and the type indicated in the schema. - One entry for every nucleotide segment and amino acid gene declared in the
reference genomes file. The value is a
sequence object, or
nullif the sequence is missing.
Additionally, raw (unaligned) nucleotide sequences may be provided under keys prefixed with unaligned_.
Unknown top-level keys are ignored with a warning. Missing required fields cause an error.
Sequence Object
Section titled “Sequence Object”A sequence object has the following structure:
{ "sequence": "ACGTACGT", "insertions": ["214:ACGT"], "offset": 0}| Key | Type | Description |
|---|---|---|
sequence | string | The aligned sequence as a string of valid symbols. |
sequenceCompressed | string | Alternative to sequence: a base64-encoded, ZSTD-compressed sequence. The ZSTD dictionary must be the column’s reference sequence. Takes precedence over sequence if both are present. |
insertions | array of strings | Insertions in the form <position>:<symbols>. The position is the index of the symbol after which the insertion is placed; position 0 inserts before the first symbol. |
offset | integer | Optional offset into the reference (default: 0). |
Example Record
Section titled “Example Record”Given a database config with metadata fields primaryKey, date, country, age, and a reference
genome with one nucleotide segment main and one gene E, a valid NDJSON line looks like:
{ "primaryKey": "seq_001", "date": "2021-03-18", "country": "Switzerland", "age": 54, "main": { "sequence": "ACGTACGT", "insertions": ["4:CC"] }, "E": { "sequence": "MYSF*", "insertions": [] }}Lineage Definition Files
Section titled “Lineage Definition Files”A lineage-indexed metadata field (generateLineageIndex in the database config) requires a YAML file
describing the lineage hierarchy. Multiple lineage systems can be declared via the
lineageDefinitionFilenames list in the preprocessing config.
Each top-level key in the YAML is a lineage label. Per label you can specify:
parents: a list of parent lineage labels (omit, set tonull, or use[]to mark a root).aliases: a list of alternative names for the lineage.
Minimal example:
A: aliases: - RootB: parents: - AC: parents: - AE: parents: [B, C] aliases: - LeafESILO verifies that the lineage labels are unique and that the relationships form a directed acyclic graph
(no cycles). It makes no further assumptions about the lineage system. See
documentation/lineage_definitions.md
in the SILO repository for the authoritative spec.
Phylogenetic Tree File
Section titled “Phylogenetic Tree File”A phylogenetic-tree-indexed metadata field (isPhyloTreeField in the database config) requires a tree
file referenced by the phyloTreeFilename preprocessing-config key.
SILO accepts two formats:
All nodes — internal and leaves — must be uniquely labelled. See
documentation/phylogenetic_queries.md
in the SILO repository for the authoritative spec.
Incremental Preprocessing
Section titled “Incremental Preprocessing”In addition to building a database from scratch, SILO supports appending new records to an existing
database state via the silo append command. See
documentation/incremental_preprocessing.md
in the SILO repository for details.