Hash-based phonemic identifiers

Accepts fasta or naked input. Normalises case, space & gaps. Uses sars-cov-2-s.yaml

Konstel normalises and hashes a given string or biological sequence before encoding the hash digest as a human-friendly phonemic word. This allows privacy-preserving confirmation of input equality, and may be thought of as a nomenclature equivalent of URL shortening. In the context of a viral pandemic, this approach can alleviate some of the challenges associated with restriced genomic data access – I can find out whether my sequence is the same as yours without you having to share your sequence with me. Custom schemes can be defined in YAML format.

A scheme is provided for identifying SARS-CoV-2 spike protein 'constellations' from both spike protein sequences and whole genome nucleotide sequences, from which the spike gene is automatically extracted and translated. Choose between two identifiers derived from the same SHA256 hash: one that is shorter and another that is easier to pronounce. Protein sequences must be unambiguous and free of gaps. Stop codons and whitespace are stripped. Nucleotide sequences must be free of gaps in the spike gene. Please refer to my blog post and GitHub for further details. Feedback and contributions welcomed through GitHub or Twitter.

2021-05-09: Support for arbitrary strings and biological sequences
2021-05-07: Larger Google Cloud instance class for improved performance

Uses data from the GISAID Initiative (last updated 2021-05-05)