How to: Compute a seqcol digest given a sequence collection

Use case

One of the most common uses of the seqcol specification is to compute a standard, universal digest for a particular sequence collection. You have a collection of sequences, like a reference genome or transcriptome, and you want to determine its seqcol digest. There are two ways to approach this: 1. Using an existing implementation; 2. Implement the seqcol digest algorithm yourself (it's not that hard).

1. Using existing implementations

The refget Python package provides an implementation for local computation of digests: https://refgenie.org/refget/.

2. Implement the seqcol digest algorithm yourself

Follow the procedure under the section for Encoding. Briefly, the steps are:

Step 1. Organize the sequence collection data into canonical seqcol object representation.
Step 2. Apply RFC-8785 JSON Canonicalization Scheme (JCS) to canonicalize the value associated with each attribute individually.
Step 3. Digest each canonicalized attribute value using the GA4GH digest algorithm.
Step 4. Apply RFC-8785 JSON Canonicalization Scheme again to canonicalize the JSON of new seqcol object representation.
Step 5. Digest the final canonical representation again.

Details on each step can be found in the specification.

Example Python code for computing a seqcol encoding

# Demo for encoding a sequence collection

import base64
import hashlib
import json

def canonical_str(item: dict) -> bytes:
    """Convert a dict into a canonical string representation"""
    return json.dumps(
        item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True
    ).encode("utf8")

def sha512t24u_digest(seq: bytes) -> str:
    """ GA4GH digest function """
    offset = 24
    digest = hashlib.sha512(seq).digest()
    tdigest_b64us = base64.urlsafe_b64encode(digest[:offset])
    return tdigest_b64us.decode("ascii")

# 1. Get data as canonical seqcol object representation

seqcol_obj = {
  "lengths": [
    248956422,
    133797422,
    135086622
  ],
  "names": [
    "chr1",
    "chr2",
    "chr3"
  ],
  "sequences": [
    "SQ.2648ae1bacce4ec4b6cf337dcae37816",
    "SQ.907112d17fcb73bcab1ed1c72b97ce68",
    "SQ.1511375dc2dd1b633af8cf439ae90cec"
  ]
}

# Step 1a: Remove any non-inherent attributes,
# so that only the inherent attributes contribute to the digest.
# Only names and sequences are inherent.

seqcol_obj_inherent = {
  "names": seqcol_obj["names"],
  "sequences": seqcol_obj["sequences"]
}

# Step 2: Apply RFC-8785 to canonicalize the value 
# associated with each attribute individually.

seqcol_obj2 = {}
for attribute in seqcol_obj_inherent:
    seqcol_obj2[attribute] = canonical_str(seqcol_obj_inherent[attribute])
seqcol_obj2  # visualize the result

# Step 3: Digest each canonicalized attribute value
# using the GA4GH digest algorithm.

seqcol_obj3 = {}
for attribute in seqcol_obj2:
    seqcol_obj3[attribute] = sha512t24u_digest(seqcol_obj2[attribute])
print(json.dumps(seqcol_obj3, indent=2))  # visualize the result

# Step 4: Apply RFC-8785 again to canonicalize the JSON 
# of new seqcol object representation.

seqcol_obj4 = canonical_str(seqcol_obj3)
seqcol_obj4  # visualize the result

# Step 5: Digest the final canonical representation again.

seqcol_digest = sha512t24u_digest(seqcol_obj4)
# Answer: 'KxZO6qIbVNCIKtQj0WR3fwzg2rsJLlC3'

# Equivalent to:
# import refget
# sc = refget.SequenceCollection.from_dict(seqcol_obj)
# sc.digest