Refget specifications
What is refget?
Refget is a set of GA4GH standards for identifying and distributing reference biological sequences. It consists of these standards:
Standard | Description | Status |
---|---|---|
Refget sequences | For individual sequences | |
Refget sequence collections | For collections of sequences | |
Refget pangenomes | For collections of sequence collections | Currently in process |
What is the main purpose of the refget project?
Refget standards help to identify, retrieve, and compare reference sequences, like a reference genome. Key principles include:
- Reference data, including sequences and collections of sequences, are identified using cryptographic digest-based identifiers that are derived from the data itself. This allows reference data to be identified without requiring a centralized accessioning authority.
- Refget standards can be used for any type of sequences: DNA, RNA, protein, etc -- anything that can be represented as a string of characters.
- Refget standards also specify retrieval APIs, providing a mechanism for retrieving a sequence or collection if you have its identifier.
- Refget sequence collections also provides a programmatic approach to assessing compatibility among sequence collections.
For more information about use cases, see the use cases section of the sequence collections specification.
How do the standards work together?
The Refget Sequences standard is used by the Sequence Collections standard, and the Sequence Collections standard forms the basis of the Pangenomes standard. First, sequences are digested to yield a deterministic identifier. These sequence identifiers are then used, together with their names, to create an identifier for a collection.