Seqcol rationale
Introduction
The Refget Sequence Collections standard specification is written mostly to answer the question of how to implement the standard, rather than explaining why we designed the standard the way we did. But because there are many different use cases and perspectives with an interest in sequence collections, several readers of the specification have completed reading it and understand how to do it, but are confused about why we made some of the decisions we did, particularly because the specification doesn’t perfectly fit with their expectations of the easiest way to solve their specific use case. This document is my attempt to broaden the perspective, provide the rationale for why on some of the more controversial decisions, and hopefully convince readers that the years of thought and discussion that went into sequence collections was useful and created a standard that, although not necessarily the easiest or simplest way to solve one particular use case, is an elegant balance between complexity and simplicity that allows it to solve a huge number of related use cases while maintaining interoperability wherever possible and flexibility when truly needed.
More specifically, this document attempts to answer these questions:
- Why should the specification include a specific recommended schema, with a defined set of terms, including which are required (must exist) and which are inherent (affect the outcome of the digest)?
- Why is there not a digest for
<insert my use case here>
, so I can just compare two strings to see if the sequence collections are the same? Why do you need these more complex digests that make my comparison harder? - What is the purpose of the comparison function, which seems more complicated than a simple string match?
- Why do you organize the sequence collection as a set of arrays, instead of an array of annotated sequence objects?
- Why do you do repeated layers of digests, instead of just building one string for the collection and digesting it, which seems simpler?
- Why is it important to include names, and lengths, instead of just the sequences themselves?
A tale of three use cases
A major need that drove seqcol development is the need to see if two sequence collections match. This need is repeated across many use cases, and a main motivating factor in choosing a deterministic, content-derived identifier, because it lets you know if the content is identical by simply asking whether the digests are identical. Early on in our discussions of sequence collection use cases, it was clear that different use cases would have slightly different formulations of what is important in a sequence collection. For example, a first use case is that of the archiver: we just need to assess whether the content of the sequences in two sets are the same. The archiver doesn't care about the order of the sequences, or the names of the sequences, or any other metadata attributes. In this case, it would be simple to just digest each sequence, sort the digests, concatenate them, and then digest that string -- and in fact, this was an early simple proposal for the sequence collection standard. But while this would work perfectly fine for the simple use case of asserting two sets of sequences match in content, this is not the only use case for sequence collections.
Another use case is that of the aligner: say we want to reproduce a computational pipeline that does sequence alignment. We need an identifier of the reference genome (sequence collection) used for the alignment, so that when we repeat it, we guarantee we are using the same reference. In this scenario, the sorted, concatenated sequence digests wouldn't work for two reasons: First, for many aligners, the order of the sequences matter. Therefore, the final digest representing the collection should differ if the underlying sequences had different order. Second, the output of the aligner is going to specify the location of each read with respect to the name of a sequence. A sequence collection with the same sequences, even in the same order, would yield a different alignment output file if the sequences have different names. Therefore, the final digest should differ if the names of the sequences do not match. To generate a digest useful for this use case, we could adjust the prior proposal by 1) not sorting the sequence digests prior to concatenation; and 2) adding in a names vector. Digesting this would yield a content-derived identifier that would match only if the analysis could be reproduced entirely. Thus, already we have two use cases with different immediate answers to question; in the first, we imagine an identifier computed from sorted sequence digests, and in the second, we imagine an identifier computed from sequences plus their names in their existing order.
Now, here's yet another use case not served by either of these identifiers: the analyst. An analyst, further downstream, uses data aligned to a reference genome and summarized into genomic annotations. Say they have ChIP-seq data, which defines the binding locations of various transcription factors stored in BED file format. These datasets are columns of a sequence name, plus coordinate start and end for regions of interest. More than 80,000 BED files have been posted on GEO, as these results vary by cell-type, treatment, species, age, etc. The user now is interested in integrating BED files from different studies, either to visualize or analyze them together -- but it only makes sense to integrate them if the BED coordinates are defined on the same coordinate system. If defined on different coordinate systems, a position in one does not correspond to a position in another, and they cannot be easily integrated; additional processing would be required, depending on how different the reference genomes are. We would like the digest to somehow inform us as to whether the sequence collections are compatible at the level of a coordinate system. In this use case, the underlying sequence content is irrelevant, as long as the coordinate systems match. Therefore, a digest should consider the names and lengths of the sequences, but not the actual sequence content. This leads to a third vision of a sequence collection but where the actual base pairs don't matter; what's important is the names and lengths of those sequences.
Answering the question
It's clear from these examples that we have multiple common and reasonable use cases that use collections of sequences in different ways and therefore lead to a different optimal identifier. How could a standard accommodate all these different viewpoints?
An immediate solution is to let each use case define its own identifier. That would solve each problem individually; and yet, it is unsatisfying because these resulting identifiers would not be interoperable. The way the archiver computes the identifier wouldn't be useful for the aligner; the aligner's identifier wouldn't be useful for the epigenome analyst, and so on. So these one-off digests would not be able to become the universal identifiers we envisioned at the outset. We wondered if there could be a way to solve all these different needs, but in a way that maintained some interoperability so the identifiers could somehow be used together within a common ecosystem. Clearly, some aspects of the use cases are shared -- could we use this to do better than the individual solutions that neglect interoperability?
To do this, we employed several strategies in the sequence collections standard:
- First, we use a schema to separate the question of "what gets included" from the rest of the standard.
- Second, we deploy the comparison function, a more powerful way of comparing two sequence collections.
- Third, we specify a layered algorithm for computing digests, where individual attributes are digested separately and then these are digested again to make the final digest. This provides intermediate digests that can be used for different purposes and also makes it easy to define a custom internal digests.
These strategies don’t solve all the problems individually, but taken together, they allowed us to design an elegant, powerful, and extensible structure that provides reasonably easy solutions to all our use cases, while maintaining a degree of interoperability among identifiers. Let's dive into each of these strategies in detail to see how it helps us get the best of both worlds.
1. Handling divergent needs by splitting the standard into two parts
Our first strategy splits the identifier definition into two parts: 1) the algorithm by which the identifier will be computed, and 2) the list of what content contributes to the digest. The algorithm means the sorting approach, hash function, delimiters, concatenation process, etc. -- how to construct the string that gets digested to make the identifier. The second part is the list of content, which asks what attributes affect the digest; that is, do we include the names, sequences, etc in the string to digest? The division is useful because many of the differences come from different choice of content, not from a different algorithm, meaning the algorithm could be consistent even in situations where the content is not. For example, the second use case requires sequences, but the third use case does not. The algorithm can be kept the same.
Separating these two tasks also has a conceptual benefit: it isolates the critical question: exactly which attributes of a sequence collection should contribute to the digest computation? The standard abstracts away this question by allowing an implementation to specify which attributes contribute to the digest through the "inherent" property in the schema. This lets us publish a proposal for the algorithm, but leave the final decision open to the community (or even to different versions to be used by different communities). So people can use the sequence collections standard with whatever schema they want, providing flexibility to handle a wide variety of use cases by changing which attributes are listed as inherent in the schema.
The split into two parts thus provides some important modularity, but it doesn't solve all the issues. Two important challenges remain: First, it does not provide flexibility to adjust whether the content is sorted, since that's part of the algorithm, not part of the schema. Since this also differs between the first two use cases (the archiver use case wants to sort the sequence digests, but the aligner use case cannot), it means just splitting out the list of inherent attributes is insufficient for accommodating all the use cases. The other, more important issue that remains unsolved is that the identifiers computed with different schemas will, by definition, be different -- meaning we haven't solved the interoperability challenge. Thus, our first strategy helps solve some of the needs to be applied to different use cases, but it doesn't do enough to maintain interoperability.
2. The comparison function
The second strategy is the comparison function. In short, the goal here is to move away from comparing collections by simply checking if their digests are identical; instead, we want a more powerful comparison that can answer multiple questions using a single digest.
In more detail: each of the scenarios described above can be viewed as constructing a digest to make it really easy to ask a particular comparison question. For example, in the first use case, the question is "do these two collections have exactly the same sequence content, regardless of order?" In the second example, the question is more general: "do these two collections have exactly the same sequence content and sequence names, in the same order?". The third question is "Do these collections have the same coordinate system, in the same order?". To answer any of these questions, if you had the bespoke digest, you'd simply see if two digests are identical. If they are, the comparison question is satisfied. This is convenient if your question happens to be the one used to construct the digest. But the problem is that the final digest representing a sequence collection can only answer one such question. A simple string check approach simply offers only a single comparison: are these two things identical, or not? Therefore, to accommodate our complex use cases, where we have multiple things we want to compare in different scenarios, we need something more than just comparing digests. We need a single digest to be able to answer all of the above questions, and more.
Here's our alternative: instead of a digest-matching query, we design a comparison function. The comparison function goes beyond simply comparing digest strings; it provides a comparison of every attribute in the collection, including how many elements match, whether they are in the same order, and more. The output lets you answer all the questions posed above, plus more. You can tell if two collections have the same sequences, whether they are in the same order, whether their names match, whether the sequences differ but the lengths match, etc. It also allows you to determine more complex comparisons: is one sequence collection a subset of another? Do they have at least some sequences or names in common? Are their coordinate systems compatible, even if the sequence content differs? All of these questions are immediately answerable from the result of the comparison function.
To give a specific example of why the comparison function is better than relying on digest matching, let's revisit the first and second use cases. Whereas the original digest matching approach would require a separate identifier for each use case, the comparison function allows us to answer both questions without requiring separate identifiers; we can use a single, general-purpose identifier. We first build a single identifier (the more specific one): it does not sort the sequences, and does include names. This identifier, as before, needs only a string match for the second use case. To answer the question of the first use case, checking if the strings match is not enough, since it can't confirm the content is the same even without a matching order, since the digests were computed using the given order. Thus, to make this comparison, you'd need to go one further step and use the comparison function. Luckily, this is very simple: you run the comparison function on the two digests, and the result immediately tells you for the sequences attribute, whether they have the same sequences in a different order, independent of names, answering your question.
So, the vision here is that we've moved away from a quick check to see if two strings match, into a more information-rich function where you look at the output of the comparison function to answer the question. This added complexity is required if we want to use sequence collections to solve more than one problem. Admittedly, by necessity this makes the standard more complex for a single use case alone. However, it also makes the standard more general, meaning it is more likely to be used for a variety of different use cases. In other words, it maintains interoperability.
The risk here is that users who only see a single use case could balk and say, “Well, this is too complex for what we need. Our needs are solved by a simple algorithm; we don't need to define all these other terms, use a comparison function, etc.” But while that may be true, this leads to fragmentation, with digests minted and useful only for a single purpose. The unfortunate truth is that we cannot have it both ways: it's simply impossible to define a single digest that can address all the comparisons we need at the level of a string match. We face a tradeoff: either define individual identifiers for each use case at the cost of interoperability, or define a more general identifier with a more complex comparison function, maintaining interoperability at the cost of increased complexity. By adding in a bit more complexity, you're able to buy into an ecosystem where a community can define and provide tools that will support the increased complexity, leading to better interoperability and ease of use in the long run. Eventually, the benefits of doing something that solves multiple problems will outweigh the additional cost over the simple solution.
To mitigate the costs of the increased complexity of the comparison function, we've spent years optimizing and simplifying it to make it as simple as possible while retaining enough power to make it really useful. We tried to hit a sweet spot, where this information is 1) very fast to compute; 2) reasonably easy to code (my implementation is only about 150 lines of Python code); 3) simple enough to be understood relatively easily; and 4) information-rich enough to answer a huge number of comparison questions, making it very generally useful. The result returned by a comparison is a pretty understandable JSON document, which we've documented extensively in the specification (under the HowTo on comparing sequence collections). For me, the interoperability benefits and increased power are clearly worth the added complexity.
The comparison function is extremely powerful. Even if you don't think you need it for your use case, it will open the opportunity to do much more interesting comparisons of sequence collections that go beyond exact identity. But there are two limitations with the comparison function: First, it can't answer every possible question; it can only compare to the level specified. Second, it comes with the increased cost of having to compute a function. The third and final strategy addresses both of these limitations.
3. The layered approach
In some situations it may truly be necessary to answer a comparison question with a faster string-match check. For example, if you need to compare a huge number of collections, maybe the time cost of running the comparison function is intractable. This seems unlikely, since the comparison function is extremely fast, and it does not increase the algorithmic complexity... If you're doing pairwise comparisons between many sequence collections, the time complexity will grow exponentially with the number of collections; the comparison function will only be a linear factor. Nevertheless, there may be a rare case that requires a specific digest for a single comparison.
Our third strategy is what we call the "layered approach," because Sequence Collections uses a layered, recursive method of computing digests. First, each attribute is digested independently, and then these digests are re-digested to create the final identifier. These internal digests -- the first independent digest, which we call "level 1" -- can be useful for some questions on their own. In some cases, it may be possible to make the comparison of interest by a string comparison of the level 1 digests, bypassing the need for the comparison function. For example, if you don't care about names, you could simply compare the sequence
attribute-level digest. This gives you the same, fast, string-match check to answer a question, but in a framework that maintains the interoperability of the primary identifier.
To make this even more useful, the flexibility becomes complete with custom ancillary non-inherent attributes. In short, it works like this: if you really need a particular digest, then just compute it and add it to the sequence collection as a non-inherent attribute. You can then use it the way you would outside of the standard. As an example, the standard specifies a recommended ancillary attribute called sorted_name_length_pairs
. This attribute solves a specific use case of coordinate system compatibility. Two sequence collections that have matching sorted_name_length_pairs
attributes will have identical coordinate systems. The details of how to construct this attribute are included in the specification. We considered this use case common enough that we included this as a recommended ancillary attribute in the standard. It also provides an example of how you could construct other ancillary attributes to answer other specific questions that are not already answered by the basic comparison function or the level 1 digests. The point is that custom attributes like this can be easily added by a particular implementation to answer a specific question that is not already answered by the comparison function, or if the situation requires a string match instead of calling an external function for performance reasons. This provides ultimately flexibility for any comparison use case we have not already imagined, all within a framework that maintains interoperability of the final sequence collection identifiers.
Our proposal for inherent attributes
These strategies are major steps toward balancing interoperability and specific use cases, but there's one last major issue before we're all the way there: we need a proposal for a minimal, universal set of inherent attributes. One that could be the catch-all. Here's the problem. In our first strategy, we described how users could change the inherent attributes in the schema, depending on a particular use case. That's great for accommodating different needs while maintaining a common framework, but it also leads to a potential challenge: If two different implementations define different inherent attributes, then the digests of the same sequence collection will not match between these two implementations. That's part of the definition of inherent. This means if people use different sets of inherent attributes, we've again lost the interoperability that we've sought. By making the specification flexible enough that people can change which attributes contribute to the digest, we also open up the possibility that people will choose different inherent attributes, costing interoperability.
Our solution to this is that we should provide a general standard list of inherent attributes, an "accepted" or "approved" schema, as a recommended part of the specification. This will encourage people to adopt this set of attributes, preserving interoperability for as many implementations as possible. But it also maintains the flexibility for users who need to implement something that breaks that interoperability. Buying into a general-purpose standardized schema is the only way to get interoperability at the level of matching digests across use cases. Thus, the specification includes a RECOMMENDED set of inherent attributes.
Don't get caught up in the structure
One final important point. Sometimes people seeing the standard for the first time are confused at the way we structure sequence collections when we compute the digest. Specifically, why we structure the collection as an object of arrays, rather than an array of objects. In some use cases, it's more natural to think of a sequence with its metadata (name, etc) as an object, and a sequence collection as an array of such objects. Flipping the axis to group the information by attribute, instead of by sequence isn't the way they've thought about collections of objects and so it's unnatural, and this turns them off to the standard.
For reasons outlined in the specification, for the actual computing of the identifier, it's important to use the array-based structure -- this is what enables us to use the "level 1" digests for certain comparison questions, and also provides critical performance benefits for extremely large sequence collections. But don't let this dissuade you! My critical point is this: the way to compute the interoperable identifier does not force you to structure your data in a certain way in your service -- it's simply the way you structure the data when you compute its identifier. You are, of course, free to store a collection however you want, in whatever structure makes sense for you. You'd just need to structure it according to the standard if you wanted to implement the algorithm for computing the digest. In fact, my implementation provides a way to retrieve the collection information in either structure.
Sequence collections without sequences
Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant. Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this. First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the length of the sequence, and not the actual sequence of characters. For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence. On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence. However, this not strictly true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates. In other words, even if the underlying sequences aren't exactly the same, as long as they represent something equivalent, then the coordinates can be compared. A prerequisite for this is that the lengths of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions.
We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system. Thus, actual sequence content is optional for sequence collections. We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that is representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case. Thus, we can think of these as a sequence collection without sequence characters.
An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be:
{
"lengths": [
"1216",
"970",
"1788"
],
"names": [
"A",
"B",
"C"
]
}