Introduction

Currently, this is the home of the GA4GH Tool Registry API proposal. Tools or workflows (both terms are often used interchangeably in this series of documents and in the community) will be defined as a data manipulation step or a series of steps respectively. These steps can be linearly sequential or fan out in a tree and then aggregate the results in what is often called a DAG. Each individual step usually executes an individual script or binary executable and thus data is fed from overall input files, is manipulated by each step, and eventually results in output files.

Workflows are stored in a variety of different locations with incompatible metadata (such as authorship, tagged versions, working parameter sets), are written in a variety of workflow languages, and often do not have their dependencies documented in a consistent way. This makes it difficult for researchers from different groups (or even in some cases, for researchers from the same group at a different time) to re-run workflows.

This proposal is part of an effort to create a standard and interoperable way to build and exchange workflows between groups so that developers that build and contribute workflows to the community can have confidence that other groups working in a very different cloud environment can run their workflow.

Initially, the registry API is intended on being a minimal common API for describing tools that we propose for implementation by registries like Dockstore and Agora. The intent is to balance the need for a simple set of requirements for implementors while still being useful for users of the API. The API is intended to be used for tasks like exchanging, indexing, and searching workflows from the bioinformatics community and beyond.

assets/trs.png

With a number of different repositories for storing bioinformatics workflows, it would be useful to settle on a way for different repositories to communicate with one another while recognizing that there are many valid different approaches and design decisions that can go into a registry project such as:

  1. where are the containers hosted (a private registry, a public registry, built on demand?)
  2. how are descriptors for those containers stored (source control, database)
  3. what is the format for descriptors?
  4. how is a registry secured (or not!)

Our intent is to allow registries built with different assumptions to communicate with one another and be discoverable by third party indexing or aggregation services.

Use Cases

Users

Users of tools (such as researchers that wish to reproduce/confirm results) will be able to access a larger universe of tools while still having the ability to go to the authoritative source for a tool for full information or to contact the author of a tool.

Tool Developers

Tool developers (writers of workflows) will be able register their tools and automatically have them visible on a number of different sites that exchange metadata via this API, allowing them to reach a larger audience.

Registry Developers

Registry developers will be able to develop services to either aggregate, index, or mirror tools found on different registries. This mechanism would be similar to how regular software developers working in languages such as Java or Javascript download dependencies from multiple repositories using the remote repositories mechanism in Maven or in npm respectively.

In summary, we envision a world where bioinformatics workflows (and workflows from other interested fields) can be freely exchanged with no technical barriers between a variety of cloud environments, allowing researchers to easily re-use workflows and reduce the effort of reimplementing workflows for use in different platforms.

Discussion of the models and endpoints in details continues at Data Model

TRS V1 vs TRS V2

TRS V1 is the initial release of the schema. It is available at https://github.com/ga4gh/tool-registry-service-schemas/releases/tag/1.0.0 . TRS V2 is a newer version of the schema, it proposes a number of changes both major and minor. V2 is recommended. As of the current date (December 9, 2020) it includes the following changes

As a brief summary, some changes include

Breaking Changes
  • conversion of many parameters and object properties from kebab case (hyphens) to snake case (underscores) to better support Javascript tooling and to better match other GA4GH cloud workstream APIs
  • removal of protobuf support and many of the conversions that were inherited from that choice
  • ToolDescriptor.descriptor is optional if ToolDescriptor.url is provided (21)
  • Some variable name changes to better account for possibility of support for Singularity or other Docker alternatives
  • allow for multiple container images, multiple authors, multiple verification sources for versions of tools
  • the last point is also related to a number of variable name changes suggested by the TRS PRC
Substantial Changes
  • addition of a documentation build system supporting https://ga4gh.github.io/tool-registry-service-schemas/Introduction/ (this site)
  • the addition of Nextflow as a descriptor format
  • ` /tools/{id}/versions/{version_id}/{type}/files` endpoint added to describe all avaliable files for a tool
  • checker workflows (26)
  • process for auto-generation of OpenAPI 3 copy of the schema
  • removal of the /metadata endpoint in favor of a /serivce-info endpoint to be added in 2.0.1