Architectural Decision Record

This document serves as a record of decisions made throughout the product development process. Each entry outlines a decision that has been formally approved by members of the WGS-QC workgroup. Together, these ADRs capture the institutional memory of key decisions and their underlying rationales. The aim is to preserve the context behind these choices and share this knowledge with the broader community.

Architectural Decision Record
- Contents
- 2025-09-23
  - Decision
  - Rationale
- 2025-08-26
  - Decision
  - Rationale
- 2025-07-22
  - Decision
  - Rationale
- 2025-06-27
  - Decision
  - Rationale
- 2025-05-27
  - Decision
  - Rationale
- 2025-04-22
  - Decision
  - Rationale
- 2025-02-25
  - Decision
  - Rationale
- 2025-01-21
  - Decision
  - Rationale
- 2024-11-26
  - Decision
  - Rationale
- 2024-10-22
  - Decision
  - Rationale
- 2024-09-24
  - Decision
  - Rationale
- 2024-07-23
  - Decision
  - Rationale
- 2024-06-25
  - Decision
  - Rationale
- 2024-05-28
  - Decision
  - Rationale
- 2024-03-26
  - Decision
  - Rationale
- 2024-02-27
  - Decision
  - Rationale
- 2024-01-30
  - Decision
  - Rationale
- 2023-11-28
  - Decision
- 2023-10-24
  - Decision
- 2023-09-26
  - Decision
  - Rationale
- 2023-08-22
  - Decision
  - Rationale
- 2023-03-28
  - Decision
  - Rationale
- 2023-02-21
  - Decision
  - Rationale
- 2023-01-17
- 2022-12-13
  - Decision
  - Rationale
- 2022-09-13
  - Decision
  - Rationale
- 2022-07-26
  - Decision
  - Rationale
- 2022-05-17
  - Decision
  - Rationale
- 2022-04-12
  - Decision
  - Rationale
- 2022-03-08
  - Decision
  - Rationale
- 2022-01-18
  - Decision
  - Rationale
- 2021-11-30
  - Decision
  - Rationale
- 2021-10-26
  - Decision
  - Rationale
- 2021-09-14
  - Decision
  - Rationale
- 2021-08-10
  - Decision
  - Rationale
- 2021-06-29
  - Decision
  - Rationale

2025-09-23

Decision

Metric Definition Attributes
- Extend the metric definition template with attributes:
- Data type (e.g., WGS Short-read/Long-read, RNA-Seq, single-cell)
- Assembly/reference sequence collection (aligned with GA4GH Refget terminology rather than just "assembly")
- Version (of the metric definition and of the reference implementation/tool)
- External references and access IDs where applicable
Schema Flexibility
- JSON schema will allow for attributes that adapt across different data types and technologies.
- Include placeholders for implementation details while distinguishing them from version-controlled definitions.
Terminology Alignment
- Use the term “reference genome sequence collection” (per GA4GH standards) instead of "assembly."
- Use “version” consistently (instead of "tool") to avoid ambiguity.
Backward Compatibility
- Maintain compatibility with hg19/GRCh37 to meet PRC and adapter requirements.
- Explicitly track which genome builds a given metric definition supports.
Focus of Metrics
- Prioritize broad quality assessment metrics (coverage, duplication, contamination) before complex site-specific metrics.
- Document site-specific extensions as optional, not required for baseline adoption.
Public Outreach and Roadmap
- Gather feedback from clinical genomics laboratories community to refine applicability.
- Explore support for long read sequencing and somatic QC metrics in future versions.
- Provide real-world examples of QC metric implementation (~PRECISE data)
- Develop validation tools to check compliance against JSON schema.

Rationale

Interoperability: Defining attributes like data type, reference collection, and version ensures metrics can be consistently interpreted across datasets.
Clarity: Separating implementation details from versioned definitions reduces confusion when pipelines evolve.
Alignment: Using GA4GH-compliant terminology (e.g., "reference genome sequence collection") avoids schema drift.
Adoption: Clinical genomics groups require backward compatibility (hg19) and simple, clear metrics for implementation.
Future-proofing: The schema design anticipates additional sequencing technologies without requiring disruptive changes
Metrics: discusses the challenges of implementing site-specific metrics and suggests focusing on broader quality assessments.

2025-08-26

Decision

Git Issues & Pull Requests
- Minor patches applied to align the NPM sample QC README and MkDocs config file.
- Product specification document updated and published via GitHub IO page.
- Repository alignment confirmed with no blocking issues pending for the release.
Product Approval Process
- Product approved by REWS, DSW, and PRC.
- Feedback from adapters (Illumina, EGA) required before final submission to the GA4GH Steering Committee.
- Steering Committee review scheduled for end of October.
PRC Recommendations
- Add support for additional genome assemblies (hg19).
- Include more variant QC metrics in the specification.
- Revisit GA4GH ontology development and improve alignment.
- Provide better guidance and feedback channels for clinical genomics labs.
Public Feedback
- Add support for other sequencing data types (RNA-seq, single-cell sequencing).
- Provide implementation examples and a validation tool/schema checker.
- Roadmap Updates
- Include support for hg19, long-read sequencing, and somatic mutation QC metrics.
- Update schema from JSON Schema to JSON-LD for better interoperability.
- Integrate GA4GH QC metrics into the Pan-Canadian Genome Project (PCG) data model.

Rationale

Technical Consistency: Repository patches ensure reproducible reference implementations.
Adoption Readiness: Addressing PRC and public feedback increases community confidence and encourages adoption by clinical and research groups.
Interoperability: Supporting multiple assemblies (hg19/GRCh37) and sequencing modalities expands applicability across datasets.
Governance & Transparency: Completing approval steps ensures the release meets GA4GH’s review and public comment requirements.

2025-07-22

Decision

Product Specification Page Structure
- Use a structured page including:
- Introduction
- Motivation
- Implementation
- Benchmarking results
- Generate the page using MkDocs tools with individual markdown files for each metric.
Metric Definition
- Maintain one authoritative metric definition file for all QC metrics.
- Each metric may have an individual markdown file, linked to a JSON schema for programmatic validation.
- Align metric definitions with VCF for variant calling, but remain flexible to adapt when VRS becomes a better standard.
- Recognize the need for standardized nomenclature for structural variants to support cross-pipeline comparison and benchmarking.
Reference Genome
- GRCh38 is the chosen primary reference genome for QC metrics.
- Include a supporting write-up justifying the use of GRCh38.
- Allow consideration for additional assemblies (e.g., GRCh37/hg19) in future updates, based on adapter feedback to support legacy data (EGA).
Product Approval Process
- Submit documents to Regulatory Ethics Workstream and Data Security Workstream.
- Solicit implementer feedback from NPM and Argo pipelines.
- Incorporate PRC feedback, including expanding QC metrics and considering alternative reference genomes.
- Timeline is aligned with GA4GH plenary presentation opportunities.
Roadmap and Integration
- Engage with the Pan-Canadian Genome Library for integration.
- Consider additional variant QC metrics based on project requirements and community feedback.

Rationale

Consistency: A single metric definition ensures consistent interpretation across projects, tools, and pipelines.
Interoperability: Linking markdown metric files to a JSON schema allows both human-readable and machine-readable documentation.
Benchmarking: Standardized nomenclature for variants and reference genome selection enables fair comparison of QC across datasets.
Adoption: Structured product specification pages and supporting write-ups facilitate uptake by clinical and research communities.
Future-proofing: Flexibility to adapt to VRS and alternative assemblies ensures long-term relevance.

2025-06-27

Decision

Released NPM-sample-qc v0.13.5 as the near-final reference implementation for GA4GH v1.
Added disclaimer on sensitive identifiers in the README.
Added Nextflow workflow citation and Dockstore product specification details.
Closed major GitHub issues/PRs (metric definition updates, benchmark uploads, repo links).
Submitted approval documents to REWS & DSWS committees; addressed reviewer comments from Jessica (data security) and Gemma (EDI).
Initiated outreach to LMIC efforts (e.g., GenomeIndia Consortium).

Rationale

Align NPM and ARGO implementations on metric precision (indel counts, multi-allelic counts).
Improve clarity and usability for implementers (disclaimer, workflow citation, Dockstore product details).
Keep GA4GH repositories updated with latest benchmark results and pipeline versions.
Meet regulatory, ethics, and data security requirements for GA4GH product approval.
Address Equity, Diversity, and Inclusion (EDI) concerns by engaging LMIC communities and ensuring non-prescriptive standards.
Provide general guidance on sensitive data handling without assuming implementer responsibilities.

2025-05-27

Decision

Finalization of NPM-sample-qc release v0.13.4

Addressed precision inconsistencies in the cross-contamination rate matrix (from 2 to 4 decimal places)
Resolved inclusion logic for multi-allelic sites in sample QC
Fixed incorrect count inflation due to string length checks in metrics involving alternate alleles

Alignment with Argo Repository

Verified functional equivalence between NPM and Argo pipelines, except for “percentage of reads properly paired”
Agreed to update the functional equivalent implementation and clearly document exceptions in the metric definitions
Decided to link the Argo reference implementation in the G4GH repository alongside NPM as part of the formal documentation set

Approval Process for Product Release

Timeline and document preparation agreed:
Finalize: Product submission form, REWS form, and Security form
Select PRC members and initiate public feedback channels
Public comment period: July 15 – August 15, 2025
PRC meeting: To be held during the same window (July 15–August 15)
Target release: October 29, 2025 (GA4GH Plenary Session)

Metric Definition Improvements

Each metric matrix will now include its field type to eliminate ambiguity
Plan to improve clarity between nominal, continuous, and derived metrics across pipelines

Roadmap Additions

Confirmed inclusion of the following deliverables in the roadmap:

A Common Practice Resource Guide for adopters
A Peer-reviewed Publication to increase visibility and adoption of the standard
Committed to updating the roadmap document accordingly

Rationale

Ensures technical precision, consistency, and compatibility across platforms (NPM, Argo)
Formalizes the approval and feedback process in accordance with GA4GH standards
Supports adoption through documentation, tooling references, and academic publication
Enhances transparency and interpretability of metrics by standardizing field definitions and formats

2025-04-22

Decision

Include Multi-Allelic Indels in QC Metrics

Update the NPM QC pipeline to count multi-allelic sites in indel metrics (insertion/deletion counts and ratios).
Remove existing filters that exclude multi-allelic sites to ensure biological relevance and data fidelity.

Define Somatic Variant Metrics

Specify required input files (e.g., VCF, BAM) for somatic mutation QC calculations.
Collaborate with Argo to integrate their VCF QC pipeline as a reference implementation.
Address tumor purity/ploidy dependencies in metric definitions.

Expand Ecosystem Integration

Align with GA4GH Data Connect and Damasc for federated querying of QC metrics.
Engage Nomad and EGA to promote adoption beyond GA4GH.
Plan ISO TC 215 alignment post-version 1 release.

Roadmap Prioritization

Version 1 (Short-Read WGS): Finalize multi-allelic and somatic updates.
Version 2: Add long-read sequencing and advanced somatic QC.

Rationale

Multi-Allelic Indels

Biological Relevance: Excluding multi-allelic sites risks omitting true variants (e.g., complex indels in cancer/population genomics).
Data Integrity: Metrics should reflect raw input without arbitrary filtering.
Implementation Impact: Minimal pipeline adjustments needed (removes a filter, no computational overhead).

Somatic Variants

Standardization Need: Somatic pipelines lack unified QC metrics (e.g., tumor-normal concordance, artifact detection).
Argo’s Benchmark: Their VCF QC pipeline provides a validated starting point.
Clarity: Explicit input requirements (e.g., "Tumor VCF + BAM") prevent ambiguity.

Ecosystem Integration

Interoperability: Data Connect enables centralized QC metric querying across repositories.
Adoption: Partnerships with Nomad/EGA ensure real-world utility.
Regulatory Alignment: ISO TC 215 collaboration future-proofs the standard.

Roadmap

Short-read focus (v1): Ensures deliverability while deferring complex long-read/somatic challenges to v2

2025-02-25

Decision

1. NPM Repository

Confirm that there are no open issues or pull requests.
Issues #136 and #138 are closed, resolving the cross-contamination rate calculation and updating the README.
Pull Requests #137 and #141 are merged, updating the Nextflow manifest version and adding license details to the README.
Pull Request #144 is merged, correcting the cross-contamination metric name and updating unit test results.

2. GA4GH Repository

Acknowledge pending issues:
issue #40 (printing median value) – pending.
issue #46 (copyright update) – pending.
issue #50 (update reference implementation and docstore links) – pending.
Closed/linked issues:
issue #45 (linked to NPM #137) – resolved in NPM and merged to GA4GH release 13.2.
issue #55 closed, correcting cross-contamination metric name.
Pending pull requests:
issue #56 awaiting Argo VCF QC implementation results.

3. Argo Implementation

Accept Argo’s completion of the VCF QC reference implementation, with 100-sample symmetric results expected in early March for correlation with the NPM reference implementation.
Note Argo’s ongoing work on somatic mutation, with updates expected early March.
Accept external contribution for minor license file fixes.

4. Product Approval and Security Forms

Complete and submit the Product Approval Submission Form and Security Form to the GA4GH committee.
Prepare a Public Review Committee (PRC) with members from multiple categories.
Publish a recorded product presentation for a 30-day public comment period before PRC review.

5. GA4GH Connect Session

Present WGS QC roadmap and interoperability plans on April 4, 2025, 11:15 ET.
Rename current document as Version 1.0 and the forthcoming document as Version 2.0.
Include updates in Roadmap v2: QC for long-read germline post-alignment, long-read germline post-variant calling, structural variation, and somatic mutation.

Rationale

Repository alignment: Closing, merging, and linking related issues ensures consistent metric naming, updated documentation, and maintained compatibility across NPM and GA4GH repositories.
Reference implementation verification: Coordinating NPM and Argo results strengthens confidence in the specification’s reproducibility and accuracy.
Compliance and transparency: Completing GA4GH product approval requirements, including security forms and a public review period, ensures the process meets governance standards.
Forward planning: Versioning and roadmap updates maintain clarity on current vs. future scope, enabling structured evolution of the specification.

2025-01-21

Decision

1. Finalize Version 1.0 Release

Close remaining PRs: Merge PR #57 (cross-contamination rate fix) and update reference implementation links (Issue #50).
Publish release notes: Document all changes (closed Issues #45, #55, #136, #138; merged PRs #137, #141, #144).

2. Integrate Argo’s VCF QC Implementation

Include as a reference implementation after March testing completion.
Add somatic mutation metrics (under development) to Version 2.0 roadmap.

3. Prepare for GA4GH Product Approval

Complete submission and security forms for committee review.
Form Public Review Committee (PRC) with cross-workstream members.
Share recorded presentation for 30-day public comment (pre-PRC review).

4. Plan Version 2.0 & Connect Session

Scope: Long-read QC (germline/somatic), structural variants, GA4GH ecosystem interoperability.
April 4 Connect Session: Present roadmap and align with Data Connect/Damasc.

Rationale

1. Version 1.0 Readiness

Closed Issues/PRs: Critical fixes (e.g., cross-contamination precision, license updates) ensure stability.
Argo’s Contribution: VCF QC implementation expands use cases beyond NPM’s pipeline.

2. Product Approval

GA4GH Compliance: Formal review ensures alignment with federation standards.
Transparency: Public comment period invites community feedback.

3. Version 2.0 Priorities

Long-Read/Somatic Gaps: Address emerging sequencing technologies and cancer genomics needs.
Ecosystem Integration: Data Connect/Damasc compatibility enhances adoption.

4. Connect Session

Visibility: Showcases progress to GA4GH members and external stakeholders (e.g., EGA, Nomad)

2024-11-26

Decision

1. Separate QC Workflows by File Type

BAM/CRAM Workflow: Alignment/post-alignment QC (NPM-led).
VCF Workflow: Variant calling QC (Argo-led).
Rationale: Simplifies implementation, testing, and adoption for distinct pipeline stages.

2. Close Pending Issues

Issue #41: Dockstore registration (close).
Issue #48: README updates (close).
Issue #44: Median value implementation (defer to Version 2.0).

3. Align NPM/Argo Metrics

Resolve Issue #45 (decimal precision) via cross-validation with GATK.
Action: Justin/Linda to finalize comments.

4. Version 2.0 Scope

Somatic Workflow: Argo to lead (align with Issue #46).
Long-Read QC: Seek public benchmarks (e.g., Platinum dataset).
Single Metric Document: Unify definitions across workflows.

5. GA4GH Approval Timeline

Target April–July 2025 for full approval.
Next Steps: Complete submission documents (Thomas to guide).

Rationale

1. Workflow Separation

Modularity: Enables independent updates (e.g., VCF metrics without BAM changes).
User Flexibility: Labs can adopt workflows relevant to their pipelines.
Validation Efficiency: Simplifies benchmarking (e.g., GIAB for VCF, Platinum for long-reads).

2. Version 2.0 Priorities

Somatic Focus: Addresses critical gap in cancer genomics.
Long-Reads: Prepares for emerging technologies (ONT/PacBio).
Unified Docs: Reduces redundancy and improves clarity.

3. Approval Process

GA4GH Alignment: Ensures interoperability with Data Connect/Damasc.
Community Review: Incorporates feedback via public comment period

2024-10-22

Decision

1. Diversify Benchmark Resources

Merge PR #42: Expand 100-sample benchmark to include diverse ethnicities.
Action: Argo to run pipeline on updated sample list for cross-validation.

2. Standardize Metric Precision

Decimal Consistency: Align NPM/Argo cross-contamination rates (avoid rounding to 0%).
Documentation: Add precision requirements to metric definitions (GitHub Issue #TBD).

3. VCF QC Implementation

Reference Pipeline: Argo’s VCF QC workflow is now production-ready for testing.
Deprecate MultiQC: NPM v0.13 removes MultiQC dependency (simplifies outputs).

4. Somatic/Long-Read Roadmap

Somatic QC: Collaborate with Dragen to align with GA4GH metrics.
Long-Reads: Use Platinum dataset (bioRxiv 10.1101/2024.10.02.616333) as benchmark.

5. GA4GH Ecosystem Integration

Data Connect: Engage leads to incorporate QC metrics into federated queries.
Publication: Draft GA4GH best practices paper for QC standards.

Rationale

1. Benchmark Diversity

Representation Matters: Ethnic diversity reduces bias in QC thresholds.
Validation: Parallel runs (NPM/Argo) ensure metric robustness across populations.

2. Metric Precision

Clinical Relevance: Rounding contamination rates to 0% masks critical quality issues.
Interoperability: Consistent precision enables cross-pipeline comparisons.

3. VCF QC Readiness

User Need: Separate VCF workflow simplifies adoption for variant-focused labs.
Maintenance: Removing MultiQC reduces dependency conflicts.

4. Future-Proofing

Somatic/Dragen: Industry alignment accelerates adoption.
Long-Reads: Platinum dataset provides gold-standard validation

2024-09-24

Decision

1. Two-Phase Implementation Approach

Phase 1 (v1): Finalize short-read WGS QC standard and achieve GA4GH production status
Phase 2 (v2): Expand scope to:
- Somatic mutation QC (Argo-led implementation)
- Long-read sequencing QC / Structural variant calling QC

2. Documentation Strategy

Maintain single unified metric definition document
Add versioned appendices for new data types (somatic/long-read)

3. Pipeline Architecture

Develop specialized QC pipelines for:
- Short-read alignment (BAM/CRAM)
- Variant calling (VCF)
- Somatic mutation
- Long-read data
Preserve interoperability through shared JSON schema

4. Ethnicity Considerations

Formalize ethnicity-specific benchmarking in v1.1 update
Document expected metric ranges by population group

5. Governance

Form working subgroups for:
- Somatic QC (Argo chair)
- Long-read QC (open chair)
Monthly cross-group syncs to maintain alignment

Rationale

1. Progressive Standardization

Completing short-read standard first provides stable foundation
Parallel development risks scope creep and delays

2. Technical Pragmatism

Separate pipelines allow:
- Technology-specific optimizations
- Independent versioning
- Gradual adoption
Shared schema ensures interoperability

3. Biological Realism

Ethnicity-specific ranges reflect true genetic diversity
Prevents false quality flags in diverse cohorts

4. Community Alignment

Argo's clinical genomics expertise suits somatic leadership
Open chair encourages long-read community participation

2024-07-23

Decision

1. Update Post-Alignment QC Metrics Definitions

Refine functional equivalence for DRAGEN v3.7 and other aligners.
Focus on "not hard-to-sequence" regions (ensure benchmark regions exclude problematic genomic areas).
Clarify implementation requirements for:
- Coverage depth uniformity
- Mapping quality distribution
- Insert size metrics

2. Expand Benchmark Dataset Diversity

Add ethnically diverse samples to the 100-sample benchmark set.
Prioritize inclusion of:
- African, East Asian, South Asian, and admixed populations
- Samples with varying GC content and complexity

3. Resolve Pending Pull Requests

PR #21, #22, #31, #32, #33: Finalize reviews and merges.
PR #22: Update comparison analysis between NPM and DRAGEN v3.7.

4. Improve Visualization & Reporting

Enhance Jupyter notebooks for:
- Super-population stratification
- Side-by-side tool comparisons (NPM vs. DRAGEN)
Standardize plots for GA4GH documentation.

5. Collaborate with GA4GH DRS Group

Define standardized region sets for benchmarking (e.g., "not hard-to-sequence" BED files).
Ensure compatibility with GA4GH Data Repository Service (DRS) for metric sharing.

Rationale

1. Metric Refinements

DRAGEN v3.7 Alignment: Ensures interoperability with industry-standard pipelines.
"Not Hard-to-Sequence" Regions: Reduces false QC failures in problematic genomic areas (e.g., high homology regions).
Functional Equivalence: Allows fair comparison across tools (e.g., NPM vs. DRAGEN).

2. Diverse Benchmarking

Ethnic Representation: Prevents population-specific biases in QC thresholds.
Genomic Diversity: Captures wider range of technical challenges (e.g., GC extremes).

3. Visualization & Reporting

Population Stratification: Helps labs interpret metric ranges for their cohorts.
Tool Comparisons: Identifies systematic differences between implementations.

4. DRS Integration

Reproducibility: Standardized regions enable cross-study comparisons.
FAIR Compliance: DRS enables federated querying of QC metrics

2024-06-25

Decision

1. Investigate Sample Stratification in QC Plots

Hypothesis Testing: Correlate the two observed sample clusters with 1KGP ethnicity data (super-populations).
Action: Update Jupyter notebooks to color-code samples by ethnicity.

2.DRAGEN v3.7.8 Functional Equivalence

Root Cause: NPM’s region exclusion vs. DRAGEN’s lack of equivalent metrics.
Short-Term Fix:
- Run DRAGEN with whitelist regions (if supported).
- Document workarounds for missing metrics (e.g., coverage uniformity).
Long-Term Proposal:
- Request DRAGEN to adopt whitelist by default via Illumina collaboration.

3. VCF QC Tool Standardization

Clarify Tools: Specify whether bcftools, GATK, or custom scripts are used for VCF QC.
Add to Docs: Update README with tool versions and parameters.

4.Plotting Improvements

Replot Correlation Charts:
- Fix axis scaling (log where appropriate).
- Add jitter to overlapping ratio points.
Ethnicity Annotation: Label clusters if ethnicity-linked.

5. Plenary Attendance Plan

Remote Participation: Prioritize key discussions for asynchronous input (Slack/email).
Delegate Presentation: Assign on-site attendees (e.g., Justin, Nicolas) to represent the group.

Rationale

1. Ethnicity-Driven QC Variation

Biological Reality: Metric distributions often differ by ancestry (e.g., AFR vs. EUR).
Mitigation: Annotate plots to prevent misinterpretation as technical artifacts.

2. DRAGEN Compatibility

User Need: Labs using DRAGEN require comparable QC outputs.
Advocacy: Pushing for whitelist adoption improves future interoperability.

3. Visualization Clarity

Jitter: Resolves overplotting in ratio metrics (e.g., Ti/Tv).
Log Scales: Reveals patterns in skewed distributions (e.g., depth).

4. Remote Collaboration

Inclusivity: Ensures absent members can contribute to decisions

2024-05-28

Decision

1. Functional Equivalence Confirmed

AWS 1KGP DRAGEN vs. NPM QC:
- 3 metrics initially discrepant due to partial BAM processing.
- 64 samples reanalyzed; results now show 100% correlation for yield/alignment metrics.
Action:
- Fix BAM loading issue in pipeline (GitHub Issue #TBD).
- Document DRAGEN-NPM benchmark protocol.

2. Variant Calling QC Metrics

PR #31 Review: Finalize definitions for:
- Transition/Transversion (Ti/Tv) ratios
- Indel size distributions
- Multiallelic site frequencies
Action: All members review proposed metrics.

3. ISO Standardization Alignment

Clarify Engagement:
- TC 140 (NGS): Focus on wet-lab protocols.
- Bioinformatics TC: Target for QC metric standards.
Action: Draft GA4GH-ISO crosswalk document (Thomas/Linda).

4. GA4GH Product Integration

DRS Collaboration: Ensure QC metrics are queryable via Data Repository Service.
Roadmap: Add interoperability milestone for Q3 2024.

Rationale

1. Tool Validation

Reproducibility: Confirms NPM and DRAGEN outputs are comparable when processed fully.
Transparency: Documents edge cases (partial BAM loads) for future debugging.

2. VC Metric Standardization

Clinical Relevance: Ti/Tv ratios detect sequencing artifacts.
Biological Insight: Indel distributions vary by disease context.

3. ISO Harmonization

Regulatory Impact: Aligns GA4GH standards with international frameworks.
Scope Clarity: Separates wet-lab (TC 140) from bioinformatics standards.

4. Ecosystem Integration

FAIR Data: DRS enables federated QC metric access.

2024-03-26

Decision

1. Define Functional Equivalence Thresholds

High Agreement Metrics (✅):
- Require <5% relative difference (e.g., mean coverage, properly paired reads).
Disputed Metrics (❌):
- % Reads Mapped: Investigate mapping algorithm differences.
- Insert Size Std Dev: Exclude from cross-tool comparisons.
Non-Equivalent Metrics:
- Label clearly in reports (e.g., "DRAGEN lacks PCT_Autosome_15x").

2. Address Cell Line vs. Primary Sample Bias

Annotate 1KGP Data: Flag immortalized cell line status in benchmarks.
Supplement with Primary Samples: Use Ivo’s inter-lab comparison data where possible.

3. Variant Calling QC Standardization

Core Metrics:
- Ti/Tv ratio (tumor/normal concordance for somatic).
- Indel size spectrum.
Exclusions:
- DRAGEN-specific metrics without NPM equivalents.

4. Pipeline Improvements

DRAGEN Whitelisting: Test --region parameter for assembly gap exclusion.
NPM Updates: Add warning for cell line artifacts.

Rationale

1. Threshold Clarity

5% threshold balances biological variability with technical reproducibility.
Explicit non-equivalence prevents false comparisons.

2. Sample Type Transparency

Cell lines show +15% false positives vs. primary samples (per Ivo’s data).
Annotations ensure appropriate interpretation.

3. Variant QC Focus

Ti/Tv ratios detect PCR artifacts; indel sizes reveal alignment issues

2024-02-27

Decision

1. VCF-Centric QC Contributions

Open Collaboration: Invite all members to propose metric definitions via GitHub markdown.
Linking Pipelines: Use .gitmodules to reference external QC implementations (e.g., Argo’s pipeline).

2. Workflow Versioning

NPM v0.12: Now reflects Picard updates.
Argo Pipeline: Pending PR for updates (MH to guide GitHub submodule process).

3. Long-Read Terminology

Scope: Officially include long-read QC in Roadmap v2.
Clarification: Define "long-read" as:
- PacBio HiFi: ≥Q20, ≥10 kb reads.
- Oxford Nanopore: ≥Q15, ultra-long reads.

4. Registry Integration

Docstore Registration: Accelerate via GA4GH Tool Registry Service.
Metadata Alignment: Sync with GA4GH Metadata Schema group.

Rationale

1. Transparent Contributions

GitHub markdown ensures traceability; .gitmodules enables modular pipeline reuse.

2. Forward Compatibility

Submodule approach allows independent pipeline updates without breaking references.

3. Technology Neutrality

Explicit long-read definitions prevent ambiguity in v2 planning.

4. Ecosystem Alignment

Docstore/Data Registry integration enables federated QC queries

2024-01-30

Decision

1. Roadmap Document Updates

Adopt new table of contents structure (per Maxime's proposal)
Rename "Expected Completion Date" → "Completed Date"
Add checkboxes for tracking progress
Mark completed items as we progress

2. Milestone Adjustments

Schema Integration

Connect with Experiment Metadata group
Engage DRS team

Reference Tools

PR #19 ready
Merge PR
Continue Argo improvements

Metric Finalization

Identify responsible parties
Schedule alignment meeting

3. 2024 Priority Projects

Post-Variant Calling QC Metrics
- Select 5-7 key metrics (Ti/Tv ratio, indel spectrum, etc.)
- Formalize definitions in GitHub markdown
- Implement in ≥2 workflows (NPM + Argo)
Reference Implementations
- Formalize Argo workflow as second independent implementation
- Establish validation protocol
GA4GH Integration
- Experiment Metadata WG (primary)
- DRS standard (secondary)

4. Benchmarking Resources

Current: DRAGEN v3.7.6 dataset available
Needed: Additional version comparisons
Action: Nominate benchmark coordinator

Rationale

1. Transparent Tracking

Checkboxes and renamed dates improve accountability
Clear documentation of completed work

2. Accelerating Progress

Merging PR #19 unblocks dependent work
Parallel development of Argo implementation saves time

3. Strategic Focus

VCF metrics address urgent community needs
Dual implementations ensure robustness
Metadata/DRS integration enables FAIR QC data

4. Resource Optimization

Leverage existing DRAGEN data while planning expansions
Technical paper will attract collaborators

2023-11-28

Decision

WG agree for the reference implementation to use picard to compute coverage based metrics with the default parameters (exclude unmapped reads & BQ < 20)
WG see the benefit of running the pipeline on 3202 samples from 1000Genome project in order to define the range of accepted values to define functional equivalence.
ARGO pipeline will work on Dockstore registration
Align ARGO and NPM before running 1000 Genome dataset
Need to participate to Experiments Metadata Standard and invite a representative to our meeting

2023-10-24

Decision

Is the current process sufficient to ensure implementations work as advertised!

Seems that with 100 samples tested we already see some divergence in the implementations. It might not need to go further.

2023-09-26

Decision

1. Scope Expansion for Metrics

Primary Focus: Germline short-read WGS QC (current priority).
Future Extensions:
- Variant Types: SNPs + small indels (Phase 1) → Structural Variants (Phase 2).
- Read-Level QC: Deferred (e.g., FASTQ-style metrics, vendor quality tags).

2. Vendor Engagement

DRAGEN (Illumina):
- Request alignment of DRAGEN 3.7.8+ with GA4GH metric definitions.
- Collaborate on standardizing variant-level QC flags.
Unified Output Format: Advocate for a consolidated file (JSON/CRAM-tag) for all metrics.

3. NIH Metric Framework Integration

Adopt/adapt NIH’s germline QC metrics where applicable.
SV-CNV Cross-Validation: Explore in Phase 2 (per DC’s suggestion).

4. Working Group Priorities

Short-Term: Finalize SNP/small-indel VCF metrics (Q2 2024).
Long-Term:
- Structural variant QC (2025).
- Vendor tag standardization (post-2024).

Rationale

1. Focus on Germline Short-Reads

Addresses immediate community needs (clinical/large-scale genomics).
Provides foundation for future expansions.

2. Vendor Collaboration

DRAGEN Ubiquity: Ensures broad adoption of standards.
Unified Outputs: Simplifies integration for end users.

3. NIH Alignment

Leverages existing NIH/CSER work to avoid duplication.
SV-CNV sanity checks improve variant reliability.

4. Phased Approach

Prevents scope creep while planning for SVs/long-reads

2023-08-22

Decision

1. VCF-Centric QC Metrics Framework

Scope: Focus on germline short variants (SNPs/indels <50bp) from standard callers (e.g., DeepVariant, GATK).

Core Metrics:

Metric	Definition	Filter
Insertion Count	PASS variants only	Exclude SVs
Deletion Count	PASS variants only	Exclude SVs
Ti/Tv Ratio	All variants	-

Filtering: Standardize on PASS-only for critical metrics (per Linda/Jukka consensus).

2. Implementation & Validation

Reference Workflows:
- NPM (Nextflow): Current benchmark.
- Argo (GATK-based): To be formalized.
Benchmarking:
- Run on 10K samples (Oliver’s bad-quality WGS + 1KG).
- Compare via Dockstore for reproducibility.

3. PR #7 Merge

Approve major revision of metric definitions for:
- BAM/CRAM QC (mature).
- VCF QC (initial draft).

4. Somatic/Long-Read Deferral

Postpone until germline metrics are stable.
Exception: Basic Ti/Tv for tumor/normal pairs.

Rationale

1. Practical Focus

PASS-only filtering reflects clinical/lab standards.
Size cutoff (<50bp) aligns with major variant callers.

2. Scalable Validation

10K samples stress-test metric robustness.
Dockstore ensures workflow portability.

3. Ecosystem Alignment

PR #7 harmonizes with NIH/CSER frameworks.
Delaying SVs avoids dilution of effort.

2023-03-28

Decision

1. Metric Definition Updates

Adopt PR revisions to clarify:
- Remove ambiguous booleans (e.g., "include duplicates?") → Replace with plain-language descriptions.
- Standardize insert size reporting (mean + median absolute deviation).
- Rename "% read map" (was "read mapping quality") for clarity.
Implementation Details Section:
- Add tool-specific parameters (e.g., verifyBAMID2 for contamination).
- Document functionally equivalent thresholds (e.g., 1000 Genomes benchmarks).

2. Coverage Metrics

Retain 15x autosome coverage (vs. AoU’s 20x):
- Balances clinical utility and sequencing economics.
Evenness Calculation:
- Standardize on median/mean coverage ratio (simple, reproducible).
- Future: Evaluate dispersion-based methods for tumor samples.

3. Contamination Rate

Include as optional metric:
- Tools: verifyBAMID2 (alignment) or GATK CalculateContamination (post-calling).
- No universal threshold: Labs set own cutoffs (e.g., ≤1% for clinical).
- Exclude ACMG panel coverage: Too project-specific.

4. VCF-Centric Additions

Required: Ti/Tv ratio, indel spectrum.
Optional: Cross-sample contamination (tumor/normal).

Rationale

1. Clarity vs. Flexibility

Plain-language definitions reduce implementation drift.
Optional contamination accommodates diverse use cases (research vs. clinical).

82. Coverage Practicality*

15x covers most germline needs; 20x adds marginal value.
Median/mean ratio is robust for evenness (per ICGC experience).

3. Ecosystem Alignment

AoU metrics inform but don’t dictate standards.
Tumor-specific methods deferred to somatic roadmap.

2023-02-21

Decision

1. Repository Structure & Governance

Folder Organization:
- /roadmap: High-level deliverables
- /definitions: Metric specifications (one .md per metric)
- /implementations: Reference workflows (Singapore NPM, Argo)
Approval Process: Merge PRs after 72-hour review window.

2. Metric Definition Standards

Template:

markdown
Copy
Download

### `ID`
`mean_autosome_coverage`  


### `Description`  
Mean coverage depth across autosomes, excluding assembly gaps.  


### `Implementation`  
- **Tools**: mosdepth (default), samtools  
- **Parameters**: `-Q 13`, `--region autosomes.bed`  
- **Validated Ranges**: 25–35x (WGS)  


### `Notes`  
Aligns with All of Us 20x threshold but uses 15x for broader utility.

Rationale

1. Clarity & Reproducibility

Plain-English definitions prevent tool-specific assumptions.
Dockstore integration enables cloud-scale validation.

2. Biological Utility

Dual mapping metrics address distinct quality aspects.
Whitelists ensure consistent autosomal coverage calculations.

3. Ecosystem Alignment

AoU/ICGC cross-pollination avoids duplication.
Modular structure accommodates future expansions (e.g., somatic)

2023-01-17

NA

2022-12-13

Decision

1. GitHub Governance

Protected main branch: Require 2-3 reviews before merging (Maxime/Lindsay to configure)
Process documentation: Adopt conventions from other GA4GH Work Stream repos (Lindsay to draft)
Repository naming: Retain current name (wgs-sample-qc), clarify scope in README

2. SC Proposal Preparation

Submit questionnaire by Jan 12, with slides to follow
Key deliverables:
- Standardized JSON schemas for 7 core metrics
- Reference implementations (Singapore-NPM, AGHA)
- Benchmarking protocol (GIAB + 10X Linked-Read datasets)

3. Technical Scope

Exclude sample/tumor metadata: QC outputs will not include genetic or clinical data
API development: Defer to later phase; focus on file-based JSON outputs initially

4. Functional Equivalence

Use metrics to compare pipelines (e.g., via statistical thresholds)
Adopt alignment statistics framework from alignstats as reference

Rationale

Controlled Collaboration: Protected branches ensure stability while allowing community input.
Clear Boundaries: Separating QC metrics from sample metadata prevents misuse and simplifies adoption.
SC Alignment: Explicit deliverables (JSON schemas, implementations) demonstrate production readiness.
Interoperability: File-first approach ensures compatibility with existing workflows; API can be added later.

2022-09-13

Decision

1. Finalize Reference Implementations

Singapore-NPM: Complete DSL2 transition and local resource handling (Nicolas/Maxime)
AGHA: Test and provide feedback on implementation (Sehrish)

2. Standardize Benchmarking Data

Use GIAB Baid dataset (PCR-free/PCR-plus WGS/WES) as primary benchmark
Supplement with AGHA’s 10X Linked-Read (poor-quality data)
Host datasets on AWS Open Data/Google Cloud for accessibility (Oliver)

3. Refine Metric Definitions

Revise descriptions for the 7 core metrics (Nicolas/Maxime)
Align JSON schemas with GA4GH standards

4. Adjust Roadmap

Move ontology development to a later milestone
Add GA4GH Standards Interoperability as a near-term milestone
Prioritize pipeline validation over theoretical standardization

Rationale

Practical Focus:

Core metrics and implementations provide immediate value for cross-pipeline comparison.

Data Diversity:

Combining GIAB (gold standard) and problematic datasets (10X Linked-Read) ensures robust validation.

Technical Scalability: - DSL2 improves workflow portability. - Local resource handling reduces runtime dependencies.

Roadmap Realism:

Separating ontology work allows focus on production-ready

2022-07-26

Decision

1. Prioritize 5 core metrics for initial standardization:

Mean autosomal coverage
Mapping quality distribution
Contamination estimate
Insert size metrics
Read depth distribution

2. Maintain two implementation tracks:

Minimal implementation: Core 5 metrics only (for benchmarking/comparison)
Extended implementation: Full metric set (for exploratory analysis)

3. Improve workflow portability:

Resolve environment-specific hardcoding in Nextflow/Docker
Implement versioned releases with changelog
Adopt GA4GH tool registry (Dockstore) for sharing

4. Curate test datasets:

Include both high-quality (1000 Genomes) and problematic samples
AGHA to share failed runs (Azure/AWS)
Explore ICGC blacklisted samples where possible

Rationale

1. Practical adoption:

Core metrics address critical QC needs while remaining manageable Dual implementation balances standardization with flexibility

2. Technical robustness:

Focused metrics simplify cross-tool comparison Version control ensures reproducibility

3. Validation requirements:

Problematic samples are essential for testing metric sensitivity Public/private dataset mix provides comprehensive coverage

2022-05-17

Decision

1. Defer large-scale ontology development and focus on:

Finalizing JSON schemas for key metrics (e.g., coverage, contamination)
Publishing reference implementations (Nextflow/Docker workflows)
Documenting minimum metadata requirements (e.g., refget IDs, tool versions)

2. Adopt a hybrid approach for ontologies:

Use existing terms (e.g., GenEpiO) where possible.
Define only critical new terms (e.g., "autosomal_depth_of_coverage").
Link JSON fields to ontology IDs (e.g., "metric_type": "GENEPIO:XXXX").

3. Curate benchmarking datasets:

Prioritize real-world problematic samples (AGHA to share failed runs).
Supplement with ICGC/CINECA data where accessible.
Avoid synthetic data (per Ivo’s feedback on biological realism).

4. Align with GA4GH tooling:

Use Dockstore for workflow sharing.
Adopt refget for reference standardization.

Rationale

Practical adoption: JSON schemas and workflows provide immediate value to labs.
Resource efficiency: Ontology work is time-intensive; defer to post-implementation.
Real-world validation: Problematic samples (not just GIAB) are critical for QC use cases.
Interoperability: Hybrid JSON+ontology approach balances rigor and flexibility

2022-04-12

Decision

Adopt OBO Foundry principles for ontology development:

Reuse existing terms from GenEpiO where applicable
Create new terms following best practices (precise labels, technical definitions)
Maintain separation between concepts and implementation details

2. Pilot with mean autosomal coverage:

Create 3 new terms (depth/breadth of autosome coverage)
Reuse 2 existing GenEpiO terms
Structure as: Copy Download

sequencing metric
└── coverage metric
    ├── depth of autosome coverage (new)
    └── breadth of autosome coverage (new)
    }

3.Standardize parameters:

Use "reference genome accession" (refget-compliant) Replace "BED file name" with "genomic region filter URI" Rename "minimum mapping quality score" to "minimum mapping quality threshold"

4. Implement layered specification:

Ontology layer: Universal definitions (e.g., "depth of coverage") Application profile: Required fields for implementations (JSON schema) Guidance documentation: Tool-specific parameters (e.g., mosdepth settings)

Rationale

1. Interoperability:

OBO Foundry compliance ensures compatibility with existing biomedical ontologies
GenEpiO provides established framework for genomic quality terms

2. Precision:

Technical definitions disambiguate metrics (e.g., "autosomal" vs whole-genome)
Parameter standardization prevents implementation drift (e.g., refget for references)

3. Scalability:

Pilot approach validates methodology before expanding to other metrics
Layered design accommodates both universal concepts and tool-specific needs

4. FAIR compliance:

URIs for region filters enable reproducibility
Versioned terms maintain backward compatibility

2022-03-08

Decision

1. Adopt GENEPIO ontology as foundation for QC metric standardization, with initial focus on:

Mean autosomal coverage as pilot term
Alignment with existing GENEPIO quality terms (e.g., GENEPIO_0000087)

2. Develop layered implementation:

Reference ontology: Generic metric definitions (platform/algorithm-agnostic)
Application profiles: JSON schemas specifying required fields for implementations
Controlled vocabularies: For parameters (e.g., mapping quality thresholds)

3. Curate reference datasets from:

Genome in a Bottle (GIAB) Ashkenazim Trio
ILMN Dragen 1kGP AWS dataset
CINECA synthetic data (supplemental)

4. Establish governance model:

GENEPIO team maintains core ontology
GA4GH community extends through application profiles
Implementation feedback loop via GitHub

Rationale

1. Avoids duplication by building on established GENEPIO framework

2. Balances flexibility and rigor:

Ontology provides semantic backbone
JSON profiles enable practical adoption

3. Comprehensive validation: - GIAB data provides gold-standard references - Multiple platforms reveal edge cases

4. Sustainable maintenance:

Leverages existing GENEPIO infrastructure
Clear separation between universal definitions and implementation-specific needs

2022-01-18

Decision

1. Expand proof-of-concept analysis to:

Include Genome in a Bottle trio datasets
Compare Dragen vs. non-Dragen implementations (via CWL workflows)
Analyze diagnostic vs. research pipeline differences (SickKids use case)

2. Develop a QC metrics ontology with:

Standardized JSON schema (initially non-LD, transitionable to JSON-LD)
Controlled vocabularies for key fields (reference, filters, warning flags)
Alignment with GA4GH standards (refget for references, future BED file standards)

3. Define implementation priorities:

Core utility tools for metric calculation (independent of alignment/variant calling)
Warning/flagging system for outlier metrics
Top-level metadata (pipeline version, reference checksums, UMI info if available)

4. Engage GA4GH work streams to:

Formalize ontology with Melanie Courtot (EBI)
Align with refget and file format teams

Rationale

Evidence-based standardization:

SickKids comparison showed pipeline differences are smaller than metric definition differences (e.g., coverage calculations).
Expanding to public datasets (GIAB) ensures broader applicability.

Interoperability:

Plain JSON allows immediate implementation while enabling future transition to JSON-LD.
Refget checksums provide precise reference tracking without redundant metadata.

Practical adoption:

Decoupled QC tools enable integration with diverse pipelines (Dragen/non-Dragen).
Warning flags address real-world needs for outlier detection (e.g., diagnostic labs).

Community alignment:

Leveraging GA4GH ontologies (DUO) ensures compatibility with existing standards.
CWL workflows support reproducibility across platforms.

2021-11-30

Decision

1. Adopt an ontology-based approach for metric standardization, leveraging Melanie Courtot's expertise from EBI and building on DUO ontology experience

2. Structure implementation documentation with:

Clear reporting guidelines with examples
Implementation-specific sections (organized as a table of contents for submitted tools/scripts)
Controlled vocabulary section for standardized fields

3. Implement reference standardization using:

Mandatory refget checksums for precise reference identification
Optional human-readable reference descriptions

4. Expand proof-of-concept with SickKids diagnostic vs genome center comparisons to identify:

Pipeline version differences
Unexplained metric variations

5. Separate metadata architecture:

Top-level metadata section for global properties
Metric-specific fields (marked N/A when not applicable)

Rationale

Ontology benefits:

Enables semantic interoperability across implementations Leverages existing GA4GH expertise (DUO experience) Provides structured framework for community contributions

Implementation clarity:

Multiple real-world examples (SickKids, PRECISE) demonstrate practical adaptation Structured documentation supports diverse tool integration

Reference standardization:

Refget ensures precise reproducibility Maintains flexibility for different genome builds Complements ongoing GA4GH work on BED file standardization

Evidence-based development:

Diagnostic vs research pipeline comparisons reveal practical challenges Guides refinement of controlled vocabulary

Scalable architecture:

Separated metadata accommodates both global and metric-specific needs "N/A" convention maintains structure while allowing implementation flexibility

2021-10-26

Decision

1. Adopt JSON as the primary output format for QC metrics, structured as:

sample_id,
qc_metrics (aligned with the minimal definitions document),
reference_metadata (e.g., refget checksum, BED file path).

2. Separate global BAM properties (e.g., duplicate marking, UMI usage) from per-metric calculations, with:

A header section documenting filters/assumptions (e.g., clipping, overlapping bases),
Optional tool-specific annotations (e.g., samtools flags).

3. Prioritize runtime efficiency:

Allow streaming computation (e.g., samtools mpileup),
Avoid reprocessing BAMs when possible (leverage existing headers/tools like mosdepth).

4. Benchmark implementations using NA12878, focusing on:

Coverage discrepancies (autosomes vs. full genome),
Variant counting methods (SNPs/indels categorization),
Contamination metric harmonization.

5. Engage GA4GH Work Streams (e.g., VRS, refget) to align with broader standards, but defer BAM header embedding due to impracticality.

Rationale

Interoperability:

JSON ensures machine-readability and tool-agnostic adoption.

Transparency:

Global headers clarify assumptions (e.g., filters applied) without bloating per-metric outputs.

Performance:

Streaming-compatible tools (e.g., samtools) enable scalable QC for large cohorts.

Validation:

NA12878 comparisons expose definitional nuances (e.g., coverage thresholds, overlapping bases).

Community Alignment:

Leveraging GA4GH standards (e.g., refget) future-proofs the framework, while avoiding premature BAM header modifications.

2021-09-14

Decision

1. Define a core set of QC metrics for germline WGS, covering:

Alignment (e.g., coverage, mapping quality),
Variant calling (high-level metrics only),
Sample contamination (with explicit type definitions),
Yield & quality (e.g., read quality, duplication).

2. Standardize metric definitions with:

Unique names/IDs,
Clear descriptions (including assumptions like clipping or filters),
Preferred open-source tools (avoiding proprietary solutions).

3. Allow flexible implementations but require documentation of methodology (e.g., reference regions, tool versions).

4. Optimize for runtime—avoid recomputing metrics when avoidable (e.g., reuse BAM/CRAM stats).

5. Validate via GitHub collaboration—test implementations on NA12878 and refine definitions iteratively.

Rationale

Comparability: Ensures metrics are interpretable across pipelines.

Practicality: Prioritizes widely supported tools and efficient computations.

Adaptability: Definitions accommodate multiple technologies (e.g., short/long reads).

Transparency: Documenting tool-specific assumptions mitigates misinterpretation (e.g., rounding differences).

Community-driven: GitHub-based validation ensures real-world usability.

2021-08-10

Decision

The group decided to define a foundational set of WGS QC metrics, focusing first on those that are:

Technically feasible to obtain from aligned data (BAM/CRAM)
Widely used and accepted across institutions
Relevant for determining sample usability and data quality
Able to be calculated independently of platform or variant calling pipeline (where possible)

Initial Metric Categories and Examples

Category	Example Metrics (Proposed)
Coverage	Mean read depth, % of genome covered at ≥X (e.g., 20x), genome completeness, absolute/effective coverage
Mappability	% mappable reads, % mapped, % aligned, cross-species contamination
Library/Read Quality.	% duplication, median fragment length, insert size, ATCG dropout, uniformity of coverage
Yield	Total bases > Q30 (e.g., Gb > Q30), total reads
Contamination	Estimated contamination (e.g., verifyBAMID, % mixed reads), noise proxy metrics
Callable Regions	% callable bases (e.g., genome, OMIM, ClinVar), reference genome version and regions covered
Technology-specific flags	Low-quality base prevalence, read length bias, differences in Q-score distribution (e.g., Illumina vs PacBio)

Additional distinctions:

Metric format: scalar (e.g., % mapped) vs. distributional (e.g., insert size histogram)

Metric derivation: Derived directly from aligned reads, with clear documentation of computational choices (e.g., read depth calculation approach)

Standardization approach: Each metric will be defined both in plain language and optionally mapped to a reference implementation (e.g., Nextflow module)

Rationale

1. Wide applicability: The selected metrics are broadly relevant and feasible for institutions to report without needing extensive pipeline changes.

2. Reproducibility: Focusing on BAM/CRAM-level metrics avoids reliance on downstream tools that vary more widely (e.g., variant callers).

3. Interoperability: Enables initial comparison across institutions even with varied processing approaches.

4. Incremental adoption: Allows organizations to begin contributing metrics from their current pipelines and participate in convergence over time.

5. Clarity: A living document and GitHub repository will track metric definitions, including optional reference code.

2021-06-29

Decision

1: Start with a Minimal, Agreed-Upon Set of QC Metrics

2: Develop Clear Definitions for Each Metric

3: Separate File Format/Schema from QC Computation Pipeline

4: Use the GitHub Repository and a Shared Google Doc for Collaborative Metric Listing

5: Exclude Somatic Sequencing from Scope

6: Ensure QC Metric Definitions are Build-Agnostic Where Possible

7: Encourage Metrics to Be Accompanied by Contextual Information

8. Create a Mailing List for Ongoing Communication

Rationale

1.

Initiatives have diverse pipelines and stages of QC implementation.
A common denominator will enable cross-project communication and comparability without requiring changes to local pipelines.
Enables tracking and understanding across systems with varied infrastructure maturity (e.g., ingest-only QC vs. end-of-project QC).

2.

Different initiatives use the same metric names with differing meanings or computational methods.
Ambiguity in metric definitions leads to miscommunication both within and across initiatives.
Needed for interoperability and trust when sharing data or results.

3.

Some participants only need a standardized way to share and store metrics (file format/schema).
Others also want a pipeline to calculate them.
Decoupling these allows flexibility: adoption of the schema without forcing a specific implementation.

4.

Facilitates asynchronous collaboration. Enables commenting and harmonization of metric names, definitions, and formats. Promotes transparency and traceability of decisions made.

5.

Somatic sequencing introduces additional complexity.
Current efforts focus on germline WGS; including somatic would dilute the effort and make consensus more difficult.

6.

Some initiatives use different reference genomes.
Metrics need to remain interpretable across builds for consistent reporting and downstream use.

7.

Tool outputs can differ even for the “same” metric, depending on implementation details or assumptions.
Making the computation method transparent helps others interpret the results accurately and decide on fitness for use.

8.

Supports continuous engagement.
Avoids fragmented updates and misaligned efforts between meetings.

Architectural Decision Record

Contents

2025-09-23

Decision

Rationale

2025-08-26

Decision

Rationale

2025-07-22

Decision

Rationale

2025-06-27

Decision

Rationale

2025-05-27

Decision

Rationale

2025-04-22

Decision

Rationale

2025-02-25

Decision

Rationale

2025-01-21

Decision

Rationale

2024-11-26

Decision

Rationale

2024-10-22

Decision

Rationale

2024-09-24

Decision

Rationale

2024-07-23

Decision

Rationale

2024-06-25

Decision

Rationale

2024-05-28

Decision

Rationale

2024-03-26

Decision

Rationale

2024-02-27

Decision

Rationale

2024-01-30

Decision

Rationale

2023-11-28

Decision

2023-10-24

Decision

2023-09-26

Decision

Rationale

2023-08-22

Decision

Rationale

2023-03-28

Decision

Rationale

2023-02-21

Decision

Rationale

2023-01-17

2022-12-13

Decision

Rationale

2022-09-13

Decision

Rationale

2022-07-26

Decision

Rationale

2022-05-17