Under development
The cadd-snv datasets are still being refined, and information on this page is liable to change.
cadd-snv¶
tags: deleteriousness, variant interpretation, perturbation, CADD, SNV
tl;dr¶
cadd-snv is a single-nucleotide variant modeling task, where models attempt to separate a set of ‘benign’ variants from a set of ‘deleterious’ (simulated) variants, as in the original CADD.
overview¶
Aligning a personal (i.e. real) genome to the a reference genome creates a variant call file – a list of all divergences from the reference sequence. These divergences can often be atomized into variants, as many (not all) regions of the genome are additive (i.e. non-epistatic). Single-nucleotide variants, being the shortest and simplest type of variant (just 1 bp substitutions), have absorbed the bulk of prior variant effect research.
- The original CADD model demonstrated that deleterious (or ‘evolutionarily harmful’) variants can be modeled in simulation by contrasting:
proxy benign variants: alleles derived from the human-chimp common ancestor, versus
proxy deleterious variants: simulated random mutations
This allows for the construction of an ancestrally-biased deleteriousness estimator – one independent of large-scale population databases like gnomAD. While such datasets can provide invaluable information about variant frequency (frequent variants are almost cetainly not deleterious), they themselves are often biased due to volunteerism and historical biases in medicicine.
build details¶
[Under development]
appears in¶
original citations¶
Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892. PubMed PMID: 24487276.
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2018 Oct 29. doi: 10.1093/nar/gky1016. PubMed PMID: 30371827.
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M. CADD v1.7: Using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res. 2024 Jan 5. doi: 10.1093/nar/gkad989. PubMed PMID: 38183205.