dnase_propensity

tags: DNase hypersensitive sites, DHS, accessibility, functional genomics, sequence function, reference sequences

tl;dr

dnase_propensity (aka dnase_prop) measures the tendency of a DNA sequence to be accessible, i.e. cut or cleaved by the DNase enzyme. Unlike the typical concept of a DNase Hypersensitive Site (DHS), this propensity, or disposition, is agnostic to cell-types and instead measures basal sequence accessibility across tissues and organs.

overview

For a DNA sequence to function, it cannot be densely packed away in closed chromatin – it must be ‘unpacked,’ i.e. accessible. The simplest measure of ‘unpackedness’ is a sequence’s ability to be digested by a DNase enzyme, which naturally slices up any ‘loose’ DNA. The idea that certain regions of the genome are regularly, if not always, accessible lead to the idea of (functional) elements being located in “DNase-hypersensitive” positions across the genome, hence the task’s name.

dnase_prop was constructed on the human reference genome hg38 based on a combination of DHS tracks from the SCREEN v2 database, a subset of ENCODE. Sequence labels \(y\) correspond to the likelihood a sequence would be annotated as accessible (a DHS) – an essential indicator of sequence function. Although dnase_prop is cell-type-agnostic, cell-type-specific models will perform well on this task due to the nature of its construction. The high-level statistical model is \(y \sim p(\textrm{DHS} \ | \ X)\) for \(X\) a given DNA sequence.

Class labels are ordinal, i.e. represent a ranking, and they range from zero (0, completely inaccessible) to four (4, almost always accessible, at least across measured cell-types). intermediate scores of one to three (1-3) represent some degree of cell-type specific accessibility, although which types of cells is left unsaid.

example models

model

\(\rho\)

Borzoi

70.6892

SEI

70.4688

Enformer (Pre-output)

68.0818

Beluga

64.9577

Enformer

64.9164

5-mer LinSVR (baseline)

36.3022

GC-content (baseline)

20.5533

interpretation

Accessibility is one of the most fundamental features of sequence function, and models scoring well on dnase_prop likely recognize functional genomic elements (including cis-regulatory elements) in the reference genome. That being said, while it corresponds to the degree of cell-type-specific accessibility, it does not indicate which cell types a sequence is accessible in, nor does it directly measure a model’s ability to predict a specific cell type’s DHS track.

Additionally, while accessibility suggests sequence function, a model that can predict dnase_prop is not guaranteed to possess functional element understanding; such a model may not be able to infer sequence function in the reference genome (including function indicated by histone marks, e.g. H3k4me3 for sequence promoters), much less be able to estimate the effects of variants.

Supervised models like DeepSEA, Enformer, Borzoi, etc, are prime examples of models built for dnase_prop – their own training data includes the same DHS tracks that make up dnase_prop. To restate, dnase_prop is in-distribution for supervised models trained on ENCODE DHS tracks, and a score below 100 should be understood as having underfit their training data.

See also

The most closely related task is ccre_propensity, which builds on top of dnase_prop’s measure of accessibility to assess sequence function.

example usage

first, clone the dataset from huggingface (make sure you have Git LFS installed):

git clone https://huggingface.co/datasets/guanine/dnase_propensity

then, read the file into main memory with your favorite file parser

loading with pandas
import pandas as pd

# 1per is the recommended few-shot training split
train_dat = pd.read_csv('dnase_propensity/bed/1per/1per.bed', sep='\t')
train_dat.head()

finally, splice the sequence out with your preferred genome reader, e.g. twobitreader

accessing sequences with twobitreader
from twobitreader import TwoBitFile

# download from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit
hg38 = TwoBitFile('hg38.2bit')

CONTEXT_SIZE = 8192 # change this for your model

row = train_dat.iloc[0]
ch = row['#chr']
st = row['center']-CONTEXT_SIZE//2
en = row['center']+CONTEXT_SIZE//2

seq = hg38[ch][st:en]

# optionally convert your sequence to uppercase before tokenizing it, etc
seq = seq.upper()
assert len(seq)==CONTEXT_SIZE # we recommend checking for truncation

build details

Basal accessibility is approximated by integrating out the signal unique to cell types, i.e. \(\int_{c \in C} \ p(\textrm{DHS} \ | \ X, c)\) for \(c\) a given cell line or cell type. Specifically, for the over 700 DHS tracks in SCREEN v2, we consider the discrete sum \(y(X) \propto \sum_{t \in \textrm{DHS tracks}} \ \alpha_t \ \cdot \ \textbf{1}_\textrm{t, DHS}(X)\) for boolean indicator function \(\textbf{1}_\textrm{t, DHS}(X)\), which represents the signal at the locus of sequence \(X\) being called as a peak (1) or non-peak (0) in typical DHS-track fashion. Of note is the weighting \(\alpha_t\), which allows us to downweight cancerous cell lines by half (to help mitigate cancer-specific accessibility signals).

Because SCREEN v2 makes use of consensus peak calling, the peak-called loci for reference sequences is the same across tracks – this allows for an otherwise difficult-to-pinpoint function to be well-defined.

Finally, for ease of modeling, the raw \(y(X)\) scores are binned from one to four (1-4), and a partially G/C-balanced control set of inaccessible sequences are added to the dataset with labels of zero (0).

controlled factors

  • repetitive elements (partial)

  • G/C content (partial)

  • immortalized cancer line accessibility (partial)

appears in

GUANinE v1.0

original citation

The ENCODE Project Consortium., Moore, J.E., Purcaro, M.J. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020). https://doi.org/10.1038/s41586-020-2493-4