gpra-c¶

tags: gene expression, promoter sequences, GPRA, dual-reporter assay, yeast, synthetic biology

tl;dr¶

gpra-complex (aka gpra-c) is a synthetic biology task measuring gene expression in yeast. gpra-c solves the issue of finite natural gene-promoter diversity by relying on random oligonucleotides.

overview¶

The central dogma suggests that genes exist to produce RNA, which is then used to manufacture proteins. Upstream promoter elements heavily regulate the amount of RNA a gene can produce, known as its gene expression. Because genes (and their adjacent promoters) tend to be constrained evolutionarily, there is limited sequence diversity in and around gene regions – including promoters. This means that only a small, finite set of promoter sequences (~ 20k in hg38) are ‘trainable’ for any given model. To circumvent this bottleneck, the de Boer lab has developed ways of injecting randomized promoter sequences (complete with their own control) into yeast, known as Gigantic Parallel Reporter Assays.

Class labels are ordinal and range from zero (0, minimal expression) to seventeen (17, maximal expression). Intermediate scores of one to sixteen (1-16) represent increasing levels of gene expression.

example models¶

model	\(\rho\)
T5 (baseline)	84.6738
nt-v2-500m	72.6726
Evo2_1B_base	72.6487
nt-v2-250m	72.436
Caduceus-PS	72.4355
5-mer LinSVR (baseline)	36.3022
GC-content (baseline)	20.5533

interpretation¶

gpra-c is a difficult, but insightful task. While its dynamic range (the typical upper and lower bound of scores) is slightly constrained – it nonetheless produces rankings that correlate to model quality and other tasks (i.e. the newest, fanciest models increasingly do well).

example usage¶

first, clone the dataset from huggingface (make sure you have Git LFS installed):

git clone https://huggingface.co/datasets/guanine/gpra_c

then, read the file into main memory with your favorite file parser

loading with pandas¶

import pandas as pd

# 1per is the recommended few-shot training split
# there are no bed files for GPRA, as it is not in a reference genome
train_dat = pd.read_csv('gpra_c/1per/1per.csv', sep=',') # csv
train_dat.head()

finally, splice the sequence out with your preferred genome reader, e.g. twobitreader

sequences are directly available¶

CONTEXT_SIZE = 8192 # change this for your model

row = train_dat.iloc[0]
seq = row['seq']

# we recommend pre/appending a yeast scaffold for large context models, e.g.
seq = scaf_a + seq + scaf_b

assert len(seq)==CONTEXT_SIZE # we recommend checking for truncation

build details¶

Compared to the source dataset, gpra-c has undergone slight refinement. Specifically, non-canonical length datapoints (i.e. those differening from the standard 80 bp of randomized sequence) have been pruned. While the variability in length likely represents some biological signal, it trivialized a significant portion of the final scoring (as non-80 lengths clustered heavily across the class labels).

controlled factors¶

sequence length

appears in¶

GUANinE v1.0

original citation¶

Eeshit Dhaval Vaishnav, Carl de Boer, & Aviv Regev. (2022). The evolution, evolvability and engineering of gene regulatory DNA. https://doi.org/10.1038/s41586-022-04506-6