Ever since listening to Jacob Kimmel on Dwarkesh’s Podcast I was wondering about the following question: Given that you inject an LNP-mRNA vaccine (e.g. the BioNTech-Pfizer vaccine) in a random location, for example your arm, how do you make sure that whatever protein is encoded on your mRNA is only expressed in the right cell type?

This actually matters because quite a lot because your LNP will concentrate in a few off-target locations and high expression in those cell-type may cause unwanted toxicity. From a BioNTech-EU report:

Over 48 hours, distribution was mainly observed to liver, adrenal glands, spleen and ovaries, with maximum concentrations observed at 8-48 hours post-dose.

NewLimit, Jacob Kimmels longevity company, cleverly exploits this skewed biodistribution by targeting liver cells for their first program. However, all current and future programs targeting other cell types will deal with this issue in the near-term. The chemists among you would raise a finger and say: “Oh but we will make sure the LNP doesn’t go into the wrong cells in the first place!”. This is a very valid argument but just one lever that we can exploit. The other lever that I got interested in is: assuming your LNP does go to the wrong cell, how do we minimize payload expression while we wait for the mRNA to get degraded?

Cell-type-specific mRNA sequence generation as latent-space optimization

At a recent Hackathon, a few smart teams found a bunch of interesting solutions to this problem of How to design mRNA sequences with cell-type specific expression? One of them was to do a form of latent-space bayesian optimization with a simple MLP-encoder-decoder setup and RiboNN as the translational efficiency (TE) predictor. Apparently this works fairly well. One worry with this kind of setup is though that you are inherently assuming that your TE predictor is reliable. This is not true in general and you might end up generating sequences that live in regions where your predictor is inaccurate. Sequence-to-expression models are generally known to do a lot of “look up”-predictions where the worst sequence to predict is one with some sequence identity that’s enough to fool the predictor and just enough difference to cause totally different expression. This sounds an awful lot like what you would do with latent space optimization where you start from a known sequence and potentially push it to extreme expression configurations where your predictor is uncertain. One potential way around this would be to bake into the model that generated sequences have to follow the natural distribution of mRNA sequences.

Generate RNA sequences with input protein sequence and translation efficiency autoregressively

This is how the idea for tropical was born: an autoregressive transformer model that generates mRNA sequences with protein sequence and translation efficiency conditioning. From a user perspective you give it a (short/zero) mRNA sequence prompt, a protein sequence representing your payload and a dictionary of translation efficiencies for all cell types as the input and get an optimized mRNA sequence as output.

If we were not to condition on protein sequence and TE, this would just be a straight-forward decoder-only transformer. I am going to explain conditioning on protein sequence and TE in more detail but in short, tropical uses cross-attention to condition on the protein sequence and adaLayerNorm to condition on a translational efficiency vector (each dimension is a cell-type). The beautiful thing about this setup is that we keep the loss and training task the same: autoregressive, self-supervised next-nucleotide prediction. The intuition on why this works is that conditioning the model boils down to giving it hints that help to predict the next token more accurately. To illustrate this, think about which task would be harder: 1) generate the correct RNA sequence from scratch or 2) generate it when you know it should encode a given protein sequence. 2) is easier because you massively restrict the plausible RNA sequences.

Conditioning on protein sequence

If you are familiar with cross-attention skip to here.

🚧 To be written

Conditioning on translational efficiency with AdaLayerNorm

To understand how AdaLayerNorm is and how it enables TE-conditioning, it helps to first understand what LayerNorm really is and why it exists in the first place.

$$ \text{LayerNorm(x)} = \gamma * \frac{x - E[x]}{\sqrt{Var[x] + \epsilon}} + \beta $$

$x$ is one input sample which is standardized to zero-mean and unit variance. In contrast to BatchNorm, LayerNorm doesn’t compute mean and variance across samples but across other dimensions within each sample. For transformers that often means normalising across the embedding dimension (i.e. you get one mean and variance for each (sample, token_pos)-pair).

To get the full picture, let’s have a look at the PyTorch API:

class torch.nn.LayerNorm(
	normalized_shape, 
	eps=1e-05, 
	elementwise_affine=True, 
	bias=True, 
	device=None, 
	dtype=None
)

We’re using LayerNorm like this:

nn.LayerNorm(emb_dim, elementwise_affine=False)