There’s a 2019 ICML paper from Stern an colleagues which introduces a modification of a Transformer that modifies it to insert tokens in an input sequence to match an output sequence.

Paper: https://proceedings.mlr.press/v97/stern19a.html

Code: https://github.com/tdsone/insertion_transformer

Why bother?

By ML standards, a 2019 paper is probably severely outdated, yet I found it to contain an idea very relevant to the task of DNA sequence generation. The DNA design world today is dominated by autoregressive models (e.g. Evo), BERT-style transformers (DNABERT) and diffusion models (e.g. Disc-Diff). From a practical perspective though, few biologists will ever make a sequence and have it do its thing as a linear piece of DNA floating around in the cell. Instead, most DNA sequences are either on a plasmid or integrated somewhere in the genome. Both cases demand that you can incorporate upstream/left and downstream/right context into your design. In addition, except for some budget constraints, end users also wouldn’t care about the length of the promoter or enhancer, they just want whatever works (ideally the shortest and thus cheapest version). This motivates why we would want something like an Insertion Transformer that can insert a sequence of arbitrary length with bidirectional context.

Quick Intuition

Instead of modelling a sequence autoregressively like decoder-only transformer models, we try to model the conditional joint distribution of an insert and a location given the starting sequence and the current sequence to insert into:

$$ p(c, l | x, \^y_t) $$

with $c \in C$ and $C$ is the vocabulary of tokens, $l \in [0, |\^y_t|]$ the location where $\^y_t$ is the current sequence that we are inserting into and $x$ is the starting sequence (can be empty).