Learnings from building sequence-to-function models

[WIP: Look at your data]

Do it for longer than you think
- not a useful statemenent → what are good questions to ask?

[WIP: Splitting genomes into useful training examples]

Coverage ≠ input
how many genes?
- yeast vs. mammalian
look at the samples in igv with bed tracks

[WIP: On Pearson Correlation]

Pearson correlation is the most reported performance metric for sequence-to-function models. To compute it, take your coverage profiles and make a set of pairs of predicted $y$ and true $x$ coverage values, $\{(x_0,y_0), ..., (x_n, y_n)\}$ for all postions $n$, and compute the Pearson r as always.

The core issue of Pearson correlation is not new but worth mentioning again: Pearson correlation is weighted by the difference to mean. Recall that $r = Cov(X, Y)/\sigma(X)\sigma(Y)$ and $Cov(X, Y) = \frac{1}{n}\sum(x - \mu_x)(y -\mu_y)$ which means that higher gene expression values will dominate your covariance. Since for example RNA-seq values can have a large dynamic range within the same window (consider a window of a low and a high expression gene), you can get strinto trouble with this. To drive this point home, consider this plot of two model predictions for the same coverage profile:

In this case the green model is clearly preferable over the orange one but our Pearson correlation is mostly dominated by the high expression gene. Personally, I would prefer a metric that discriminates the two models much better!

As you might have guessed, the different distances of each gene’s expression to the mean result in very different contributions to the global Pearson because they are ~squared by the covariance. Long, low expression genes paired with high expression genes make this problem even worse because length linearly moves the mean downwards which quadratically blows up the contribution of high expression genes.

Another unintuitive failure mode of Pearson correlation (that Carl de Boer pointed out to me) is what happens if you have e.g. two gene bodies in a prediction window:

[IMAGE]

Now, there is another way you can compute the Pearson correlation

Dataset leakage

[]

[WIP: Look at your data]

[WIP: Splitting genomes into useful training examples]

[WIP: On Pearson Correlation]

Dataset leakage

Measure performance on OOD datasets