Look at your data
[On Pearson Correlation]
Pearson correlation is the most reported performance metric for sequence-to-function models. To compute it, take your coverage profiles and make a set of pairs of predicted $y$ and true $x$ coverage values, $\{(x_0,y_0), ..., (x_n, y_n)\}$ for all postions $n$, and compute the Pearson r as always.
The core issue of Pearson correlation is not new but worth mentioning again: Pearson correlation is weighted by the difference to mean. Recall that $r = Cov(X, Y)/\sigma(X)\sigma(Y)$ and $Cov(X, Y) = \frac{1}{n}\sum(x - \mu_x)(y -\mu_y)$ which means that higher gene expression values will dominate your covariance. Since for example RNA-seq values can have a large dynamic range within the same window (consider a window of a low and a high expression gene), you can get into trouble with this. To drive this point home, consider these two plots of two model predictions for the same coverage profile:

In this case the green model is clearly preferable over the orange one but our Pearson correlation is mostly dominated by the high expression gene. Personally, I would prefer a metric that discriminates the two models much better!

As you might have guessed, the different distances of each gene’s expression to the mean result in very different contributions to the global Pearson because they are ~squared by the covariance. Long, low expression genes paired with high expression genes make this problem even worse because length linearly moves the mean downwards which quadratically blows up the contribution of high expression genes.
Another unintuitive failure mode of Pearson correlation (that Carl pointed out to me) is what happens if you have e.g. two gene bodies in a prediction window:
[IMAGE]
Now, there is another way you can compute the Pearson correlation
Dataset leakage
[]
Measure performance on OOD datasets
[Tale of high R, low generality]
Spearman Correlations on sums non-affine transformed values
To measure OOD performance, we often sum the predicted coverage values over some region (e.g. the coding sequence) and use that scalar as a proxy to correlate with an experimentally measured value. To check if your model is able to distinguish variants, you can use rank correlations like Spearman. In addition some sequence-to-expression models work on transformed labels, i.e. they predict in some squashed space. When lazy you might end up computing Spearmans on those untransformed values. The issue with this is that

where $T(x_i)$ is the tranformation and $x_i$ is your coverage value at position $i$. A real example of this is the Borzoi transformation:
