Abstract:
Expression levels determined by RNA-seq are count-based. For
expression level analysis, logarithm of read counts is more usual
than read counts directly.
Normal distribution
is the most popular distribution hypothesis for parametric
significance analysis. Unfortunately, expression levels of a given
gene determined by RNA-seq may not fit the normal distribution.
Therefore, manipulation of read counts should be noted when
implementing linear regression model. That is why we discuss
logarithm of read counts. Logarithm of read counts for a given gene
from 200 samples is more like normal distribution, differing read
counts without logarithm (Figure 1).
Figure 1. QQ plots
of read counts and log2 read counts
Such bad-normal
distribution of expression level is due to big dynamics of read
counts determined by RNA-seq. Coefficient of variations have less
correction with expression levels (Figure 2). In addition, setup of
noise values for removing lower read counts would not always impact
the distribution of read counts overall.
Figure 2. Dynamics
of expression levels of 1500 genes determined by RNA-seq
Goodness-of-fit of
estimation of means and standard deviations (SD) based on between
read counts and logarithm of read counts. The theoretical curve by
log-normal distribution fit better than the curve by normal
distribution (Figure 3).
Figure 3.
Goodness-of-fit of normal and log-normal distribution. Mean of the
observed and theoretical read counts were indicated by gray and blue
lines, respectively. The red lines were the cutoff of read counts for
significance analysis with p value 0.05.
Writing date:
2014.10.2, 2015.02.20