Friday, February 20, 2015

NGS: Why is the logarithm of expression level determined by RNA-seq?


Abstract: Expression levels determined by RNA-seq are count-based. For expression level analysis, logarithm of read counts is more usual than read counts directly.



Normal distribution is the most popular distribution hypothesis for parametric significance analysis. Unfortunately, expression levels of a given gene determined by RNA-seq may not fit the normal distribution. Therefore, manipulation of read counts should be noted when implementing linear regression model. That is why we discuss logarithm of read counts. Logarithm of read counts for a given gene from 200 samples is more like normal distribution, differing read counts without logarithm (Figure 1).



Figure 1. QQ plots of read counts and log2 read counts

Such bad-normal distribution of expression level is due to big dynamics of read counts determined by RNA-seq. Coefficient of variations have less correction with expression levels (Figure 2). In addition, setup of noise values for removing lower read counts would not always impact the distribution of read counts overall.




Figure 2. Dynamics of expression levels of 1500 genes determined by RNA-seq

Goodness-of-fit of estimation of means and standard deviations (SD) based on between read counts and logarithm of read counts. The theoretical curve by log-normal distribution fit better than the curve by normal distribution (Figure 3).



Figure 3. Goodness-of-fit of normal and log-normal distribution. Mean of the observed and theoretical read counts were indicated by gray and blue lines, respectively. The red lines were the cutoff of read counts for significance analysis with p value 0.05.

Writing date: 2014.10.2, 2015.02.20