Compositional data is used in statistics to describe parts or constituents of some discrete sample space. A typical example of compositional data is encountered in transcription factor binding sites (TFBS) models in genetics, where data is often reported in positional frequencies of the four bases \(A\), \(C\), \(G\) and \(T\) in each position of a TF binding site. In this case, knowing the frequency or proportion of \(A\) base in a specific position is only informative when viewed relative to the frequency or proportions of the other bases \(C\), \(G\) and \(T\). This is a typical characteristic of compositional data. Other examples of such data are protein sequence data where each position in a protein sequence is a composition of 20 amino acids.
Why is shrinkage important in the context of compositional data for the TFBS ? Suppose a compositional data at a frequency level for a specific site of a TFBS is
\[ (A, C, G, T) : = (6, 1, 2, 1) \]
and in another case it is
\[ (A, C, G, T) : = (600, 100, 200, 100) \]
Usually the background probability for the site is assumed to be equal or close to equal for the four bases. Now the PWM data of estimated proportions of A, C, G and T is the same in both the above scenarios. However, in the second case, we have drawn the proportion estimates from 100 times more data compared to the first case. In fact the estimated proportions in the first case are based on just 10 sites.
In such a scenario, one would want to shrink the estimated proportions to \((0.25, 0.25, 0.25, 0.25)\) more strongly in the first case than in the second, since the force of the data is strong in the second case and not so much in the first.
But the big question is : what should be the amount of shrinkage in the two cases? Can we provide an adaptive approach that will automatically learn the amount of shrinkage in the two cases, that also scales based on the amount of data available?
The answer to this question lies in our new approach called dash.
We define dash for generic compositional data, not restricted to TFBS or protein sequence data. Let us assume that there are \(L\) constituents in the compositional mix. \(L\) equals \(4\) (corresponding to A,C, G and T bases) for the transcription factor data and \(20\) corresponding to the amino acids for the protein sequence data.
Say we have observed the counts of the constituents for the \(L\) categories for \(n\) compositional samples. Here \(n\) represents the number of binding site positions for a specific TF or a number of TFs. \(n\) therefore corresponds to the number of i.i.d compositional samples available to us.
We model these compositional counts vectors as follows
\[ (c_{n1}, c_{n2}, \cdots, c_{nL}) \sim Mult \left ( c_{n+} : p_{n1}, p_{n2}, \cdots, p_{nL} \right ) \]
where \(c_{n+}\) is the total frequency of the different constituents of the compositional data observed for the \(n\) th base. \(p_{nl}\) here represents the compositional probabilities such that
\[ p_{nl} >= 0 \hspace {1 cm} \sum_{l=1}^{L} p_{nl} = 1 \]
Now we assume a prior distribution on the vector of compositional probabilities \((p_{n1}, p_{n2}, \cdots, p_{nL})\). A typical choice to maintain conjugacy of the posterior distribution is to assume a Dirichlet distribution prior. However, the choice of the concentrator parameter for the Dirichlet parameter prior can impact the posterior and hence the amount of shrinkage greatly. to bypass this problem, we choose a mixture of known Dirichlet priors, each having mean same as the background mean but with varying amounts of concentration, along with unknown mixture proportions to be estimated from the data.
\[ \left ( p_{n1}, p_{n2}, \cdots, p_{nL} \right ) : = \sum_{k=1}^{K} \pi_{k} Dir \left (\alpha_{k} \mu_{k1}, \alpha_{k} \mu_{k2}, \cdots, \alpha_{k} \mu_{kL} \right ) \hspace {1 cm} \alpha_{k} > 0 \hspace{1 cm} \sum_{l=1}^{L} \mu_{kl} = 1 \] We assume a prior of \(\pi_{k}\) to be Dirichlet
\[ f(\pi) : = \prod_{k=1}^{K} {\pi_{k}}^{\lambda_{k}-1} \] Such a prior is similar to the ash prior introduced by Stephens (2016) for modeling False discovery rates in normal data, and we call it the dash prior.
In this specification of the dash prior, all Dirichlet components have the same mean. However, one can add other mean components, some corresponding to the corners, like \((1, 0, \cdots, 0)\), etc, in which case there will be multiple modes to the prior. But as of now, we focus on the unimodal version of the dash prior. All the following calculations will go through as is for the multimodal versions of the dash prior as well.
The marginal distribution of the counts is given by
\[ f (c_{n*} | \mu, \alpha) = \int f(c_{n*}| p_{n*}) f (p_{n*} | \mu, \alpha) d p_{n*} \] Let
\[ l_{nk} : = \frac{ c_{n+} ! \Gamma (\delta_{0k})}{ \Gamma (c_{n+} + \delta_{0k})} \prod_{l=1}^{L} \frac{\Gamma (c_{nl} + \alpha_{k}\mu_{kl})}{c_{nl} ! \Gamma (\alpha_{k}\mu_{kl})} \hspace{1 cm} where \hspace{1 cm} \delta_{0k} = \alpha_{k} \sum_{l=1}^{L} \mu_{kl}\]
\[ f(c_{n*} | \mu, \alpha) = \sum_{k=1}^{K} \pi_{k} l_{nk} \] We then use EM algorithm or convex programming to estimate the mixture proportions \(\pi_{k}\) which is equiavlent to solving the equation
\[ \log L (\pi) + \log h (\pi) = \sum_{n} \log \left (\sum_{k=1}^{K} \pi_{k} l_{nk} \right) + \sum_{k} (\lambda_{k} - 1) \log(\pi_{k}) \] Once we estimate \(\pi\) from solving the above equation, we define posterior weight of the sample \(n\) int the component mixture \(k\) to be
\[ \omega_{nk} : = \frac{\hat{\pi}_{k} l_{nk}}{\sum_{k} \hat{\pi}_{k} l_{nk}} \]
The posterior can be computed similarly as
\[ f(p_{n*} | \hat{\pi}, c_{n*}) : = \sum_{k=1}^{K} \omega_{nk} f_{k} (p_{n*} | c_{n*}) \]
where \(f_{k} (p_{n*})\) is the posterior component with prior component equal to \(k\) th component of the dash prior. the posterior component is equal to
\[ f_{k} (p_{n*} | c_{n*}) : = Dir \left ( c_{n1} + \alpha_{k}\mu_{k1}, c_{n2} + \alpha_{k}\mu_{k2}, \cdots, c_{nL} + \alpha_{k}\mu_{kL} \right) \] The posterior mean therefore is equal to
\[ E(p_{n*} | \hat{\pi}, c_{n*}) := \sum_{k=1}^{K} \omega_{nk} \frac{c_{n*} + \alpha_{k}\mu_{k*}}{\sum_{l}^{L} (c_{nl} + \alpha_{k} \mu_{kl})} \]
To get an idea of how concentrated the sample is to the prior mean, we compute \(\omega_{n1}\) - the posterior probability that the sample comes from the first component - where the 1st component corresponds to \(\alpha = Inf\).
We can also find the corner posterior probability of each samples by computing the sum of the posterior probabilities corresponding to the components with concentration parameter less than 1. Also the center posterior probability of each sample can be computed by the sum of the posterior probabilities corresponding to the components with very high concentration parameter (say greater than 50).
One pertinent issue is how to choose \(K\). In general, \(K\) can be chosen as large as possible but adding more components beyond a point is futile and time expensive.
We choose a default set of \(\alpha_{k}\) to be \((Inf, 100, 50, 20, 10, 2, 1, 0.1, 0.01)\). In this case \(\alpha_{k}=Inf\) corresponds essentially to point mass at the prior means, and then the subsequent choices of \(\alpha_{k}\) have lower degree of concentration. \(\alpha_{k} = 1\) corresponds to the most uniform scenario, whereas \(\alpha_{k} < 1\) correspond to cases with probabiliy masses at the edges of the simplex but with the mean at the prior mean. These components would help to direct the points close to the corners towards the corners and away from the center, resulting in clearer separation of the points closer to the mean with the ones away from it. We choose the prior amount of shrinkage of \(\pi_{k}\), namely \(\lambda_{k}\) to be \(\left( 10, 1, 1, 1, \cdots, 1 \right )\).
For \(\alpha_{k} = Inf\), we use the Stirling formula (ref) with the assumption that \(\alpha_{k} \rightarrow Inf\) to approximate the Gamma functions. For the other components, we use the LaplacesDemon R package to evaluate the Gamma functions in the log scale.
This R Markdown site was created with workflowr