Fundamental Definitions
Probability theory is a mathematical framework that allows us to reason about experiments whose outcome is uncertain. A probabilistic model is a mathematical model of a probabilistic experiment that satisfies certain mathematical properties (the axioms of probability theory), and which allows us to calculate probabilities and to reason about the likely outcomes of the experiment. Source: Link
- $\Omega $ is the sample space, the set of possible experiment outcomes.
- $\mathcal{F} $ is a $\sigma $-algebra, a collection of subsets of $\Omega $.
- $\mathbb{P} $ is a probability measure, a function that assigns a nonnegative probability to every set in the $\sigma $-algebra $\mathcal{F} $. Source: Link
We'll not give measure-theoretical introduction here. For more measure-theoretical introduction to probability theory, please read 2.
- $p(A|B) $ is the probability of event $A $ occuring given that $B $ is true. It is also called the posterior probability of $A $ given $B $. $p(B|A) $ is the probability of event $B $ occurring given that $A $ is true. It can be also be interpreted as the likelihood of $A $ given a fixed $B $.
- $p(A) $ and $p(B) $ are the probabilities of observing $A $ and $B $ respectively without any conditions; they are known as the prior probability and marginal probability. Source: Link
Suppose that we have a medical test with following probabilities:
- Disease occuring rate in a general population is $1\% $,
- If you have the disease, the test shows that you do with probability $80\% $,
- If you don't have the disease, it shows that you do with probability $9.6\% $.
Let's use following notation: $H $ is being healthy, $S $ is being sick, $+ $ is positive test result, and $- $ is a negative test result. Then, this probability can be calculated as $$ \begin{aligned} p(S|+) &= \frac{p(+|S)p(S)}{p(+)} \\ &= \frac{0.8\times 0.01}{0.01\times0.8 + 0.99\times0.096} \\ &= 0.077 \end{aligned} $$ So, the probability of being sick is $7.8\%$. Let's think the doctor wants to re-test the patient to be sure. If we update the probability of being sick with $7.8\%$, then the probability of being sick if the test is positive becomes $$ p(S|+) = \frac{0.8\times 0.077}{0.077\times0.8 + 0.923\times0.096} \approx 0.41 $$ After the second test, if second test is positive, the probability of being sick is nearly $41\%$. This example shows the power of Bayes' theorem, it allows to update our beliefs when our data is updated.
There is another way to interpret Bayes' theorem: Diachronic interpretation. Suppose that $H $ is our hypothesis and $D $ is our data. Then Bayes' theorem can be written as $$ p(H|D) = \frac{p(D|H)p(H)}{p(D)}. $$ Here is Source: Link:
- $p(H) $ is the prior probability which is probability of hypothesis $H $ before data $D $ is observed / collected.
- $p(D|H) $ is the probability of the data under the hypothesis, called the likelihood.
- $p(D) $ is the probability of the data under any hypothesis, called the normalizing constant. This constant amkes the posterios density integrate to one.
- $p(H|D) $ is the probability of the hypothesis after data is collected, called the posterior probability.
As a second example, let's consider the famous Monty Hall problem. The problem is stated as follows in Parade magazine in 1990:
Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? Source: Link
Let's look at the probability table of the problem:
Door 1 | Door 2 | Door 3 | Result if not switch doors | Result if switch doors |
---|---|---|---|---|
Car | Goat | Goat | Wins Car | Losts Car |
Goat | Car | Goat | Losts Car | Wins Car |
Goat | Goat | Car | Losts Car | Wins Car |
If you switch the door, you will win the car with probability $2/3$, so the switching strategy is a better strategy than not switching the doors.
Let's examine the problem with Bayes' theorem now. Assume that you choose Door 1, and the host opens Door 2. Then:
- If the car is behind Door 1, the host can open Door 2 or Door 3. So $$ p(A|D)=\frac{\frac{1}{2}\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=\frac{1}{3}. $$
- If the car is behind Door 2, the host will not open Door 2. So $$ p(B|D)=\frac{0\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=0. $$
- If the car is behind Door 3, the host will open Door 2 all the time. So $$ p(C|D)=\frac{1\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=\frac{1}{3}. $$
Hypothesis | Prior probability $p(H) $ | Likelihood $p(D|H) $ | Posterior probability $p(H|D) $ |
---|---|---|---|
A | 1/3 | 1/2 | 1/3 |
B | 1/3 | 0 | 0 |
C | 1/3 | 1 | 2/3 |
Differences between Frequentist and Bayesian Statistics
Regardless of which statistical approach, any statistical inference paradigm deals with three things:
- The quantities which we want to estimate or test, called parameters.
- Observed and/or collected data.
- Models which help us to relate data and parameters.
- Data is random because they are observations of stochastic processes.
- Model parameters are fixed because they remain constant during the sampling process.
- Data is fixed.
- Model parameters are treated and described probabilistically. And they are defined via probability distributions.
Bayesian Inference
Bayesian inference can be described in three steps: Source: A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari,D. B. Rubin, Bayesian data analysis, third edition, 2013.
- Setting up full probability model: For all observed and unobserved quantities, a joint
probability
distribution must be defined.
Suppose $y $ is observed data and $\theta $ is parameters. Then joint probability distribution is defined as $$ p(\theta, y) = p(y|\theta)p(\theta). $$ Here, $p(\theta) $ is prior distribution and $p(y|\theta) $ is the sampling distribution. - Calculating the posterior distribution: After the data were observed, posterior distribution will be calculated with Bayes' theorem: $$ p(\theta|y)=\frac{p(\theta)p(y|\theta)}{p(y)}\propto p(\theta)p(y|\theta) $$
- Model diagnostics: How well does the model fit the data? How sensitive are the results to the modeling assumptions in step 1? We will dive into tests a few chapters later.
- Posterior mode $$ \hat{\theta}=\max_\theta[p(\theta|y)] $$
- Posterior mean $$ \hat{\theta}=\mathbb{E}[\theta|y]=\int{y p(\theta|y) d\theta} $$
- (Credible interval) A $100(1-\alpha)\%$ credible interval for parameter $\theta $ is an interval $[a, b] $ such that the probability that $\theta $ lies in the interval is $1-\alpha$ Source: Link: $$ p(\theta\in[a,b])=1-\alpha $$ If parameter vector has a dimension bigger than one, this is called as credible region.
- Highest density region is the region with $100(1-\alpha)\%$ probability of containing $\theta$ with the shortest total length.
Prior Distribution Recommendations
Stan (Bayesian inference package) has a great wiki about selecting prior distributions for analyses.
Historically, due to lack of necessary computation power and methods, conjugate priors were mostly selected for ease of calculation of the posterior. Today with faster computers and MCMC sampling methods, nearly any arbitrary posterior can be sampled.Model Diagnostics
Due to nature of MCMC algorithms, we need to make some posterior diagnostic tests. These tests are:
- Checking traceplots: Sampling was done with multiple chains and and all chains are expected to have similar results.
- Checking convergence: All chains must converge.
- Effective sample size (ESS): Suppose $m$ is the chain count, $n $ is length of a chain, $\hat{\rho_t}$ is estimated autocorrelation at lag $t $ and $T $ is the first positive odd number which makes $\hat{\rho_{T+1}}+\hat{\rho_{T+2}}$ sum is negative. Then effective sample size is defined as $$ \hat{n}_\text{eff}=\frac{mn}{1+2\sum_{t=1}^T\hat{\rho}_t}. $$ In practice, it is expected to be $\frac{\hat{n}_\text{eff}}{mn}>0.001.$
- Gelman-Rubin statistic: It is defined as ratio of between-chain variance to within-chain variance. We should run our chains until this value is close to 1.
Information Criterions
To compare and select the best model for our data, we need to score models. To compare Bayesian models, following statistics can be used:
- Within-sample predictive accuracy $$ \text{lppd}=\sum_{i=1}^N \log\int{p(y_i|\theta)p_{\text{post}}(\theta)}d\theta\approx\sum_{i=1}^N \log(\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)})) $$
- Widely-Applicable Information Criterion (WAIC) $$ \text{WAIC}=-2\text{lppd} + p_{\text{WAIC}} $$ where $$ p_{\text{WAIC}}=2\sum_{i=1}^N[\log(\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)}))-\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)})] $$
- Leave-One-Out Cross-Validation (LOO) $$ \text{loo} = \sum_{i=1}^n \log p(y_i|y_{-i}) $$ where $$ p(y_i|y_{-i})=\int p(y_i|\theta)p(\theta|y_{-i})d\theta $$ is the leave-one-out predictive density given the data without the $i-$th data point.
Hypothesis Tests & Significance Tests
Suppose that $H_0 $ is the null hypothesis and $H_1 $ is the alternative hypothesis. From Bayes' theorem, we can calculate corresponding posterior distributions, $p(H_0|D)=\frac{p(D|H_0)p(H_0)}{p(D)}$ and $p(H_1|D)=\frac{p(D|H_1)p(H_1)}{p(D)}$, respectively. We can calculate the likelihood ratio $$ \frac{p(H_0|D)}{p(H_1|D)} = \frac{p(D|H_0)}{p(D|H_1)}\frac{p(H_0)}{p(H_1)} $$ In this expression, the $\frac{p(D|H_0)}{p(D|H_1)}$ ratio is called Bayes factor. In the table below, you can find interpretations of Bayes factors.
$\log_{10}(\text{Bayes Factor})$ | Bayes Factor | Evidence (for $H_1$) |
---|---|---|
[-Inf,0[ | [0,1[ | Negative |
[0, 0.5[ | [1, 3.2[ | Weak |
[0.5, 1[ | [3.2, 10[ | Substantial |
[1, 1.5[ | [10, 32[ | Strong |
[1.5, 2[ | [32, 100[ | Very Strong |
[2, +Inf[ | [100, +Inf[ | Decisive |
For testing parameter significance, we can use region of practical equivalence (ROPE). In this method, The null value is declared to be rejected if the (89% or 95%) HDI falls completely outside the ROPE, and the null value is declared to be accepted if the 89%/95% HDI falls completely inside the ROPE. But how do we set the limits of the ROPE? How big is "practically equivalent to the null value"? Since it depends on the particular application domain and current best practices there, there is typically no one right answer to this question. Source: Link
Bayesian Regression Models
Linear regression is basis of many models in statistical analysis. So we will start with it. A linear regression model with one dependent and one independent variable can be written as $$ y=\hat{\beta}_0 + \hat{\beta}_1 x + \epsilon, \quad \epsilon\sim\text{Normal}(0,\sigma^2) $$ Here, $\epsilon=y-\hat{y} $ is residuals. In Bayesian linear regression, prior distributions for intercepts and coefficients can be selected as Normal / Student-T distributions.
ANOVA models can be seen as linear regression models where the independent variables are dummy variables. Similarly, ANCOVA models can be seen as linear regression models where the independent variables can be dummy and continuous. For Bayesian ANOVA and ANCOVA models, R2 family can be used as prior distributions. Source: Link
Linear mixed effect models are used to analyze grouped or hierarchical data. These models can be classified as follows Source: F. Korner-Nievergelt, T. Roth, S. von Felten, J. Guélat, B. Al-masi, P. Korner-Nievergelt, Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan, 2015. :
- Complete pooling: Group structure is omitted. $$ \begin{aligned} \hat{y}_i &= \beta_0 \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma^2) \end{aligned} $$
- Partial pooling: Group means are weighted averages of the population mean and the unpooled group means. $$ \begin{aligned} \hat{y}_i &= \beta_0 + b_{g_i} \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma^2) \\ b_g&\sim \text{Normal}(0, \sigma_b^2) \end{aligned} $$
- No pooling: Group means can be estimated separately for each group. $$ \begin{aligned} \hat{y}_i &= \beta_{g_i} \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma_{g_i}^2) \end{aligned} $$ Hyperprior or multivariate distributions can be used as prior distributions of parameters of these models.
What if our dependent variables are not continuous but binary variables, ratio variables, or count variables? In this case, generalized linear models can be used to model our variables. Generalized linear models consists of three elements Source: Link:
- A distribution for modelling dependent variable $Y$,
- A linear predictor $\eta=X\beta$,
- A link function $g $ such that $\mathbb{E}(Y|X)=\mu=g^{-1}(\eta) $.
This text only examines models for cross-sectional data. Time-series and longitudinal data models will not be explained here.
Thanks for reading! I hope you found this article informative and insightful. If you have any questions or feedback, please don't hesitate to reach out to me. I always appreciate hearing from my readers and learning how I can improve my writing. If you notice any errors or inaccuracies, please notify me, and I'll make the necessary corrections. How to contact me: E-mail LinkedIn