A Brief Introduction to Bayesian Statistics

Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If findings are replicable, then interrogation of their meaning and validity can advance knowledge. Assessing replicability can be productive for generating and testing hypotheses by actively confronting current understandings to identify weaknesses and spur innovation. — Brian Nosek

Fundamental Definitions

Probability theory is a mathematical framework that allows us to reason about experiments whose outcome is uncertain. A probabilistic model is a mathematical model of a probabilistic experiment that satisfies certain mathematical properties (the axioms of probability theory), and which allows us to calculate probabilities and to reason about the likely outcomes of the experiment. Source: Link

A probabilistic model is defined by a triple $(\Omega, \mathcal{F}, \mathbb{P}) $, called a probability space, consists of the following three elements:

$\Omega $ is the sample space, the set of possible experiment outcomes.
$\mathcal{F} $ is a $\sigma $-algebra, a collection of subsets of $\Omega $.
$\mathbb{P} $ is a probability measure, a function that assigns a nonnegative probability to every set in the $\sigma $-algebra $\mathcal{F} $. Source: Link

We'll not give measure-theoretical introduction here. For more measure-theoretical introduction to probability theory, please read ².

Let $A $ and $B $ be events in discrete sample space $S, $ and $p(B)\neq0 $. Then Bayes' theorem is stated as $$ p(A|B)=\frac{p(B|A)p(A)}{p(B)}. $$ In this formula

$p(A|B) $ is the probability of event $A $ occuring given that $B $ is true. It is also called the posterior probability of $A $ given $B $.

likelihood

$p(A) $ and $p(B) $ are the probabilities of observing $A $ and $B $ respectively without any conditions; they are known as the prior probability and marginal probability. Source: Link

Suppose that we have a medical test with following probabilities:

Disease occuring rate in a general population is $1\% $,
If you have the disease, the test shows that you do with probability $80\% $,
If you don't have the disease, it shows that you do with probability $9.6\% $.

The question is, how likely is that you have the disease if the test is positive?

Let's use following notation: $H $ is being healthy, $S $ is being sick, $+ $ is positive test result, and $- $ is a negative test result. Then, this probability can be calculated as $$ \begin{aligned} p(S|+) &= \frac{p(+|S)p(S)}{p(+)} \\ &= \frac{0.8\times 0.01}{0.01\times0.8 + 0.99\times0.096} \\ &= 0.077 \end{aligned} $$ So, the probability of being sick is $7.8\%$. Let's think the doctor wants to re-test the patient to be sure. If we update the probability of being sick with $7.8\%$, then the probability of being sick if the test is positive becomes $$ p(S|+) = \frac{0.8\times 0.077}{0.077\times0.8 + 0.923\times0.096} \approx 0.41 $$ After the second test, if second test is positive, the probability of being sick is nearly $41\%$. This example shows the power of Bayes' theorem, it allows to update our beliefs when our data is updated.

There is another way to interpret Bayes' theorem: Diachronic interpretation. Suppose that $H $ is our hypothesis and $D $ is our data. Then Bayes' theorem can be written as $$ p(H|D) = \frac{p(D|H)p(H)}{p(D)}. $$ Here is Source: Link:

$p(H) $ is the prior probability which is probability of hypothesis $H $ before data $D $ is observed / collected.
$p(D|H) $ is the probability of the data under the hypothesis, called the likelihood.
$p(D) $ is the probability of the data under any hypothesis, called the normalizing constant. This constant amkes the posterios density integrate to one.
$p(H|D) $ is the probability of the hypothesis after data is collected, called the posterior probability.

In practice, $p(D) $ is difficult to calculate in a closed-form, so one often uses the following adaptation of Bayes' formula $$ p(H|D) \propto p(D|H)p(H) $$

As a second example, let's consider the famous Monty Hall problem. The problem is stated as follows in Parade magazine in 1990:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? Source: Link

Let's look at the probability table of the problem:

Door 1	Door 2	Door 3	Result if not switch doors	Result if switch doors
Car	Goat	Goat	Wins Car	Losts Car
Goat	Car	Goat	Losts Car	Wins Car
Goat	Goat	Car	Losts Car	Wins Car

If you switch the door, you will win the car with probability $2/3$, so the switching strategy is a better strategy than not switching the doors.

Let's examine the problem with Bayes' theorem now. Assume that you choose Door 1, and the host opens Door 2. Then:

If the car is behind Door 1, the host can open Door 2 or Door 3. So $$ p(A|D)=\frac{\frac{1}{2}\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=\frac{1}{3}. $$
If the car is behind Door 2, the host will not open Door 2. So $$ p(B|D)=\frac{0\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=0. $$
If the car is behind Door 3, the host will open Door 2 all the time. So $$ p(C|D)=\frac{1\times\frac{1}{3}}{\frac{1}{2}\times\frac{1}{3}+0\times\frac{1}{3}+1*\frac{1}{3}}=\frac{1}{3}. $$

If we calculate the posterior probability for each hypothesis, we have the following results:

Hypothesis	Prior probability $p(H) $	Likelihood $p(D\|H) $	Posterior probability $p(H\|D) $
A	1/3	1/2	1/3
B	1/3	0	0
C	1/3	1	2/3

Differences between Frequentist and Bayesian Statistics

Regardless of which statistical approach, any statistical inference paradigm deals with three things:

The quantities which we want to estimate or test, called parameters.
Observed and/or collected data.
Models which help us to relate data and parameters.

In the frequentist view of probability:

Data is random because they are observations of stochastic processes.
Model parameters are fixed because they remain constant during the sampling process.

In Bayesian view of probability:

Data is fixed.
Model parameters are treated and described probabilistically. And they are defined via probability distributions.

Bayesian Inference

Bayesian inference can be described in three steps: Source: A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari,D. B. Rubin, Bayesian data analysis, third edition, 2013.

Setting up full probability model: For all observed and unobserved quantities, a joint probability distribution must be defined.
Suppose $y $ is observed data and $\theta $ is parameters. Then joint probability distribution is defined as $$ p(\theta, y) = p(y|\theta)p(\theta). $$ Here, $p(\theta) $ is prior distribution and $p(y|\theta) $ is the sampling distribution.
Calculating the posterior distribution: After the data were observed, posterior distribution will be calculated with Bayes' theorem: $$ p(\theta|y)=\frac{p(\theta)p(y|\theta)}{p(y)}\propto p(\theta)p(y|\theta) $$
Model diagnostics: How well does the model fit the data? How sensitive are the results to the modeling assumptions in step 1? We will dive into tests a few chapters later.

Posterior distribution has all the information about our parameters. But, in practice, we need some tools to summarize information. Graphically, if our parameter vector has low dimension (1-2), we can plot the distribution. If our parameter vector has higher dimension, we can plot marginal posterior distributions. Numerically, we can use point estimation and interval estimation methods. For point estimations, we have different options:

Posterior mode $$ \hat{\theta}=\max_\theta[p(\theta|y)] $$
Posterior mean $$ \hat{\theta}=\mathbb{E}[\theta|y]=\int{y p(\theta|y) d\theta} $$

For interval estimations, we can use:

(Credible interval) A $100(1-\alpha)\%$ credible interval for parameter $\theta $ is an interval $[a, b] $ such that the probability that $\theta $ lies in the interval is $1-\alpha$ Source: Link: $$ p(\theta\in[a,b])=1-\alpha $$ If parameter vector has a dimension bigger than one, this is called as credible region.
Highest density region is the region with $100(1-\alpha)\%$ probability of containing $\theta$ with the shortest total length.

Prior Distribution Recommendations

Stan (Bayesian inference package) has a great wiki about selecting prior distributions for analyses.

Historically, due to lack of necessary computation power and methods, conjugate priors were mostly selected for ease of calculation of the posterior. Today with faster computers and MCMC sampling methods, nearly any arbitrary posterior can be sampled.

Model Diagnostics

Due to nature of MCMC algorithms, we need to make some posterior diagnostic tests. These tests are:

Checking traceplots: Sampling was done with multiple chains and and all chains are expected to have similar results.
Checking convergence: All chains must converge.
Effective sample size (ESS): Suppose $m$ is the chain count, $n $ is length of a chain, $\hat{\rho_t}$ is estimated autocorrelation at lag $t $ and $T $ is the first positive odd number which makes $\hat{\rho_{T+1}}+\hat{\rho_{T+2}}$ sum is negative. Then effective sample size is defined as $$ \hat{n}_\text{eff}=\frac{mn}{1+2\sum_{t=1}^T\hat{\rho}_t}. $$ In practice, it is expected to be $\frac{\hat{n}_\text{eff}}{mn}>0.001.$
Gelman-Rubin statistic: It is defined as ratio of between-chain variance to within-chain variance. We should run our chains until this value is close to 1.

Information Criterions

To compare and select the best model for our data, we need to score models. To compare Bayesian models, following statistics can be used:

Within-sample predictive accuracy $$ \text{lppd}=\sum_{i=1}^N \log\int{p(y_i|\theta)p_{\text{post}}(\theta)}d\theta\approx\sum_{i=1}^N \log(\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)})) $$
Widely-Applicable Information Criterion (WAIC) $$ \text{WAIC}=-2\text{lppd} + p_{\text{WAIC}} $$ where $$ p_{\text{WAIC}}=2\sum_{i=1}^N[\log(\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)}))-\frac{1}{M}\sum_{j=1}^M p(y_j|\theta^{(j)})] $$
Leave-One-Out Cross-Validation (LOO) $$ \text{loo} = \sum_{i=1}^n \log p(y_i|y_{-i}) $$ where $$ p(y_i|y_{-i})=\int p(y_i|\theta)p(\theta|y_{-i})d\theta $$ is the leave-one-out predictive density given the data without the $i-$th data point.

Hypothesis Tests & Significance Tests

Suppose that $H_0 $ is the null hypothesis and $H_1 $ is the alternative hypothesis. From Bayes' theorem, we can calculate corresponding posterior distributions, $p(H_0|D)=\frac{p(D|H_0)p(H_0)}{p(D)}$ and $p(H_1|D)=\frac{p(D|H_1)p(H_1)}{p(D)}$, respectively. We can calculate the likelihood ratio $$ \frac{p(H_0|D)}{p(H_1|D)} = \frac{p(D|H_0)}{p(D|H_1)}\frac{p(H_0)}{p(H_1)} $$ In this expression, the $\frac{p(D|H_0)}{p(D|H_1)}$ ratio is called Bayes factor. In the table below, you can find interpretations of Bayes factors.

$\log_{10}(\text{Bayes Factor})$	Bayes Factor	Evidence (for $H_1$)
[-Inf,0[	[0,1[	Negative
[0, 0.5[	[1, 3.2[	Weak
[0.5, 1[	[3.2, 10[	Substantial
[1, 1.5[	[10, 32[	Strong
[1.5, 2[	[32, 100[	Very Strong
[2, +Inf[	[100, +Inf[	Decisive

For testing parameter significance, we can use region of practical equivalence (ROPE). In this method, The null value is declared to be rejected if the (89% or 95%) HDI falls completely outside the ROPE, and the null value is declared to be accepted if the 89%/95% HDI falls completely inside the ROPE. But how do we set the limits of the ROPE? How big is "practically equivalent to the null value"? Since it depends on the particular application domain and current best practices there, there is typically no one right answer to this question. Source: Link

Bayesian Regression Models

Linear regression is basis of many models in statistical analysis. So we will start with it. A linear regression model with one dependent and one independent variable can be written as $$ y=\hat{\beta}_0 + \hat{\beta}_1 x + \epsilon, \quad \epsilon\sim\text{Normal}(0,\sigma^2) $$ Here, $\epsilon=y-\hat{y} $ is residuals. In Bayesian linear regression, prior distributions for intercepts and coefficients can be selected as Normal / Student-T distributions.

ANOVA models can be seen as linear regression models where the independent variables are dummy variables. Similarly, ANCOVA models can be seen as linear regression models where the independent variables can be dummy and continuous. For Bayesian ANOVA and ANCOVA models, R2 family can be used as prior distributions. Source: Link

Linear mixed effect models are used to analyze grouped or hierarchical data. These models can be classified as follows Source: F. Korner-Nievergelt, T. Roth, S. von Felten, J. Guélat, B. Al-masi, P. Korner-Nievergelt, Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan, 2015. :

Complete pooling: Group structure is omitted. $$ \begin{aligned} \hat{y}_i &= \beta_0 \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma^2) \end{aligned} $$
Partial pooling: Group means are weighted averages of the population mean and the unpooled group means. $$ \begin{aligned} \hat{y}_i &= \beta_0 + b_{g_i} \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma^2) \\ b_g&\sim \text{Normal}(0, \sigma_b^2) \end{aligned} $$
No pooling: Group means can be estimated separately for each group. $$ \begin{aligned} \hat{y}_i &= \beta_{g_i} \\ y_i&\sim \text{Normal}(\hat{y}_i, \sigma_{g_i}^2) \end{aligned} $$

What if our dependent variables are not continuous but binary variables, ratio variables, or count variables? In this case, generalized linear models can be used to model our variables. Generalized linear models consists of three elements Source: Link:

A distribution for modelling dependent variable $Y$,
A linear predictor $\eta=X\beta$,
A link function $g $ such that $\mathbb{E}(Y|X)=\mu=g^{-1}(\eta) $.

If a generalized linear model contains both fixed and random effects, it is called generalized linear mixed model. In Bayesian modeling, priors for parameters of generalized linear models / generalized linear mixed models can be selected similarly to linear models / linear mixed effect models.

This text only examines models for cross-sectional data. Time-series and longitudinal data models will not be explained here.

Thanks for reading! I hope you found this article informative and insightful. If you have any questions or feedback, please don't hesitate to reach out to me. I always appreciate hearing from my readers and learning how I can improve my writing. If you notice any errors or inaccuracies, please notify me, and I'll make the necessary corrections. How to contact me: E-mail LinkedIn