Probabilistic Graphical Models Revision Notes

[Last Updated: 2019.10.27]

This note summarises the online course, Probabilistic Graphical Models Specialization on Coursera.
Any comments and suggestions are most welcome!


Table of Contents

Back to Table of Contents


Representations

Bayesian Network (directed graph)

Defination

A directed acyclic graph (DAG) whose nodes represents the random variables, $X_1$, $X_2$, … $X_n$; For each node, $X_i$, we have a conditional probability distribution (CPD): $p(x_i|Par_G(Xi))$.

$Par_{G(Xi)}$ is the parents of $X_i$.

The bayesian network represents a joint distribution via the chain rule:
$p(X_1, X_2, …, X_n) = \prod_i p(X_i | Par_G(X_i))$

Example:
Example
$p(D,I,G,S,L)=p(D)P(I)P(G|I,D)P(S|I)P(L|G)$
It is a legal distribution: $p \ge 0$ and $\sum_{D,I,G,S,L}p(D,I,G,S,L)=1$

Back to Table of Contents

Reasoning Patterns in Bayesian Network

Example
Causal Reasoning (top to bottom)
$p(l^1) \approx 0.5$, the probability of getting the reference letter(l=1) for a student when we don’t know any information about him, is 0.5.
$p(l^1 \mid i^0) \approx 0.39$, if we know the student’s intelligence is not excellent (i=0), the probablility of getting the reference letter (l=1) is lower than that of when we don’t know his intelligence level.

Evidence Reasoning (bottom to top)
Given the student’s grade of a course, that is grade C (g=3) which is not a good performance, we can infer the probability of, the course is a difficult one (d=1) and the student has high-level intelligence (i=1) as follows:
$p(d^1\mid g^3) \approx 0.63$
$p(i^1\mid g^3) \approx 0.08$

Intercausal Reasoning (intercausal reasoning between causes)
$p(i^1 \mid g^3, d^1) \approx 0.11$, given the student get grade C and the course is hard. Here is another example which is also the example for the case of “explaining away“:
$p(i^1) \approx 0.3$, if we don’t know any information about the student;
$p(i^1 \mid g^2) \approx 0.175$, given the student got grade B;
$p(i^1 \mid g^2,d^1) \approx 0.34$, given the student got grade B for this course and the course is hard.

We can see that: the variables D and I become conditionally dependent given their commen child G is observed, even if they are marginally independent, i.e. $p(D,I)=p(D)p(I)$ (when the commen child G is not observed).

Longer Path of Graph
We can also have longer reasoning on a graph:
Example1
$p(d^1)=0.4$
$p(d^1 \mid g^3) \approx 0.63$
$p(d^1 \mid g^3,s^1)\approx0.76$, $s^1$ is the student got a high SAT score.

Exmaple2
$p(i^1)=0.3$
$p(i^1\mid g^3)\approx0.08$
$p(i^1\mid g^3,s^1)\approx0.58$

Back to Table of Contents

Flow of Probabilistic Influence (Active Trial)

When can X influence Y (condition on X can change the beliefs\probablities of Y)?

X CAN influence Y:
$X\rightarrow Y$,
$X\leftarrow Y$
$X\rightarrow W \rightarrow Y$
$X\leftarrow W \leftarrow Y$
$X\leftarrow W \rightarrow Y$

X CANNOT influence Y:
$X\rightarrow W \leftarrow Y$ (V Structure)

Active Trails
A trial $X_1-X_2-…-X_n$ is active if it has no V-structures ($X_{i-1}\rightarrow X_i \leftarrow X_{i+1}$).

When can X infludence Y given evidence about Z which is observed?

X CAN influence Y:
$X\rightarrow Y$
$X\leftarrow Y$
$X\rightarrow W \rightarrow Y$ ($W\not\in Z$, i.e. $W$ is not observed)
$X\leftarrow W \leftarrow Y$ ($W\not\in Z$, i.e. $W$ is not observed)
$X\leftarrow W \rightarrow Y$ ($W\not\in Z$, i.e. $W$ is not observed)
$X\rightarrow W \leftarrow Y$ ($W\in Z$, i.e. $W$ is observed), either if $W$ or one of its descendants is in $Z$.

X CANNOT influence Y:
$X\rightarrow W \rightarrow Y$ ($W\in Z$, i.e. $W$ is observed)
$X\leftarrow W \leftarrow Y$ ($W\in Z$, i.e. $W$ is observed)
$X\leftarrow W \rightarrow Y$ ($W\in Z$, i.e. $W$ is observed)
$X\rightarrow W \leftarrow Y$ ($W\not\in Z$, i.e. $W$ is not observed), if $W$ and all its descendants are not observed.

Active Trails
A trial $X_1-X_2-…-X_n$ is active given $Z$ if:

  • for any V-structure ($X_{i-1}\rightarrow X_i \leftarrow X_{i+1}$), we have that $X_i$ or one of its descendants $\in Z$
  • no other $X_i$ is in $Z$ (i.e. $X_i$ is not in V-structure).
    Back to Table of Contents
Independencies
  • For events $\alpha$ and $\beta$, $p \models \alpha \perp \beta$ ($\models$: satisfied; $\perp$: independent) if:
    • $p(\alpha, \beta)=p(\alpha)p(\beta)$
    • $p(\alpha \mid \beta)=p(\alpha)$
    • $p(\beta \mid \alpha)=p(\beta)$
  • For random variables X and Y, $p \models X\perp Y$ if:
    • $p(X, Y)=p(X)p(Y)$
    • $p(X \mid Y)=p(X)$
    • $p(Y \mid X)=p(Y)$

Conditional Independencies

  • For random variables X, Y, Z, $p \models (X \perp Y \mid Z)$ ($P(X,Y,Z)\propto \phi_1(X,Z)\phi_2(Y,Z)$) if:
    • $p(X, Y \mid Z)=p(X\mid Z)p(Y\mid Z)$
    • $p(X \mid Y, Z)=p(X\mid Z)$
    • $p(Y \mid X, Z)=p(Y\mid Z)$

Example
Example
$p \nvDash X_1 \perp X_2$
But, $p \models (X_1 \perp X_2 \mid Coin)$
(Also, conditionaing can also lose independences)

d-seperation

Definiation: X and Y are d-seperated in a directed graph G given Z if there is no active trial in G between X and Y given Z.
Notation: $d$-$sep_G(X,Y\mid Z)$

Any node is d-seperated from its non-descendants given its partens, $p(X_1,X_2,…,X_n)=\prod_i p(X_i\mid Par_G(X_i))$

I-Maps (Indenpendency Map)

d-seperation in G $\Rightarrow$ P, a distribution, satisfies corresponding independence statement
$I(G)={(X\perp Y \mid Z):d\text{-}sep_G(X,Y|Z)}$ (all the independences)

Definiation: if P satisfied I(G), we say that G is an I-map (Independency map) of P
Back to Table of Contents

Factorisation and I-Maps

Theorem:

  • if P factorises over G, then G is an I-map for P
  • if G is an I-map for P, then P factorises over G

2 equivalent views of graph structure:

  • Factorisation: G allows P to be represented
  • I-map: Independencies encoded by G hold in P
Naive Bayes

Example
$X_1 \dots X_N$: observation (features)
($X_i \perp X_j \mid C$) for all $X_i, X_j$

$P(C, X_1, X_2, …, X_N)$ = $p(C)\prod_{i=1}^Np(X_i\mid C)$

$\frac{p(C=c1\mid X_1, X_2, \dots, X_N)}{p(C=c2\mid X_1, X_2, \dots, X_N)}=\frac{p(C=c1)}{p(C=c2)}\prod_{i=1}^N\frac{p(X_i\mid C=c1)}{p(X_i\mid C=c2)}$

Example: Bernoulli Naive Bayes for Text Classification
Example

Example: MultinomialNaive Bayes for Text Classification
Example
$N$: the length of this document
Probabilities
In order to obtain the probabilities, firstly, calculate the word frequency in each document category based on the dataset; Secondly, normalise each row therefore each element in this big table would be valid probability.

  • This model was surprisingly effective in domains with many weakly relevant features.
  • Strong independence assumptions reduce performance when many features are strongly correlated.
    Back to Table of Contents
    Template Models
    Temporal Models (involve over time)
    Markov Assumption
    $p(X^{0:T})=p(X^0)\prod_{t=0}^{T-1}p(x^{t-1}\mid x^{0:t})$

The assumption is that: ($X^{t+1} \perp X^{0:t-1} \mid X^t$)

Therefore, the joint distribution over the entire sequence will be:
$p(X^{0:T})=p(X^0)\prod_{t=0}^{T-1}p(x^{t-1}\mid x^{t})$

Example:
Graphical Model
$p(w’,v’,l’,f’,o’|w,v,l,f)=p(w’|w)p(v’|w,v)p(l’|l,v)p(f’|f,w)p(o’|l’,f’)$
Initial state distribution:
Initial state distribution
$p(w^0,v^0,l^0,f^0,o^0)=p(w^0)p(v^0|l^0)p(l^0)p(f^0)p(o^0|l^0,f^0)$

2 Time-Slice Bayesian Network (2TBN)

A Template variable $X(u_1,u_2,…,u_k)$ is instantiated duplicated multiple times and share parameters (conditional probability distribution, CPD).
Example:
Difficulty(course), Intelligence(Student), Grade(Course, Student)
2 Time-Slice Bayesian Network (2TBN)
A transioin model (2TBN) over template variables $X_1,X_2,…,X_N$ is specified as a Bayesian network fragment such that:

  • the nodes include $X_1’,…,X_N’$ at time t+1 and a subset of $X_1,…,X_N$ (the time t variables directly affect the state of t+1)
  • only the nodes $X_1’,…,X_N’$ have parents and CPD, conditional probability distribution.
    Example:
    HMM
    Back to Table of Contents
    Plate Models
    Parameter Sharing
    Parameter Sharing
    Nested Plates
    Nested Plates
    Overlapping Plates
    Overlapping Plates
    Conditional Probability Distribution (CPD)
    General CPD
  • CPD $p(X|Y_1,Y_2,…,Y_K)$ specifies distribution over X for each assignment $y_1,y_2,…,y_k$
  • can use any function to specify a factor $\phi(x,y_1,y_2,…,y_k)$ such that $\sum_X\phi(X,Y_1,…,Y_K)=1$ for all $y_1,…,y_k$
Table-based CPD

A table-based representation of a CPD in a Bayesian network has a size that grows exponentially in the number of parents. There are a variety of other form of CPD that explorit some type of structure in the dependency model to allow for a much more compact representation.
Back to Table of Contents

Context-specific Independence

$p \models (X \perp_c Y | Z, c)$, X a set of variables and c a particular assignment.
$p(X,Y|Z,c) = p(X|Z,c)p(Y|Z,c)$
$p(X|Y,Z,c)=p(X|Z,c)$
$p(Y|X,Z,c)=p(Y|Z,c)$

Tree-Structured CPD

Tree-Structured CPD

Multiplexer CPD

Multiplexer CPD

Noise OR CPD

Y is true if someone succeed in making it true.
Noise OR CPD

Sigmoid CPD

Sigmoid CPD

Back to Table of Contents

Continuous Variables

Image
Linear Gaussian
Linear Gaussian
Conditional Linear Gaussian
Conditional Linear Gaussian
Non-Linear Gaussians
Non-Linear Gaussians
Back to Table of Contents

Markov Network (undirected graph)

Markov Network Fundamentals
Pairwise Markov Networks

Example
A Pairwise Markov Network is an undirected graph whose nodes are $x_1,…,x_u$.

Each edge $x_i - x_j$ is associated with a factor (potential):

$\phi_{ij}(i,j)$

General Gibbs Distribution (a more general expression)

Even for a fully connected pairwise markov network, it is not fully expressive (i.e., it can not represent any probability distribution over random variables, not sufficiently expressive to capture all probability distribution).

Example:
Example
For the above fully connected pairwise Markov network, we will have $C_4^2 edges * 2^2$ assignments, $O(n^2d^2)$. n is the number of edges and d is the number of how many possible values each variable can take on. However, for a general markov network, the complexity of assignments is $O(d^n)$ which is much larger than $O(n^2d^2)$. Therefore, we need a general representation method to increase the coverage.

Gibbs Distribution (represents distribution as a product of factors)
Gibbs Distribution

Back to Table of Contents

Induced Markov Network (connects every pair of nodes that are in the same factor)

Example:
Example
More general:
more general
Induced Markov Network $H_{\phi}$ has an edge $X_i-X_j$ whenever there exists $\phi_m \in \phi, s.t., X_i, X_j \in D_m$
Back to Table of Contents

Factorization

P, a probability distribution, factorised over H (the induced graph for $\phi$) if:

there exists $\phi={\phi_i(D_i)}={\phi_1(D_1),\phi_2(D_2),…,\phi_k(D_k)}$

such that:
$p=p_{\phi}$ (normalised product of factors)

Active Trials in Markov Network: A trial $X_1-…-X_N$ is active given $Z$ (observed) if no $X_i$ is in Z.

Conditional Random Fields

Not to model p(X,Y) but trying to model p(Y|X). X is the input and Y is the target variable.
CRF representation
CRF
CRF and Logistic Model (an example of CRF representation)
CRF and logistic model
CRF for languages
CRF for languages
The features could be:
is word capitalised, word in atlas or name list, the previous word is “Mrs”, the next word is “Times” etc.
the goal of CRF for languages is p(labels|words)
Summary for CRF

  • A CRF is parameterised the same as a Gibbs distribution, but normalised differently p(Y|X)
  • The CRF model do not need to model distribution over variables, we only care the prediction
  • allows models with highly expressive features, without worrying about wrong independencies
    summary

Back to Table of Contents

Independencies in Markov Networks

Definition:
X and Y are seperated in H (Induced Markov Network Graph) given Z, if there is no active trial in H between X and Y given Z (i.e., no node along trial in Z)
Example:
Example
Theorem:
if a probability distribution P factorised over H, and $sep_H$(X,Y|Z)

then P satisfied $(X\perp Y|Z)$

For the set which includes independencies, $I(H)={(X\perp Y|Z):sep_{H}(X,Y|Z)}$, if P satisfies I(H), we say that H is a I-map (independency map) of P
Theorem: if P factorises over H, the H is an I-Map of P
Theorem (Independence => factorisation):
For a positive distribution P (p(x)>0), if H is an I-map of P, the P factorises over H.
We can summarise that:

  • Factorisation: H allows P to be represented.
  • I-map: Independencies encoded by H hold in P

I-maps and perfect maps
Perfet maps capture independencies in a distribution P, $I(P)={(X\perp Y|Z): P \models (X\perp Y|Z)}$.
P factorises over G => G is an I-map for P, $I(G)\subseteq I(P)$
However, not always vice versa, there can be independencies in I(P) that are not in I(G).
If the graph encodes more independencies,

  • it is sparser (has few parameters. We want a sparse graph actually)
  • and more informative (we want a graph that captures as much of the structure in P as possible)

Minimal I-map
A minimal I-map does not have redundant edges.
Example
A minimal I-map may still cannot capture I(P). A minimal I-map may fail to capture a lot of structure even if they are presented in the distribution and even if it is representable as a Bayes net or as a graphical model.
Example:
Example
if remove the red edge, $D\perp G$; if remove the green one, $I\perp G|D$; if remove the purple one, $D\perp I | G$. They are not the case in our original distribution. None of them can be removed. Therefore, this is also a minimal I-map.
A perfect map $I(G)=I(P)$, means G perfectly captures independencies in P. Unfortunately, a perfect map are hard to come by.
Example of doesn’t have a perfect map:
IF we consider using Bayes net as perfect maps:
Example
Another imperfect map:
Example
IF we consider Markov network as a perfect map:
Perfect map: $I(H)=I(P)$, (here we use a different symbol replace G by H), H perfectly captures independencies in P (a perfect map is great, but may not exist)
Example
Converting BNs <—> MNs losses independencies:

  • BN to MN: loss independencies in V-structure
  • MN to BN: must add triangulating edges to loops

Uniqueness of perfect map
Example
Formal Definition: I-equivalence
Two graphs G1 and G2 over X1 … Xn are I-equivalent if I(G1)=I(G2) (examples above)
Most Gs have many I-equivalent variants.
I-equivalence is an important notion. It tells us: there are certain aspects of the graphical model are unidentifiable, which means that if we end up for whatever reason thinking this is the graph that represent our probability distribution, it could just as easyly as this one or that one. So without prior knowledge of some kind or another, for example that we prefer X to be a parent of Y, there is no way for us to select among these different choices.

Back to Table of Contents

Local Structure in Markov Networks
Log-linear Models (CRF, Ising Model, Metric MRFs)

The local structure means we do not need full table representations in both directed and undirected models. Here we will discuss the local structure in undirected models.

local structure

  • each feature $f_j$ has a score $D_j$
  • different featrues can have same scope

Example 1:
Example
Example 2 (CRF):
Example
Example 3 (Ising Model, pair-wise Merkov Network, Joint Spins):
Example
Example 4 (Metric MRFs):
All $X_i$ take values in label space V
Example
Distance function $\mu: V*V\to R^{+}$ (non-negative)
Example
We have a feature:

we want lower distantce (higher probability).

lower probability, if the values of Xi and Xj far in $\mu$.

Examples of $\mu$:

Example
$\mu(v_k,v_l)=min(|v_k-v_l|,d)$
Example

Back to Table of Contents

Shared features in Log-Linear Models
In most MRFs, same features and weights can be used over many scopes.

  • Ising Model
    Ising Model
    Here, $X_iX_j$ is the feature function, $f(X_i,X_j)$.
  • CRF
    CRF

1) same energy terms $w_k f_k(x_i,y_i)$ repeat for all positions i in the sequence.

2) same energy terms $w_m f_m(y_i,y_{i+1})$ repeat for all positions i

Summary:
Repeated Features:
1) need to specify for each feature $f_k$ a set of scopes, Scopes[$f_k$]
2) for each $D_k \in Scope[f_k]$ we have a term $w_kf_k(D_k)$ in the energy function: $w_k\sum f_k(D_k), D_k \in Scopes(f_k)$. Example: Scope[$f_k$] = {Yi,Yj; i and j are adjacient}

Back to Table of Contents

Decision Making
Maximum Expected Utility

Simple Decision Making:
A simple dicision making situation D:

  • A set of possible actions Val(A)={$a^1,…,a^k$}, different choices
  • A set of states Val(X)={$x_1,…,x_N$}, states of the world
  • A distribution p(X|A)
  • A utlity function U(X,A)

The expected utility:
EU

D is the situation and a is the one of the actions.

We want to choose action a that maximise EU:
$a^*=\text{argmax}_a EU[D(a)]$

Simple Influence Diagram

  • Note that: Action is not a random variable, so it does not have a CPD (conditional probability distribution)

Simple Influence Diagram

EU[$f_0$]=0
*EU[$f_1$]=0.5(-7)+0.35+0.2*20=2

Back to Table of Contents

More complex Influence Diagram
Example
$V_G, V_S$ and $V_Q$ represent different components of the utility function (a decomposed utility).
$V=V_G + V_S + V_Q$

Information Edges
Information Edges
The survey here means that conducting a survey to investigate the market demand.

Expected utility with Information:
EU[D[$\delta_A$]]=$\sum_{x,a}P_{\delta_A}(X,a)U(X,a)$, a joint probability distribution over X $\cup$ {A}.

Decision rule $\delta$ at action node A is a CPD, P(A|parents(A)), here is P(F|S). $\delta_A$ is a decision rule for an action.

We want to choose the decision rule $\delta_A$ that maximises the expected utility $argmax_{\delta_A}EU[D[\delta_A]]$. (MEU(D)=$\max_{\delta_{A}}EU[D[\delta_{A}]]$).

Example:
Example
To maximise the summination: the optimal decision rule is:
0+1.15+2.1=3.25 (the agent overall expected utility in this case)

More generally
More Generally
A: actions

Summary

  • treat A as a random variable with arbitrary (unknown) CPD, $\delta_A(A|Z)$
  • introduce utility factor with scope $P_{a_U}$, P is parent, u is utility
  • eliminate all variables except A, Z (A’s parents) to produce factor $\mu(A,Z)$
  • for each Z (observation) set: we choose the optimal decision rule:

Back to Table of Contents

Utility Functions

Lotteries Example:
Lotteries
we can use expected utility to decide which to buy between 2 different lotteries.
1) 0.2U($4) + 0.8U($0)
2) 0.25U($3) + 0.75U($0)

Utility Curve
Example
Example
Multi-Attribute Utilities

  • All attributes affecting preferences (e.g., money, time, pleasure, …) must be integrated into one utility function;
  • Example: Micromorts 1/1000000 chance of death worth ($\approx $$20, 1980); QALY (quality-adjusted life year)

Example (prenatal diagnosis):
Example
Break down the utility function as a sum of utilities, $U_1(T) + U_2(K) + U_3(D,L) + U_4(L,F)$

Back to Table of Contents

Value of Perfect Information

(Another question: which observations should I even make before making a decision? Which one is worthwhile and which one is not?)

  • VPI(A|X) is the value of observing X before choosing an action at A
  • $D$: original influence diagram
  • $D_{x\to A}$: influence diagram with edge $x \to A$

, MEU is maximum expected utility

Example:
First find the MEU decision rules for both and then compute $MEU(D_{x\to A})-MEU(D)$.

As shown below: $MEU(D_{x\to A})-MEU(D)=3.25-2=1.25$, which means the agent should be willing to pay anthing up to 1.25 utility points in order to conduct the survey.

Example

Theorem
$VPI(A|X) := MEU(D_{x\to A})-MEU(D), VPI(A|X) >= 0$

$VPI(A|X)=0$, if and only if the optimal decision rule for D is still optimal for $D_{x\to A}$.

Detailed Example:
Example
If the agent does not get any information (observation):

$EU(D[C_1])=0.1\times0.1+0.2\times0.4+0.7\times0.9=0.72$

$EU(D[C_2])=0.04+0.2+0.09=0.33$

What if the agent get to make an observation?
Example
The expected utility (EU) (with observation) is:

  • if the agent choose company2, and its state is poor ($S_2=poor$), the EU is 0.1 (stick the original idea, which prefers company1, because the EU is lowere than 0.72)
  • if the agent choose company2, and its state is moderate, the EU is 0.4 (stick the original idea, which prefers company1, because the EU is lowere than 0.72)
  • if the agent choose company2, and its state is great, the EU is 0.9 (prefer company2, change mind to company2)

Therefore, the optimal decision rule is:

$\delta_A(C|S_2)= P(C^2)=1, if S_2=S^3 (great)$

$\delta_A(C|S_2)= P(C^1)=1, otherwise$

In this scenario, the MEU value is 0.743.

$MEU(D_{S_2\to C})=\sum_{S_2,C}\delta(C|S_2)\mu(S_2,C)=0.743$

0.743 is not a significant improvement over our original MEU value. If observing, the agent shouldn’t be willing to pay his company too much money in order to get information about the detail.

Another situation (neither company doing great):
Example

$EU(D[C_1])=0.35$
$EU(D[C_2])=0.04+0.2+0.09=0.33$

$\delta_A(C|S_2)= P(C^2)=1, if S_2=S^2, S^3 (great)$

$\delta_A(C|S_2)= P(C^1)=1, otherwise$

$MEU(D_{S_2\to C})=\sum_{S_2,C}\delta(C|S_2)\mu(S_2,C)=0.43$

0.43 is much more significant increase.

Third situation (neither company doing great):
Example

$EU(D[C_1])=0.788$
$EU(D[C_2])=0.779$

$\delta_A(C|S_2)= P(C^2)=1, if S_2=S^2, S^3 (great)$

$\delta_A(C|S_2)= P(C^1)=1, otherwise$

$MEU(D_{S_2\to C})=\sum_{S_2,C}\delta(C|S_2)\mu(S_2,C)=0.8142$

In this situation, in the bubble days of the Internet boom and pretty much companys gets funded, with a pretty high probability even if their business models are dubious.
(only a small fairly small increase over 0.788 that they could have already guaranteed themselves without making that observation)

Back to Table of Contents

Knowledge Engineering
Generative vs. Discriminative
  • Generative: when don’t have a preditermined task (task shifts). Example: medical diagnosis pack: every patient present differently, each patient case, we have different subset of things happend to know (symptoms and tests), we want to measure some variables and predict others; easy to train in certain regimes. (where data is not fully labelled)
  • Discriminative: particalar prediction task, need richly expressive features (avoid dealing with corelations), and can achive high performance.
Designing a graphical model (variable types)
  • Target: there are the ones we care about, e.g., a set of diseases in the diagnosis setting
  • Observed: not necessary care predicting them, like symptoms and test results in the medical setting
  • Latent/hidden: which can simply our structure, example below:
    Example

Back to Table of Contents

Structure
  • Causal versus non-causal ordering.
    Do the arrows in directed graph corresponding to causality? (Yes and No)
    No: $X\to Y$ any distribution that we can model on this graphical model where X is a parent of Y, we can equally well model in a model $Y\to X$. In some examples, we can reverse edges and have a model that’s equally expressive. (But the model might be nasty, examples below. Thus the causal ordring is generally more sparser, intuitive and more easier to parameterise)
    Example
Parameters: Local Structure
Context-Specific Aggregating
Discrete tree-CPDs sigmoid CPDs
Continuous regression tree (continues version of tree CPD, breaks up the context based on some thresholds on the continuous variables) Linear Gaussian
Iterative Refinement
  • Model testing (ask queries and see whether the answers coming out are reasonable)
  • sensitivity analysis for parameter: look at a given query, and ask which parameters have the biggest different on the value of the query, and that means those are probably the ones we should fine tune in order to get best results)
  • Error Analysis
    • add features
    • add dependencies

Back to Table of Contents

Inference

Overview: how to use the these declarative representation to answer actual queries.

Conditional Probability Queries (Overview)

Evidence: E = e, set of observations
Query: a subset of variables Y
Task: compute p(Y|E=e) (NP-Hardness)

Sum-Product in Bayes Network and Markov Network

Given a PGM $P_{\phi}$ (defined by a set of variables X and factors), value $x \in Val(X)$, and we need compute $P_{\phi}(X=x)$
Example:
Example
Note that each of the CPDs converts a factor over the scope of the family (e.g., P(G|I,D)->$\phi_G(G,I,D)$).

If we need to know P(J), we need to compute the follows:
P(J)

For the sum-product in MN:
Sum-Product in MN

Back to Table of Contents

Evidence: Reduced Factors

$p(Y|E=e)=\frac{P(Y,E=e)}{P(E=e)}$
Define: W = {$X_1,X_2,…,X_n$} - Y - E
Evidence
Example:
Example
If we want to compute p(J|I=i, H=h), we can just compute the above equation and re-normalise.

Summary of Sum-Product Algorithm

$p(Y|E=e)=\frac{P(Y,E=e)}{P(E=e)}$, Y is the query.
numerator: $p(Y,E=e)=\sum_w\frac{1}{Z}\prod_k\phi_k’(D_k’)$, reduced by evidence.
denominator: $p(E=e)=\sum_Y\sum_w\frac{1}{Z}\prod_k\phi_k’(D_k’)$
In practice, we can compute $\sum_w\prod_k\phi’_k(D’_k)$ and renormalise.

There are many algorithms for computing conditional probability queries.
The algorithm list of conditional probability:

  • Push summations into factor product
    • Variable elimination (dynamic programming)
  • Message passing over a graph
    • Belief propagation (exact inference)
    • Variational approximations (approximate inference)
  • Random sampling instantiations (approximate inference)
    • Markov Chain Mente Carlo (MCMC)
    • Importance Sampling

Back to Table of Contents

MAP (Maximum A Posterior) Inference (Overview)

Evidence: E = e (observations)
Query: all other variables Y (Y = {$X_1, X_2,…,X_n$} - E)
Task: compute MAP = $argmax_YP(Y|E=e)$

Note that:
1) there may be more than one possible solution
2) MAP $\neq$ Max over margins. Example below:
Example
The CPDs are:
P(A=0)=0.4
P(A=1)=0.6
P(B=0|A=0)=0.1
P(B=1|A=0)=0.9
P(B=0|A=1)=0.5
P(B=1|A=1)=0.5
Therefore we can get the joint distribution:
P(B=0,A=0)=0.04
P(B=1,A=0)=0.36
P(B=0,A=1)=0.3
P(B=1,A=1)=0.3

Obviously, the assignement which MAP(A,B) is A=0,B=1. Because its has the hightest probability 0.36. However, if we look at the two variables seperately, the A should be assigned as A=1 which is opposite to the A=0. Therefore, we cannot look seperately at the marginal over A and over B.

Compute MAP is a NP-Hardness problem.

Given a PGM $P_{\phi}$, find a joint assignment X with the hightest probability $P_{\phi}(X)$.

Back to Table of Contents

Max-Product

Example:
$p(J)=argmax_{C,D,I,G,S,L,J,H}\phi_C(C)\phi_D(C,D)\phi_I(I)\phi_G(G,I,D)\phi_S(S,I)\phi_L(L,G)\phi_J(J,L,S)\phi_H(H,D,J)$
Example

Algorithms: MAP (there are also different algorithms to solve the problem)

  • Push maximisation into factor product
    • variable elimination
  • Message passing over a graph
    • Max-product belief propagation
  • Using methods for integer programming (a general class of optimisation which is over discreet spaces)
  • For some networks: graph-cut methods
  • Combinatorial search (use standard search techniques over combinatorial search spaces)

Back to Table of Contents

Variable Elimination

The simplest and the most fundamental algorithm.

Variable Elimination Algorithm
Elimination in Chains

Elimination in Chains
Ultimately, we will end up with an expression that involves only the variable E.

Elimination in a more complicated BN

Elimination in a more complicated BN

Variable Elimination with Evidence

Variable Elimination with Evidence
If we need to compute P(J|I=i,H=h), we just renormalise the above results.

Variable Elimination in MNs

Variable Elimination in MNs

Summary Variable Elimination Algorithm

Summary

  • reduce all factors by evidence
    • get a set of factors $\Phi$
  • For each non-query variable Z
    • Eliminate variable Z from $\Phi$
  • multiply all remaining factors
  • renormalising to get distribution

Back to Table of Contents

Complexity of Variable Elimination

Factor Product and Marginalisation
Factor Product:
Factor Product
Factor Marginalisation:
Factor Marginalisation
Details:

  • Start with m factors
    • for Bayesian Networks, m <= n (the number of variables). m=n when one factor for each variable
    • for Markov Networks, m can be larger than the number of variables
  • At each elimination step generate a factor
  • at most n elimination steps
  • total number of factors $m^*\leqslant m+n$
  • product operations: $\sum_k(m_k-1)N_k\leqslant N\sum_k(m_k-1)\leqslant Nm^*$, $N=\max(N_k)=$size of the largest factor. Each factor multiply in at most once.
  • sum operations $\leqslant \sum_kN_k\leqslant N\cdot$number of elimination steps $\leqslant$ $Nn$. (Total work is linear in N and $m^*$)

$N_k=|Val(X_k)|=O(d^{rk})$ where: (exponential blow up)

  • $d=\max(|Val(X_i)|)$, d values in their scope
  • $r_k=|X_k|$ (number of variables in k-th factor) =cardinality of the scope of the k-th factor

Complexity of algorithm depends heavily on the elimination ordering
Depends on the elimination ordering

Back to Table of Contents

Graph-based Perspective on Variable Elimination

View a BN as a undirected graph:
View a BN as a undirected graph
all variables connected to I become connected directly.
View a BN as a undirected graph
Its a pretty good elimination ordering in this case.

To summarise:
Induced Graph
The induced graph $I_{\Phi,\alpha}$ over factors $\Phi$ and ordering $\alpha$:

  • undirected graph
  • $X_i$ and $X_j$ are connceted if they appeared in the same factor in a run of the variable elimination (eliminate factors generate new factors) algorithm using $\alpha$ as the ordering

Cliques in the Induced Graph
Theorem: Every factor produced during VE is a clique in the induced graph.
Every factor produced during VE
A clique is a maximal fully connected sub-graph.
Example:
Example
(fully connected each other)
It is maximal because we cannot add any other variable to this and still have that property hold. For example, add node S, but S not connects D.

Theorem: every (maximal) clique in the induced graph is a factor produced during VE.

Induced width:

  • the width of an induced graph is the number of nodes in the largest clique in the graph minus 1.
  • minimal induced width of a graph K is $\min_{\alpha}$(width$(I_{k,\alpha})$), $\alpha$ over all possible elimination orders.
  • it provides a lower bound on best perfromance (the best achievable complecity) of VE to a model fatorizing over K (graph K)

Back to Table of Contents

Finding Elimination Orderings
  • Greedy seach using heuristic cost function
    • at each point, eliminate node with smallest cost
  • possible cost functions:
    • min-neighbors: number of neighbors in current graph
      • pick the node has the smallest/minimal number of neighbors
      • corresponding to the smallest factor
      • the one whose cardinality (the number of variables in this factor) is smallest
    • min-weight: weight (e.g., number of values) of factor formed
    • min-fill: number of new fill edges
    • weighted min-fill: total weight of new fill edges (edge weight = product of weights of the 2 nodes)

Theorem: the induced graph is triangulated.

  • no loops of length $>$ 3 without a “bridge”
    Example
  • can find elimination ordering by finding a low-width (little clique) triangulation of original graph $H_{\phi}$
Variable Elimination (Summary)
  • Finding the optimal elimination ordering is NP-hard.
  • Simple heuristics that try to keep induced graph small often provide reasonable performance.

Back to Table of Contents

Learning

Belief Propagation Algorithms

Message Passing in Cluster Graphs

Build a cluster graph:
Example
Passing Messages:
Example
Each cluster is going to send messages to its adjacent clusters that reflect the same process as above.
Example

Cluster Graphs (Generalisation Definition)

  • Undirected graph

    • Nodes: clusters $C_i\subseteq{X_1,…,X_n}$. $C_i$ is a subset of variables
    • Edges: between $C_i$ and $C_j$ associated with subset $S_{i,j} \subseteq C_i \cap C_j$. $S_{i,j}$ is the variables that they talk about
  • Given a set of factors $\Phi$, assign each $\phi_k$ to a cluster $C_{\alpha(k)}$ (one and only one cluster) s.t. Scope[$\phi_k$] $\subseteq C_{\alpha(k)}$ (a subset of $C_{\alpha(k)}$)

  • Define $\psi_i(C_i)=\prod_{k:\alpha(k)=i}\phi_k$

    • define a initial belief of a particular cluster the product of all factors that assigned to it
    • and some clusters might not have have factors assigned to them, in this case equal to 1.

Example Cluster Graph
Example
Initially we have to figure out for each of those factors a cluster in which to put it.

  • For A,B,C: there is only one choice because there is only one cluster in the enitre graph that understands abouts all of A, B and C.
  • For B,C: however, it has two choinces. It can go into cluster one or two. Both are fine.
  • For B,D: also has two choinces.
  • For D,E and B,E: they only have one choice.
  • For B,D: it has 2 choices.
  • For B,D,F: there is only one choice.
    This is one possible way of assigning the cluster, the factor Flow of Probabilistic Influence (Active Trial)

When can X influence Y (condition on X can change the beliefs\probablities of Y)?

X CAN influence Y:
$X\rightarrow Y$,
$X\leftarrow Y$
$X\rightarrow W \rightarrow Y$
$X\leftarrow W \leftarrow Y$
$X\leftarrow W \rightarrow Y$

X CANNOT influence Y:
$X\rightarrow W \leftarrow Y$ (V Structure)

Active Trails
A trial $X_1-X_2-…-X_n$ is active if it has tno clusters.

Different Cluster Graph
For the exact same set of factors, the clusters have not changed, but the edge changes.
Example

Message Passing
Example
General:
General

Back to Table of Contents

Belief Propagation Algorithm
  • Assign each factor $\phi_k\in\Phi$ to a cluster $C_{\alpha(k)}$
  • Construct initial potentials $\Psi_i(C_i)=\prod_{k:\alpha(k)=i}\phi_k$
  • Initialise all messages to be 1
  • Repeat: (until When and how to select edge - different variants of algorithms will be described later on)
    • select edge (i,j) and pass message. $N_i$ is the neighbour.
      Example
  • Compute, belief of this cluster, $\beta_i(C_i) = \Psi_i \times \prod_{k\in N_i} \delta_{k \to i}$. $N_i$ is all the neighbors.

Belief propagation run (approximate)
Example:
Example
The resulting beliefs are pseudo-marginals, but the algorithm actually performs well in a range of practical applications.

Back to Table of Contents

Properties of Cluster Graphs

Family Preservation
Given a set of factors $\Phi$, we assign each $\phi_k$ to a cluster $C_{\alpha(k)}$, s.t. Scope[$\phi_k$]$\in C_{\alpha(k)}$
For each factor $\phi_k$ in $\Phi$, there exists a cluster $C_i$ s.t. Scope[$\phi_k$] $\subseteq C_i$

Runing Itersection Property
For each pair of clusters $C_i$ and $C_j$, and variable $X\in C_i\cap C_j$, there exists a unique path between $C_i$ and $C_j$ for which all clusters and sepsets (edge) contain X.
Example

Alternative defination of Running Itersection Property —- Equivalently: for any variable X, the set of clusters and sepsets (edges) containing X from a tree. (each variable inducing its own little tree across which information about that variable flows in the graph.)
Example

If X and Y are very strongly correlated, BP does poorly when have sharing parameters, due to the feedback loops. The more skewed the probabilities in your graphical mode, the harder time belief propagation has in terms of the results that it gets.
Example

Examples of Illegal Cluster Graphs:
Example

For B, there is no way to connect the B in cluster 1 and 2 or 2 and 3. This graph violates the existence.
Example

This vailates the uniqueness. (the loop, 2 ways between cluster 1 and 4.)
Examples of Valid Graph:
Examples of Valid Graph

Bethe Cluster Graph (How to construct a cluster graph that has the desired properties?)

  • Big Cluster: for each $\phi_k\in\Phi$, a factor cluster $C_k$ = Scope[$\phi_k$]
  • Small Cluster: for each $X_i$ a singleton cluster {$X_i$} (variable)
  • Edge $C_k$—-$X_i$ if $X_i\in C_k$
    Bethe Cluster Graph

Summary

  • Cluster graph must satisfy:
    • family preservation: allow $\Phi$ to be encoded
    • running intersection: connect all information about any variable, but without feedback loops
  • Bethe cluster graph is often first default
  • Richer cluster graph stuctures (e.g., more edges) can offer different trade-offs. wrt. computational cost (increasing the amount of sizes of the messages that are passed that can grow more expensive) but at the same time imprve the preservation of dependencies as messages are passed in the graph so that more information is actually kept and not lost in this message passing process.

Back to Table of Contents

Properties of Belief Propagation

Calibration
Calibration
A cluster graph is calibrated if every pair of adjacent clusters $C_i$ and $C_j$ agree on their sepset $S_{i,j}$:

$\sum_{C_i-S_{ij}}\beta_i(C_i)=\sum_{C_j-S_{ij}}\beta_j(C_j)$. In other words, the marginilised beliefs are equal.

Convergence ($\to$ Calibration)
$\delta_{i\to j}(S_{ij})=\delta’_{i\to j}(S_{ij})$, equal the message at previous time step.
Convergence
The expression above is corresponds to the sepset belief:
$\mu_{i,j}(S_{i,j})=\delta_{j\to i}\delta_{i\to j}=\sum_{C_j-S_{i,j}}\beta_j(C_j)$

Reparameterisation
Reparameterisation
If we write in this form:
Reparameterisation
It can be seen that it is simply the unnormalised measure. It implicates the term/ratio is simply a different set of parameters that captures the original un-normalised measure that defined our distribution.
Therefore, we have not lost information as a result of belief propagation algorithm. The distribution is still there, just a different set of parameters.

Summary
Cluster graph beliefs are an alternative, calibrated, parametorisation of the original unnormalised density.

  • No information is lost by message passing.
  • Reparametorised the original distribution into a more convenient and easily usable form.

Back to Table of Contents

Clique Trees (faster and exact answer of inference)

A speical case of cluster graph which has the performance guarantee. (faster, exact answer of inference).