EE 278 Lecture Notes # 3 Winter 2010–2011
Random Variables Probability space ( Ω, F , P)
Random variables, vectors, and processes
A (real-valued) random variable is a real-valued function defined on Ω with a technical condition (to be stated) Common to use upper-case letters. E.g., a random variable X is a R. Y, Z, U, V, Θ, · · · function X : Ω
→
Also common: random variable may take on values only in some subset Ω X R (sometimes called the alphabet of X , A X and X also common notations)
⊂
Intuition: Randomness is in experiment, which produces outcome ω according to probability P X (ω) ΩX R .
∈
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
1
X (r ) =
0
if r
→
≤ 0.5
1 otherwise
Then the function g ( X ) : Ω R defined by g ( X )(ω) = g ( X (ω)) is also a real-valued mapping of Ω, i.e., a real-valued function of a random variable is a random variable
→
.
Can express the previous examples as W Y = cos(2πV )
Observe X , do not observe outcome of fair spin. =
2
Lots of possible random variables, e.g., W (r) r , Z (r ) L(r ) = r ln r (require r 0 ), Y (r) = cos(2πr ), etc.
−
≥
2
Suppose that X is a rv defined on ( Ω, F , P) and suppose that g : ΩX R is another real-valued function.
→ { 0, 1} by
c R.M. Gray 2011
Functions of random variables
Consider ( Ω, F , P) with Ω = R , P determined by uniform pdf on [0, 1) R
⇒ random variable outcome is
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Examples
Coin flip from earlier: X :
⊂
=
r
e , V (r )
=
r,
=
V 2, Z
=
eV, L =
−V ln V ,
Similarly, 1 /W , sinh(Y ), L 3 are all random variables
Can think of rvs as observations or measurements made on an underlying experiment. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
3
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
4
Random vectors and random processes
Derived distributions
In general: “input” probability space ( Ω, F , P) + random variable X “output” probability space, say ( ΩX , B(ΩX ), PX ), where ΩX R and P X P X (F ) = Pr( X F ) is distribution of X
A finite collection of random variables (defined on a common probability space ( Ω, F , P) is a random vector
⊂
E.g., ( X, Y ), ( X0, X1, · · · , Xk−1)
=
0 , 1, 2, ···} , {X (t); t
For binary quantizer special case derived P X .
∈ (−∞, ∞)}
Idea generalizes and forces a technical condition on definition of random variable (and hence also on random vector and random process)
So theory of random vectors and random processes mostly boils down to theory of random variables.
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∈
Typically P X described by pmf p X or pdf fX
An infinite collection of random variables (defined on a common probability space) is a random process E.g., {Xn, n
⇒
5
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
6
Inverse image formula
Given ( Ω, B(Ω), P) and a random variable X ,
X −1 ( F )
find P X
X F
Basic method:P X ( F ) = the probability computed using P of all the srcinal sample points that are mapped by X into the subset F : PX (F ) = P ({ω : X (ω)
Inverse image method: Pr( X
∈ F })
Shorthand way to write formula in terms of inverse image of an event −1 F B (ΩX ) under the mapping X : Ω F }: ΩX : X ( F ) = { r : X (r )
∈
→
∈
PX (F ) = P (X −1(F ))
Written informally as P X (F ) = Pr( X F ) = P {X random variable X assumes a value in F ”
∈
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∈ F } = “probability that c R.M. Gray 2011
7
∈ F)
=
P ({ω : X (ω)
∈ F })
=
P (X −1(F ))
inverse image formula — fundamental to probability, random processes, signal processing. Shows how to compute probabilities of output events in terms of the does the definition make sense? input probability space i.e., is P X (F ) = P (X −1( F )) well-defined for all output events F ??
Yes if include requirement in definition of random variable — EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
8
Careful definition of a random variable
• Most every function we encounter is measurable, but calculus of probability rests on this property and advanced courses prove measurability of important functions.
Given a probability space ( Ω, F , P), a (real-valued) random variable R with the property that Ω ΩX
In simple binary quantizer example, X is measurable (easy to show since F = B ([0, 1))contains intervals) Recall
X is a function X :
→
if F
⊂
∈ B (Ω
X ),
then X −1(F )
∈F
PX ({0})
Notes: • In English: X : Ω ΩX R is a random variable iff the inverse image of every output event is an input event and therefore PX (F ) = P (X −1(F )) is well-defined for all events F .
→
⊂
=
P({r : X (r) = 0 }) = P ( X −1({0}))
=
P({r : 0
PX ({1})
=
P(X −1({1}))
P X (Ω X )
=
PX ({0, 1}) = P ( X −1({0, 1})
PX ( )
=
P(X −1( )) = P ( )
∅
≤ r ≤ 0.5}) =
∅
=
P ([0, 0.5])
P ((0.5, 1.0])
∅
=
=
=
=
0 .5
0 .5
P ([0, 1))
=
1
0,
In general, find P X by computing pmf or pdf, as appropriate. • Another name for a function with this property: measurable function
Many shortcuts, but basic approach is inverse image formula.
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
9
Random vectors
10
Can be discrete (discribed by multidimensional pmf) or continuous (e.g., described by multidimensional pdf) or mixed
All theory, calculus, applications of individual random variables useful for studying random vectors and random processes since random vectors and processes are simply collections of random variables. One k -dimensional random vector = k 1-dimensional random variables defined on a common probability space.
Recall that a real-valued function of a random variable is a random variable. Similarly, a real-valued function of a random vector (several random variables) is a random variable. E.g., if X 0, X1,... Xn−1 are random variables, then n−1 1 Sn = Xk n k 0
�
Earlier example: two coin flips, k -coin flips (first k binary coefficients of fair spinner)
=
is a random variable defined by
Several notations used, e.g., X k = ( X0, X1,..., Xk−1) is shorthand for X k (ω) = ( X0(ω), X1(ω),..., Xk−1)(ω) or X or {Xn; n = 0 , 1,..., k
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
S n(ω)
=
1 n
n 1
�−
Xk (ω)
k=0
− 1} or {X ; n ∈ Z }
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n
k
c R.M. Gray 2011
11
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
12
Inverse image formula for random vectors
PX(F )
=
P(X−1(F )) = P ({ω :
=
P({ω : (X0(ω), X1(ω),..., Xk−1(ω))
X (ω)
Random processes
A random vector is a finite collection of rvs defined on a common probability space
∈ F }) ∈ F })
where the various forms are equivalent and all stand for Pr(X
A random process is an infinite family of rvs defined on a common probability space. Many types:
∈ F)
Technically, the formula holds for suitable eventsF B (R)k, the Borel field of R k (or some suitable subset). See book for discussion.
∈
One multidimensional event of particular interest is a Cartesian product of 1D events (called a rectangle ): F = ki −01 Fi = { xk : x i F i; i = 0 ,..., k 1}
∈
PX(F ) = P ({ω : X 0(ω)
−
∈ F , X (ω) ∈ F ,..., 0
1
1
Xk−1(ω)
In general: {Xt ; t
R}
(continuous-time, two-sided)
∈
Also called stochastic process
k 1
c R.M. Gray 2011
0 , 1, 2,... } (discrete-time, one-sided)
∈ Z } (discrete-time, two-sided) ∈ [0, ∞)} (continuous-time, one-sided)
{ Xt ; t
∈ F − })
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
{ X n; n { Xt ; t
=
×
{ X n; n
13
∈ T } or {X(t); t ∈ T }
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
14
Keep in mind the suppressed argument ω — e.g., each X t is X t (ω), a function defined on the sample space
Other notations: { X (t)}, {X [n]} (for discrete-time)
X (t) is X (t , ω), it can be viewed as a function of two arguments
Sloppy but common: X (t ), context tells rp and not single rv
Have seen one example — fair coin flips, a Bernoulli random process
Also called a stochastic process. Discrete-time random processes are also called time series
Another, simpler, example: Random sinusoids Suppose that A and Θ are two random variables with a joint pdf fA,θ (a, θ) = fA(a) fΘ(θ). For example, Θ U ([0, 2π)) and A N (0, σ2). Define a continuous-time random process X (t) for all t R X (t) = A cos(2πt + Θ)
Always: a random process is an indexed family of random variables, T is index set
∈
For each t , X t is a random variable. All X t defined on a common probability space index is usually time, in some applications it is space, e.g., random field {X (t, s); t, s [0 , 1)} models a random image, {V ( x, y, t); x, y [0 , 1); t [0 , )} models analog video.
∈
∈
∼
∼
Or, making the dependence on ω explicit, X (t , ω)
=
A (ω) cos(2πt + Θ(ω))
∈ ∞
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
15
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
16
Derived distributions for random variables
Cumulative distribution functions
General problem: Given probability space ( Ω, F , P) and a random variable X with range space (alphabet) ΩX . Find the distribution P X .
Define cumulative distribution function (cdf) by x
F X ( x)
If ΩX is discrete, then P X described by a pmf p X ( x) = P (X −1({ x})) PX (F ) =
�
=
≡
fX (r )dr
−∞, x]))
and from calculus
If ΩX is continuous, then need a pdf.
⇒
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
17
Notes: • If a b , then since ( , a] = ( , b] (b, a] is the union of disjoint intervals, then F X (a) = F X (b) + PX ((b, a]) and hence
−∞ ∪
b
f X ( x) d x
=
F X (b)
a
⇒F
X ( x)
d F X ( x) dx So first find cdf F X ( x), then differentiate to find fX ( x) f X ( x) =
But a pdf is not a probability so inverse image formula does not apply immediately alter approach
PX ((a, b]) =
≤ x)
F X ( x) = P (X −1((
p X ( x) = P (X −1(F ))
∈
−∞
Pr( X
This is a probability and inverse image formula works
P ({ω : X (ω) = x })
x F
≥
=
−∞
−F
18
If srcinal space ( Ω, F , P) is a discrete probability space, then rv X defined on ( Ω, F , P) is also discrete Inverse image formula
⇒
pX ( x) = P X ({ x}) = P ( X −1({ x}))
X (a)
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
�
p(ω)
ω: X(ω)= x
is monontonically nondecreasing
• cdf is well defined for discrete rvs: F X (r )
=
Pr( X
≤ r)
pX ( x),
=
�≤
x: x r
but not as useful. Not needed for derived distributions
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
19
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
20
Example: discrete derived distribution pY (1)
=
�
(1
ω: ω even
=
Ω
=
Z +, P determined by the geometric pmf =
Define a random variable Y : Y (ω) =
pY (0)
1
if ω even
0
if ω odd
=
k 1
− p) − p
�∞
p
(1
((1
k=1
2
(1
k=2,4,... 2 k
− p) ) 1−p 2−p
− p) (1 − p ) p 1 − (1 − p ) 1 1 − p (1) 2−p
�
=
=
p (1
k 1
− p) − p
− p)
�∞
((1
k=0
2 k
− p) )
=
=
Y
Using the inverse image formula for the pmf for Y (ω) = 1 : c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
21
Suppose srcinal space is (Ω, F , P) = ( R, B(R), P) where P is described by a pdf g : P( F ) =
g(r) dr ; F r F
∈
X a rv. Inverse image formula PX (F )
=
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
22
Example: continuous derived distribution
∈ B (R).
Square of a random variable (R, B(R), P) with P induced by a Gaussian pdf.
⇒
P (X −1(F ))
Define W :
→ R by W (r)
∈
F W ( w) r: X (r )= x g(r ) dr
=
=
Quantizer example did this. If X is continuous, want the pdf. First find cdf then differentiate. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
r 2; r
∈ R.
≥
r : X(r) F
If X discrete, find the pmf p X ( x) =
=
Find pdf fW . First find cdf F W , then differentiate. If w < 0 , F W (w) = 0 . If w 0 ,
g(r) dr.
=
R
c R.M. Gray 2011
Pr(W
≤ w)
=
P ({ω : W (ω) = ω 2
P([ w1/2, w1/2]) =
−
w1/2
≤ w })
g(r) dr
−w1/2
This can be complicated, but don’t need to plug in g yet 23
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
24
Example: continuous derived distribution
Use integral differentiation formula to get pdf directly — b(w)
d dw
g(r) dr
=
g (b(w))
db (w) dw
a(w)
− g(a(w))dadw(w)
The max and min functions
In our example
f X ( x) and Y fY (y) be independent so that Let X fX,Y ( x, y) = fX ( x) fY (y).
∼
fW (w) = g (w1/2)
− w
1/2
2
− g(−w
1/2
)
−
w−1/2
∼
Define
2
U
E.g., if g = N (0, σ2), then f W ( w)
=
w−1/2
√
2πσ2
w
min(x, y) = c R.M. Gray 2011
25
=
min{X, Y }
x
if x
y
otherwise
≥y
y
if x
x
otherwise
y
≥
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
26
Thus
Find the pdfs of U and V . To find the pdf of U , we first find its cdf. U u , so using independence
≤
Pr(U
=
∈ [0, ∞).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
max{X, Y }, V
max( x, y)
2 e−w/2σ ;
— a chi-squared pdf with one degree of freedom
FU (u)
=
where
≤ u)
=
Pr( X
fV (v) = fX (v) + f Y (v)
≤ u iff both X and Y are
≤ u, Y ≤ u)
=
−
fX (v)FY (v)
−
fY (v)F X (v)
F X (u) FY (u)
Using the product rule for derivatives, fU (u)
=
fX (u) FY (u)
+
To find the pdf of V , first find the cdf. V using independence F V (v)
≤ v iff either X or Y ≤ v so that
≤ v or Y ≤ v ) − Pr(X > v , Y > v) 1 − (1 − F (v))(1 − F (v))
=
Pr(X
=
1
=
fY (u)F X (u)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
X
Y
c R.M. Gray 2011
27
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
28
Implies output probability space in trivial way:
Directly-given random variables
PV (F ) = P (V −1(F ))
All named examples of pmfs (uniform, Bernoulli, binomial, geometric, Poisson) and pdfs (uniform, exponential, Gaussian, Laplacian, chi-squared, etc.) and the probability spaces they imply can be considered as describing random variables: Suppose ( Ω, F , P) is a probability space with Define a random variable V :
Ω
Ω
A random variable is said to be Bernoulli, binomial, etc. if its distribution is determined by a Bernoulli, binomial, etc. pmf (or pdf) Two random variables V and X (possibly defined on different experiments) are said to be equivalent or identically distributed if PV = P X , i.e., P V (F ) = P X (F ) all events F
→Ω
V (ω) = ω
E.g., both continuous with same pdf, or both discrete with same pmf
— the identity mapping, random variable just reports srcinal sample value ω c R.M. Gray 2011
Example: Binary random variable defined as quantization of fair spinner vs. directly given as above. 29
1. Describe a probability space ( Ω, F , P) and define a function X on it. Together these imply distribution P X for rv (by a pmf or pdf)
30
As in the scalar case, distribution can be described by probability functions — cdf’s and either pmfs or pdfs (or both)
2. (Directly given) Describe distribution P X directly (by a pmf or pdf). =
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Derived distributions: random vectors
Note: Two waysto describe random variables:
Implicitly ( Ω, F , P)
P (F )
If srcinal space discrete (continuous), so is random variable, and random variable is described by pmf (pdf)
⊂ R.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
( ΩX , B(Ω X ), PX ) and X (ω) = ω .
If random vector has a discrete range space, then the distribution can be described by a multidimensional pmf p X(x) = P X({x}) = Pr( X = x ) as
Both representations are useful.
PX( F )
=
� ∈F
x
pX(x) =
�
( x0, x1 ,..., xk −1) F
∈
p X0 ,X1,..., Xk−1 ( x0, x1,..., xk−1)
If the random vector X has a continuous range space, then distribution can be described by a multidimensional pdf fX PX(F ) = F fX(x) d x Use multidimensional cdf to find pdf EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
31
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
32
Given a k -dimensional random vector distribution function (cdf) F X by
X,
define cumulative
Other ways to express multidimensional cdf: FX(x)
× − −∞ k 1 i=0 (
PX
=
P({ω : Xi(ω)
− − k 1
=
P
Xi
1
i=0
≤x ;i ((−∞, x ]) i
i
=
0 , 1,..., k
− 1})
.
Integration and differentiation are inverses of each other
FX(x) =
F X0,X1,..., Xk−1 ( x0, x1,..., xk−1)
=
PX({α : αi
=
Pr(Xi x
=
, xi ]
=
0
≤x;i
≤x;i i
x
1
−∞ −∞
···
i
=
=
0 , 1,..., k
0 , 1,..., k x
⇒
fX0,X1,..., Xk−1 ( x0, x1,..., xk−1)
− 1})
=
− 1)
∂k F X ,X ,..., X ( x0, x1,..., xk−1). ∂ x0∂ x1 . . . ∂ xk−1 0 1 k−1
k 1
−∞
− f X0,X1,..., Xk −1 (α0 , α1 ,..., αk−1 )d α0d α1 · · · d αk−1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
33
Joint and marginal distributions
34
For example, if X = ( X0, X1,..., Xk−1) is discrete, described by a pmf pX, then distribution for P X0 is described by pmf p X0 ( x0) which can be computed as
Random vector X = ( X0, X1,..., Xk−1) is a collection of random variables defined on a common probability space ( Ω, F , P)
p X0 ( x 0 )
Alternatively, X is a random vector that takes on values randomly as described by a probability distribution P X, without explicit reference to the underlying probability space.
P({ω : X 0(ω) = x 0})
=
P({ω : X 0(ω) = x 0, Xi(ω)
=
�
x1, x2,..., xk−1
∈Ω
X; i =
1 , 2,..., k
− 1})
1 , 2,..., k
− 1})
pX( x0, x1, x2,..., xk−1) =
x 0)
In general we have for cdfs that
random vector. E.g., finding the distributions of individual components of the random vector.
c R.M. Gray 2011
=
In English, all of these are Pr( X0
Either the srcinal probability measure P or the induced distribution PX can be used to compute probabilities of events involving the
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
35
F X0 ( x 0 )
=
P({ω : X 0(ω)
0
=
P({ω : X 0(ω)
0
=
FX
≤x ) ≤ x , X (ω) ∈ Ω ( x , ∞, ∞,..., ∞) i
X; i =
0
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
36
⇒ if the pdfs exist, fX0 ( x0)
=
Sum or integrate over all of the dummy variables corresponding to the unwanted random variables in the vector to obtain the pmf or pdf for the random variable X i
fX( x0, x1, x2,..., xk−1)d x1d x2 . . . dx k−1
or Pr( Xi pXi (α) =
�
x0 , x1,..., xi−1, xi+1,..., xk−1
p X0, X1,..., Xk−1 ( x0, x1,..., xi−1, α, xi+1,..., xk−1)
or fXi (α)
∞ ∞,..., ∞, α, ∞,..., ∞), ≤ α and X ≤ ∞, all j i)
F Xi (α) = F X( ,
Can find distributions for any of the components in this way:
≤ α)
=
Pr( Xi
�
j
Similarly can find cdfs/pmfs/pdfs for any pairs or triples of random variables in the random vector or any other subvector (at least in theory) These relations are called consistency relationships — a random vector distribution implies many other distributions, and these must be consistent with each other.
=
d x0 . . . dx i−1d xi+1 . . . dx k−1 f X0,..., Xk−1 ( x0,..., xi−1, α, xi+1,..., xk−1) c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
37
2D random vectors
Ideas are clearest when only 2 rvs:
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
38
If the range space of the vector ( X, Y ) is continuous and the cdf is differentiable so that fX,Y ( x, y) exists,
( X, Y ) a random vector.
marginal distribution of X is obtained from the joint distribution of X and Y by leaving Y unconstrained
f X ( x) =
∞
fX,Y ( x, y) dy ,
−∞ with similar expressions for the distribution for rv Y . Joint distributions imply marginal distributions.
PX (F )
=
P X,Y ({( x, y) : x
Marginal cdf of X is F X (α)
=
∈ F, y ∈ R }); F ∈ B (R).
F X,Y (α,
The opposite is not true without additional assumptions, e.g., independence.
∞)
If the range space of the vector ( X, Y ) is discrete, p X ( x) =
�
p X,Y ( x, y).
y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
39
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
40
Examples of joint and marginal distributions
Thus in the special case of a product distribution, knowing the marginal pmfs is enough to know the joint distribution. Thus marginal distributions + independence the joint distribution.
⇒
Example
Pair of fair coins provides an example:
Suppose rvs X and Y are such that the random vector ( X, Y ) has a pmf of the form pX,Y ( x, y) = r ( x)q(y), where r and q are both valid pmfs. ( pX,Y is a product pmf)
p XY ( x, y)
=
p X ( x) pY (y) =
p X ( x)
=
pY (y) =
1 ; x, y = 0 , 1 4
1 ; x = 0, 1 2
Then p X ( x)
=
�
pX,Y ( x, y) =
y
=
r ( x)
�
r ( x)q(y)
y
�
q(y)
=
r ( x).
y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
41
Example of where marginals not enough
Loaded pair of six-sided dice have property the sum of the two dice = 7 on every roll.
0 1 p XY ( x, y) 0 0.4 0.1 1 0.1 0.4 X ( x) =
pY (y)
=
42
Another example
Flip two fair coins connected by a piece of flexible rubber
⇒p
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
All 6 combinations possible combinations ( (1,6), (2,5), (3,4), (4,3), (5,2), (6,1)) have equal probability.
1 /2, x = 0 , 1
Suppose outcome of one die is X , other is Y
Not a product distribution , but same marginals as product distribution case Quite different joints can yield the same marginals. Marginals alone do not tell the story.
(X, Y ) is a random vector taking values in {1, 2,..., 6}2
p X,Y ( x, y) =
1 , x+y 6
=
7 , ( x, y)
∈ { 1, 2,...,
6}2.
Find marginal pmfs EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
43
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
44
p X ( x) =
�
pXY ( x, y) = p XY ( x, 7
y
− x)
=
1 ,x 6
Continuous example =
1 , 2,..., 6
Same as if product distribution. marginals alone do not imply joint
(X, Y ) a rv with a pdf that is constant on the unit disk in the XY plane:
f X,Y ( x, y)
=
C
x2 + y2
0
otherwise
≤1
Find marginal pdfs. Is it a product pdf? Need C : C dxdy
=
1.
x2+y2 1
≤
Integral = area of a circle multiplied by C c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
+
f X ( x)
=
√
1 x2
−
C dy = 2C
√
√
1 x2
− −
45
1
2C
−1 or C
=
√
1
−x
2
dx
=
−x
2
, x2
f X ( x)
√
2π−1 1
c R.M. Gray 2011
46
2D Gaussian pdf with k = 2 , m = (0 , 0)t , and Λ = { λ(i, j ) : λ(1, 1) = λ(2, 2) = 1 , λ(1, 2) = λ(2, 1) = ρ }.
πC
=
Inverse matrix is
1,
− 1 ρ ρ 1
=
1 /π.
≤ 1.
π −1 .
Thus
=
Joints and marginals: Gaussian pair 1
Could now also find C by a second integration: +
⇒C
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
1 =
1 1
−
ρ2
1
−ρ
−ρ 1
,
the joint pdf for the random vector ( X, Y ) is
−x
2
, x2
≤ 1.
By symmetry Y has the same pdf. fX,Y not a product pdf.
fX,Y ( x, y) =
Note marginal pdf is not constant, even though the joint pdf is.
exp
1 2 2(1 ρ2) ( x +
y
2π 1
ρ2
− − −
2
− 2ρ xy) , ( x, y) ∈ R .
2
ρ called “correlation coefficient” EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
47
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
48
Need ρ 2 < 1 for
Λ
Consistency & directly given processes
to be positive definite
To find the pdf of X , integrate joint over y Do this using standard trick: complete the square: x
2
+
y
2
− 2ρxy
=
2
(y
− ρ x) − ρ x
exp fX,Y ( x, y)
=
+
x
2
=
(y
− ρ x)
2
+
(1
2
− ρ )x
− − − − − −− −− √ − (y ρ x)2 2(1 ρ2)
2π 1
Part of joint is N (ρ x, 1
2 2
x2 2
(y ρ x)2 2(1 ρ2)
exp
=
ρ2
ρ2)
2π(1
x2 2
exp
Have seen two ways to describe (specify) a random variable – as a probability space + a function (random variable), or a directly given rv (a distribution — pdf or pmf)
2
Same idea works for random vectors. What about random processes? E.g., direct definition of fair coin flipping process.
.
2π
2
− ρ ), which integrates to 1. Thus 2 /2
fX ( x) = (2π)−1/2e− x
.
Note marginals the same regardless of ρ ! c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
49
For simplicity, consider discrete time, discrete alphabet random process, say {Xn}. Given random process, can use inverse image formula to compute pmf for any finite collection of samples (Xk1 , Xk2 ,..., XkK ), e.g., p Xk
1
,Xk ,..., Xk 2
K
( x1, x2,..., xK )
=
Pr(Xki
=
P({ω : X ki (ω)
=
p X1 ( x1)
=
,Xk ,..., Xk 2
K
( x1, x2,..., xK )
=
2 −K , all ( x1, x2,..., x K )
� � �
p X1, X2 ( x1, x2)
x2
x i; i = 1 ,..., K }) =
1
50
⇒
pX0,X1, X2 ( x0, x1, x2)
x0, x2
For example, in the fair coin flipping process p Xk
c R.M. Gray 2011
The axioms of probability that these pmfs for any choice of K and k1,..., kK must be consistent in the sense that if any of the pmfs is used to compute the probability of an event, the answer must be the same. E.g.,
x i; i = 1 ,..., K ) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∈ { 0, 1}
=
K
pX1,X3, X5 ( x0, x2, x5)
x3, x5
since all of these computations yield the same probability in the srcinal probability space Pr( X1 = x 1) = P ({ω : X 1(ω) = x 1})
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
51
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
52
To completely describe a random process, you need only provide a formula for a consistent family of pmfs for finite collections of samples.
Bottom lineIf given a discrete time discrete alphabet random process {Xn; n Z} , then for any finite K and collection of K sample times k 1,..., kK can find the joint pmf p Xk ,Xk ,..., Xk ( x1, x2,..., xK ) and 1 2 K this collection of pmfs must be consistent.
∈
The same result holds for continuous time random processes and for continuous alphabet processes (family of pdfs)
Kolmogorov proved a converse to this idea now called the Kolmogorov extension theorem, which provides the most common method for describing a random process:
Difficult to prove, but most common way to specify model. Kolmogorov or directly-given representation of a random process – describe consistent family of vector distributions. For completeness:
Theorem. Kolmogorov extension theorem for discrete time processesGiven a consistent family of finite-dimensional pmfs pXk ,Xk ,..., Xk ( x1, x2,..., xK ) for all dimensions K and sample times 1 2 K k1,..., kK , then there is a random process {Xn; n Z described by these marginals.
∈
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
53
Theorem. Kolmogorov extension theorem Suppose that one is given a consistent family of finite-dimensional distributions P Xt0 ,Xt1 ,..., Xtk−1 for all positive integers k and all possible sample times t i T ; i = 0 , 1,..., k 1. Then there exists a random process {Xt ; t T } that is consistent with this family. In other words, to describe a random process completely, it is sufficient to describe a consistent family of finite-dimensional distributions of its samples.
∈
∈
−
Example:Given a pmf p , define a family of vector pmfs by
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
54
The continuous alphabet analogy is defined in terms of a pdf f — define the vector pdfs by K
f Xk
1
, Xk ,..., Xk 2
K
( x1, x2,..., x K ) =
f ( xk )
i=1
A discrete time continuous alphabet process is iid if its joint pdfs factor in this way.
K
p Xk
1
, Xk ,..., Xk 2
K
( x1, x2,..., x K ) =
p( xk ),
i=1
then there is a random process {Xn} having these vector pmfs for finite collections of samples. A process of this form is called an iid process. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
55
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
56
Independent random variables
If X , Y discrete, choosing F
=
{ x}, Y
=
{ y}
⇒
p XY ( x, y) = p X ( x) pY (y) all x , y
Return to definition of independent rvs, with more explanation. Conversely, if joint pmf = product of marginals, then evaluate Pr(X F, Y G) as
Definition of independent random variables an application of definition of independent events. Defined events F and G to be independent if P (F
∩ G)
=
∈
∈
P(X −1(F ) ∩ Y −1(G ))
P (F ) P(G)
Two random variables X and Y defined on a probability space are independent if the events X −1(F ) and Y −1(G) are independent for all F and G in B(R), i.e., if P ( X −1 ( F )
∩ Y − (G)) 1
=
∈
×
∈
� � ∈ ∈ ∈
pXY ( x, y)
x F,y G
p X ( x)
=
x F
P ( X −1(F ))P(Y −1(G ))
Equivalently, Pr( X F, Y G) = Pr( X PXY ( F G ) = P X ( F )PY (G )
� ∈�∈ ∈
=
pX ( x) pY (y)
=
x F,y G
pY (y)
=
P (X −1(F ))P(Y −1(G))
y G
⇒ independent by general definition
∈ F )Pr( Y ∈ G) or c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
57
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
For general random variables, considerF = ( , x ], G = ( , y]. Then if X , Y independent, F XY ( x, y) = F X ( x) FY (y) all x , y. If pdfs exist, this implies that fXY ( x, y) = fX ( x) fY (y)
A collection of rvs {Xi, i = 0 , 1,..., k 1} is independent or mutually independent if all collections of events of the form {Xi−1(Fi); i = 0 , 1,..., k 1} are mutually independent for any Fi B (R); i = 0 , 1,..., k 1.
Conversely, if this relation holds for all x , y, then P(X −1(F ) Y −1(G)) = P (X −1( F ))P(Y −1(G )) and hence X and Y are independent.
A collection of discrete random variables X i; i mutually independent iff
−∞
−∞
∩
58
−
−
∈
−
=
0 , 1,..., k
− 1 is
k 1
p X0,..., Xk−1 ( x0,..., xk−1) =
−
p Xi ( xi);
i=0
∀x . i
A collection of continuous random variables is independent iff the joint pdf factors as
− k 1
fX0,..., Xk−1 ( x0,..., xk−1) EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
59
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
f Xi ( x i ) .
i=0
c R.M. Gray 2011
60
A collection of general random variables is independent iff the joint cdf factors as
Conditional distributions
k 1
F X0,..., Xk−1 ( x0,..., xk−1)
=
− i=0
F Xi ( xi); ( x0, x1,..., xk−1)
k
∈R .
Apply conditional probability to distributions. Can express joint probabilities as products even if rvs not independent
The random vector is independent, identically distributed (iid) if the components are independent and the marginal distributions are all the same.
E.g., distribution of input given observed output (for inference) There are many types: conditional pmfs, conditional pdfs, conditional cdfs Elementary and nonelementary conditional probability
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
61
Discrete conditional distributions
Define for each x
∈A
pY | X (y| x)
X
=
=
Simplest, direct application of elementary conditional probability to pmfs
=
Consider 2D discrete random vector ( X, Y ) alphabet A X
×A
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
for which p X ( x) > 0 the conditional pmf P(Y
=
P(Y
=
y|X
=
x)
x) P( X = x ) P({ω : Y (ω) = y } {ω : X (ω) P({ω : X (ω) = x }) pX,Y ( x, y) y, X
=
∩
p X ( x)
=
x })
=
x
,
elementary conditional probability that Y
Y
62
=
y given X
joint pmf p X,Y ( x, y) marginal pmfs p X and p Y EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
63
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
64
Properties of conditional pmfs
Can compute conditional probabilities by summing conditional pmfs, P(Y
For fixed x , p Y |X (·| x) is a pmf:
�
pY |X (y| x)
=
y AY
�
y AY
∈
∈
=
p X,Y ( x, y)
=
p X ( x)
1 p X ( x) = 1 . p X ( x)
1
�
p X ( x) y∈A
∈ F |X
=
x)
=
�
pY |X (y| x)
y F
∈
Can write probabilities of events of the form X as p X,Y ( x, y) P(X
Y
∈ G, Y ∈ F )
� �∈ ∈ � �∈ ∈
∈ G, Y ∈ F (rectangles)
p X,Y x, y
=
x,y: x G,y F
p X ( x)
=
x G
pY |X (y | x )
y F
p X ( x) P(F | X
=
The joint pmf can be expressed as a product as
=
x)
x G
∈
Later: define nonelementary conditional probability to mimic this formula
pX,Y ( x, y) = p Y |X (y| x) pX ( x). c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
65
If X and Y are independent, then p Y | X (y| x) = p Y (y)
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
66
Example of Bayes rule: Binary Symmetric Channel
Given p Y |X , p X , Bayes rule for pmfs: p X|Y ( x|y) =
p X,Y ( x, y) pY (y)
=
pY |X (y| x) pX ( x)
u
p Y |X (y|u) pX (u)
Consider the following binary communication channel
,
Z
a result often referred to as Bayes’ rule .
∈ {0, 1} + ❄
X
∈ { 0, 1}
✲
✲
Y
∈ { 0, 1}
Bit sent is X Bern( p), 0 p 1 , noise is Z Bern( ), 0 bit received is Y = ( X + Z ) mod 2 = X Z , and X and Z are
∼
≤ ≤
independent Find 1) p X|Y ( x|y), 2) p Y (y), and 3) Pr {X
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
67
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
∼
≤ ≤ 0.5,
⊕ �
Y }, the probability of error
c R.M. Gray 2011
68
1. To find p X|Y ( x|y) use Bayes rule
Therefore
pY | X (y| x)
pX |Y ( x|y) =
pY |X (y| x) pX ( x)
x AX
∈
p X ( x)
pY |X (0 | 0) = p Z (0
=
pZ (0) = 1
pY |X (0 | 1) = p Z (0
=
pZ (1) =
pY |X (1 | 0) = p Z
=
pZ (1) =
=
pZ (0) = 1
pY |X (1 | 1) = p Z
⊕ 0) ⊕ 1) (1 ⊕ 0) (1 ⊕ 1)
− −
Know p X ( x), but we need to find p Y |X (y| x):
pY |X (y| x) = Pr {Y
=
y|X
=
Pr { x
=
Pr {Z
=
=
p Z (y
⊕ x)⊕
⊕Z
=
=
x}
y
x } = Pr {X
y|X
=
⊕Z
x } = Pr {Z
= =
y|X y
=
x}
⊕x|X
=
x}
since Z and X are independent
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
69
Plugging into Bayes rule:
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
70
c R.M. Gray 2011
72
2. We already found p Y (y) as pY (y) = p Y | X (y | 0) p X (0) + p Y |X (y | 1) p X (1)
pX |Y (0|0) = pX |Y (1|0) = pX |Y (0|1) =
pY |X (0|0)
p X (0) =
− )(1 − p ) (1 − )(1 − p ) p
pY |X (0|0) pX (0) + p Y | X (0|1) p X (1) p 1 p X |Y (0|0) = (1 )(1 p ) + p pY |X (1|0) p X (0) = pY |X (1|0) pX (0) + p Y | X (1|1) p X (1) (1
pX |Y (1|1) = 1
−
−p
−
X |Y (0|1) =
(1
−
−
(1 ) p ) p + (1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
−
(1
(1
=
+
(1
− )p
− p)
+
(1
− )(1 − p ) p − p) (1 − ) p
(1
3.Now to find the probability of error Pr {X
− p)
− p)
c R.M. Gray 2011
71
Pr{X
�
Y}
for y = 0
+
for y = 1
+
�
Y }, consider
=
pX,Y (0, 1) + p X,Y (1, 0)
=
pY | X (1|0) p X (0) + p Y |X (0|1) p X (1)
=
(1
− p)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
+
p
=
An interesting special case is = 12 . Here, Pr {X the worst possible (no information is sent), and pY (0) = 12 p + 12 (1
Therefore Y
∼
Bern( 12 ),
− p)
=
1 = 2
�
Y } = 12 , which is
Conditional pmfs for vectors
p Y (1)
Random vector ( X0, X1,..., Xk−1)
independent of the value of p !
In this case, the bit sent X and the bit received Y are independent (check this) pmf p X0,X1,..., Xk−1
Define conditional pmfs (assuming denominators
p Xl| X0,..., Xl−1 ( xl| x0,..., xl−1) = c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
73
⇒ chain rule =
..
pX0,X1,..., Xn−1 ( x0, x1,..., xn−1) pX0,X1,..., Xn−2 ( x0, x1,..., xn−2) n 1
=
p X0 ( x 0 )
− − i=1
p X0 ( x 0 )
p X0 ,..., Xl ( x0,..., xl) . p X0 ,..., Xl−1 ( x0,..., xl−1) c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
74
Continuous distributions more complicated Given X , Y with joint pdf fX,Y , marginal pdfs fX , fY , define conditional pdf
p X0, X1,..., Xn−2 ( x0, x1,..., xn−2)
pX0,X1,..., Xi ( x0, x1,..., xi) pX0,X1,..., Xi−1 ( x0, x1,..., xi−1)
fY |X (y| x)
l=1
≡ f f ((xx,)y). X,Y
X
n 1
=
0)
Continuous conditional distributions
pX0,X1,..., Xn−1 ( x0, x1,..., xn−1)
�
analogous to conditional pmf, but unlike conditional pmf, not a conditional probability!
p Xl| X0,..., Xl−1 ( xl| x0,..., xl−1)
A density of conditional probability Formula plays an important role in characterizing memory in processes. Can be used to construct joint pmfs, and to specify a random process. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
Problem: conditioning event has probability 0. Elementary conditional probability not work. 75
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
76
Nonelementary conditional probability
Conditional pdf is a pdf: fY |X (y| x) dy
fX,Y ( x, y)
=
f X ( x)
1
=
fX ( x) 1
=
fX ( x)
dy
Does P (Y F |X = x ) = F fY |X (y| x) dy . make sense as an appropriate definition of conditional probability given an event of zero probability?
∈
fX,Y ( x, y) dy
Observe that analogous to the ed result for pmfs, assuming the pdfs all make sense
f X ( x) = 1 ,
provided require that fX ( x) > 0 over the region of integration.
P(X
Given a conditional pdf fY |X , define (nonelementary) conditional probability that Y F given X = x by
∈ G, Y ∈ F )
∈
P(Y
fX,Y ( x, y)dxdy
=
x,y: x G,y F
∈ ∈
fX ( x)
=
∈
x G
∈ F |X
=
x)
≡
fY |X (y| x) dy .
fY |X (y | x )dy d x
∈
y F
fX ( x) P(F | X
=
F
=
x)
x G
∈
Resembles discrete form. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
77
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
78
Bayes rule for pdfs
Our definition is ad hoc. But the careful mathematical definition of conditional probability P (F | X = x ) for an event of 0 probability is made not by a formula such as we have used to define conditional pmfs and pdfs and elementary conditional probability, but by its behavior inside an integral (like the Dirac delta). In particular, P(F | X = x ) is defined as any measurable function satisfying equation for all events F and G , which our definition does.
Bayes rule: f X | Y ( x |y ) =
fX,Y ( x, y) fY ( y )
=
fY |X (y| x) fX ( x) fY |X (y|u) f X (u) du
.
Example of conditional pdfs: 2D Gaussian U
=
( X, Y ), Gaussian pdf with mean ( mX , mY )t and covariance matrix Λ
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
79
=
σ2X ρσX σY ρσX σY σ2Y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
, c R.M. Gray 2011
80
Algebra
Rearrange
⇒ det(Λ)
=
−1 Λ
=
σ2X σ2Y (1 1 (1
2
2
−ρ )
−ρ ) −
1/σ2X ρ/(σX σY )
−ρ/(σ
X σY ) 1/σ2Y
exp f XY ( x, y) =
1
√
2π det Λ 1 2πσX σY
1 −1 t e− 2 ( x−mX ,y−mY )Λ ( x−mX ,y−mY )
− 1
ρ2
exp
2
×
− x
−
⇒
fXY ( x, y)
=
1 x mX 2 ( ) 2 σX
2πσ2X
so
=
−
mX σX
fY | X ( y | x ) =
−2(1 1− ρ ) 2
X
X
Y
+
− y
− − 1 2
2
y mY (ρσY /σX )( x mX )
1 2
−√
1 ρ2 σY
−
2πσ2Y (1
2
−ρ ) 2
y mY (ρσY /σ X )( x mX )
Gaussian with variance σ 2Y | X σ 2Y (1 mY | X m Y + ρ(σY /σX )( x mX )
mY σY
c R.M. Gray 2011
− ≡
≡
81
−√
1 ρ2 σ Y
−
−
2
−ρ )
−
,
− ρ ), mean 2
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
⇒
82
Chain rule for pdfs
2 2 e−( x−mX ) /2σX
− −
2 Y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fX ( x) =
exp
2πσ2Y (1
− 2ρ(x − mσ )(σy − m )
Integrate joint over y (as before)
exp
.
Assume fX0,X1,..., Xi ( x0, x1,..., xi) > 0 ,
2πσ2X
Similarly, fY (y) and fX|Y ( x|y) are also Gaussian Note: X and Y jointly Gaussian conditionally Gaussian!
fX0 ,X1,..., Xn−1 ( x0, x1,..., xn−1)
⇒ also both individually and
=
.. =
fX0,X1,..., Xn−1 ( x0, x1,..., xn−1) fX0,X1,..., Xn−2 ( x0, x1,..., xn−2)
f X0 ( x 0 )
fX0,X1,..., Xn−2 ( x0, x1,..., xn−2)
n 1
−
fX0,X1,..., Xi ( x0, x1,..., xi)
−
fX0,X1,..., Xi−1 ( x0, x1,..., xi−1)
i=1
n 1
=
f X0 ( x 0 )
i=1
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
83
fXi| X0,..., Xi−1 ( xi| x0,..., xi−1).
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
84
Statistical detection and classification
binary symmetric channel (BSC) Given observation Y , what is the best guess Xˆ (Y ) of transmitted value?
Simple application of conditional probability mass functions describing discrete random vectors
decision rule or detection rule
Transmitted: discrete rv X , pmf p X , p X (1) = p
Measure quality by probability guess is correct:
(e.g., one sample of a binary random process) Pc(Xˆ ) = Pr( X
Received: rv Y Conditional pmf (noisy channel) p Y |X (y| x)
1
−P , e
Pe(Xˆ ) = Pr( Xˆ (Y ) � X ).
x� y
−
Xˆ (Y )) = 1
where
More specific example as special case: X Bernoulli, parameter p
p Y | X (y| x) =
=
x= y
A decision rule is optimal if it yields the smallest possible P e or maximum possible P c
.
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
85
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
86
This is maximum a posteriori (MAP) detection rule Pr(Xˆ
=
X)
=
=
1
− P (Xˆ ) e
� � �
�
=
In binary example: Choose Xˆ (y) = y if < 1 /2 and Xˆ (y) > 1 /2.
pX,Y ( x, y)
( x,y):Xˆ (y)= x
pX |Y ( x|y) pY (y)
=
pY (y)
y
=
pY (y) pX |Y (Xˆ (y)|y).
General Bayesian classification allows weighting of cost of different
y
kinds of errors risk) so minimize a weighted (expected cost)(Bayes instead of only probability of error average
To maximize sum, maximizep X |Y (Xˆ (y)|y) for each y . Accomplished by Xˆ (y)
≡ arg max p
pX |Y (Xˆ (y)|y) = maxu p X|Y (u|y)
u
− y if
In general nonbinary case, statistical detection is statistical classification: Unseen X might be presence or absence of a disease, observation Y the results of various tests.
p X |Y ( x|y)
x: Xˆ (y)= x
1
⇒ minimum (optimal) error probability over all possible rules is min( , 1 − )
( x,y):Xˆ (y)= x
�
=
X|Y (u|y)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
which yields
c R.M. Gray 2011
87
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
88
Additive noise: Discrete random variables p X,Y ( x, y)
=
Pr(X
Common setup in communications, signal processing, statistics:
x, Y
=
y)
Pr( X
=
p X,W (α, β)
=
=
x, X + W
pX,W ( x, y
α,β:α= x,α+β=y =
Original signal X has random noise W (independent of X ) added to it, observe Y = X + W Typically use observationY to make inference about X
p X ( x) p W (y
pY |X (y| x) =
=
y)
− x)
− x ).
Note: Formula only makes sense if y Thus
Begin by deriving conditional distributions.
− x is in the range space of W
pX,Y ( x, y)
=
p X ( x)
p W (y
− x ),
Intuitive!
Discrete case: Have independent rvs X and W with pmfs p X and p W . Form Y = X + W . Find p Y
Marginal for Y : pY (y) =
Use inverse image formula: EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
�
=
�
p X,Y ( x, y) =
x
c R.M. Gray 2011
89
a discrete convolution
�
p X ( x) pW (y
x
− x) c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
90
Additive noise: continuous random variables
Above uses ordinary real arithmetic. Similar results hold for other definitions of addition, e.g., modulo 2 arithmetic for binary X , W , fXW ( x, w) = fX ( x) fW (w) (independent), Y
As with linear systems, convolutions usually be easily evaluated in the transform domain. Will do shortly.
=
X+W
Find fY |X and fY Since continuous, find joint pdf by first finding joint cdf F X,Y ( x, y)
=
Pr(X
≤ x , Y ≤ y)
=
Pr( X
≤ x, X
+
W
≤ y)
fX,W (α, β) d α d β
=
α,β:α x,α+β y
≤
x =
−∞ x
dα
≤
y α
− −∞
d α fX (α)FW (y
=
−∞ EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
91
d β fX (α) fW (β)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
− α). c R.M. Gray 2011
92
Taking derivatives:
Additive Gaussian noise fX,Y ( x, y) = fX ( x) fW (y
⇒
fY |X (y| x) = fW (y
⇒
fY ( y ) =
fX,Y ( x, y) d x =
− x)
− x ).
Assume fX = N (0, σX ), fW Y = X + W.
f X ( x) fW (y
=
N (0, σ2Y ), fXW ( x, w) = fX ( x) fW (w),
− x ) d x,
a convolution integral of the pdfs fX and fW pdf fX |Y follows from Bayes’ rule: fX |Y ( x|y)
fY | X ( y | x ) = fW ( y
f X ( x ) fW ( y
=
fX (α) fW
− x) . (y − α) d α c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
93
=
fY |X (y|α) fX (α) d α
∞ exp −∞
=
1 2πσX σW exp
=
2/2σ2 W
2πσ2W
−
−
1 (y 2σ2W
exp
∞
−∞ exp
1 α2 2σ2X
1 y 2
2αy + α
dα
2
σ2W
1 2 1 2 α ( σ2X
−
∞
2
+
1 +
σ2W )
α
σ2X
dα
2 W
−12 (α −σ m)
2
.
−12 ( α −σ m )
2
dα
=
√
2πσ2
(Gaussian pdf integrates to 1)
2αy
−σ
exp
−∞
Compare
dα
− 12
Can integrate by completing the square (later see an easier way using tranforms, but this trick is not difficult) EE278: Introduction to Statistical Signal Processing, winter 2010–2011
94
which has integral
2πσ2X
2
−∞
2πσX σW
exp
2πσ2W
∞
y2 2σ2W
− − − 2
− α)
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
exp
−∞ =
e−(y− x)
Integrand resembles
To find fX |Y using Bayes’ rule, need fY : fY (y)
=
which is N ( x, σ2W ).
Gaussian example:
∞
− x)
c R.M. Gray 2011
95
− α2
1
+
σ2 X
1
2αy
σ2W σ2 W
vs.
− 12 α −σ m
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
2 =
−12
−
α2 αm m2 2 2 + 2 . σ2 σ σ c R.M. Gray 2011
96
⇒
The braced terms will be the same if choose 1 σ2
=
1 σ2W
+
1
⇒σ
σ2X
2
∞
σ2X σ2W
=
σ2X + σ2W
,
exp
−∞
−
1 2 1 α( 2 2 σX
∞
=
+
y
σ2W
⇒ α2
1 σ2X
+
“completing the square.’
=
1 σ2W
m
σ2
−
⇒m 2αy σ2W
σ2
=
σ2W
⇒
y.
fY (y)
− − m
α
=
2
σ
m2
So fY
σ2
c R.M. Gray 2011
97
For a posteriori probability fX |Y use Bayes’ rule + algebra
=
=
=
−
=
1 x2 2σ2X
x )2 exp
2πσ2W
exp
2πσ2 σ2 /(σ2 exp
W
2
− 2 1 y 2 σ2
W
2πσX σW
√
2
σ2
2πσ2 exp
2
σ2
m2
2σ2
dα =
exp =
√
2πσ2 exp
−
y2 1 2 σ2 +σ2 X
W
m2
2σ2
2π(σ2X + σ2W )
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
σ2X σ2X
+
σ2W
y,
σ2X σ2W σ2X
+
σ2W
98
.
/
y2 1 2 σ2 +σ2 X
W
2π(σ2X + σ2W )
y2 σ2X +σ2W
+
X
1 (x 2σ2X σ2W /(σ2X +σ2W )
exp
2πσ2X
2 2 1 y 2yx + x x2 + 2 2 σ2W σX
X
The mean of a conditional distribution called a conditional mean, the variance of a conditional distribution called a conditional variance
− − − − − − − 1 (y 2σ2W
2 W
N (0, σ2X + σ2W )
fX |Y ( x|y) = N
fY |X (y| x) fX ( x)/ fY (y)
exp
=
=
)
Sum of two independent 0 mean Gaussian rvs is another 0 mean Gaussian rv, the variance of the sum = sum of the variances
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fX |Y ( x|y)
exp
y − 2α dα σ −1 α − m − m
σ2W
exp
−∞
and
1
σ2 ) W 2 /(σ2X + X
− yσ
σ2W ))2
.
2πσ2X σ2W /(σ2X + σ2W )
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
99
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
100
continuous cases
Continuous additive noise with discrete input
Most important case of mixed distributions in communications applications
FY |X (y| x)
Typical: Binary random variable X , Gaussian random variable W , X and W independent, Y = X + W
≤ y | X x) ≤y|X Pr(W ≤ y − x ) F
=
Pr(Y
=
Pr( x + W
=
=
=
=
Pr( X + W
=
W (y
=
d F W (y dy
≤ y | X x) ≤y−x| X =
x ) = Pr(W
=
x)
− x)
Previous examples do not work, one rv discrete, other continuous Similar signal processing issue: Observe Y , guess X
Differentiating,
As before, may be one sample of a random process, in practice have {Xn}, {Wn}, {Yn}. At time n , observe Y n, guess X n Conditional cdf F Y |X (y| x) for Y given X = x is an elementary conditional probability. Analogous to purely discrete and purely c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fY |X (y| x) = d FY |X (y| x) dy 101
Pr(X
∈ F and Y ∈ G)
=
p X ( x)
=
R
∈ G)
=
�
=
(
�
�
fY |X (y| x) dy
p X ( x)
=
fY | X ( y | x ) p X ( x ) fY (y)
=
fY |X (y| x) pX ( x)
α
p X (α) fY |X (y|α)
,
Can be justified in similar way to conditional pdfs:
G
−∞, y] yields cdf F
fY (y) =
102
but this is not an elementary conditional probability, conditioning event has probability 0!
p X ( x)
=
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
pX |Y ( x|y)
− x ) dy.
yields
Pr(Y
Choosing G
fW ( y G
F
Choosing F
− x)
G
F
=
fW (y
Continuing analogy Bayes’ rule suggests conditional pmf:
fY |X (y| x) dy
p X ( x)
=
a convolution, analogous to pure discrete and pure continuous cases
Joint distribution combined by a combination of pmf and pdf.
� �
− x)
fW ( y G
Y (y)
p X ( x) fY |X (y| x) =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
⇒
�
x ) dy .
− p X ( x) fW (y
Pr(X
dy fY (y)Pr( X
=
G
dy fY (y)
=
− x ), c R.M. Gray 2011
∈ F and Y ∈ G)
G 103
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
�
∈ F |Y
=
y)
p X|Y ( x|y)
F
c R.M. Gray 2011
104
so that p X|Y ( x|y) satisfies Pr(X
Binary detection in Gaussian noise
∈ F |Y
=
y)
=
�
pX |Y ( x|y)
F
The derivation of the MAP detector or classifier extends immediately to a binary input random variable and independent Gaussian noise
Apply to binary input and Gaussian noise: the conditional pmf of the binary input given the noisy observation is pX |Y ( x|y)
=
=
fW ( y
− x) p
As in the purely discrete case, MAP detector Xˆ (y) of X given Y given by
X ( x)
fY (y) fW (y x ) pX ( x)
−
α p X (α) fW (y
Xˆ (y) = argmax p X |Y ( x|y) = argmax x
− α) ; y ∈ R , x ∈ { 0, 1}.
x
fW ( y
α
x
c R.M. Gray 2011
105
Assume for simplicity that X is equally likely to be 0 or 1:
=
argmax pX |Y ( x|y)
=
argmax
x =
x
argmax pX |Y ( x|y)
=
1
argmin | x
x
x
2πσ2W
− y|
exp
−
0
1
− α).
x
− x) p
X ( x).
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
106
Error probability of optimal detector:
1 ( x y)2 2 σ2W
−
Pe
Minimum distance or nearest neighbor decision, choose closest x to y
Xˆ (y) =
X ( x)
p X (α) fW (y
Xˆ (y) = argmax p X|Y ( x|y) = argmax fW (y
Xˆ (y)
y is
Denominator of the conditional pmf does not depend on x , the denominator has no effect on the maximization
Can now solve classical binary detection in Gaussian noise.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
− x) p
=
=
Pr(Xˆ (Y ) � X )
=
Pr(Xˆ (Y ) � 0 | X
=
Pr(Y > 0 .5|X
=
Pr(W
=
Pr(W > 0 .5| X
=
Pr(W > 0 .5) pX (0) + Pr(W < 0.5) pX (1)
+
=
=
0) pX (0) + Pr(Xˆ (Y ) � 1 | X
0) p X (0) + Pr(Y < 0 .5| X
X > 0 .5| X =
=
0) pX (0) + Pr(W
0) pX (0) + Pr(W
+
+
=
=
1) pX (1)
1) pX (1)
X < 0 .5| X
1 < 0 .5|X
=
=
1) pX (1)
1) p X (1)
−
y < 0 .5 y > 0 .5
.
using the independence of W and X . In terms of 1 0.5 0.5 Pe = 1 Φ + Φ = Φ 2 σW σW
−
−
Φ
−
function: 1 . 2σW
A threshold detector EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
107
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
108
Statistical estimation
Will later introduce another quality measure (MSE) and optimize. Now mention other approaches. Examples of estimation or regression instead of detection
In detection/classification problems, goal is to guess which of a discrete set of possibilities is true. MAP rule is an intuitive solution. Different if ( X, Y ) continuous, observe Y , and guess X . Examples:X , W independent Gaussian,Y guess of X given Y ?
=
X + W . What is best
{Xn} is a continuous alphabet random process (perhaps Gaussian). Observe X n−1. What is best guess for X n? What if observe X0, X1, X2,..., Xn−1?
Quality criteria for discrete case no longer works, Pr( Xˆ (Y ) = Y ) = 0 in general. c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
109
MAP Estimation
110
Maximum Likelihood Estimation
The maximum likelihood (ML) estimate of X given Y = the value of x that maximizes the conditional pdf fY |X (y| x) (instead of the a posteriori pdf fX |Y ( x|y))
Mimic map detection, maximize conditional probability function Xˆ MAP(y) = argmaxx f X|Y ( x|y) Easy to describe, application of conditional pdfs + Bayes.
Xˆ ML(y) = argmax fY |X (y| x).
But can not argue “optimal” in sense of maximizing quality
x
Advantage: Do not need to know prior fX and use Bayes to find fX |Y ( x|y). Simple
Example:Gaussian signal plus noise Found fX |Y ( x|y) = Gaussian with mean y σ2X /(σ2X + σ2W )
In the Gaussian case, Xˆ ML(Y )
Gaussian pdf maximized at its mean MAP estimate of X given Y = y is the conditional mean y σ2X /(σ2X + σ2W ).
⇒
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
=
y.
Will return to estimation when consider expectations in more detail. 111
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
112
For discrete rv with pmf p X , define characteristic function M X
Characteristic functions
MX ( ju)
=
�
pX ( x)e jux
x
When sum independent random variables, find derived distribution by convolution of pmfs or pdfs
where u is usually assumed to be real.
Can be complicated, avoidable using transforms as in linear systems
A discrete exponential transform. Sometimes φ , Φ, j not included. ( notational differences in Fourier transforms)
Summing independent random variables arises frequently in signal analysis problems. E.g., iid random process {Xk} is put into a linear filter to produce an output Y n = nk 1 hn−k Xk.
∼
Alternative useful form: Recall definition of expectation of a random variable g defined on a discrete probability space described by a pmf g: E (g) = ω p (ω)g(ω)
=
What is distribution of Y n?
n-fold convolution a mess. Describe shortcut.
Consider probability space ( ΩX , B(ΩX ), PX ) with P X described by pmf pX
Transforms of probability functions called characteristic functions. Variation on Fourier/Laplace transforms. Notation varies.
This is directly-given representation for rv X , X is the identity function on ΩX : X ( x) = x
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
113
Define random variable g (X ) on this space g (X )( x) = e jux. Then E [g(X )] = pX ( x)e jux so that
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
MX ( ju)
�
=
c R.M. Gray 2011
114
c R.M. Gray 2011
116
F −u/2π( p X ) = Z e ju ( p X )
Properties of characteristic functions follow from those of Fourier/Laplace/z/exponential transforms.
x
MX ( ju) = E [e juX ]
Characteristic functions, like probabilities, can be viewed as special cases of expectation Resembles discrete time Fourier transform pX ( x)e− j2πνx
Fν( p X ) =
� � x
and the z -transform
Zz( p X ) =
pX ( x)z x.
x
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
115
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Can recover pmf from M X by suitable inversion. E.g., given pX (k); k Z N ,
Characteristic functions and summing independent rvs
∈
π/2
1 2π
MX ( ju)e−iuk du
=
−π/2 =
π /2
1 2π
�− �
π/2
p X ( x)
�
x
=
x
pX ( x)e jux e−iuk du
x
1
2π
π /2
e
ju( x k)
−
Two independent random variablesX , W with pmfs p X and p W and characteristic functions M X and M W
du
Y
− π /2
pX ( x)δk− x
=
p X (k).
=
X+W
To find characteristic function of Y MY ( ju)
=
�
pY (y)e juy
y
But usually invert by inspection or from tables, avoid inverse transforms
use the inverse image formula pY (y)
=
�
p X,W ( x, w)
x,w: x+w=y
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
117
to obtain MY ( ju)
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
118
Iterate:
=
� � � � y
=
pX,W ( x, w) e juy
x,w: x+w=y
y
pX,W ( x, w)e ju( x+w)
x,w: x+w=y
y
=
� � �
pX,W ( x, w)e juy
x,w: x+w=y
pX,W ( x, w)e ju( x+w)
=
MY ( ju)
=
pX ( x) pW (w)e
jux juw
e
x,w
=
=
� x
p X ( x)e
jux
=
MY ( ju) =
x,w
N
MXi ( ju).
i=1
Last sum factors:
�
Theorem 1. If {Xi; i = 1 ,..., N } are independent random variables with characteristic functions M Xi , then the characteristic function of the random variable Y = iN 1 Xi is
�
pW (w)e
If the X i are independent and identically distributed with common characteristic function M X , then
juw
w
N
MY ( ju) = M X ( ju).
MX ( ju) MW ( ju),
transform of the pmf of the sum of independent random variab les is the product of their transforms
⇒
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
119
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
120
Example:X Bernoulli with parameter p = p X (1) = 1
−p
Uniqueness of transforms
X (0)
1
MX ( ju) =
�
e juk pX (k) = (1
k=0
− p)
+
pe ju
{Xi; i = 1 ,..., n} iid Bernoulli random variables, Y n = MYn ( ju) = [(1
with binomial theorem
− p)
+
ju n
pe ]
pYn (k) = n k=1
⇒
n k
(1
− p) − p ; k ∈ Z n k k
n+1 .
Xi, then
Same idea works for continuous rvs For a continous random variable X with pdf fX , define the characteristic function M X of the random variable (or of the pdf) as
⇒ n
MYn ( ju)
=
�
pYn (k)e juk
((1
p ) + pe ju)n
− � − − =
n
=
k=0
n
As in the discrete case,
p )n k pk e juk ,
(1 k
MX ( ju)
pYn (k)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
121
Relates to the continuous-time Fourier transform Fν ( fX ) =
=
E e juX .
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
122
Consider again two independent random variables X and Y with pdfs fX and fW , characteristic functions M X and M W
fX ( x)e− j2πν x dx
Paralleling the discrete case,
and the Laplace transform L s( fX ) =
f X ( x)e jux dx .
MX ( ju) =
k=0
MY ( ju) fX ( x)e− sx dx
=
M X ( ju) MW ( ju).
Will later see simple and general proof.
by MX ( ju) = F −u/2π( fX ) = L − ju( fX )
Hence can apply results from Fourier/Laplace transform theory. E.g., given a well-behaved density fX ( x); x R M X ( ju), can invert transform ∞ 1 f X ( x) = MX ( ju)e− jux du . 2π −∞
∈
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
123
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
124
As in the discrete case, iterating gives result for many independent rvs:
Summing Independent Gaussian rvs
X
If {Xi; i = 1 ,..., N } are independent random variables with characteristic functions M Xi , then the characteristic function of the random variable Y = iN 1 Xi is
2
∼ N (m, σ )
Characteristic function found by completing the square:
=
MX ( ju)
N
MY ( ju) =
i=1
=
M XN ( ju).
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
125
=
Then
n k=1
1
=
e jum−u
↔e
+
−∞ (2πσ2)1/2 ∞ 1
=
−∞ (2πσ2)1/2 2 σ 2 /2
e−( x−(m+ juσ
2))2/2σ2
dx e jum−u
2 σ 2 /2
.
jum u2σ2/2
−
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
{Xi; i = 1 ,..., n} iid Gaussian random variables with pdfs N (m, σ2) Yn
∞
Thus N (m, σ2)
∞ 1 2 2 e−( x−m) /2σ e jux d x −∞ (2πσ2)1/2 2 2 2 2 e−( x −2mx −2σ jux m )/2σ dx
E (e juX ) =
=
If the X i are independent and identically distributed with common characteristic function M X , then MY ( ju)
=
MXi ( ju).
c R.M. Gray 2011
126
c R.M. Gray 2011
128
Gaussian random vectors
Xi 2 σ 2 /2 n
MYn ( ju) = [ e jum−u
]
=
e ju(nm)−u
2(nσ2)/2
A random vector is Gaussian if its density is Gaussian
,
2
= characteristic function of N (nm, nσ ) Component rvs are jointly Gaussian Moral: Use characteristic functions to derive distributions of sums of independent rvs.
Description is complicated, but many nice properties Multidimensional characteristic functions help derivation Random vector X = ( X0,..., Xn−1) vector argument
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
127
u
=
( u0,..., un−1)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
n-dimensional characteristic function: M X ( j u)
=
=
MX0,..., Xn−1 ( ju0,..., ju n−1)
�− n 1
E exp j
k=0
uk Xk
=
So exists more generally, only need Λ to be nonnegative definite (instead of strictly positive definite). Define Gaussian rv more generally as a rv having a characteristic function of this form (inverse transform will have singularities)
E e
jut X
Can be shown using multivariable calculus: Gaussian rv with mean vector m and covariance matrix Λ has characteristic function M X ( j u)
t
=
t
e ju m−u Λu/2
�− n 1
=
exp j
�− �− n 1 n 1
uk mk
k=0
− 1/2
uk Λ(k, m)um
k=0 m=0
Same basic form as Gaussian pdf, but depends directly on EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Λ,
not Λ−1
c R.M. Gray 2011
129
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Further examples of random processes:
130
Gaussian random processes
Have seen two ways to define rps: Indirectly in terms of an underlying probability space or directly (Kolmogorov representation) by describing consistent family of joint distributions (via pmfs, pdfs, or cdfs). Used to define discrete time iid processes and processes which can be constructed from iid processes by coding or filtering.
A random process {Xt; t T } is Gaussian if the random vectors (Xt0 , Xt1 ,..., Xtk−1 ) are Gaussian for all positive integers k and all possible sample times t i T ; i = 0 , 1,..., k 1.
∈
∈
−
Works for continuous and discrete time. Consistent family? Yes if all mean vectors and covariance matrices drawn from a common mean function m (t); t T and covariance function Λ(t , s); t, s T ; i.e., for any choice of sample times t 0,..., tk−1 the random vector ( Xt0 , Xt1 ,..., Xtk−1 ) is Gaussian with mean (m(t0), m(t1),..., m(tk−1)) and the covariance matrix is Λ = { Λ(tl, t j); l, j Z k }.
Introduce more classes of processes and develop some properties for various examples.
∈
∈
In particular: Gaussian random processes and Markov processes
∈T
∈
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
131
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
132
Discrete time Markov processes
Gaussian random processes in both discrete and continuous time are extremely common in analysis of random systems and have many nice properties.
An iid process is memoryless because present independent of past. A Markov process allows dependence on the past in a structured way. Introduce via example.
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
133
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
134
Since process iid
A binary Markov process
− n 1
p Xn ( xn) =
n
p X ( xi) = p w( x )(1
i=0
{Xn; n = 0 , 1,... } is a Bernoulli process with
p Xn ( x) =
p
− p
1
p
x
=
1
x
=
0
Let {Xn} be input to a device which produces an output binary process {Yn} defined by
,
Yn
p x (1
1 x
− p) − ;
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
n= 0
Y0 Xn
⊕Y −
n 1
n = 1 , 2,...
,
where Y 0 is a binary equiprobable random variable ( pY0 (0) = p Y0 (1) = 0 .5), independent of all of the X n and addition
Since the pmf p Xn ( x), abbreviate to p X : =
,
where w ( xn) = Hamming weight of the binary vector x n.
∈ (0, 1) a fixed parameter
p X ( x)
n w( xn )
− p) −
⊕ is mod 2
(linear filter using mod 2 arithmetic)
x = 0 , 1. c R.M. Gray 2011
135
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
136
Alternatively:
Use inverse image formula: Yn
=
1
if X n
�
Y n−1
0
if X n
=
Y n−1
pY n (yn) = Pr(Y n
.
This process is called a binary autoregressive process. As will be seen, it is also called the symmetric binary Markov process
Unlike X n, Y n depends strongly on past values. Since p < 1 /2, Y n is more likely to equal Y n−1 than not
=
y 0 , Y1
=
=
Pr(Y0
=
y 0 , X1
=
Pr(Y0
=
y 0 , X1
⊕Y ⊕y
=
Pr(Y0
=
y 0 , X1
=
pY0 ,X1,X2,X3,..., Xn−1 (y0, y1
=
pY0 (y0)
c R.M. Gray 2011
Plug in specific forms of p Y0 and p X pY n (yn) =
1 2
137
=
� 1 2
=
y1
1 =
=
y n−1)
y 2,..., Xn−1 y 2,..., Xn−1
⊕Y − ⊕y −
n 2 =
n 2 =
y n−1)
y n−1)
⊕ y , X y ⊕ y ,..., X − y − ⊕ y − ) ⊕ y , y ⊕ y ,..., y − ⊕ y − ) 0
2 =
0
2
2
n 1 =
1
n 1
1
n 1
n 2
n 2
⊕ y − ). i 1
⊕
c R.M. Gray 2011
Note: Would not be the same with different initialization, e.g., Y 0
i=1
pY0,Y1 (y0, y1) =
; y1
y 1, X2
1 =
1 2
1 yi yi 1
− p) − ⊕ − .
� y0
py1⊕y0 (1
=
1
Unlike the iid {Xn} process n 1
pY n (yn) �
pY (yi)
(provided p � 1 /2)
1 y1 y0
− p) −
− i=0
⊕
{Yn} not iid
0 , 1.
Joint not product of marginals, but can use chain rule with conditional probabilities to write as product of conditional pmfs, given by
In a similar fashion it can be shown that the marginals for Y n are all the same: 1 pYn (y) = ; y = 0 , 1; n = 0 , 1, 2,... 2 EE278: Introduction to Statistical Signal Processing, winter 2010–2011
138
Hence drop subscript and abbreviate pmf to p Y
⇒
pyi⊕yi−1 (1
y0
=
0 =
⊕Y ⊕y
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Marginal pmfs for Y n evaluated by summing out the joints (total probability), e.g., pY1 (y1)
y 2,..., Yn−1
y 1, X2
⊕
n 1
−
pX (yi
=
0 =
Used the facts that (1) a b = c iff a = b c, (2) Y 0, X 1, X 2, ... , X n−1 mutually independent, and (3) X n are iid.
y n)
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
y 1 , Y2
n 1
− i=1
=
y n)
Pr(Y0
If p is small, Y n is likely to have long runs of 0s and 1s. Task: Find joint pmfs for new process: pY n (yn) = Pr(Y n
=
=
c R.M. Gray 2011
pYl|Y0,Y1,...,Yl−1 (yl|y0, y1,..., yl−1) = 139
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
pY l 1 (yl+1) +
pY l (yl)
=
p X (yl
⊕y− ) l 1
c R.M. Gray 2011
140
The binomial counting process
Note: Conditional probability of current output Y l given entire past Yi; i = 0 , 1,..., l 1 depends only on the most recent past output Yl−1! This property can be summarized nicely by also deriving the conditional pmf
−
pYl|Yl−1 (yl|yl−1)
⇒
=
pYl−1,Yl (yl, yl−1)
=
pyl⊕yl−1 (1
pYl−1 (yl−1)
− p) −
Next filter binary Bernoulli process using ordinary arithmetic. {Xn} iid binary random process with marginal pmf pX (1) = p = 1 p X (0).
−
⊕
1 yl yl−1
Yn pYl|Y0 ,Y1,..., Yl−1 (yl|y0, y1,..., yl−1) = p Yl|Yl−1 (yl|yl−1).
A discrete time random process with this property is called a Markov process or Markov chain
=
Y0
=
n k=1
0
n= 0
Xk
Y n−1 + Xn
=
n = 1 , 2,...
.
Yn = output of a discrete time time-invariant linear filter with Kronecker delta response h k given by h k = 1 for k 0 and h k = 0 otherwise.
≥
By definition, The binary autoregressive process is a Markov process!
Yn c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
141
A discrete time process with this property is called a counting process. Will later see a continuous time counting process which also can only increase by 1
n
pY1,..., Yn (y1,..., yn)
=
pY1 (y1)
l=1
=
Y n−1 + 1; n
=
2 , 3,... c R.M. Gray 2011
142
pYn |Yn−1,...,Y1 (yn|yn−1,..., y1)
To completely describe this process need a formula for the joint pmfs
Y n−1 or Yn
=
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
pYl|Y1,...,Yl−1 (yl|y1,..., yl−1)
y n |Y l
=
Pr(Yn
=
Pr(Xn = y n
=
Pr(Xn = y n
=
=
y l; l = 1 ,..., yn−1)
− y − |Y − y − |X
y l; l = 1 ,..., n
n 1
l =
n 1
1 =
y 1 , Xi
=
yi
− 1) −y− ;i i 1
=
2 , 3,..., n
− 1)
Follows since since conditioning event {Yi = y i; i = 1 , 2,..., n 1} is the event {X1 = y 1, Xi = y i yi−1; i = 2 , 3,..., n 1} and, given this event, the event Y n = y n is the event X n = y n yn−1.
−
−
−
−
Thus Already found marginal pmf p Yn (k) using transforms to be binomial binomial counting process
⇒ pYn |Yn−1,...,Y1 (yn|yn−1,..., y1) =
Find conditional pmfs, which imply joints via chain rule. EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
143
pXn|Xn−1,..., X2,X1 (yn
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
− y − |y − − y − ,..., n 1
n 1
n 2
y2
− y ,y ) 1
c R.M. Gray 2011
1
144
Xn iid
⇒
Similar derivation pYn|Yn−1 ,...,Y1 (yn|yn−1,..., y1)
Hence chain rule + definition y 0
=
0
=
pX (yn
−y − ) n 1
⇒ n
pY1,..., Yn (y1,..., yn) =
pX (yi
i=1
i 1
yi
−y−
i 1 =
Pr(Yn = y n|Yn−1 = y n−1) Pr(Xn = y n yn−1|Yn−1 = y n−1).
−
⇒
Similar derivation works for sum of iid rvs with any pmf p X to show that pYn |Yn−1,...,Y1 (yn|yn−1,..., y1) = p Yn|Yn−1 (yn|yn−1) or, equivalently,
p(yi−yi−1 )(1
i=1
where
=
−
n
=
Conditioning event, depends only on values of X k for k < n , hence pYn |Yn−1 (yn|yn−1) = pX (yn yn−1) {Yn} is Markov
−y− )
For binomial counting process, use Bernoulli p X : pY1,..., Yn (y1,..., yn) =
⇒
pYn |Yn−1 (yn|yn−1)
1 (yi yi−1) ,
− p) −
0 or 1, i = 1 , 2,..., n; y0
=
−
Pr(Yn = y n|Yi
yi ; i
=
1 ,..., n
− 1)
=
Pr(Yn
=
y n|Yn−1
=
y n−1),
⇒ Markov
0. c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
=
145
binomial theorem
Discrete random walk
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
146
⇒
Slight variation: Let X n be binary iid with alphabet {1, 1} and Pr(Xn = 1) = p
−
−
Yn =
0
n= 0 n k=1
Xk
n = 1 , 2,...
,
MYn ( ju)
=
Yn
1+
((1
� − n
Also has autoregressive format Yn
=
=
Xn , n
=
k=0
1 , 2,...
p )e ju + pe − ju)n n k
− =
Transform of the iid random variables is
− −
(1
�
n k k
− p) − p −
k= n, n+2,..., n 2,n
MX ( ju)
=
(1
− p )e
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
ju
+
pe − ju, c R.M. Gray 2011
147
(n
e ju(n−2k)
n
(1
− p)
k) / 2
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
−
−
(n+k)/2 (n k)/2
p
pYn (k)
e juk.
c R.M. Gray 2011
148
⇒ pYn (k) =
The discrete time Wiener process
n
(n
(n+k)/2 (n k)/2
− k)/2 (1 − p )
p
−
,
k
=
−n, −n
+
2,..., n
− 2, n.
Note that Y n must be even or odd depending on whether n is even or odd. This follows from the nature of the increments.
{Xn} iid N (0, , σ2).
As with the counting process, define
Yn
=
0
n= 0 n k=1
Xk
n = 1 , 2,...
,
discrete time Wiener process Handle in essentially the same way, but use cdfs and then pdfs Previously found marginal fYn using transforms to be N (0, nσ2X ) c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
149
To find the joint pdfs use conditional pdfs and chain rule
c R.M. Gray 2011
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Differentiating the conditional cdf to obtain the conditional pdf
150
⇒
n
fY1 ,...,Yn (y1,..., yn)
=
l=1
fYl|Y1,..., Yl−1 (yl|y1,..., yl−1).
fYn|Y1,..., Yn−1 (yn|y1,..., yn−1) =
∂ F X (yn ∂yn
−y − ) n 1
=
fX (yn
− y − ), n 1
To find conditional pdf fYn |Y1,...,Yn−1 (yn|y1,..., yn−1), first find conditional cdf P (Yn y n|Yn−i = y n−i; i = 1 , 2,..., n 1)
≤
−
. Analogous to the discrete case: P(Yn = =
pdf chain rule
≤ y |Y − y − ; i 1, 2,..., n − 1) P ( X ≤ y − y − |Y − y − ; i 1 , 2,..., n − 1) P( X ≤ y − y − ) F (y − y − ), n
n i =
n i
⇒
=
n i =
n
n
n 1
n
n
n 1 =
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
X
n i
=
n
n 1
c R.M. Gray 2011
n
fY1,...,Yn (y1,..., yn) =
i=1
151
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fX (yi
− y − ). i 1
c R.M. Gray 2011
152
If fX
=
and hence
N (0, σ2)
− − − √ √ − � − −
exp fY n (yn)
=
=
y21
n
2σ2
2πσ2
(2πσ2)
n/2
2πσ2
i=2
exp
fYn|Y1,...,Yn−1 (yn|y1,..., yn−1)
(yi yi−1)2 2σ2
exp
1
n
( (yi 2σ2 i=2
Combine the discrete alphabet and continuous alphabet definitions into a common definition: a discrete time random process {Yn} is said to be a Markov process if the conditional cdf’s satisfy the relation
A similar argument implies that
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
Pr(Yn
≤ y |Y − n
n i =
y n−i; i = 1 , 2,... ) = Pr(Yn
≤ y |Y − n
n 1 =
y n− 1 )
for all y n−1, yn−2,...
−y − ) n 1
c R.M. Gray 2011
153
More specifically, such a {Yn} is frequently called a first-order Markov process because it depends on only the most recent past value. An extended definition to n th-order Markov processes can be made in the obvious fashion.
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
fYn|Yn−1 (yn|yn−1).
As in discrete alphabet case, a process with this property is called a Markov process
yi−1)2 + y21) .
This is a joint Gaussian pdf with mean vector 0 and covariance matrix K X (m, n) = σ 2 min(m, n), m , n = 1 , 2,...
fYn|Yn−1 (yn|yn−1) = fX (yn
=
c R.M. Gray 2011
155
EE278: Introduction to Statistical Signal Processing, winter 2010–2011
c R.M. Gray 2011
154