Numerical Methods for Unconstrained Optimization
Cheng-Liang Chen
PSE
LABORATORY
Department of Chemical Engineering National TAIWAN University
Chen CL
1
Anal Analyt ytic ical al vs vs.. Numer Numeric ical al ?
In Anal Analyytica ticall Meth Methods ods, we write necessary conditions an and solve them (analytical or numerical ?) for candidate local minimum designs Some Difficulties:
Number of design variables in constraints can be large Functions for the design problem can be highly nonlinear In many applications, cost and/or constraint functions can be implicit in terms of design variables
Numerical Methods: esti estima mate te an init initia iall desi design gn and and improve it until optimality conditions are satisfied
Chen CL
1
Anal Analyt ytic ical al vs vs.. Numer Numeric ical al ?
In Anal Analyytica ticall Meth Methods ods, we write necessary conditions an and solve them (analytical or numerical ?) for candidate local minimum designs Some Difficulties:
Number of design variables in constraints can be large Functions for the design problem can be highly nonlinear In many applications, cost and/or constraint functions can be implicit in terms of design variables
Numerical Methods: esti estima mate te an init initia iall desi design gn and and improve it until optimality conditions are satisfied
Chen CL
2
Unconstrained Optimization
Chen CL
3
General Concepts Related to Numerical Algorithms
A General Algorithm Current estimate:
x(k)
Subproblem 1:
d(k)
:
feasible search direction
Subproblem 2:
αk
:
(positive scalar) step size
⇒ New estimate:
k = 0, 1,
x(k+1) = x(k) + αk d(k)
= x(k) + ∆x(k)
···
Chen CL
4
Chen CL
5
Descent Step Idea current estimate
new estimate
f (x(k)) > f (x(k+1))
= f (x(k) + αk d(k))
≈
(k)
∇ ·
f (x ) +
T
f (x(k))
= f (x(k)) + αk c(k) d(k)
x(k) + αk d(k)
−
<0
T
∇f (x(k))d(k) = c(k)·d(k) < 0 : descent condition
Angle between c(k) and d(k) must be between 90o and 270o
x(k)
Chen CL
6
Example: check the descent condition f (x) = x 21
− x1x2 + 2x22 − 2x1 + e(
x1 +x2 )
Verify d1 = (1, 2), d2 = (1, 0) at (0, 0) are descent directions or not c =
c d1 =
·
c d2 =
·
− − − − − 2x1
x2
2 + e(x1+x2)
x1 + 4x2 + e
1 1
1 1
1 2 1 0
(x1+x2)
=
(0,0)
=
−1+2=1>0
=
− 1 + 0 = −1 < 0
− 1
1
(not a descent dir.)
(a descent dir.)
Chen CL
7
One-Dimensional Minimization: Reduction to A Function of Single Variable Assume:
a descent direction has been found
f (x) = f (x(k) + αd(k))
≈
f (x(k)) + α f T (x(k))d(k)
∇
¯ = f (α)
=c·d<0
f (α) < f (0) = f (x(k)) small move reducing f f (0) = c(k) d(k) < 0
·
d should be a descent direction
Chen CL
8
Analytical Method to Compute Step Size
d(k) is a descent direction
⇒
⇒ α > 0
df (αk ) df 2(αk ) = 0, >0 2 dα dα
df (x(k+1)) df (x(k+1)) dx(k+1) 0= = = dα dx dα
∇f T (x(k+1)) d(k)
c(k+1)
T
Gradient of the cost function at NEW point, c(k+1), (k) is orthogonal to the current search direction, d
Chen CL
9
Example: analytical step size determination f (x) = 3x21 + 2x1x2 + 2x22 + 7 d(k) = ( 1, (k)
⇒ (k)
c
c
(k)
·d
(k+1)
x
=
=
=
− −1) ∇f (x(k))
− − − − − − − 10 10 1 2
f (x(k+1)) = 3 ( 1 =
at x(k) = (1, 2)
+ α
6x1 + 2x2
=
2x1 + 4x2
1 1
1 1
=
=
=
x(k)
10 10
20 < 0
1
α
2
α
− α)2 + 2(1 − α)(2 − α) + 2(2 − α)2 + 7 7α2 − 20α + 22 ≡ f (α)
Chen CL
10
df 10 NC: = 14αk 20 = 0 αk = dα 7 d2f = 14 > 0 2 dα 1 1 3/7 10 (k+1) x = + ( 7 ) = 2 1 4/7
−
(k+1)
f (x
⇒
− − −
54 ) = < 22 = f (x(k)) 7
Chen CL
11
Numerical Methods to Compute Step Size Most one-dimensional search methods work for only unimodal functions (work for α = 0 α α ¯ = αu,) ( αu α interval of uncertainty )
− ≡
≤ ≤
Chen CL
12
Unimodal Function
Unimodal function: f (x) is one unimodal function if
x1 < x2 < x∗ implies f (x1) > f (x2), and x∗ > x3 > x4 implies f (x3) < f (x4)
Chen CL
13
Unimodal Function
Outcome of two experiments x∗
f 1 < f 2 f 1 > f 2 f 1 = f 2
∈ [0, 1],
⇒ x∗ ∈ [0, x2] ⇒ x∗ ∈ [x1, 1] ⇒ x∗ ∈ [x1, x2]
0 < x1 < x2 < 1
Chen CL
14
Equal Interval Search
To reduce successively the interval of uncertainty, I , to a small acceptable value
I = αu
Evaluate the function at
− α,
(α = 0) δ, 2δ, 3δ,
···
If f ((q + 1)δ ) < f (qδ ) then continue If f ((q + 1)δ ) > f (qδ ) then α = (q
− 1)δ,
αu = (q + 1)δ α∗
∈ [ α, αu ]
Chen CL
15
Chen CL
16
Equal Interval Search: Example
f (x) = x(x 1.5), x∗
−
i xi f (xi)
∈ [0, 1] ⇒ x∗ ∈ [x7, x8] = [0.7, 0.8]
1
2
3
4
5
6
7
8
9
.1
.2
.3
.4
.5
.6
.7
.8
.9
−.14 −.26 −.36 −.44 −.50 −.54 −.56 −.56 −.54
Chen CL
17
No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Equal Interval Search: Example
f (α) = 2
− 4α + eα
δ = 0.5
= 0.001
α
→ αu →
α
→ αu →
α
→ αu →
α
→ αu →
Trial step 0.000000 0.500000 1.000000 1.500000 2.000000 1.050000 1.100000 1.150000 1.200000 1.250000 1.300000 1.350000 1.400000 1.450000 1.355000 1.360000 1.365000 1.370000 1.375000 1.380000 1.385000 1.390000 1.380500 1.381000 1.381500 1.382000 1.382500 1.383000 1.383500 1.384000 1.384500 1.385000 1.385500 1.386000 1.386500 1 387000
Function 3.000000 1.648721 0.718282 0.481689 1.389056 0.657651 0.604166 0.558193 0.520117 0.490343 0.469297 0.457426 0.455200 0.463115 0.456761 0.456193 0.455723 0.455351 0.455077 0.454902 0.454826 0.454850 0.454890 0.454879 0.454868 0.454859 0.454851 0.454844 0.454838 0.454833 0.454829 0.454826 0.454824 0.454823 0.454823 0 454824
δ = 0.5
start from α = 1.0 δ = 0.05
start from α = 1.35 δ = 0.005
start from α = 1.38 δ = 0.0005
Chen CL
18
Equal Interval Search: 3 Interior Points x∗
∈ [a, b] three tests x1, x0, x2 ⇒ three possibilities
Chen CL
19
Equal Interval Search: 2 Interior Points
αa = α + 13 (αu αb =
Case 1: f (αa) < Case 2: f (αa) >
− α) = 13 (αu + 2α) α + 23 (αu − α) = 13 (2αu + α) f (αb) ⇒ α < α∗ < αb f (αb) ⇒ αa < α∗ < αu
I = 23 I : reduced interval of uncertainty
Chen CL
20
Golden Section Search
Question of Equal Interval Search (n = 2): known midpoint is not used in next iteration
Solution: Golden Section Search
Fibonacci Sequence: F 0 = 1;
F 1 = 1;
F n = F n−1 + F n−2, n = 2, 3,
1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, F n F n−1
→
1.618,
F n−1 F n
→
0.618 as n
···
→∞
··
Chen CL
21
Golden Section Search Initial Bracketing of Minimum Starting at α = 0, q evaluate αq = δ (1.618)j = αq−1 + (1.618)q δ,
q = 0, 1, 2,
j =0
···
q = 0; α0 = δ α1
α0
− − −
q = 1; α1 = δ + 1.618δ 1.618(α0
− 0) α2
= 2.618δ
α1
q = 2; α2 = 2.618δ + 1.6182δ
= 5.236δ
1.618(α1 − α0 )
α3
α2
q = 3; α3 = 5.236δ + 1.6183δ ..
..
1.618(α2 − α1 )
= 9.472δ
Chen CL
22
Golden Section Search Initial Bracketing of Minimum If
f (αq−2) > f (αq−1) and f (αq−1) < f (αq ) Then αq−2 < α∗ < αq q
αu = αq
=
δ (1.618)j
j =0 q 2
α = αq−2 = I = αu
−
−
δ (1.618)j
j =0
α = (1.618)q δ + (1.618)q−1δ
− − αq
αq−1
= 2.618(1.618)q−1δ
αq−1
αq−2
Chen CL
23
Golden Section Search Reduction of Interval of Uncertainty
Given αu, α Select αa, αb s.t. αu αb Suppose
f (αb) > f (αa) α = α , αb
⇒
− − ⇒ −
I = αu αa = τ I, α = τ I, α∗ αb α
−α
αa αu
−α −α
b
= (1 = (1
− τ )I − τ )I
[αb, αu], delete [αb, αu] = αa, αu = αb, I = αu α, = τ I , αu αb = (1 τ )I
∈
−
−
−
Chen CL
24
Golden Section Search Reduction of Interval of Uncertainty
I = τ I,
⇒ ⇒
τ 2 + τ
−√ 1 = 0
τ = −1+2 αq−1
5
αu αq
− τ )I =
= 0.618 =
αq−2
− − αa
⇒ ⇒
(1
τ I = τ (τ I )
1 1.618
α = 0.382I,
αa = 0.618I = (1.618) 0.382I
αq−1
αq−1
− αq−2
αu αa αq αq−1 0.618I = = = 1.618 0.382I αa α αq−1 αq−2 ratio of increased trial step size is 1.618
− −
− −
Chen CL
25
Golden Section Search Algorithm = α −2, α ⇒ q, α = α ⇒
= α q , I
Step 1: choose δ
= α + 0. 0 .382 382II , αb = α = α + 0. 0 .618 618II , f ( f (αa), f ( f (αb) Step 2: α = α
f (αa), f ( f (αb), go to Step 4, 5, or 6 Step 3: compare f (
f (αa) < f ( f (αb) α < α∗ < αb Step 4: if f ( α = α , αu = α b, αb = α a, αa = α = α + 0. 0 .382(α 382(αu
f (αa) > f ( f (αb) αa < α∗ < αu Step 5: if f ( α = α a, αu = α u, αa = α b, αb = α + 0. 0 .618(α 618(αu
⇒
⇒
− α ), go to Step 7
− α ), go to Step 7
f (αa) = f ( f (αb) αa < α∗ < αb Step 6: if f ( α = α a, αu = α b, return to Step 2
⇒
⇒
u
⇒
⇒
q
Step 7: if I = α u α < αu+α ∗ α = 2 and Stop; otherwise return to Step 3
⇒
−
Chen CL
26
Golden Section Search Example
f ((α) = 2 f
− 4α + e + eα
δ = 0.5
= 0.001
Chen CL
27
Gold Golden en Sect Sectio ion n Sea Search: rch: Exam Exampl ple e f ((x) = x(x f Initial No. 1 2 3 4 5
− 1.5)
Bracketing of Minimum Trial x’s Fcn value 0.000000 0.000000 −0.312500 0.250000 −0.500000 0.500000 −0.562500 0.750000 −0.500000 1.000000
Reducing Interval of Uncertainty No. X Xa 1 0.5000000 0.6910000 −0.5000000 −0.5590190 2 0.6910000 0.7360760 −0.5590190 −0.5623061 3 0.7360760 0.7469139 −0.5623061 −0.5624892 4 0.74691393 0.74922448 −0.562489202 −0.562499399 5 0.7492244890 0.7498169790 −0.562493399 −0.562499967 6 0.7498467900 0.7499566900 −0.562499966 −0.562499998
← X = 0.5 ← f min ← Xu = 1.0 Xb
Xu
I
0.8090000 −0.5590190 0.7639240 −0.5623061 0.7532861 −0.5624892 0.75077551 −0.562499399 0.7501830210 −0.562499967 0.7500431210 −0.562499998
1.0000000 −0.5000000 0.8090000 −0.5590190 0.7639240 −0.5623061 0.75328606 −0.562489202 0.7507755110 −0.562499399 0.7501830210 −0.562499967
0.50000000 0.11800000 0.02784800 0.00657212 0.001551022 0.000366231
Chen CL
28
Polynomial Interpolation Quadratic Curve Fitting q (α) = a0 + a1α + a2α2
(approximated quadratic function)
f (α) = q (α) = a0 + a1α + a2α2 f (αi) = q (αi) = a0 + a1αi + a2α2i f (αu) = q (αu) = a0 + a1αu + a2α2u
⇒
a2 a1 a0
dq (α) dα
⇒
1
f (αu) f (α) f (αi) = αu αi αu α αi f (αi) f (α) = a2(αi + α) αi α = f (α) a1α a2α2
−
− − −
− − − −
= a1 + 2a2α ¯ = 0 ¯ α a1 d2 q α ¯ = if dα 2 = 2a2 > 0 2a2
−
−
− f (α ) −α
Chen CL
29
Computational Algorithm:
Step Step Step Step
1: locate initial interval of uncertainty (α, αu) 2: select α < αi < αu f (αi) 3: compute a0, a1, a2, ¯ α, f (¯ α) 4: f (αi) < f (α ¯ ) f (αi) > f (α ¯) α∗ [¯ α, αu] α∗ [α, αi] αi < α ¯ α, αi, ¯ α αi, ¯ α, αu α∗ [α, ¯ α] α∗ [αi, αu] α ¯ < αi α, ¯ αi, αu α, ¯ α, αi Step 5: Stop if two successive estimates of minimum point of f (α) are sufficiently close. Otherwise delete primes on α, αi, αu and return to Step 2
⇒
⇒
⇒
∈
∈
⇒
⇒
∈
∈
Chen CL
30
Example: f (α) = 2
− 4α + e
α
δ = 0.5
α = 0.5 αi = 1.309017
αu = 2.618034
f (α) = 1.648721 f (αi) = 0.466464 f (αu) = 5.236610 a2 = a1 =
−
a0 = 1.648271
−
−
− (−5.821)(0.50) − 2.41(0.25) = 3.957
α ¯ = 1.2077 < αi
⇒
−1.1823 = 2.410 1 3.5879 1.30902 2.1180 0.80902 −1.1823 (2.41)(1.80902) = 5.821 0.80902 f (¯ α) = 0.5149 > f (αi)
α = α ¯ = 1.2077 αu = αu = 2.618034, αi = α i = 1.309017 α = 1.2077
αi = 1.309017 αu = 2.618034
f (α) = 0.5149 f (αi) = 0.466464 f (αu) = 5.236610 a2 = 5.3807
a1 =
−7.30547
α ¯ = 1.3464
f (¯ α) = 0.4579
a0 = 2.713
Chen CL
31
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method
Conjugate Directions Let A be an n n symmetric matrix. A set of n vectors (directions) S i is said to be A-conjugate if
×
{ }
S T i AS j = 0 for i, j = 1,
··· , n;
i=j
Note: orthogonal directions are a special case of conjugate directions (A = I )
Chen CL
32
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method
Quadratically Convergent Method If a minimization method, using exact arithmetic, can find the minimum point in n steps while minimizing a quadratic function in n variables, the method is called a quadratically convergent method
Theorem: Given a quadratic function of n variables and two parallel hyperplanes 1 and 2 of dimensions k < n. Let the constrained stationary points of the quadratic function in the hyperplanes be X 1 and X 2, respectively. Then the line joining X 1 and X 2 is conjugate to any line parallel to the hyperplanes.
Chen CL
33
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method
Proof:
∇
1 T Q(X ) = X AX + B T X + C 2 Q(X ) = AX + B (n 1)
search from a along S search from b along S
⇒
S orthogonal to
S T
X 1
(stationary pt)
X 2
∇Q(X 1) and ∇Q(X 2)
S T Q(X 1) = S T AX 1 + S T B
∇ S ∇Q(X 2) [∇Q(X 1) − ∇Q(X 2)] T
⇒
⇒ ⇒
×
= 0
= S T AX 2 + S T B = 0 = S T A(X 1
− X 2)
= 0
Chen CL
34
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method
Meaning: If X 1 and X 2 are the minima of Q obtained by searching along the direction S from two different starting points X a and X b, respectively, the line (X 1 X 2) will be conjugate to S
−
Chen CL
35
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method Theorem:
If a quadratic function 1 Q(X ) = X T AX + B T X + C 2
Proof:
∇Q(X ∗) S j
β j S j
:
conjugate directions to A
n
⇒
0
= B + AX 1 + A
β j S j
j =1
0
conjugate directions, the
= S T i (B + AX 1 )+ n
S T i A
minimum of the function Q
β j S j
j =1
will be found at or before the starting point
j =1
of a set of n mutually
the nth step irrespective of
n
Let X ∗ = X 1 +
is minimized sequentially, once along each direction
= B + AX ∗ = 0
⇒
β i
= (B + AX 1)T S i + β iS T i AS i (B + AX 1)T S i = T
−
S i AS i
Chen CL
36
Multi-Dimensional Minimization: Powell’s Conjugate Directions Method Note: X +1 = X + λ∗S , i = 1, ··· , n i
i
λ∗i
is
i
i
found by minimizing Q(λiS i) so that
0 = S T Q(X i+1) i Q(X i+1) = B + AX i+1 = B + A(X i + λ∗i S i) T ∗ Q(X i+1) = S T 0 = S i i B + A(X i + λi S i )
∇
∇
⇒
⇒
∇
{
}
= (B + AX i)T S i + λ∗i S T i S i (B + AX i)T S i ∗ λi = T
−
X i
= X 1 +
S i i 1
−
AS i
λ∗j S j
j =1
X iT AS i
= X 1T AS i +
T
i 1
−
λ∗j S j AS i = X 1T AS i
j =1
⇒
λ∗i = =
T
−(B + AX ) −(B + AX 1) i
S i
S T i AS i S i T T
S AS i
= β i
Chen CL
37
Powell’s Conjugate Directions: Example f (x1, x2) = 6x21 + 2x22 = if S 1 =
⇒
S T 1 AS 2
= =
λ∗1 =
⇒
X 2
− 6x1x2 − x1 − 2x2
− − − − − − ⇒ 1
2
1
0 2
x2
12 6 s1 s2
12
1 + x1 x2 2
6 4
4
x1 x2
0
s1 s2
= 0
S 2 =
1 −1 −2 2 − 12 −6 1 1 2 −6 4 2
= X 1 + λ∗1 S 1 =
6
6
0
=
X 1
2
1 2
x1
=
0 0
+
5 1 4 2
1 0
5 4
=
5/4 5/2
Chen CL
38
1 −1 −2 0 − 12 −6 1 1 0 −6 4 0
λ∗2 =
⇒
X 3
= X 2 + λ∗2 S 2 =
=
1 12
5/4 1 1 + 12 0 5/2
=
4/3 5/2
= X ∗ (?)
Chen CL
39
Powell’s Algorithm
Chen CL
40
Progress of Powell’s Method
un; u1,
u2,
S (1); u2,
··· , ··· ,
un−1,
S (n−1); un, S (1),
S (2); u3,
···
un−1, un,
S (1);
un,
S (1),
S (2);
S (2),
··· ,
S (n−1)
(un, S (1)), (S (1), S (2)),
⇒
un;
··· ,
un, S (1), S (2),
are A-conjugate
···
···
Chen CL
41
Powell’s Conjugate Directions: Example
Min: f (x1, x2) = x1
− x2 + 2x21 + 2x1x2 + x22
X 1 = [ 0 0 ]T
Chen CL
42
Cycle 1: Univariate Search along u2: f (X 1 + λu2) = f (0, λ) = λ2 df dλ
⇒ along
−u1: ⇒
X 2
f (X 2
− λu1) df dλ
= 0
⇒
λ∗ =
−λ
1 2
= X 1 + λ∗u2 =
0
0.5
= f ( λ, 0.5) = 2λ2 =
X 3 =
− − 2λ − 0.25 0 ⇒ λ∗ = 12 − 0.5 ∗ X 2 − λ u1 =
0.5
along u2: f (X 3 + λu2) = f ( 0.5, 0.5 + λ) = λ2 df dλ
⇒
X 4
=
− 0 ⇒
λ∗ =
1 2
= X 1 + λ∗u2 =
− λ − 0.75
− 0.5 1
Chen CL
43
Cycle 2: Pattern Search (1)
S
= X 4
− X 2
=
− − − 0.5 1
0
0.5
=
0.5
0.5
f (X 4 + λS (1)) = f ( 0.5 = df dλ
⇒
=
X 5 =
− − 0.5λ, 1 + 0.5λ) 0.25λ2 − 0.5λ − 1 0 ⇒ λ∗ = 1.0 − 1.0 (1) ∗ X + λ S = 4
1.5
Chen CL
44
Simplex Method
Chen CL
45
Simplex Method
Chen CL
46
Simplex Method
Chen CL
47
Simplex Method
Chen CL
48
Properties of Gradient Vector
∇f (k)
c
=
∂f ∂x1
..
∂f ∂xn
(k)
= c
= c(x ) =
(k)
∇f (x
) =
∂f (x(k)) ∂xi
Chen CL
49
Property 1: The gradient vector c of a function f (x1, , xn) at point x∗ = (x∗1, , x∗n) is orthogonal (normal) to the tangent plane for the surface f (x1, , xn) = constant.
··· ···
···
C is any curve on the surface through x∗ T is a vector tangent to curve C at x∗ c T = 0
⇒ ·
Chen CL
50
Proof: s : any parameter along C
··· ∂x1 ∂s
T =
∂xn ∂s
⇒
x=x∗
(a unit tangent vector along C at x∗) df = 0 f (x) = constant ds df ∂f ∂x1 ∂f ∂xn 0 = = + + ds ∂x1 ∂s ∂xn ∂s = cT T = c T
⇒ ··· ·
Chen CL
51
Property 2: Gradient represents a direction of maximum rate of increase for f (x) at x∗
Proof: u
:
a unit vector in any direction not tangent to C
t : a parameter along u df f (x + u) = lim →0 dt ∂f ∂f 2 f (x + u) = f (x) + u1 ∂x + + u n ∂xn + O( ) 1 f (x + u)
−
n
∂f f (x) = ui + O(2) ∂xi i=1
df f (x + u) = lim = → 0 dt =
···
||c|| ||u|| cos θ
1 ( )
×
n
∂f ui = c u = cT u ∂xi i=1
·
(max rate of increase when θ = 0)
Chen CL
52
Property 3: The maximum rate of change of f (x) at any point x∗ is the magnitude of the gradient vector
(max df dt = c ) u is in the direction of gradient vector for θ = 0
|| ||
Chen CL
53
Verify Properties of Gradient Vector
f (x) = 25x21 + x22, f (x(0)) = 25 c
=
C = t
T
=
=
x(0) = (0.6, 4)
∇ √ || || − − − √ f (0.6, 4) =
c c
=
30 8
302+82
2 ∂ (25x2 1 +x2 =25) ∂s 1 2 ∂ (25x1 +x2 2 =25) ∂s 2
t t
|| ||
=
∂f ∂x 1 ∂f ∂x 2
=
0.966235
=
0.257663 4 15
=
4 15
(−4)2 +152
50x1 2x2
=
.257663 0.966235
=
30 8
Chen CL
54
Property 1:
C T = 0
·
Slope of tangent: m1 = Slope of gradient: m2 =
=
c1 c2
=
√ −15−
x1
50x1 2x 2
=
x2 1
=
−3 175 .
30 8
= 3.75
Property 2: choose arbitrary direction D = (0.501034, 0.865430), α = 0.1 (0)
x
= x
x(1)
= x(0) + αD =
(1) C
D
+ αC =
f (x(1) )
= 28.3389
f (x(1) )
= 27.2566
C D
dx2 dx1
Property 3:
<
C C = 1.00
·
0.6 0.966235 + 0.1 4.0 0.257663
=
0.6966235 4.0257663
0.6 0.501034 + 0.1 4.0 0.865430
=
0.6501034 4.0854300
f (x(1) ) C
> C D = 0.7071059
·
Chen CL
55
Steepest Descent Algorithm
Steepest Descent Direction Let f (x) be a differentiable function w.r.t. x. The direction of steepest descent for f (x) at any point is d = c
−
Steepest Descent Algorithm:
Step Step Step Step Step
1: a starting design x(0), k = 0, 2: c(k) = f (x(k)); stop if c(k) < 3: d(k) = c(k) 4: calculate αk to minimize f (x(k) + αd(k)) 5: x(k+1) = x(k) + αk d(k), k = k + 1 Step 2
∇ −
|| ||
⇒
Chen CL
56
Notes:
d =
−c ⇒⇒ c · d = −||c||2 < 0
The successive directions of steepest descent are normal to each other c = = c(k) c(k+1) = 0
−
d
⇒
d(k) d(k+1)
·
·
proof: 0 = =
df (x(k+1) ) dα (k+1)
T
∂f (x ∂ x
c(k+1) T
)
T
= c(k+1) d(k) = = d(k+1) d(k)
−
·
∂ x(k+1) ∂α
(k )
x(k)+αd
∂ (
)
∂α
− c( +1) · c( ) k
k
Chen CL
57
Steepest Descent: Example f (x1, x2) = x 21 + x22 − 2x1x2 x(0) = (1, 0)
Step 1: x(0) = (1, 0), k = 0, () Step 2: c(0) = f (x(0)) = (2x1 2x2, 2x2 = (2, 2); c(0) = 2 2 = 0
∇
− || || Step 3: d(0) = −c(0) = (−2, 2)
− 2x1)
Step 4: to minimize f (x(0) + αd(0)) = f (1 f (1
− 2α, 2α) df (α) dα 2 d f (α) dα2
− √
Step 5: x(1) = x (0) + α0d(0) = (1 (1) (0 0) stop
⇒
= = =
− 2α, 2α) (1 − 2α)2 + (2α)2 − 2(1 − 2α)(2α) 16α2 − 8α1 = f (α) 32α − 8 = 0 ⇒ α0 = 0.25
= 32 > 0
− 0.25(2), 0 + 0.25(2)) = (0.5, 0.5)
Chen CL
58
Steepest Descent: Example f (x1, x2, x3) = x21 + 2x22 + 2x23 + 2x1x2 + 2x2x3 x(0) = (2, 4, 10) x∗ = (0, 0, 0)
Step 1: k = 0, = 0.005, (δ = 0.05, ε = 0.0001 for Golden) Step 2: c(0) = f (x(0)) = (2x1 + 2x2, 4x2 + 2x1 + 2x3, 4x3 + 2x2) = (12, 40, 48); c(0) = 4048 = 63.6 >
∇
√ || || Step 3: d(0) = −c(0) = (−12, −40, −48)
Step 4: to minimize f (x(0) + αd(0)) by Golden
Step 5:
⇒ α0 = 0.158718
x(1) = x (0) + α0d(0) = (0.0954, 2.348, 2.381) c(1) = ( 4.5, 4.438, 4.828); c(1) = 7.952 > Note: c(1) d(0) = 0 (perfect line search)
− − ·
−
|| ||
Chen CL
59
Chen CL
60
Steepest Descent: Disadvantages
Slow to converge, especially when approaching the optimum a large number of iterations
⇒
Information calculated at previous iterations is not used, each iteration is started independent of others
Chen CL
61
Scaling of Design Variables
The steepest descent method converges in only one iteration for a positive definite quadratic function with a unit condition number of the Hessian matrix
To accelerate the rate of convergence scale design variables such that condition number of new Hessian
⇒
matrix
is
unity
Chen CL
62
Example: Min: f (x1, x2) = 25x21 + x22 50 0 H = 0 2
let x = Dy
⇓
Min: f (y1, y2) = y0
D=
√ √
1 2
y12 + y22
= ( 50,
2)
x0 = (1, 1)
√ 1 50
0
0
1 √
2
Chen CL
63
Chen CL
64
Example: Min: f (x1, x2) = 6x21 H =
12 6
−
− 6x1x2 + 2x22 − 5x1 + 4x2 + 2 −6 4
λ1,2 = 0.7889, 15.211 (eigenvalues) v 1,2 = (0.4718, 0.8817), ( 0.8817, 0.4718)
−
let x = Qy
− 0.4718 0.8817
Q = v 1 v 2 =
0.8817 0.4718
Min: f (y1, y2) = 0.5(0.7889y12 + 15.211y22) + 1.1678y1 + 6.2957y2 + 2 let y = Dz
D=
1 0.7889
0
√
0
1 15.211
√
Min: f (y1, y2) = 0.5(z12 + z22) + 1.3148z1 + 1.6142z2 x0 = ( 1, x∗ =
− −2) ⇒ z∗ = (−1.3158, −1.6142) QDz ∗ = (− 13 , − 23 )
Chen CL
65
Conjugate Gradient Method Fletcher and Reeves (1964)
Steepest Descent: orthogonal at consecutive steps converge but slow
⇒
Conjugate Gradient Method: modify current steepest descent direction by adding a scaled previous direction
⇒ cut diagonally through orthogonal steepest descent directions
Conjugate Gradient Directions: d(i), d( j) orthogonal w.r.t. a symmetric and positive definite matrix A (i)T d Ad( j) = 0
Chen CL
66
Conjugate Gradient Method: algorithm
Step 1: k = 0, x(0) f (x(0)) d(0) = c(0) = Stop if c(0) < , otherwise go to Step 4 (k) Step 2: c = f (x(k)), Stop if c(k) < (k) (k−1) (k) (k) (k−1) 2 Step 3: d = c +β k d , β k = c / c (k) Step 4: compute αk = α to minimize f (x + αd(k)) (k+1) Step 5: x = x(k) + αd(k), k = k + 1, go to Step 2
⇒ || || ∇ −
−
−∇ || || || || ||
||
Note:
Find the minimum in n iterations for positive definite quadratic forms having n design variables Inexact line search, non-quadratic forms re-started every n + 1 iterations for computational stability (x(0) = x(n+1))
⇒
Chen CL
67
Example:
Min: f (x) = x21 + 2x22 + 2x23 + 2x1x2 + 2x2x3 c(0) = (12, 40, 48); x(1) = (0.0956, c(1) =
β 1 = d(1) =
=
||c(0)|| = 63.6;
x(0) = (2, 4, 10)
f (x(0)) = 332.0
−2.348, 2.381) (−4.5, −4.438, 4.828); ||c(1)|| = 7.952; f (x(1)) = 10.75 2 (1) (0) ||c ||/||c || = [7.952/63.3]2 = 0.015633 −c(1) + β 1d(0) −12 4.500 4.31241 4.438 + (0.015633) −40 = 3.81268 −4.828 −48 −5.57838
Chen CL
68
−
0.0956
x(2) = x(1) + αd(1)
=
⇒
x(2) = c(2) =
Note:
⇒ α = 0.3156 (1.4566, −1.1447, 0.6205) (0.6238, −0.4246, 0.1926), c(2) · d(1) = 0
−
4.31241
2.348 + α
2.381
Min: f (x(1) + αd(1))
3.81268
5.57838
||c(2)|| = 0.7788
Chen CL
69
Newton Method A Second-order Method x : current estimate of x∗
⇒
x∗
≈
x + ∆x (desired)
f (x + ∆x) = f (x) + cT ∆x + 12 ∆xT H ∆x
NC:
⇒ ⇒
∂f ∂ ∆x
= c + H ∆x = 0
∆x = ∆x =
−H −1c −αH −1c (modified)
Chen CL
70
Steps: (modified)
Step 1: k = 0; c(0); (k) ∂f (x(k) ) Step 2: c = ∂xi , i = 1 i
(k)
∂ 2 f ∂xi∂xj 1 (k)
Step 3: H (x
)=
Step 4: d(k) =
−H − c
∼ n; Stop if ||c( )|| < k
or
Hd(k) =
− c( ) k
Note: for computational efficiency, a system of linear simultaneous eqns is solved instead of evaluating the inverse of Hessian
Step 5: compute αk = α to minimize f (x(k) + αd(k)) (k+1) = (k) + α (k) Step 6: x x d , k = k + 1, go to Step 2
Note: unless H is positive definite, d(k) will not be that of descent for f H > 0
⇔
(k)T (k)
c
d
=
(k)T
−αk c
H −1c(k) < 0
> 0 for positive H
Chen CL
71
Example: f (x) = 3x21 + 2x1x2 + 2x22 + 7 x(0) = (5, 10); c
(0) (0)
H
= 0.0001
= (6x1 + 2x2, 2x1 + 4x2) = ( 5 0, 50); =
d(0) =
6 2 2 4
,
− − − − − − − − −− − − · − − − − − −
(0)−1
H
−H −1c(0) =
x(1) = x(0) + αd(0) = df dα
∇f (x(1)) ∇f (x(1)) · d(0)
= 0 or = = =
6(5
√ ||c || = 50 2 (0)
∇f (x(1))
4 2 = 2 6 2 50 5 1 4 = 20 2 6 50 10 5 5 5 5α + α = 10 10 10 10α 1 20
d(0) = 0
− 5α) + 2(10 10α) = 2(5 − 5α) + 4(10 10α) 5 50 − 50α 50 50α −10
50
50α
50
50α
−5(50 − 50α) − 10(50 − 50α) = 0 ⇒
α=1
Chen CL
72
(1)
x
c
(1)
=
=
5 10
− 5α − 10α
50 50
− 50α − 50α
=
0 0
=
0 0
Chen CL
73
Example: f (x) = 10x41 c = H =
− 20x21x2 + 10x22 + x21 − 2x1 + 5, x(0) = (−1, 3) ∇f (x) = ( 40x31 − 40x1x2 + 2x1 −2, −20x21 + 20x2) ∇2f (x) = 120x −−40x40x + 2 −40x 20 2 1
2
1
1
Chen CL
74
Chen CL
75
Comparison of Steepest Descent, Conjugate Gradient Methods
f (x) = 50(x2
− x21)2 + (2 − x1)2
x(0) = (5,
−5)
Newton,
x∗ = (2, 4
Chen CL
76
Chen CL
77
Chen CL
78
Newton Method
Advantage: quadratic convergent rate
Disadvantages:
Calculation of second-order derivatives at each iteration A system of simultaneous linear equations needs to be solved Hessian of the function may be singular at some iterations Memoryless method: each iteration is started afresh Not convergent unless Hessian remains positive definite and a step size determination scheme is used
Chen CL
79
Marquardt Modification (1963) d(k) =
−(H + λI )−1c(k) Far away solution point ⇒ use Steepest Descent Near the solution point ⇒ use Newton Method
Step 1: k = 0; c(0); ; λ (= 10000) (large) (k) ∂f (x(k) ) (k) Step 2: c = 1 Stop c , i = n; if < i ∂xi
(k)
Step 3: H (x
=
∼ ∂ 2 f ∂xi ∂xj
)
|| ||
Step 4: d(k) = (H + λk I )−1c(k) (k) Step 5: if f (x + d(k)) < f (x(k)), go to Step 6 Otherwise, let λk = 2λk and go to Step 4 Step 6: Set λk+1 = 0.5λk , k = k + 1 and go to Step 2
−
Chen CL
80
Quasi-Newton Methods
Steepest Descent:
Use only 1st-order information poor rate of convergence Each iteration is started with new design variables without using any information from previous iterations
⇒
Newton Method:
Use 2nd-order derivatives
⇒ quadratic convergence rate
Requires calculation of n(n2+1) 2nd-order derivatives ! DIfficulties if Hessian is singular Not learning processes
Chen CL
81
Quasi-Newton Methods
Quasi Newton Methods, Update Methods:
Use first-order derivatives to generate approximations for Hessian combine desirable features of both steepest descent and Newton’s methods Use information from previous iterations to speed up convergence (learning processes) Several ways to approximate (updated) Hessian or its inverse Preserve properties of symmetry and positive definiteness
Chen CL
82
Davidon-Fletcher-Powell (DFP) Method
Davidon (1959), Fletcher and Powell (1963)
To approximate Hessian inverse using only first derivatives
−αH −1c ≈ −αAc
∆x = A :
find A by using only 1st-order information
Chen CL
83
DFP Procedures:
Step Step Step Step Step Step
A
≈
H −1
1: k = 0; c(0), ; A(0)(= I , H −1) 2: c(k) = f (x(k)), Stop if c(k) < 3: d(k) = A(k)c(k) 4: compute αk = α to minimize f (x(k) + αd(k)) 5: x(k+1) = x(k) + αk d(k) 6: update A(k)
≈ || ||
∇ −
A(k+1) = A(k) + B (k) + C (k) B
(k)
s(k)
(k) (k) T
= ss(k)s·y (k) = αk d(k)
y (k) = c(k+1) c(k+1) =
(k)
C
(k)T
= −yz(k)·zz (k)
(change in design)
− c( ) k
∇f (x( +1)) k
(k)
(change in gradient) z (k) = A(k)y (k)
Step 7: set k = k + 1 and go to Step 2
Chen CL
84
DFP Properties:
Matrix A(k) is always positive definite always converge to a local minimum if α > 0
⇒
d f (x(k) + αd(k)) dα
α=0
=
−c
(k)T
A(k)c(k) < 0
When applied to a positive definite quadratic form, A(k) converges to inverse of Hessian of the quadratic form
Chen CL
85
DFP Example: f (x) = 5x21 + 2x1x2 + x22 + 7
1-1.
x(0) = (1, 2);
A(0) = I ;
x(0) = (1, 2)
k = 0, = 0.001
c(0) = (10x1 + 2x2, 2x1 + 2x2) = (14, 6),
1-2. 1-3. 1-4.
(0)
||c ||
=
d(0) =
x(1) =
√ 142 + 62 = 15.232 > −c(0) = (−14, −6) x(0) + αd(0) = (1 − 14α, 2 − 6α)
f (x(1)) = f (α) = 5(1 − 14α)2 + 2(1 − 14α)(2 − 6α) + 2(2 − 6α)2 + 7 df dα
= 5(2)(−14)(1 − 14α) + 2(−14)(2 − 6α) + 2(−6)(1 − 14α) +2( 6)(2
−
α = 0.0988,
1-5.
− 6α) = 0
d2f = dα2
x(1) = x(0) + α0d(0) = (1
2348 > 0
− 14α, 2 − 6α) = (−0.386, 1.407)
Chen CL
86
s(0) = α0d(0) = ( 1.386,
1-6.
y (0) = s(0) y (0) =
·
(0) (0)T
s
s
B
(0)
(1)
A
=
=
−0.593), c(1) = (−1.406, 2.042 c(1) − c(0) = (−15.046, −3.958), z (0) = y (0) 23.20, y (0) · z (0) = 242.05 −
1.921 0.822
z
0.822 0.352
0.0828 0.0354 0.0354 0.0152 (0)
= A
+B
(0)
(0)
+ C
(0) (0)T
z
(0)
C
=
−
=
0.148
=
226.40 59.55
− −
0.211
59.55
0.935 0.246
−0.246 −0.065
−0.211 0.950
15.67
Chen CL
2-2. 2-3. 2-4.
87
(1)
||c ||
=
d(1) =
√ 1.0462 + 2.0422 = 2.29 > −A(1)c(1) = (0.586, −1.719)
x(2) = x(1) + αd(1)
α1 = 0.776 (minimize f (x(1) + αd(1)))
2-5.
x(2) = x(1) + αd(1) = ( 0.386, 1.407) + (0.455,
= (0.069, 0.073)
−
−1.334)
Chen CL
88
s(1) = α1d(1) = (0.455,
2-6.
y (1) = z (1) = s(1) y (1) =
·
(1) (1)T
s
s
=
B (1) =
A(2) =
−1.334), c(2) = (0.836, 0.284) c(2) − c(1) = (1.882, −1.758) A(1)y (1) = (0.649, −2.067) 3.201, y (1) · z (1) = 4.855 0.207 −0.607 0.421 −1.341 (1) (1) z z = −0.607 1.780 −1.341 4.272 − .0647 −0.19 .0867 0.276 (1) C = −0.19 0.556 0.276 −0.880 0.126 −0.125 (1) (1) (1) A + B + C = −0.125 0.626
T
Chen CL
89
Broyden-Fletcher-Goldfarb-Shanno (BFGS) Method
Direct update Hessian using only first derivatives
−αH −1c = −αc ≈ −αc
∆x = H ∆x A∆x
A :
find A by using only 1st-order information
Chen CL
90
BFGS Procedures:
Step Step Step Step Step Step
1: k = 0; c(0), ; H (0)(= I , H ) 2: c(k) = f (x(k)), Stop if c(k) < 3: solve H (k)d(k) = c(k) to obtain d(k) 4: compute αk = α to minimize f (x(k) + αd(k)) 5: x(k+1) = x(k) + αk d(k) 6: update H (k)
∇
≈ || ||
−
H (k+1) = H (k) + D(k) + E (k) D
(k)
s(k)
T
y (k)y (k) = y(k)·s(k) = αk d(k)
y (k) = c(k+1) c(k+1) =
(k)
E
T
(k ) (k ) c = (k)c (k) c ·d
(change in design)
− c( ) k
(change in gradient)
∇f (x( +1)) k
Step 7: set k = k + 1 and go to Step 2
Chen CL
91
BFGS Example: f (x) = 5x21 + 2x1x2 + x22 + 7
1-1.
x(0) = (1, 2);
H (0) = I ;
x(0) = (1, 2)
k = 0, = 0.001
c(0) = (10x1 + 2x2, 2x1 + 2x2) = (14, 6),
1-2. 1-3. 1-4.
(0)
||c ||
=
d(0) =
x(1) =
√ 142 + 62 = 15.232 > −c(0) = (−14, −6) x(0) + αd(0) = (1 − 14α, 2 − 6α)
f (x(1)) = f (α) = 5(1 − 14α)2 + 2(1 − 14α)(2 − 6α) + 2(2 − 6α)2 + 7 df dα
= 5(2)(−14)(1 − 14α) + 2(−14)(2 − 6α) + 2(−6)(1 − 14α) +2( 6)(2
−
α = 0.0988,
1-5.
− 6α) = 0
d2f = dα2
x(1) = x(0) + α0d(0) = (1
2348 > 0
− 14α, 2 − 6α) = (−0.386, 1.407)
Chen CL
92
s(0) = α0d(0) = ( 1.386,
1-6.
y (0) = y (0) s(0) =
·
y
(0) (0)T
y
D
(0)
(1)
H
=
=
−0.593), c(1) = (−1.406, 2.042 c(1) − c(0) = (−15.046, −3.958) 23.20, c(0) · d(0) = −232.0 −
226.40 59.55 59.55
(0) (0)T
c
15.67
9.760 2.567
(0)
E
2.567 0.675 (0)
= H
+D
(0)
c
(0)
+ E
=
=
=
− −
196 84 84
0.845 0.362
9.915 2.205 2.205 0.520
36
−0.362 −0.155
Chen CL
2-2. 2-3. 2-4.
93
(1)
||c || −c(1)
= =
√ 1.0462 + 2.0422 = 2.29 > H (1)d(1) ⇒ d(1) = (17.20, −76.77)
x(2) = x(1) + αd(1)
α1 = 0.018455 (minimize f (x(1) + αd(1)))
2-5.
x(2) = x(1) + α1d(1) = ( 0.0686,
−
−0.0098)