|
Optimization for Data Science
Lecture 1: Introduction to Statistics
Lecture 1
contextualizes statistics and optimization within modern AI applications. It contrasts probability,
which mathematically describes data generation, with statistics, which infers unknown parameters
from observed data . The session reviews fundamental probability concepts including discrete
and continuous random variables, density functions, expectation, variance, and independence .
These concepts establish the framework for statistical inference, demonstrated through
parameter estimation in Binomial, Normal, and multiple linear regression models.
Lecture 2: Optimization for Statistics
Lecture 2
outlines statistical inference via point estimation and confidence intervals grounded in
the Law of Large Numbers and the Central Limit Theorem .
It establishes parameter optimization using Maximum Likelihood Estimation and
the least squares principle . The session distinguishes supervised regression and
classification from unsupervised learning . These foundations are applied
to ordinary least squares regression and Principal Component Analysis .
Model optimization is formalized through risk functions and negative log-likelihoods .
Lecture 3: Optimization and Convexity
Lecture 3
introduces the optimization framework, defining global, local,
and approximate minima . It details convex sets, convex functions,
Jensen's inequality, and operations preserving convexity .
The core theorem proves every local minimum of a convex function is inherently
a global minimum . The lecture establishes the first-order condition, gradient monotonicity,
and second-order positive semidefinite Hessian characterizations of convexity
Lecture 4: Optimality Condition and Strict/Strong Convexity
Lecture 4
outlines the fundamentals of convex optimization, emphasizing that every
local minimum is a global minimum. It details the precise first-order optimality
conditions required to identify global optima across unconstrained and
constrained problems. The session covers the rigorous definitions alongside the first-
and second-order characterizations of strict and strong convexity.
It establishes the mathematical hierarchy among standard, strict,
and strong convexity, applying these principles to quadratic functions.
Lecture 5: Optimization for Exponential Family
Lecture 5
outlines the fundamentals of the exponential family, emphasizing the inherent convexity
of their log-likelihoods. It details the canonical formulations required to frame Maximum
Likelihood Estimation as a convex optimization problem across one- and multi-parameter
distributions. The session covers rigorous definitions, moment generating functions,
and convexity proofs using Hölder's Inequality,
applying these mathematical principles to standard statistical models.
Lecture 6: Ordinary Least Squares and Gradient Descent
Lecture 6
examines Ordinary Least Squares (OLS) optimization , focusing on the geometric perspective of
its closed-form solution as an orthogonal projection onto a column space.
The lecture transitions from theoretical foundations to practical computation,
contrasting exact algebraic techniques like QR decomposition with iterative solvers.
Specifically, it establishes strict convergence bounds for first-order gradient descent
and demonstrates the single-step efficiency of second-order Newton's method for quadratic
objectives , reinforcing these concepts through functional Python implementations.
Lecture 7: Gradient Descent: Theoretical Analysis
Lecture 7
explores the theoretical convergence of gradient descent, demonstrating
how different properties of an objective function impact algorithmic speed.
The lecture highlights that a basic analysis provides weak guarantees, meaning
stronger assumptions are necessary to prove efficiency. It establishes a clear
hierarchy: while Lipschitz convex functions suffer from very slow convergence ,
introducing smoothness significantly accelerates the process. Ultimately, functions
that are both smooth and strongly convex achieve rapid, linear convergence,
drastically minimizing the total iterations needed to find the optimal solution.
Lecture 8: Line Search Methods: Gradient Descent Step Size Selection
Lecture 8
shifts focus to the practical selection of step sizes in gradient descent. Because
fixed step sizes risk either sluggish progress or unstable oscillations ,
the lecture introduces adaptive line search strategies. It moves past exact line
search due to its high computational cost , emphasizing practical, inexact
alternatives instead. The core discussion details the Wolfe conditions,
which enforce both a sufficient decrease and adequate curvature , alongside the
backtracking method that iteratively shrinks an initial trial step until these
acceptable criteria are met. This code is supplementary material for this lecture.
Lecture 9: Projected GD, Coordinate descent, Proximal GD and SGD
Lecture 9
explores four efficient extensions of standard gradient descent tailored for complex
optimization challenges. The session covers Projected Gradient Descent for enforcing
parameter constraints via projection steps , alongside Coordinate Descent, which
reduces high-dimensional problems into cheaper, single-variable updates. To address
composite objectives with non-smooth penalties, it introduces Proximal Gradient
Descent , and concludes with Stochastic Gradient Descent, a method that accelerates
large-scale training by substituting expensive full gradients with cost-effective,
unbiased single-sample estimators. This code is supplementary material for this lecture.
Lecture 10: Newton’s method
Lecture 10
explores Newton's method as a powerful second-order technique for statistical
optimization. The session begins with the 1-dimensional case, demonstrating
how tangent lines are used to find zero gradients for function minimization.
It then extends this approach to high-dimensional spaces, where it utilizes both
gradient and curvature information—specifically the inverse of the Hessian
matrix—to minimize a local second-order Taylor expansion of the objective function.
The lecture concludes by establishing the method's theoretical guarantees, proving
its rapid quadratic convergence rate under the conditions of bounded inverse and
Lipschitz continuous Hessians.
Lecture 11: Lagrange duality
Lecture 11
explores Lagrange duality and the Karush-Kuhn-Tucker (KKT) conditions
as fundamental theoretical frameworks for constrained statistical optimization.
The session begins by defining the Lagrangian and its dual function,
establishing the principles of weak and strong duality alongside the concept
of a zero duality gap. It then details the KKT conditions—encompassing primal
feasibility, dual feasibility, complementary slackness, and stationarity—as necessary
and sufficient requirements for global optimality under convexity. The lecture
concludes by demonstrating practical applications of these theoretical principles,
utilizing the Lagrangian formulation to derive elegant closed-form solutions for
projection onto the unit ball, simplex projection, and mirror descent updates via
KL divergence.
Lecture 12: Penalty and Augmented Lagrangian Methods
Lecture 12
compares three methods for solving constrained optimization problems .
While the dual-ascent method risks instability and the pure quadratic penalty
method causes severe ill-conditioning at high values , the Augmented Lagrangian
Method (ALM) elegantly combines their strengths . By pairing a stabilizing quadratic
penalty with explicit multiplier updates, ALM achieves accurate, feasible solutions
without the numerical drawbacks of extreme penalty parameters.
Lecture 13: Alternating Direction Method of Multipliers
Lecture 13
introduces the Alternating Direction Method of Multipliers as a powerful technique
for solving structured optimization problems . Specifically, it addresses scenarios
where the main objective can be cleanly separated into two independent components
that are linked by explicit linear equations rather than abstract constraints .
While the traditional Augmented Lagrangian Method forces these components to be
optimized simultaneously—which creates computational bottlenecks and destroys
separability —this new approach resolves the issue by separating the optimization
process into alternating steps . By updating one component at a time, the method s
uccessfully restores decomposability, making complex systems significantly easier
to compute and implement in practice .
|