alt text

Zijian Guo

Optimization for Data Science

Lecture 1: Introduction to Statistics

    Lecture 1 contextualizes statistics and optimization within modern AI applications. It contrasts probability, which mathematically describes data generation, with statistics, which infers unknown parameters from observed data . The session reviews fundamental probability concepts including discrete and continuous random variables, density functions, expectation, variance, and independence . These concepts establish the framework for statistical inference, demonstrated through parameter estimation in Binomial, Normal, and multiple linear regression models.

Lecture 2: Optimization for Statistics

    Lecture 2 outlines statistical inference via point estimation and confidence intervals grounded in the Law of Large Numbers and the Central Limit Theorem . It establishes parameter optimization using Maximum Likelihood Estimation and the least squares principle . The session distinguishes supervised regression and classification from unsupervised learning . These foundations are applied to ordinary least squares regression and Principal Component Analysis . Model optimization is formalized through risk functions and negative log-likelihoods .

Lecture 3: Optimization and Convexity

    Lecture 3 introduces the optimization framework, defining global, local, and approximate minima . It details convex sets, convex functions, Jensen's inequality, and operations preserving convexity . The core theorem proves every local minimum of a convex function is inherently a global minimum . The lecture establishes the first-order condition, gradient monotonicity, and second-order positive semidefinite Hessian characterizations of convexity

Lecture 4: Optimality Condition and Strict/Strong Convexity

    Lecture 4 outlines the fundamentals of convex optimization, emphasizing that every local minimum is a global minimum. It details the precise first-order optimality conditions required to identify global optima across unconstrained and constrained problems. The session covers the rigorous definitions alongside the first- and second-order characterizations of strict and strong convexity. It establishes the mathematical hierarchy among standard, strict, and strong convexity, applying these principles to quadratic functions.

Lecture 5: Optimization for Exponential Family

    Lecture 5 outlines the fundamentals of the exponential family, emphasizing the inherent convexity of their log-likelihoods. It details the canonical formulations required to frame Maximum Likelihood Estimation as a convex optimization problem across one- and multi-parameter distributions. The session covers rigorous definitions, moment generating functions, and convexity proofs using Hölder's Inequality, applying these mathematical principles to standard statistical models.

Lecture 6: Ordinary Least Squares and Gradient Descent

    Lecture 6 examines Ordinary Least Squares (OLS) optimization , focusing on the geometric perspective of its closed-form solution as an orthogonal projection onto a column space. The lecture transitions from theoretical foundations to practical computation, contrasting exact algebraic techniques like QR decomposition with iterative solvers. Specifically, it establishes strict convergence bounds for first-order gradient descent and demonstrates the single-step efficiency of second-order Newton's method for quadratic objectives , reinforcing these concepts through functional Python implementations.

Lecture 7: Gradient Descent: Theoretical Analysis

    Lecture 7 explores the theoretical convergence of gradient descent, demonstrating how different properties of an objective function impact algorithmic speed. The lecture highlights that a basic analysis provides weak guarantees, meaning stronger assumptions are necessary to prove efficiency. It establishes a clear hierarchy: while Lipschitz convex functions suffer from very slow convergence , introducing smoothness significantly accelerates the process. Ultimately, functions that are both smooth and strongly convex achieve rapid, linear convergence, drastically minimizing the total iterations needed to find the optimal solution.

Lecture 8: Line Search Methods: Gradient Descent Step Size Selection

    Lecture 8 shifts focus to the practical selection of step sizes in gradient descent. Because fixed step sizes risk either sluggish progress or unstable oscillations , the lecture introduces adaptive line search strategies. It moves past exact line search due to its high computational cost , emphasizing practical, inexact alternatives instead. The core discussion details the Wolfe conditions, which enforce both a sufficient decrease and adequate curvature , alongside the backtracking method that iteratively shrinks an initial trial step until these acceptable criteria are met. This code is supplementary material for this lecture.

Lecture 9: Projected GD, Coordinate descent, Proximal GD and SGD

    Lecture 9 explores four efficient extensions of standard gradient descent tailored for complex optimization challenges. The session covers Projected Gradient Descent for enforcing parameter constraints via projection steps , alongside Coordinate Descent, which reduces high-dimensional problems into cheaper, single-variable updates. To address composite objectives with non-smooth penalties, it introduces Proximal Gradient Descent , and concludes with Stochastic Gradient Descent, a method that accelerates large-scale training by substituting expensive full gradients with cost-effective, unbiased single-sample estimators. This code is supplementary material for this lecture.

Lecture 10: Newton’s method

    Lecture 10 explores Newton's method as a powerful second-order technique for statistical optimization. The session begins with the 1-dimensional case, demonstrating how tangent lines are used to find zero gradients for function minimization. It then extends this approach to high-dimensional spaces, where it utilizes both gradient and curvature information—specifically the inverse of the Hessian matrix—to minimize a local second-order Taylor expansion of the objective function. The lecture concludes by establishing the method's theoretical guarantees, proving its rapid quadratic convergence rate under the conditions of bounded inverse and Lipschitz continuous Hessians.

Lecture 11: Lagrange duality

    Lecture 11 explores Lagrange duality and the Karush-Kuhn-Tucker (KKT) conditions as fundamental theoretical frameworks for constrained statistical optimization. The session begins by defining the Lagrangian and its dual function, establishing the principles of weak and strong duality alongside the concept of a zero duality gap. It then details the KKT conditions—encompassing primal feasibility, dual feasibility, complementary slackness, and stationarity—as necessary and sufficient requirements for global optimality under convexity. The lecture concludes by demonstrating practical applications of these theoretical principles, utilizing the Lagrangian formulation to derive elegant closed-form solutions for projection onto the unit ball, simplex projection, and mirror descent updates via KL divergence.

Lecture 12: Penalty and Augmented Lagrangian Methods

    Lecture 12 compares three methods for solving constrained optimization problems . While the dual-ascent method risks instability and the pure quadratic penalty method causes severe ill-conditioning at high values , the Augmented Lagrangian Method (ALM) elegantly combines their strengths . By pairing a stabilizing quadratic penalty with explicit multiplier updates, ALM achieves accurate, feasible solutions without the numerical drawbacks of extreme penalty parameters.

Lecture 13: Alternating Direction Method of Multipliers

    Lecture 13 introduces the Alternating Direction Method of Multipliers as a powerful technique for solving structured optimization problems . Specifically, it addresses scenarios where the main objective can be cleanly separated into two independent components that are linked by explicit linear equations rather than abstract constraints . While the traditional Augmented Lagrangian Method forces these components to be optimized simultaneously—which creates computational bottlenecks and destroys separability —this new approach resolves the issue by separating the optimization process into alternating steps . By updating one component at a time, the method s uccessfully restores decomposability, making complex systems significantly easier to compute and implement in practice .