Statistics 860 - Estimation of Functions from Data a.k.a. statistical machine learning.

T Th 4:00-5:15 Fall 2016, Room 133 SMI (MED SC CTR, 1300 University Ave)
Stat 709-10 NOT required.

Grace Wahba, Instructor

Short description:

1. Reproducing Kernel Hilbert Spaces from the point of view of supervised machine learning and statistical model building. Penalized likelihood, support vector machines and related regularization methods. Bayesian connections. The representer theorem.

2. Smoothing splines; thin plate splines, radial basis functions. ANOVA splines.

3. Degrees of freedom for signal and the bias-variance tradeoff, Bayesian confidence intervals; variable and model selection methods. Tuning methods: GCV, GACV, BGACV, Unbiassed risk, AIC, BIC and their properties, cross validation. Randomized trace estimates for df signal.

4. The LASSO PatternSearch algorithm. The partitioned LPS. Issues regarding prediction vs. sparse variable selection.

6.Regularized Kernel Estimation, Robust Manifold Unfolding, and Distance Correlation - Assimiltion of pairwise distance/dissimilarity information into regression, classification, clustering and variable selection algorithms with attribute and other information. The Distance Covariance Variable Selection Theorem - open questions in pairwise distance methods and variable selection..

7. Multiple, complex input structures, multivariate correlated Bernoulli outcomes, soft classification, multicategory support vector machines

8. Applications in risk factor analysis in medical data analysis with genetic, pedigree, covariate and other sources of information. Applications in data mining and machine learning. *Selected recent additions to the supervised machine learning literature.*

Prerequisites: - Statistics Majors: multivariate analysis, or, some exposure to Hilbert spaces, or cons. instr. Those unfamiliar with Hilbert spaces will be asked to read the first 33 pages of Akhiezer and Glazman, Theory of Linear Operators in Hilbert Spaces, vol. I here at the beginning of the course. Graduate students in Biostatistics, CS, AOS and other physical sciences, engineering, economics, animal science, political science, social science and business may find some of the techniques studied here useful and are welcome to sit in, or, take the course for credit if they have exposure to linear algebra, sufficient math background to read Akhiezer and Glazman, and are familiar with the basic properties of the multivariate normal distribution, as found, e. g. in Anderson, Multivariate Analysis, or Wilks, Mathematical Statistics. Otherwise, the development will be self-contained. If in doubt, please contact the instructor by e-mail ( or come to the first class. This will be a seminar-type course. There will be no sit-down exams. Students taking the course for credit will be expected to do several small computer projects studying the behavior of some of the methods discussed on simulated or experimental data, and one or two projects in an area of application of their choice with a possible project being the presentation of a lecture in class on a recent paper or recent research. Text: Wahba: Spline Models for Observational Data, SIAM (1990) as well as selected papers, including some from recent conferences (e. g. NIPS, ICML, JSM). NOTE: Online version of "Spline Models" is available through the SIAM e-books at the university library. Search "Spline" in title thru: here