SDS 293 - Machine Learning


Course Number SDS 293
Semester Fall 2017
Hours MW 9:00-10:20
Location McConnell 404

Instructor R. Jordan Crouser
Slack @jcrouser
Office Ford 355
Office Hours M 10:30am-noon
& by appointment

Teaching Assistant Krishna Kompella
Slack @krishna

Discussion: sds293.slack.com

Course Description
Schedule
Assignments
Labs
Resources
Grading
Accommodation
Acknowledgement

Course Description

The field of statistical learning encompasses a variety of computational tools for modeling and understanding complex data. In this introductory course, we will explore many of the most popular of these tools, such as sparse regression, classification trees, boosting and support vector machines. In addition to unpacking the mathematics underlying the computational methods, students will also gain hands-on experience in applying these techniques to real datasets using R.


Prerequisite: MTH 220 (or an equivalent intro. statistics course), or permission of the instructor.


Schedule

Date Topic Lab Guest Assignments
09-11 Introduction to Machine Learning (p.1-28)
09-13 Evaluating Models (p.29-51) R and RMarkdown Demo
09-18 Simple and Multiple Linear Regression (p.59-82) Introduction to python
09-20 Assumptions and Other Potential Problems (p.82-119) Linear Regression
R | python
A1 out
09-25 Intro. to Classification, KNN (p.129-130, p.37-42, p.104-109) K-Nearest Neighbors
R | python
09-27 Logistic Regression (p.130-138) Logistic Regression
R | python
A2 out A1 due
10-02 Discriminant Analysis (p.138-150) LDA/QDA Ben Miller, MITLL
10-04 Classification Wrap-up (p.151-154) Comparing methods S. Chaitanya, UMass A3 out A2 due
10-09 NO CLASSES - FALL BREAK
10-11 Resampling Methods (p.175-190)
PDF
CV & bootstrap A4 out A3 due
10-16 Best Subset and Stepwise Selection (p.205-210) Subset Selection
10-18 Estimating Error w/ Cross-Validation (p.210-214) Selection by CV A5 out A4 due
10-23 Ridge Regression and the Lasso (p.214-228) RR & the Lasso
10-25 PCR and PLS (p.228-244) Dimension Reduction A6 out A5 due
10-30 Machine Learning in the Wild R. Caceres, MITLL
11-01 Final Project Workshop I FP1 out A6 due
11-06 Polynomial Regression and Step Functions (p.265-270)
PDF
Polynomials & Step Functions
11-08 Splines and GAMs (p. 271-287) Splines & GAMs A7 out FP1 due
11-13 Decision and Classification Trees (p.303-316)
11-15 Bagging, Random Forests, and Boosting (p.316-324)
PDF
Decision Trees FP2 out A7 due
11-20 NO CLASSES - Jordan Sick
11-22 NO CLASSES - THANKSGIVING
11-27 Maximal Margin and Support Vector Classifiers (p.337-355) FP Workshop II FP3 out FP2 due
11-29 Multiclass SVMs (p.355-359) SVMs for Classification
12-04 K-Means and Hierarchical Clustering (p.385-401) Clustering A8 out FP3 due
12-06 Neural Networks G. Grinstein FP3 due
12-11 Advanced Topics A8 due
12-13 Final Project Demonstrations


Assignments and Deliverables

Assignments and Deliverables: A problem set will be assigned at the end of each section (for a total of 8 assignments). The problem set will be due the following week. The course will culminate in a final project applying statistical learning techniques to a dataset of your choice.


Late submissions will be assessed a penalty of 10% per day. Extensions must be requested 48 hours in advance, or with notification from a student's class dean.



Labs

To help students gain hands-on experience in applying statistical learning techniques, this course will include many in-class lab sessions. The labs will be conducted primarily in R, with some supplemental python exercises at the instructor's discretion. Students are encouraged to work in pairs during these labs.


Lab responses are due 24 hours after the lab was released.



Resources

RStudio is great for statistical analysis.

Python is useful for data ingest, cleaning, formatting, and general wrangling.

Students enrolled in this course have free, unlimited access to DataCamp, generously provided by DataCamp for the Classroom.

The Spinelli Center for Quantitative Learning is a great place to get help brushing up on stats.


Required Reading
R1 Introduction to Statistical Learning with Applications in R
by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
Free (pdf) | (supplemental material)


Grading

Assignments 40%
Labs 20%
Final Project 20%
Class Participation 20%
Total 100%
Note that the final grade is based on my judgment of your work. Although the grade will be largely based on the percentages shown to the left, I will be giving out extra credit for excellent work and out-of-the-box thinking. Similarly, while "class participation" is somewhat subjective and is not one-size-fits-all, I will take note of contributions in class which demonstrate intellectual curiosity or clear understanding of a topic, as well as comments which help others in class to learn a difficult concept.


Accommodation

Smith is committed to providing support services and reasonable accommodations to all students with disabilities. To request an accommodation, please register with the Disability Services Office at the beginning of the semester. To do so, call (413) 585-2071 to arrange an appointment with Laura Rauscher, Director of Disability Services.


Acknowledgement

Some of the materials used in this course are derived from lectures, notes, or similar courses taught elsewhere. Appropriate references will be included on all such material.