Machine Learning and Computation Statistics

Logo

This is a teaching partnership between the Africa Center of Excellence in Data Science at the University of Rwanda and the Institute of Applied Computational Science at Harvard University.

View the Project on GitHub onefishy/Rwanda-Data-Science

Course Syllabus

What is this Course?

This course covers the fundamental concepts of machine learning, including: classification, regression, dimensionality reduction, clustering and elements of deep learning. The course has dual focus on key mathematical concepts underlying machine learning models and on the application of machine learning algorithms on real-world data science problems. This course serves as a foundation for students who are interested in advanced courses and further independent study in the field of machine learning and data science.

Learning Outcomes

After successful completion of this course, you will be able to:

Implement basic machine learning models for regression, classification, clustering and data completion tasks using python libraries. Evaluate the usefulness/appropriateness of your models using a range of metrics. Interpret your models in terms of the task and application domain. Map machine learning models and methods to meaningful real-life data science problems. Make use of a number of free online tools for enhancing and advancing your study of data science.

Course Format

This course follows a flipped classroom structure. The lectures are pre-recorded and made available at the beginning of each week. Students are expected to watch the relevant lecture videos and study the lecture materials before the class meeting. Each class meeting will consists of 1) a discussion portion where students discuss the materials that they had studied, and 2) a practical exercise portion where students work in small teams on a coding exercise applying the concepts from lecture/readings to a real dataset. Students are expected to actively participate in both the class discussion as well as the practical exercise. You will not be able to complete the exercise if you do not study the lecture videos and materials before class!

The pre-recorded lectures will be broken into segments, after each segment there will be a short concept quiz based on the lecture material.

The in-class practical exercises will be collected at the end of each class and graded. Students have the option to resubmit the exercise with improved solutions before the next class. The re-submission grade with be averaged with the original grade for each practical exercise.

There will be two homework assignments that are in the form of Kaggle competitions, where students build predictive models for a single dataset. For full credit, each student must beat the performance of the baseline model (the baseline performance will be provided for each competition). Top three scoring students will receive small scholarship prizes.

Graded Components

Course Materials

You are encouraged to study from “An Introduction to Statistical Learning” for the theoretical portion of the course (this textbook is free for you to download). For implementation, you are encouraged to refer to “Hands on Machine Learning” and “Python for Data Analysis”.

Class meetings will be virtual and facilitated by Zoom. Students are expected to be familiar with the Zoom interface by the start of the course.

All in-class exercises will be completed in Google colab. Students are expected to have a Google account and be familiar with the colab interface by the start of the course. You can find tutorials on python, colab and various python libraries that we’ll need for the course in the “Preparing for the Course” section of the course Moodle.

Homework: Kaggle Competitions

There are two homework assignments that are in the form of Kaggle competitions. Kaggle is a platform for data scientists around the world to collaborate in solving difficult data science challenges. Datasets are often published and made publicly available on Kaggle; users are invited to submit models and the best models are selected based on some pre-determined metric. Students in this course will be expected to participate in two Kaggle competitions, one in the middle of the course and one at the end. You will download a dataset we specify, build predictive models for this data and submit your model to Kaggle to be evaluated. To receive full credit, students must submit their model (in the form of a colab notebook) and must beat the performance of a basic model (the baseline for each competition will be made available).

In-Class Exercises

Students will work in small teams to complete an in-class practical exercise during each class meeting. These exercises focus on model implementation using python libraries, model analysis and interpretation. In each exercise students will be asked to apply concepts they’ve learned in lecture to a real dataset. At the end of each class, these exercises will be collected and graded. Students have the option to improve their answers and resubmit before the next class meeting. If you choose to resubmit the grade of the re-submission will be averaged with your original grade.

Concept Quizzes

Before each class meeting, students are expected to watch a set of lecture videos or read a set of relevant materials. After each video or reading there will be a short quiz on the key concepts covered in the lecture or reading. These quizzes must be completed before the class meeting.

Participation

Students are expected to participate actively during class meetings: 1) by asking questions about the lecture or reading materials 2) by working productively in a team to complete the in-class practical exercises. Each student receives 5 participation points for each class meeting: 5 being Excellent and 1 being Unsatisfactory. If a student is absent from a class meeting without notifying the instructors in advance, they will receive a 0 for participation.

Expectations and Policies

Students are expected to have working knowledge of python programming. In particular, familiarity with the python libraries pandas, numpy and matplotlib is assumed. Students should be comfortable with manipulating numpy arrays, pandas dataframes and be familiar with all basic plotting methods (histogram, line chart, bar char, scatter plot, shaded line chart) of matplotlib. You can prepare or refresh your skills for this course by following the tutorials in the “Preparing for the Course Section” of the course Moodle.

Late work and absences will not be accepted save in exceptional situations.

Help for the Course

The official office hour for the course is on Wednesday from 2pm-5pm (Kigali time). During this time, you are encouraged to ask questions about class materials or machine learning concepts. You are expected to participate actively on the course discussion forum on Piazza: you may post questions about homework or in-class exercises; your peers and your instructors will answer your questions. You are also welcome to contact the instructors by writing to the official course email. However, the fastest way to have your questions answered is through office hours and on Piazza.