WiDS Cambridge Datathon 2021

Logo

The WiDS Cambridge Datathon Workshop is an annual workshop preceding the WiDS Cambridge Conference. The workshop aims to provide mentorship and training for those interested in participating in the WiDS Datathon Challenge, and, more generally, anyone with a strong interest in data science.

View the Project on GitHub onefishy/wids_datathon

Preparing for the Workshop

Since the 2021 Datathon workshop will be happening virtually, pedagogical materials addressing the technical background necessary for the Datahon challenge will be delivered through pre-recorded videos and structured notebooks. We’ve prepared instructional materials with a range of different depths. We ask that participants review the materials that are complementary to their backgrounds before coming to the workshop - depending on your background, you may find that the materials cover concepts with which you are already familiar or you may find that only some of the materials cover concepts that are new to you. The best way to see if you need to brush up on a concept is to take the concept quiz (which follows every topic) or run through the exploratory DeepNote notebooks (which follows every 2-3 topics).

About the WiDS 2021 Datathon

This year, participants will work again with the WiDS 2020 Datathon data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) dataset. Last year, the challenge was to predict patient survival; this year. the WiDS Datathon will focus on creating classifiers to determine whether patients have a type of diabetes which could potentially inform later treatment in the ICU.

The WiDS Datathon is hosted on Kaggle, an online community of data scientists.

Resources for Exploring and Understanding the Data

The following are resources for getting started with Kaggle competitions and with working with this year’s data:

  1. Getting Started with Kaggle (Video)
  2. Getting Started with the Datathon Challenge: Diabetes Prediction for ICU Patients (Video)
  3. An Introduction to Exploratory Data Analysis (Notebook)

The following are starter code for this year’s datathon challenge:

  1. WiDS Global 2021 Starter Code (Notebook)
  2. Video Walk through of the Official WiDS Cambridge 2021 Starter Code (Video)
  3. The Official WiDS Cambridge 2021 Starter Code (Notebook)

Important: We ask that all participants work through the Official WiDS Cambridge 2021 Starter Code and the Official WiDS Cambridge 2021 Starter Code prior to attending the workshop meetings.

Technical Background for the Datathon Challenge

I. Programming

It would be helpful for participants to have some familiarity in python programming. You can familiarize yourself with python by completing online tutorials, for example: https://www.learnpython.org. In particular, it’s helpful for participants to be able to manipulate data using pandas DataFrames, perform basic operations with numpy Arrays and make use of basic plotting functions (e.g. line chart, histogram, scatter plot, bar chart) from matplotlib:

  1. pandas Basics
  2. numpy Basics
  3. matplotlib Basics

For this workshop, we will be using DeepNote - a free cloud computing service that comes with pre-installed machine learning tools. Deepnote allows you to easily work on your data science projects, together in real-time and in one place with your friends and colleagues. It allows you to create and share documents that contain live code, equations, visualizations and narrative text. You can familiarize yourself with the interface of DeepNote notebooks by reading the following tutorials (remember you don’t need to install anything!):

  1. Deepnote: the modern way to teach Data Science
  2. Deepnote: Documentation

Finally, we will be working with random variables in this workshop and will be reasoning about them through their distributions. But we’ll only need a bit of familiarity with these concepts:

  1. Probability distributions in python

Skills Check for Participants

You might find the following Deepnote notebook useful to get a sense of the types of computational tasks that we will be using during the workshop

Introduction to Deepnote Notebook and python Libraries for Data Science

Instructions:

  1. Open the notebook,  under the drop-down menu by the name of the project, 0-preparing-for-the-workshop, choose to Duplicate the project. The copy will open in the same tab.
  2. If you want to share your work with others: under ‘Share’ (upper right-hand side of your screen), change the permissions to “Public access: On” and share the link, or invite specific collaborators.

Machine Learning

The materials will be broken down into a sequence of bite-sized concepts. Each concept will be introduced in a short 10-20 minute video; following each video, there will be a short concept-check quiz for the viewer to test their understanding. For each topic, we have selected some supplementary readings that may be helpful. After two or three topics there will be an DeepNote notebook with starter code and experiments to help you further explore these topics.

These materials are taken from DSC6232 Machine Learning and Computational Statistics, an intensive summer data science course run by IACS and the University of Rwanda. You can find the complete set of lecture materials on the course website.

  1. Topics on Regression
  2. Topics on Uncertainty, Variance and Bias
  3. Topics on Classification: Logistic Regression
  4. Topics on Classification: Additional Models
  5. Topics on Neural Networks for Regression
  6. Topics on Neural Networks for Classification
  7. Topics on Transforming and Manipulating Data