Since the 2021 Datathon workshop will be happening virtually, pedagogical materials addressing the technical background necessary for the Datahon challenge will be delivered through pre-recorded videos and structured notebooks. We’ve prepared instructional materials with a range of different depths. We ask that participants review the materials that are complementary to their backgrounds before coming to the workshop - depending on your background, you may find that the materials cover concepts with which you are already familiar or you may find that only some of the materials cover concepts that are new to you. The best way to see if you need to brush up on a concept is to take the concept quiz (which follows every topic) or run through the exploratory DeepNote notebooks (which follows every 2-3 topics).
This year, participants will work again with the WiDS 2020 Datathon data from MIT’s GOSSIS (Global Open Source Severity of Illness Score) dataset. Last year, the challenge was to predict patient survival; this year. the WiDS Datathon will focus on creating classifiers to determine whether patients have a type of diabetes which could potentially inform later treatment in the ICU.
The WiDS Datathon is hosted on Kaggle, an online community of data scientists.
The following are resources for getting started with Kaggle competitions and with working with this year’s data:
The following are starter code for this year’s datathon challenge:
It would be helpful for participants to have some familiarity in
python programming. You can familiarize yourself with
python by completing online tutorials, for example: https://www.learnpython.org. In particular, it’s helpful for participants to be able to manipulate data using
DataFrames, perform basic operations with
numpy Arrays and make use of basic plotting functions (e.g. line chart, histogram, scatter plot, bar chart) from
For this workshop, we will be using DeepNote - a free cloud computing service that comes with pre-installed machine learning tools. Deepnote allows you to easily work on your data science projects, together in real-time and in one place with your friends and colleagues. It allows you to create and share documents that contain live code, equations, visualizations and narrative text. You can familiarize yourself with the interface of DeepNote notebooks by reading the following tutorials (remember you don’t need to install anything!):
Finally, we will be working with random variables in this workshop and will be reasoning about them through their distributions. But we’ll only need a bit of familiarity with these concepts:
You might find the following Deepnote notebook useful to get a sense of the types of computational tasks that we will be using during the workshop
0-preparing-for-the-workshop, choose to Duplicate the project. The copy will open in the same tab.
The materials will be broken down into a sequence of bite-sized concepts. Each concept will be introduced in a short 10-20 minute video; following each video, there will be a short concept-check quiz for the viewer to test their understanding. For each topic, we have selected some supplementary readings that may be helpful. After two or three topics there will be an DeepNote notebook with starter code and experiments to help you further explore these topics.
These materials are taken from DSC6232 Machine Learning and Computational Statistics, an intensive summer data science course run by IACS and the University of Rwanda. You can find the complete set of lecture materials on the course website.