Week 2 and 3 at Metis

Jan 29, 2016    

Week 3 of Metis Datascience Bootcamp just wrapped up today with project presentations of our 2nd project. My project on movie data will be featured in next post, but here’s what else we’ve done in the past 2 weeks, including linear regression, web scraping, and statistics…

Supervised Learning

Week 2 was kicked off with an immediate deep dive with supervised learning; the modeling of a data set based on defined targets. With a well trained model, we can apply that to predict and test additional inputs to determine a possible result.

The first topic was regression:

Linear Regression

I have some experience with linear regression, and its concepts are easy enough to visualize with 1 or 2 features. But I really appreciated that the class emphasized on all the assumptions and shortcomings when applying linear regression - such as bias vs. variance, overfitting, the curse of dimensionality, and the concern of heteroscedastic random variables in our data.

As with any machine learning algorithm, the importance of splitting our data into training and test sets is critical. We looked at cross-validation with Leave-one-out (LOOCV), k-fold, and random sampling as possible types to improve the testing of our linear regression model.

To find the best features for our linear regression model, we learn several regularization method such as lasso and ridge regression.

What I’ve seen from learning to apply linear regression with Python is the lack of documentation and examples that fully explain how. So I plan on writing a few articles demonstrating examples of applying linear regression with the Python Statsmodels package.

Statistics and Probability

We also began studying statistics and probability. Starting with a review of basic probability and continuous probability, we moved from frequentist to the interesting realm of Bayesian statistics with conditional probability and Bayes Theorem.

On the statistics side, we explored null hypothesis testing to determine if an outcome is indeed significant or if it could have likely happened by random chance. Now armed with the tools to measure our data science models’ significance we can immediately apply that to our project.

Project 2 - Linear Regression

Codenamed “Project Luther”, after the dedicated TV detective, we scraped and analyzed movie data to answer an interesting question. Instead of providing everyone with a clean set of data to work with, we were tasked with formulating our own hypothesis and then individually collecting the data that we need for our analysis.

This was really cool as it lets us experience what data scientists have to do to collect, wrangle, and clean data before it can be used effectively to create our models.

Here’s a write up of my project: The Roger-Ebertron.

Web Scraping with Beautiful Soup and Selenium

Fortunately Metis do show us how we can gather data by scraping the web. We looked at Python tools such as Beautiful Soup to parse HTML and Selenium for web automation to make our lives much easier with a crash course in HTML and CSS and live demonstration of these tools.

By the end of these 2 weeks, my web scraping skills have jumped leaps and bounds. But it is also extremely helpful when web designers label their HTML and CSS elements well.

However, I did find the learning curve of BeautifulSoup pretty steep, with no easy tutorial online. So I’ll try to provide some concrete examples on my blog.

Minimum Viable Product

Web scraping for their project did take most students a while. And after a session on creating a Minimum Viable Product, we were tasked on the same day to create a presentation for our project based on what we have gathered so far. That was a challenging but ultimately beneficial endeavor to help me organize what I have so far and if my question is actually interesting and presentable.

Here’s a post on my MVP.

Cool Stuff:

  • Guest speaker Liv Buli of Next Big Sound came to speak to us on Thursday 1/28 on “Story Telling + Data Science + Data Visualization = Data Journalism”. It was extremely insightful and I discuss her talk and my thoughts on this post.
  • Metis Open House was this Tuesday night, and I had a chance to meet some Metis alumni to see what they’re doing now, and to share my experiences so far with prospective students.
  • Stats Worksheet: Our TA Reshma gave us some excellent resources in statistics and linear regression, including a worksheet where we have get to walk through a simple regression problem with pencil and a calculator to see each step of the process. We had to manually compute the regression beta values, find the residuals, calculate our R-squared values, and then test our model for significance. It was extremely helpful to understand regression.

Concerns:

  • Just one area that I recommended to the Metis leaders: there’s no expectation of linear algebra for the course. And some students have not seen the summation symbol before. When I started learning data science on my own, I find linear algebra to be indispensable in understanding what happens behind some machine l earning models. I’d recommend that a crash course in at least linear algebra notation to be included. And even in the pre-work before new students start the bootcamp.

As always, I try to share my work on my Github repository. You can find my pair programming work, investigation, and projects (when they’re made presentation ready).

Take a look, and as always, let me know what you think!

-->