Analyzing the NYC Subway Dataset

Section 1. Statistical Test

Which statistical test did you use?

Because both rainy and non-rainy data sets are non-normally distributed, the Mann-Whitney U-test is the appropriate way to measure the difference in means.

Why is this statistical test appropriate or applicable to the dataset?

The MWUt is a non-parametric test which does not assume any particular distribution, as opposed to Welch’s t-test. Therefore the MWUt is the best fit for the NYC subway data set, simply by appreciating the histogram of the data we can clearly see the data is non-normal.

What results did you get from this statistical test?

The MWUt returned a p-value of 0.025, so we reject the null hypothesis that both data sets are identical and have the same mean. In other words, both sample means are statistically different.

What is the significance of these results?

These results show that subway usage increases when it rains, in a statistically significant way. On average, it increases by 15 riders per hour.

Section 2. Linear Regression

What approach did you use to compute the coefficients theta and produce prediction in your regression model:

Gradient descent (as implemented in exercise 3.5)
OLS using Statsmodels
Or something different?

Both gradient descent (GD) and OLS models where used to run linear regression on the NYC subway dataset. Both models look for linear relationships between the features and the predicted values or NYC subway rides.

What features did you use in your model?

In the GD model the features used were: rain, presipitation (precipi), hour of the day (Hour), mean temperature (meantempi) and dummy variables for individual station (UNIT). In the OLS model, the features used where: rain, mean temperature (meantempi) and dummy variables for stations (UNIT) and dummy variables for hours of day (Hour).

Why are these features appropriate?

After mixing and matching various features, these were the most relevant and important features based on their explicatory power and statistical significance. I had a bias for choosing the simplest model possible, without loosing too much explicatory power or R^2.

What is your model’s R2 (coefficients of determination) value?

The R squared for the GD model is 0.461. The R squared for the OLS is 0.525.

What does this R2 value mean for the goodness of fit for your regression model?

The R squared for the OLS is 0.525 which means we can explain about 52.5% of the data variability with the model. In other words, our model lets us predict NYC subway entries with 52% accuracy.

Section 3. Visualization

Please include two visualizations that show the relationships between two or more variables in the NYC subway data. You should feel free to implement something that we discussed in class (e.g., scatterplots, line plots, or histograms) or attempt to implement something more advanced if you'd like.

One visualization should be two histograms of ENTRIESn_hourly for rainy days and non-rainy days

One visualization can be more freeform, some suggestions are: Ridership by time-of-day or day-of-week How ridership varies by subway station Which stations have more exits or entries at different times of day

Section 4. Conclusion

From your analysis and interpretation of the data, do more people ride the NYC subway when it is raining versus when it is not raining?

On average, between 15 and 100 more people ride the NYC subway on a rainy day compared to a non-rainy day. These numbers come from using simple mean comparison, and linear regressions with Gradient Descent and OLS. In the mean comparison, we see a difference of 15 entries per hour, while in the gradient descent model the theta for the rain variable was 104.5. Given that the rain variable is a boolean the interpretation of the theta is that when it rains (rain = 1), the model predicts on average 104.5 more people will ride the subway.

What analyses lead you to this conclusion?

The comparison of both means using the Mann-Whitney U-test gives us good reason to believe that there is a statistical significant difference between the two data distributions. Other hints also show up in the Gradient Descent and OLS models, where rain feature had a positive theta of 104.5 and 54.3 respectively.

Section 5. Reflection

Please discuss potential shortcomings of the dataset and the methods of your analysis.

One of the possible shortcomings of the dataset is that there might be omitted variables like festivity or event dates, closed dates for maintenance, etc. Another important point to make is that in both the linear models a lot of dummy variables were used, which removes a lot of degrees of freedom and increases chances of multicollinearity. For example, I was unable to add three sets of dummy variables for hours of day, day of week and stations.

References

GGPlot ( http://ggplot.yhathq.com/docs/index.html )
GraphPad (http://graphpad.com/guides/prism/6/statistics/index.htm?how_the_mann-whitney_test_works.htm)
statsmodels.regression.linear_model.OLS (http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLS.html)