top of page

NYC Subway Traffic

  • Writer: Chintan
    Chintan
  • Oct 30, 2017
  • 3 min read

Updated: Nov 26, 2017



During the first week of my stint at Metis, we were handed an assignment from WomenTechWomenYes (WTWY). The organization aims to gather signatures from New York City (NYC) residents for its initiatives, and wants Metis to employ data towards mapping optimal volunteer placement coordinates and schedules. NYC, with it’s population of 8.5 million residents is fairly dense, and there is anecdotal evidence that it is ranked relatively high for public-transit usage. To get started, we looked at data from various sources, but quickly found that MTA publishes weekly data from 6 subway-lines, 378 stations, 733 control areas and 114 lines here. The data is quite extensive, and in combination, consists of over 50 million data points! But thanks to modern data analysis tools, a thorough analysis of large datasets is feasible.


I ended up using IPython Notebook, however, I was quick to realize its utility over conventional Python IDEs such as Spyder or PyCharm. When working with large data-sets, it is desirable to execute code in bits to save-up on processing time required to manipulate individual DataFrames and dictionaries. It is important to note that the use of conventional ‘for’ loops for slicing and analyzing such large data-sets is futile. Another key attribute of such massive data is the presence of errors and inconsistencies that can render conventional spreadsheet analysis useless due to the sheer scale of the data. While I used to work with experimental methods to measure acidity constants during my PhD, I did not encounter such massive volumes of data. Even the combined size of all of my spectroscopic data was barely 1 TB. Analysis of large data-sets is uniquely interesting and challenging. You could poke the data in different ways, but still not grasp the intricacies of the underlying information.


A quick analysis showed the median traffic (entries by commuters) on a typical weekday to be 5.1 million (calculated with Pandas and Numpy). If  you assume 1-2 entries per commuter, it can be estimated that 30-60% of NYC residents use the subway on any given weekday. However, as NYC is a top tourist destination, we had to conduct certain tests to gauge the impact of tourists on our flow calculations. I looked at the ratio of weekday entries to weekend entries (let’s call it ‘R’) in my data. If R>1 for any specific station, then it is easy to see that such a station is frequented by local residents. Whereas if R<1, then one can assume it to be a tourist-heavy station. When I plotted R against flow for all stations, I found that R was greater than one for most stations in NYC, and hence one can neglect the relative effect of tourists on this study as of minimal significance.


R (Weekday/Weekend) v/s Median Weekday Flow (Entries): Note that R>1 for most stations!


While our team’s presentation is available online, you can take a look at some of my code on this data-set. Note that I use dictionaries extensively, also because dictionaries allow for faster processing of array data, and are amenable to easy manipulation to/from Pandas DataFrames and Series. Below, I have the list of top 20 weekday and weekend stations (by the median number of entries).


Top Weekday Stations


I found a large difference between the number of visitors on weekend v/s weekday. Thus, even though a station such as 34 St-Penn gets a lot of non-local travellers, one can assert that a significant majority of the travelers are local. While this is still a basic analysis at this point, here’s a quick suggestion (map below) of top locations for anyone who wants to reach-out to commuters directly.


Mapping the Top 15 Stations


Unsurprisingly, a majority of them are located in Manhattan. While one can scrape the web for top locations associated with tech, the above analysis is a more direct strategy to maximize reach-out.


The author welcomes any positive or negative feedback.

 
 
 

Comments


bottom of page