Use Rvest to download traffic data from Caltrans Performance Measurement System

Recently I helped a friend of mine to download some traffic time-series data from the Caltrans Performance Measurement System. Basically we need to download the traffic data from all the major traffic census stations on the I-405 freeway, and the time span needs to cover a couple of months. After searching online for a couple of days and asking a few questions on stackoverflow (1,2,3) I finally assembled a piece of R code to accomplish what we need to do.

getTable <- function(resp){
  # This function extract the table from a response
  pg <- content(resp$response)
  html_nodes(pg, 'table.inlayTable') %>% html_table() -> tab
  return(tab) # return the content of table
generateURL <- function(siteID){
  # This function generates a URL for each input siteID
  urlPart1 = ""
  urlPart2 = "&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
  url = paste(urlPart1, toString(siteID), urlPart2, sep = '')
  return (url)
siteIDList = c(74250, 75020, 74020)
mainURL = ""
pgsession <- html_session(mainURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
                          'username' = '',
                          'password' = 'house6y')
# slog is the logged-in session that can be reused
slog <- submit_form(pgsession, filled_form) 
# loop thru siteIDList to scrape all the tables
vectorOfTables <- vector(mode = 'list', length = length(siteIDList))
i = 1
for (siteID in siteIDList){
   print ("Working on site:", quote = F)
   print (siteID)
   newsession = jump_to(slog, generateURL(siteID))
   vectorOfTables[i] = getTable(newsession)
   i = i+1
# Show the first table in vectorOfTables

And remember to always use caution when scarping!

