Browsed by
Author: shushi2000

Hello World!

Hello World!

This is my first blog entry!

I plan to use this blog to post my data science codes, exercises, projects, and ideas from time to time. The languages I use include Python, R, and MATLAB.

It is still a work in progress and I hope these personal notes can benefit other data scientists somehow.

Use Rvest to download traffic data from Caltrans Performance Measurement System

Use Rvest to download traffic data from Caltrans Performance Measurement System

Recently I helped a friend of mine to download some traffic time-series data from the Caltrans Performance Measurement System. Basically we need to download the traffic data from all the major traffic census stations on the I-405 freeway, and the time span needs to cover a couple of months. After searching online for a couple of days and asking a few questions on stackoverflow (1,2,3) I finally assembled a piece of R code to accomplish what we need to do.

rm(list=ls())
library(rvest)
library(httr)
 
getTable <- function(resp){
  # This function extract the table from a response
  pg <- content(resp$response)
  html_nodes(pg, 'table.inlayTable') %>% html_table() -> tab
  return(tab) # return the content of table
}
 
generateURL <- function(siteID){
  # This function generates a URL for each input siteID
  urlPart1 = "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id="
  urlPart2 = "&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
  url = paste(urlPart1, toString(siteID), urlPart2, sep = '')
  return (url)
}
 
siteIDList = c(74250, 75020, 74020)
mainURL = "http://pems.dot.ca.gov/"
pgsession <- html_session(mainURL)
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
                          'username' = 'segoviashu2000@yahoo.com',
                          'password' = 'house6y')
 
# slog is the logged-in session that can be reused
slog <- submit_form(pgsession, filled_form) 
 
# loop thru siteIDList to scrape all the tables
vectorOfTables <- vector(mode = 'list', length = length(siteIDList))
i = 1
for (siteID in siteIDList){
   print ("Working on site:", quote = F)
   print (siteID)
   newsession = jump_to(slog, generateURL(siteID))
   vectorOfTables[i] = getTable(newsession)
   i = i+1
}
 
# Show the first table in vectorOfTables
vectorOfTables[1]

And remember to always use caution when scarping!

How to Download Your Fitbit Second-Level Data Without Coding

How to Download Your Fitbit Second-Level Data Without Coding

If you are a Fitbit user who wants to save a copy of Fitbit data on your computer but doesn’t have advanced programming skills , this tutorial is right for you! You don’t need to do any coding at all to save your second-level data. I have been struggling with getting all the so-called ‘intraday data’ for quite a while. I have found many useful resources online, for example, Paul’s tutorial,  Collin Chaffin’s Powershell module, and the Fitbit-Python API, but they are somehow complicated and I just could not make any of these working for me smoothly.  Recently I finally figured out a way to download these Fitbit data without any coding. Are you ready?

Step 1: Register an app on http://dev.fitbit.com

First you need to register an account on dev.fitbit.com, and then click ‘MANAGE YOUR APPS’ on the top right area. Next you need to click ‘Register a new app’ button at the top right area to start with. Well, this step is simple, just make sure the OAuth 2.0  Application Type is set to ‘Personal’, and the Callback URL is complete – including the ‘http://’ part and also a ‘/’ at the end. Here I used ‘http://google.com/’ as an example. I have used this blog’s URL ‘http://shishu.info/’ which worked just fine too.

app2

After you click the red ‘Register’ button, you will be able to see the credentials for the app you just registered. The ‘OAuth 2.0 Client ID’ and ‘Client Secret’ will be used in the next step.

credentials2

 

Step 2: Use the OAuth 2.0 tutorial page

Next you need to right click the ‘OAuth 2.0 tutorial page’ link and open it in a new tab, so that you can look back at your app’s credentials easily. Make sure the ‘Implicit Grant Flow’ is chosen instead of ‘Authorization Code Flow’ – this will make things much easier! After you copy/paste the ‘Client ID’ and ‘Client Secret’ into the blanks and put in the Redirect URI, click the auto-generated link.

tutorial1

Then you will see the confirmation page. Just click ‘Allow’.

confirmation

And you will be led to the ‘Redirect URI’, which is ‘http://google.com/’ in this case.  But the address bar now shows a very long string which is the token for your app.

token

Next you need to copy and paste everything in the address bar but without the starting part ( https://www.google.com/ ) to the ‘Parse response’ section, and hit enter key once. This way you can clearly see what the token is, what the scope is, and how long the token is valid. In this case, the token is valid for 1 week, which equals to 604800 seconds. Pretty good, right?

token2

 

Step 3: Make request and get the data!

After you are done with the ‘Parse response’, the next step is ready for you automatically.

curl

Justclick the ‘Send to Hurl.it’ link and ‘Launch Request’ in the new page. Make sure to tell the web that you are not a robot too.

Hurl_it

After that, if you see the ‘200 OK’ status – meaning everything works fine. And you can find the data you want in the lower half of the page. Like this:

body

If you click ‘view raw’ at the right side, you will see the ‘BODY’ will be changed to raw text file and you can simply copy / paste them to a text editor. And that’s all you do to download your Fitbit second-level data! As I promised, you don’t need to know any programming skills to accomplish this, isn’t that cool?

Additional tips:

tip 1: other data types

If you want some other data rather than the ‘user profile’, you can simply change the ‘API endpoint URL’ in the ‘Make Request’ step. According the the Fitbit API documentation, you can get the heart rate data by using this URL:

https://api.fitbit.com/1/user/-/activities/heart/date/today/1d.json

Or the sleep data by using:

https://api.fitbit.com/1/user/28H22H/sleep/date/2014-09-01.json

tip 2: save json file in an easier way

If you think the ‘Send to Hurl.it’ method is not fast enough, you can copy the ‘curl’ command auto-generated in the ‘Make Request’ step and run it in terminal (for Windows, that is the ‘cmd’ window). Add the following part to the end of the copied ‘curl’ command so that the data will be saved to your disk:

>> data_file.json

tip 3: save multiple days’ data automatically

I think Fitbit provides the functionality to let uers download multiple days’ data with one command, by specifying the date rage in the curl request. However I could not make it work for me for some unknown reason. Therefore I used a piece of Python code to download multiple days’ data for me. Here’s the code, which I think is pretty self-explanatory.

import requests
import json
import pandas as pd
from time import sleep
 
# put the token for your app in between the single quotes
token = ''
 
# make a list of dates 
# ref: http://stackoverflow.com/questions/993358/creating-a-range-of-dates-in-python
# You can change the start and end date as you want
# Just make sure to use the yyyy-mm-dd format
start_date = '2015-12-28'
end_date = '2016-06-14'
datelist = pd.date_range(start = pd.to_datetime(start_date),
                         end = pd.to_datetime(end_date)).tolist()
 
'''
The codes below use a for loop to generate one URL for each day in the datelist,
and then request each day's data and save the data into individual json files.
Because Fitbit limit 150 request per hour, I let the code sleep for 30 seconds 
between each request, to meet this limitation.
'''
for ts in datelist:
    date = ts.strftime('%Y-%m-%d')
    url = 'https://api.fitbit.com/1/user/-/activities/heart/date/' + date + '/1d/1sec/time/00:00/23:59.json'
    filename = 'HR'+ date +'.json'
    response = requests.get(url=url, headers={'Authorization':'Bearer ' + token})
 
    if response.ok:
        with open(filename, 'w') as f:
            json.dump(response.content, f)
        print (date + ' is saved!')
        sleep(30)
    else:
        print ('The file of %s is not saved due to error!' % date)
        sleep(30)

Happy hacking!

How to Get 0.99+ Accuracy in Kaggle Digit Recognizer Competition

How to Get 0.99+ Accuracy in Kaggle Digit Recognizer Competition

Recently I have spent a lot of time working on the Kaggle digit recognizer competition and finally reached an accuracy higher than 0.99. I am quite happy with it and would like to share with everyone how I did it. Basically I used TensorFlow to build a neural network with these ‘highlights’:

  1. three hidden layers, with some dropout between each layer, but no convolution in them
  2. an 25 times larger training data set – generated by nudging original training images to up, down, left, and right for 1 pixel each
  3. an exponential decay learning rate

You can find the code here.

Unlike some other scripts on Kaggle.com like this one and this one, my neural network does not use convolution, mainly because I do not have a GPU and do not want to pay for the AWS… However, I think my neural network did a good job just as these two.

06022016 rank 163

I learned a lot about machine learning through this 101 Kaggle competition. The biggest lesson I learned is that: picking the right model is more important than fine-tuning the parameters in a model. Just like any other tasks, finding the right tool is always the very first step. Before I decided to use neural network, I have tried several other models already, including logistic regression, SVM, and k-nearest neighbor, but the accuracy never went above 0.97, no matter how hard I tried to fine-tune the model parameters.

The second biggest lesson is that: using a large training data set really helps with improving neural network’s accuracy. I adopted this ‘nudging images’ idea from an example on scikit-learn.org. I really like this idea and probably will keep using it for other projects.

Last but not least, I realized that it is not good just working on your own. I need to read about what other people have done, talk to different person even machine learning laymen, and listen to other peoples criticism whenever they are kind enough to do so. So, what is your criticism on this neural network, please?

TensorFlow on Windows 10 Using Docker Installation Method

TensorFlow on Windows 10 Using Docker Installation Method

I am taking an online course of Deep Learning now and it requires me to use TensorFlow. I spent a lot of time searching around, testing different things, and finally managed to run TensorFlow on my windows 10 laptop. So I think maybe I should write a post to remind myself, just in case I need to do it again in the future. And I hope this post can save someone else’s time too.

The overview section of Download and Setup page says there are four different ways to install TensorFlow:

  1. Pip install
  2. Virtualenv install
  3. Anaconda install
  4. Docker install

Since I have heard about docker for a while but never get a chance to use it, I think it is a great opportunity for me to learn how to use docker. So I chose the Docker install method here. It looks pretty simple, only three steps.

steps

However, these three steps took me a whole morning…

First, I went to the Install Docker for Windows page and followed the instructions. I have no idea about whether the virtualization is enabled or not on my laptop, and my Task Manager looks different with the image shown on the instruction page.

Task Manager

I struggled with this and the BIOS for a while and found out that the virtualization IS enabled from the System Information (by runing msinfo32 command).

msinfo32

Next, I installed the Docker Toolbox since I am pretty sure I am use a 64-bit Windows. This process is very easy and straightforward. After installing Docker Toolbox, three more icons showed up on my desktop.

short cuts

I launched the Docker Toolbox terminal by double-clicking the Quickstart Terminal icon and made my very first docker command: “docker run hello-world“. So far so good.

terminal

 

So now I have finished the first step: “Install Docker on your machine“. But I had no idea how to do the second step: “Create a Docker group to allow launching containers without sudo“.  And the big lesson I learned here is that, this second step is NOT necessary, at least in my case. I skipped this step and went ahead to the third step: “Launch a Docker container with the TensorFlow image“.

I first tried “$ docker run it gcr.io/tensorflow/tensorflow” and everything looked good from the terminal, which said “The Jupyter Notebook is running at http://[all ip addresses on your system]:8888/“. Wait, what are “all ip addresses”? I typed in “localhost:8888” in my Chrome browser address bar but the Jupyter Notebook did not load…

localhost not working

Once again, a post on stackoverflow is my life-saver. I followed the answer and everything worked out. First I ran the command “$ docker-machine ip default” and figured out the ip address should be 192.168.99.100. Then I started the TensorFlow docker container again by using command “$ docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow“. Now the Jupyter Notebook is working at 192.168.99.100:8888.

Capture10

I opened the first notebook and made a test run on the first cell. It worked!

This is how I installed TensorFlow on my laptop via Docker. I hope it is useful to you. Feel free to leave your comments or questions below!

 

My Experience with Udacity Data Analyst Nano-degree

My Experience with Udacity Data Analyst Nano-degree

After spending most of my spare time in the past 8 months, I finally graduated from the Udacity Data Analyst Nano-Degree program! Before I started this program, I have spent many hours searching online for reviews and discussions about it. Now I would like to share my whole experience with the internet and hope it is helpful to someone like me. Since I have also taken other courses at coursera.org and edx.org, I can make some direct comparisons which should be helpful too.

First, I would say it really requires a lot of time to finish the degree. I roughly spent 15 + hours each week on this program in the past 8 months. This maybe does not sound like a lot of time to you, but actually it is, especially if you have another full-time job. So don’t jump into it if you can’t afford the time. As for tuitions, I have paid $1,600 for the program but Udacity will refund half of it because I finished the program within 12 months and I paid all of my tuition out of my own pocket. I haven’t received the refund yet because Udacity told me it takes 4 – 8 weeks to process. Just don’t forget to submit a request for this refund – Udacity will not automatically refund it to you.

The 8 projects covered a wide range of aspects in the data science field, including statistics, Python programing, R programing, machine learning, and D3.js data visualization. The  Python and R programing focused on data manipulation, wrangling, and visualization. The machine learning course is really condensed and does not go deep in algorithms and theories, compared to other machine learning courses. Overall, this nano-degree really focuses on the analysis skills such as process data and find interesting stories. If you want to be a data scientist instead of data analyst, this nano-degree is probably not the best choice for you.

There are many things I really liked about this program. First, Udacity has an amazing ‘customer support’ team. The coaches provide 1-0n-1 help sessions. Of course these coaching sessions need to be reserved first, which is fairly easy to do. Each help session is scheduled to be 20 minutes long, but a coach once chatted with me for more than an hour, until I really solved the problem. I only used online text chatting but it seems the coaches are open to other communication methods such as video-chatting or phone call as well. In addition, the discussion forum is a good resource that helped me finishing all the projects. The coaches reply to questions VERY quickly, usually in 30 min or less. And they are always very patient! The coaches also review the project submission in great details, give constructive feedbacks, and encourage the students all the time. I think this coaching team is the factor that makes this nano-degree program stand out, compared to other MOOC courses or specializations.

However, I believe this program still has some room for improvement. My biggest frustrations came from the course videos. Maybe it is because Udacity only consider the course videos as supporting materials, or maybe it is because the course are taught by mentors from the industry, I felt that the course videos are nothing like a real class. For a substantial amount of portion, the videos are just two or more mentors talking. The course videos did not really help me too much in finishing my projects.  I like the course videos on coursera.org much better because they are better organized and the contents are taught systematically. That is not the case with Udacity courses, at least for the data analyst nano-degree.

Another question people care about this program is that if it really help the students finding a job. Well, I can’t tell because I am just me, one sample, and there is not even a control sample. But at least the program gave me something to talk about data analysis during my interviews, so I would say, yes, it is useful.

Please feel free to comment below if you would like to take the program, are in the middle of the program, or have graduated. I’d be happy to answer any questions about this Udacity Data Analyst Nano-degree.

Have Flights Delays decreased Over Time?

Have Flights Delays decreased Over Time?

I think it is not just me who would think that, as technology advances, the flight carriers are able to reduce the arrival delay over time. Is that really the case? I looked into the historic flights data and found something surprising and interesting.

First I downloaded all the historic flights data from stat-computing.org.  These 22 years (1987 -2008) of data add up to be more than 10 GB. So I wrote a piece of R code which can read each csv file and extract the information I need for each carrier in each year.  I tried many different ways to aggregate the data, for example, using the yearly average, 75% quantile, 99% quantile, and the yearly maximum. Since there are so many flight records for each carrier each year, I only found the yearly maximum arrival delays had some clear trends over time.

Surprisingly, this exploratory data analysis suggested that the yearly maximum of arrival delays increased rather than decreased in these 22 years. This is somehow counter-intuitive to me because I thought the Information Technology has developed so much and should have helped to reduce the flights delay. Anyway, I used D3.js and created an interactive scatter-line plot to show these trends.  Below is a thumbnail of the plot, which is linked to the real html file where the plot is hosted. The legend is clickable so you can select which carriers’ data you would like to see or not see.

Time series plot

After making this plot and looking back into the data set, I realized it is reasonable that the yearly maximum arrival delays have increased in the past. The major reasons I can speculate includes:

  1. More and more people are taking airplanes to travel therefore there are a lot more flights to manage for each carrier.
  2. The number of longer distance flights increased and chances of longer arrival delay increased.
  3. The yearly maximum delay are probably caused by some extreme weather conditions or natural disasters, which seem to happen more frequently in recent years.

If you come up with other possible reasons, please leave it in the comments!