Reading data from webpages using R


The story behind

A couple of weeks back, I wanted to test some time series models with marine fish landings data of India. I certainly knew that Central Marine Fisheries Research Institute (CMFRI, India) publish the estimated values through annual reports.  However,  I found the data published in their website (http://www.cmfri.org.in/annual-data.html). Compiling all of them manually was a tedious job.

Untitled1

I inquired my friends in CMFRI for an electronic version of the data. But I was told to pay for this and since then I thought there should be a way out because it is already published.

Technical part

After a little bit of research I found the XML R package. This package provides many approaches for both reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP.  The function to read HTML (webpages) is readHTMLTable (“The URL”)

1. You have to install the XML package first. (See this post for installing R Packages)

2. Now load the package using the library function

3. Read the data from CMFRI website and name it as `mydata’.

mydata<- readHTMLTable("http://www.cmfri.org.in/annual-data.html")

This will read the data and produce a table in the form a list. Following is the output in R.

UntitledHowever, this looks dirty. We have to clean this up to obtain data in the required format.

4. So I cropped the data using the following code:

cleanup.data<-as.data.frame(mydata[[1]][c(7,9:13,15:19,21:30,32:40,42:47,49:51,53:54,56:59,61:69,71:73,75:79,81:87,89),])

The output is shown below:

Untitled25. Now we can save this data as a CSV file in our computer. Following is the code to do that. You have to give the path to the location where you want the data to be saved.

write.csv(cleanup.data, file = "C:\\give.location.here\\data.csv", row.names = FALSE)

Following is a snap shot of the CSV file opened in MS Excel.

So that is how I managed to get data from the website with no manual errors and by not paying for it. Thanks for reading. Happy Coding…

Untitled3

Advertisements

About Deepak George Pazhayamadom

I'm a fish biologist and a mathematical modeller. I have a wide range of research interests, mostly centered on fisheries resource management.

Posted on June 23, 2013, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: