Reading data from webpages using R
The story behind
A couple of weeks back, I wanted to test some time series models with marine fish landings data of India. I certainly knew that Central Marine Fisheries Research Institute (CMFRI, India) publish the estimated values through annual reports. However, I found the data published in their website (http://www.cmfri.org.in/annual-data.html). Compiling all of them manually was a tedious job.
I inquired my friends in CMFRI for an electronic version of the data. But I was told to pay for this and since then I thought there should be a way out because it is already published.
After a little bit of research I found the XML R package. This package provides many approaches for both reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP. The function to read HTML (webpages) is readHTMLTable (“The URL”)
1. You have to install the XML package first. (See this post for installing R Packages)
2. Now load the package using the library function
3. Read the data from CMFRI website and name it as `mydata’.
This will read the data and produce a table in the form a list. Following is the output in R.
4. So I cropped the data using the following code:
The output is shown below:
Following is a snap shot of the CSV file opened in MS Excel.
So that is how I managed to get data from the website with no manual errors and by not paying for it. Thanks for reading. Happy Coding…