Easy R for M.F.Sc students – Part 5 (Organizing and Importing Data)
1. Organizing Data
Ok guys. The first part of this series was only about making you excited on working with R. We made a data, imported it into R and plotted a simple graph. But to do this with your own data would be challenging now if you are a beginner. To handle this method with different types of data, you should first understand how R work with numbers and more importantly how you organize your data.
Interestingly I have seen many students organize their data in a form that is very pleasing for themselves. This was true even for me when I was doing my Masters. But this is not that fancy for any statistical softwares available in market. Statisitical packages work only if they get your data in some specific format. If you know this beforehand, you can save a lot of time by end of year for analysis and thesis writing.
In general, most softwares (SAS, SPSS etc.) prefer a data format in which your variables and attributes go into columns and observation from each sample go into rows. So if you have 12 observations from 5 parameters, ideally your data should have 12 rows and 5 columns.
Let’s make it simple.
Suppose we are collecting observations on temperature, salinity and oxygen from four oceanographic locations of India; 1. Mumbai 2. Cochin 3. Madras and 4. Paradeep. Mumbai and Cochin belongs to West coast and Madras and Paradeep belongs to East coast of India. So here we have three quantitative variables (Temperature, Salinity and Oxygen) and two qualitative variables or attributes (Location and Coast). In total, our ideal data matrix should have five columns. If we collect two observations from each location, then the total number of rows of our ideal data matrix would be eight.
If we would have collected five observations from each location, then the total number of rows qould have been 20 rows. In the figure above , we have 8 rows of data plus one extra row on the top as heading.
2. Importing Data to R
I always use MS Excel to feed by data and save this file as in TXT or CSV format. This won’t be necessarily the best option for you. But I found this more trust worthy. Keeping data in excel format itself could sometimes be risky if the numbers in it are derived from equations. Excel might recalculate them everytime you open the file. At times, you will find strange to see numbers that was not familiar to you.
However, I’m going to show you easy ways of getting data to R. If your data matrix is small, just like the example above, you can easily type them in R. To start with, let’s create a variable ‘A’
Open R , type the following and press ENTER key:
The symbol <- means ‘Equal to’ in R. Alternatively you can also use the = symbol. But it is better to use <- instead to avoid confusions in future when you develop your code for doing specific tasks. Now type ‘A’ and press ENTER key:
> A  32.5
So now R understand that A is equal to 32.5 which is the first reading of Salinity in the example above. Now to make a vector of all the readings for salinity, type the following and press ENTER key:
A <- c (32.5 ,32.6, 31.5, 31.7, 32.8, 33.5, 33.1, 30.5)
Now type ‘A’ in R and press ENTER key:
> A  32.5 32.6 31.5 31.7 32.8 33.5 33.1 30.5
That was simple, isn’t it. Any number of characters inside concatenate command “c ( )” seperated by commas will form a vector in R. You should be careful with the case of the letter/ word. R is case sensitive. So if you type a small letter instead of capital letter A, you will end up getting an error as shown below:
> a Error: object 'a' not found
Let’s now try to create the whole data set and this time I’m going to name the variables sensibly:
Now we have three parameter observations in three vectors. let’s combine them to a single data matrix that. To combine the vectors, we should use the command “data.frame ( )” which means combine the numbers / vectors inside the brackets into a data frame (column wise for vectors).
mydata <- data.frame (temperature, salinity, oxygen)
Now if you call mydata by typing it in R and pressing the ENTER key, you should get:
> mydata temperature salinity oxygen 1 10.0 32.5 9.0 2 10.5 32.6 8.3 3 12.0 31.5 6.5 4 11.7 31.7 7.7 5 11.9 32.8 6.8 6 10.5 33.5 6.2 7 23.5 33.1 5.4 8 20.5 30.5 6.6
Wow!! Half the job done. Now we have to add the attributes, Coast and Location. This is something similar to what we did before but R handle them differently. Difference here is, they are not numbers but words and are called as ‘strings’. In R, strings are wrote between double quotation marks ” “. So in the above example it will be:
A smart way of doing this will be with the repeat “rep ( )” command in R. The first argument for “rep”command will be the character or number that you want to repeat and the second argument is the number of times you want to repeat it. (Argument: read part 1). So the same output can be generated using the following way:
Now the next step would be to combine this attributes to existing data by data.frame ( ) command.
mydata <- data.frame (coast, location, mydata)
This should get you the following when you call mydata in R:
> mydata coast location temperature salinity oxygen 1 West Mumbai 10.0 32.5 9.0 2 West Mumbai 10.5 32.6 8.3 3 West Cochin 12.0 31.5 6.5 4 West Cochin 11.7 31.7 7.7 5 East Madras 11.9 32.8 6.8 6 East Madras 10.5 33.5 6.2 7 East Paradeep 23.5 33.1 5.4 8 East Paradeep 20.5 30.5 6.6
Yes, Now we have done that. R likes to work with numbers and it is easy to handle. So I would recommend you to avoid any strings or characters in your data. In this example it is better to tag the entries for Coast and Location. Say if “1” represents West Coast and “2” represents East in the case of Coast. Similarly 1,2,3 and 4 for locations Mumbai, Cochin, Madras and Paradeep. To modify it, create again another vector:
Combine these vectors with mydata as we did before. This time we can’t combine with the existing data since we want to avoid the text. So we should combine again all the variables created together.
mydata <- data.frame (coast, location, temperature, salinity, oxygen)
This should give you the following:
coast location temperature salinity oxygen 1 1 1 10.0 32.5 9.0 2 1 1 10.5 32.6 8.3 3 1 2 12.0 31.5 6.5 4 1 2 11.7 31.7 7.7 5 2 3 11.9 32.8 6.8 6 2 3 10.5 33.5 6.2 7 2 4 23.5 33.1 5.4 8 2 4 20.5 30.5 6.6
Great !!! Now let’s see how to get this data if it is already in MS Excel.
Write “NA” if you have any missing values in your data. Avoid all special characters like _?\@# etc. since that would cause errors in R. Don’t keep any space between words of a single variable. For example: if air temperature is the variable, you should write “AirTemperature” without any gaps instead of writing “Air Temperature”. If there is a gap/space, R will read them as two variables instead as one variable.
The first step would be to save your MS Excel file as a TXT file. In R, we can import this file using a “read.table ( )” command/ function.
The first argument for this command is the location of the file. Suppose this file is a TXT file in a folder named “descrambler” in “C” directory of computer. Then the location of the file is “C:/descrambler/mydata.txt”. If you choose to use a backslash instead of a forward slash, it would be “C:\\descrambler\\mydata.txt”. For CSV file, it would be “C:\\descrambler\\mydata.csv”. But for CSV file, you will need some additional arguments to make it work. I will come to it later in this chapter.
Each entry within the curly brackets of a R command or function is an argument. By default, the arguments are arranged in a specific order. To see the complete arguments of read.table function type “? read.table ( )” in R and press ENTER key.
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown")
You can see here the first argument is “file=”, which ask for the location of the file to be imported. The second argument is “header=”, which is FALSE by default and tells R that the data don’t have a row containing labels. In our example, the variable names are present in the first row and in this case, we should type TRUE for that argument. Explanation for all other arguments are given in the HTML page.
You can change the order of arguments if you write them specifically. For example:
mydata <- read.table ("C:\\descrambler\\mydata.txt", TRUE)
In this case, the function perfectly since the argument input are at the right place. But the following won’t work and you will get an error message:
mydata <- read.table (TRUE,"C:\\descrambler\\mydata.txt")
In this case, you have to specify which input belongs to which argument, as following:
For CSV files, you have to tell R that the values are seperated by comma. This can be done by the argument “sep=” as follows:
mydata <- read.table ("C:\\descrambler\\mydata.csv", header = TRUE, sep = ",")
So now we learned how to organize data, create them in R and import data to R if available as a TXT, CSV or an Excel Spread sheet.
The code we used for the whole excercise are compiled following and this should be the way it should appear in Tinn R (Read about Tinn R from Part 1). Always write your code in Tinn R and source them to R. When you close R, it will ask you whether to save the workspace. This is not at all necessary unless and until you have a situation where you can’t produce the same result by running the R script again. I will come in detail to this in the next chapter. So for the moment, don’t save anything when asked while closing R.
If you create the dataset in R, your Tinn R should look like this:
temperature <- c (10, 10.5, 12, 11.7, 11.9, 10.5, 23.5, 20.5) salinity <- c (32.5, 32.6, 31.5, 31.7, 32.8, 33.5, 33.1, 30.5) oxygen <- c (9, 8.3, 6.5, 7.7, 6.8, 6.2, 5.4, 6.6) #coast <- c ("West", "West", "West", "West", "East", "East", "East", "East") #location <- c ("Mumbai", "Mumbai", "Cochin", "Cochin", "Madras", "Madras", "Paradeep", "Paradeep") coast <- c (rep ("West", 4), rep ("East", 4)) location <- c (rep ("Mumbai", 2), rep ("Cochin", 2), rep ("Madras", 2), rep ("Paradeep", 2)) #coast <- c (rep (1, 4), rep (2, 4)) #location <- c (rep (1, 2), rep (2, 2), rep (3, 2), rep (4, 2)) mydata <- data.frame (coast, location, temperature, salinity, oxygen) print (mydata)
I added a “print ( )” command at the end of the script so you would see the data set immedietly after you source the file through R. So you will save some time since you don’t need to type “mydata” and press ENTER key to see the dataset you created.
In R, any character or number after # symbol are neglected. So it would be useful to write your own notes in a line within the script if you wish. This can also be used to skip a few lines in the script while running in R. So you dont need to delete an unwanted line (but would need in future). You can just simply add the # symbol in the beginning of the line. In the above Tinn R example, 4 lines are skipped from running. You can always use those line by taking off the # symbol. I would suggest you to play around with the above script. See what changes happen when you alter the code.
Good luck 🙂 and bye for this chapter. Hope this tutorial was helpful and definitely leave a feedback.
In next chapter, we will see how to install a package and how to find help with R.