Easy R for M.F.Sc students – Part 5 (Organizing data)


Welcome to chapter 5 of Easy R tutorial. Many basic functions in R analyse data only if they are in some particular format and type. So it is important to learn how to convert from one type to another and arrange the rows and columns so that the function works in the desired way.

Dataframe – Matrix – Dataframe

Many times data frame convert to matrices while merging or slicing. So you might require to convert them back to a data frame to use it for a statistical test function in R. I will demonstrate with an example data.

First, download the data from here: Click

This is a CSV file with details on a few parameters of a fish species sampled in two periods from four locations of India (Mumbai, Madras, Calcutta and Cochin). The readings we have is the sex, total length, standard length and fork length of the fish.

Let’s import this file to R now as a data frame. I decided to call the data “fish”.

fish  <-read.table("C:\\Users\\Pazhayamadom\\Desktop\\easyR.csv",header = TRUE, sep = ",", quote="", dec=".",comment.char="")

Details on the syntax for importing a CSV file to R has been covered in my earlier chapters. So better read my old chapters first before moving on if you don’t understand this. After importing, this data will be of a data frame type.

Each column of the data can be extracted by calling the name of the data followed by a $ sign and then the name of the column. For example:

fish$location
 [1] Madras   Madras   Madras   Madras   Madras   Madras   Madras   Madras
 [9] Madras   Madras   Calcutta Calcutta Calcutta Calcutta Calcutta Calcutta
[17] Calcutta Calcutta Calcutta Calcutta Cochin   Cochin   Cochin   Cochin
[25] Cochin   Cochin   Cochin   Cochin   Cochin   Cochin   Mumbai   Mumbai
[33] Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai
[41] Madras   Madras   Madras   Madras   Madras   Madras   Madras   Madras
[49] Madras   Madras   Calcutta Calcutta Calcutta Calcutta Calcutta Calcutta
[57] Calcutta Calcutta Calcutta Calcutta Cochin   Cochin   Cochin   Cochin
[65] Cochin   Cochin   Cochin   Cochin   Cochin   Cochin   Mumbai   Mumbai
[73] Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai   Mumbai
Levels: Calcutta Cochin Madras Mumbai

Similarly:

> fish$tl
 [1] 283 321 302 262 274 252 309 316 315 218 282 261 237 260 295 235 207 240 228
[20] 220 230 226 205 215 233 231 238 217 240 225 323 307 330 309 303 315 259 298
[39] 290 300 312 308 292 312 298 275 236 260 282 307 285 306 269 285 253 262 291
[58] 274 305 311 228 230 227 229 221 225 230 220 222 225 224 229 212 203 213 196
[77] 237 240 212 249

To investigate the structure of a data frame, use the str ( ) function. This was also discussed before in our earlier chapters.

> str(fish)
'data.frame':   80 obs. of  7 variables:
 $ location: Factor w/ 4 levels "Calcutta","Cochin",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sampling: int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex     : Factor w/ 3 levels "F","M","U": 2 1 1 2 1 2 2 2 1 2 ...
 $ tl      : int  283 321 302 262 274 252 309 316 315 218 ...
 $ sl      : int  228 264 244 209 220 202 242 256 249 117 ...
 $ fl      : int  251 287 267 227 241 224 268 283 279 186 ...
 $ wt      : num  133.3 227.2 162.7 83.3 101.5 ...

The structure looks fine except for the sampling. We know sampling is a factor rather than an integer as how it is mentioned in the outcome. So we need to create an extra column to fish with sampling as a factorial variable. Let’s call the new column fsampling which simply represents sampling as a factor.

fish$fsampling<-as.factor(fish$sampling)

This will create a new column named fsampling and will be appended to the existing fish data. Now call fish and have a look at the last column. Check again the structure of fish data and you will see the new column as a factor with two levels.

> str(fish)
'data.frame':   80 obs. of  8 variables:
 $ location : Factor w/ 4 levels "Calcutta","Cochin",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ sampling : int  1 1 1 1 1 1 1 1 1 1 ...
 $ sex      : Factor w/ 3 levels "F","M","U": 2 1 1 2 1 2 2 2 1 2 ...
 $ tl       : int  283 321 302 262 274 252 309 316 315 218 ...
 $ sl       : int  228 264 244 209 220 202 242 256 249 117 ...
 $ fl       : int  251 287 267 227 241 224 268 283 279 186 ...
 $ wt       : num  133.3 227.2 162.7 83.3 101.5 ...
 $ fsampling: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

You can also re-arrange the columns of the existing data by defining a new one:

fish1<-data.frame (fish$tl, fish$sl, fish$fl, fish$wt, fish$location, fish$sampling, fish$sex)

Call fish1 and see how the new data frame looks like.

Some vector techniques can be applied both to a data frame as well as matrices to organize your data. They are more easy and flexible to handle. However, after using vector techniques, the resulting data might convert into a matrix. In that case you would need to convert them back to a data frame before you could use for any statistical testing or graph functions in R.

Vector techniques

By vector technique, you can easily slice off data strips from the main data frame. To gather only the standard length of fish:

> fish[,5]
 [1] 228 264 244 209 220 202 242 256 249 117 229 211 193 228 240 192 174 193 184
[20] 178 183 177 167 173 187 186 190 173 194 178 256 248 274 250 252 252 205 247
[39] 236 247 247 246 233 249 238 217 186 204 223 241 232 249 222 236 205 217 236
[58] 217 248 256 184 186 179 184 177 181 183 126 175 183 182 184 171 170 170 163
[77] 193 192 173 199

This is similar to what we did before. In the earlier method, it would have been fish$sl. You can extract any element in a data frame or matrix using [x,y] at the end of the data name. x corresponds to the row and y corresponds to the column. In the fish data to extract, the total length of the first fish sampled from Madras, which is a male with approx 133 grams:

> fish[1,4]
[1] 283

To extract a full column, for example the second column, keep the x element empty. That tells R to select all rows from yth column.

> fish[,2]
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[77] 2 2 2 2

Similarly, to select all the columns of one particular row of the data, keep the yth element empty and specify which row in the square box. For 25th row of the data:

> fish[25,]
   location sampling sex  tl  sl  fl     wt fsampling
25   Cochin        1   M 233 187 205 51.945         1

You can also create a completly new customized matrix data. For example you want only the total length of all observations from Madras, Mumbai and first sampling period!!

In our fish data, the first sampling period ends in 40th row. So first slice that out. Second step would be to slice up locations and total length. By vector technique, this is very easy. Here we will use two functions named rbind ( ) and cbind ( ). The row bind function (rbind) merge all the vectors inside the round brackets as rows and column bind (cbind) merge all vectors inside the round brackets as columns. Let’s try this now.

# This will slice all the rows from 1 to 40 representing first sampling period
first<-fish[1:40,]

Our next step is to slice data from Mumbai and Madras. The repective rows are from 1to 10 and 31 to 40.

# Slice rows from 1 to 10 from the data 'first' created in the last step
> madras<-first[1:10,]
> madras
   location sampling sex  tl  sl  fl      wt fsampling
1    Madras        1   M 283 228 251 133.320         1
2    Madras        1   F 321 264 287 227.210         1
3    Madras        1   F 302 244 267 162.710         1
4    Madras        1   M 262 209 227  83.263         1
5    Madras        1   F 274 220 241 101.475         1
6    Madras        1   M 252 202 224  73.021         1
7    Madras        1   M 309 242 268 143.946         1
8    Madras        1   M 316 256 283 167.025         1
9    Madras        1   F 315 249 279 152.830         1
10   Madras        1   M 218 117 186  44.190         1
> mumbai<-first[31:40,]
> mumbai
   location sampling sex  tl  sl  fl      wt fsampling
31   Mumbai        1   F 323 256 287 237.160         1
32   Mumbai        1   F 307 248 282 236.566         1
33   Mumbai        1   F 330 274 298 228.312         1
34   Mumbai        1   F 309 250 277 186.554         1
35   Mumbai        1   F 303 252 275 195.396         1
36   Mumbai        1   F 315 252 282 195.471         1
37   Mumbai        1   M 259 205 224 113.226         1
38   Mumbai        1   M 298 247 274 175.207         1
39   Mumbai        1   M 290 236 257 180.908         1
40   Mumbai        1   M 300 247 275 195.635         1

Now we need only the total length from both of these data. So slice up again.

> newmadras<-madras[,c(1,4,8)]
> newmadras
   location  tl fsampling
1    Madras 283         1
2    Madras 321         1
3    Madras 302         1
4    Madras 262         1
5    Madras 274         1
6    Madras 252         1
7    Madras 309         1
8    Madras 316         1
9    Madras 315         1
10   Madras 218         1

Similarly do for mumbai data.

> newmumbai<-mumbai[,c(1,4,8)]
> newmumbai
   location  tl fsampling
31   Mumbai 323         1
32   Mumbai 307         1
33   Mumbai 330         1
34   Mumbai 309         1
35   Mumbai 303         1
36   Mumbai 315         1
37   Mumbai 259         1
38   Mumbai 298         1
39   Mumbai 290         1
40   Mumbai 300         1

Next job would be to combine both of them to one single data using rbind( ).

> newfish<-rbind(newmadras,newmumbai)
> newfish
   location  tl fsampling
1    Madras 283         1
2    Madras 321         1
3    Madras 302         1
4    Madras 262         1
5    Madras 274         1
6    Madras 252         1
7    Madras 309         1
8    Madras 316         1
9    Madras 315         1
10   Madras 218         1
31   Mumbai 323         1
32   Mumbai 307         1
33   Mumbai 330         1
34   Mumbai 309         1
35   Mumbai 303         1
36   Mumbai 315         1
37   Mumbai 259         1
38   Mumbai 298         1
39   Mumbai 290         1
40   Mumbai 300         1

cbind ( ) function works in similar way but binds data or vectors as columns. May be you can try with the same datasets of madras and mumbai with a cbind ( ) function and see what turns up!!!

Check with str ( ) function.

What if you want to choose only the male populations? In the earlier case, it was easy since the required data was already in an order. Now for male populations, it would be tedious to count and pick each row. In such cases we have another function called order ( ) which helps you to order the data according to one particular variable. Let’s order our fish data using order function.

> order1<-order(fish$sex)

> newfish<-fish[order1,]

Try the above function and you will get a new data named “newfish” ordered according to sex of the fish. But infact, ordering is not really required to slice only male populations. You can reduce one step by simply doing the following trick.

newfish<-fish[fish$sex=="M",]

This trick is telling R to choose only those rows in which the sex column of fish data is equal to “M”. Try and see the results.

If by any chance, your data is not a data frame, you can convert them by using as.data.frame function. Similarly, you can convert a data frame to a matrix by using as.matrix function. Let’s create a matrix to demonstrate this:

> fish<-matrix(c(1:130),10,13)
> fish
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
 [1,]    1   11   21   31   41   51   61   71   81    91   101   111   121
 [2,]    2   12   22   32   42   52   62   72   82    92   102   112   122
 [3,]    3   13   23   33   43   53   63   73   83    93   103   113   123
 [4,]    4   14   24   34   44   54   64   74   84    94   104   114   124
 [5,]    5   15   25   35   45   55   65   75   85    95   105   115   125
 [6,]    6   16   26   36   46   56   66   76   86    96   106   116   126
 [7,]    7   17   27   37   47   57   67   77   87    97   107   117   127
 [8,]    8   18   28   38   48   58   68   78   88    98   108   118   128
 [9,]    9   19   29   39   49   59   69   79   89    99   109   119   129
[10,]   10   20   30   40   50   60   70   80   90   100   110   120   130

Now convert to data frame using as.data.frame function:

> fish<-as.data.frame(fish)
> fish
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1   1 11 21 31 41 51 61 71 81  91 101 111 121
2   2 12 22 32 42 52 62 72 82  92 102 112 122
3   3 13 23 33 43 53 63 73 83  93 103 113 123
4   4 14 24 34 44 54 64 74 84  94 104 114 124
5   5 15 25 35 45 55 65 75 85  95 105 115 125
6   6 16 26 36 46 56 66 76 86  96 106 116 126
7   7 17 27 37 47 57 67 77 87  97 107 117 127
8   8 18 28 38 48 58 68 78 88  98 108 118 128
9   9 19 29 39 49 59 69 79 89  99 109 119 129
10 10 20 30 40 50 60 70 80 90 100 110 120 130

Notice the difference by checking them with the str( ) function. Let’s convert it back to matrix now.

> fish<-as.matrix(fish)
> fish
      V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
 [1,]  1 11 21 31 41 51 61 71 81  91 101 111 121
 [2,]  2 12 22 32 42 52 62 72 82  92 102 112 122
 [3,]  3 13 23 33 43 53 63 73 83  93 103 113 123
 [4,]  4 14 24 34 44 54 64 74 84  94 104 114 124
 [5,]  5 15 25 35 45 55 65 75 85  95 105 115 125
 [6,]  6 16 26 36 46 56 66 76 86  96 106 116 126
 [7,]  7 17 27 37 47 57 67 77 87  97 107 117 127
 [8,]  8 18 28 38 48 58 68 78 88  98 108 118 128
 [9,]  9 19 29 39 49 59 69 79 89  99 109 119 129
[10,] 10 20 30 40 50 60 70 80 90 100 110 120 130

Now check again with str( ) function. str function is very useful while dealing with data frames.

This is all for now. This chapter would be the most important part while you start learning to analyze your own data for any statistics. In learning phase, most of your time will loose on getting the data in right format for analysis. Because of ignorance, you would be spending a lot of time on internet to get things done which are similar to what I have explained in this chapter. So enjoy the notes and definitely leave a feed back. So I would know how good I am and how good it is for you. Thanks 🙂

Advertisements

About Deepak George Pazhayamadom

I'm a fish biologist and a mathematical modeller. I have a wide range of research interests, mostly centered on fisheries resource management.

Posted on February 6, 2012, in Bio-Statistics and Analysis. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: