Organizing data for analysis in R

Many basic functions in R analyse data only if they are in the form of “Dataframe” type (Check my chapter about Data Types in R). It is important to understand the arrangement of rows and columns while preparing your data for analysis.

Regardless of R, most analytical software work only if data is in the format as discussed below:

Important steps:-

1. Open a fresh file of MS Excel to prepare data.

2. Enter data only on one sheet in MS Excel.

3. Enter each variable in each column.

4. Enter each observation in seperate rows.

5. Enter ‘NA’ for missing values or empty observations. Some people use zero instead which actually doesn’t mean the observation is empty. However, zero can be used in a presence-absence data where presence represented by ‘1’ and absence by ‘0’ (e.g. Infected by virus or not?).

6. Delete all empty columns and rows if used atleast once before (Very important).

7. Save MS Excel as .txt or .csv file (CSV represent Comma Seperated Values)


Download the example data from here: Click

The data is in a CSV file (Comma Seperated Values) and can be opened by any text editor or MS Excel.

The attributes or qualitative variables in this data are Location, Sampling period and Sex (Categorical data). The data have a few quantitative variables i.e., Total Length (TL), Standard Length (SL), Fork Length (FL) and Bondy Weight (BW) of a fish species. Both qualitative and quantitative variables are arranged in columns. Each fish is an observtion and hence the details about each fish is entered in each row.

1. The categories in ‘Location’ are four places from which the fish was collected (Mumbai, Madras, Calcutta and Cochin).

2. The categories in sampling period was 1 and 2 representing two months i.e, October and February. This was intentionally done since the analysis is much easier with numbers instead of the text itself.

3. The categories in ‘Sex’ are Male, Female and Undeterminate. As mentioned before, it is much easier to analyse if  the categories are represented by numbers:- Male-1, Female-2 and Undeterminate-3.


About Deepak George Pazhayamadom

I'm a fish biologist and a mathematical modeller. I have a wide range of research interests, mostly centered on fisheries resource management.

Posted on October 13, 2012, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: