Blog Archives

Can we detect a trend with no data?


fish

Well, the answer is not an absolute NO. You do require a few observations but, a long time series is not necessary. My recent paper talks about the potential application of Self-Starting CUSUM in fisheries research. The paper investigates, “Whether a fishery can be managed if no historical data are available?” The fish cartoon depicted above is a symbolic representation to highlight the key messages.

1. Number of samples: A minimum of three observations could help detecting a trend.

2. Reference point: Reference point (value indicating ‘OK’) is not required to detect a trend.

3. Large fish indicator: Large fishes in the population indicate the health of the fish stock.

What is a CUSUM?

The CUSUM refers to Cumulative Sum control chart in ‘Statistical Process Control (SPC)’ theory. They are used for monitoring purposes and  help detecting whether the observations in a time series are deviating consistently from a desired level of performance. The graphical version of SPC is known as a control chart and the simplest version is the ‘Shewhart’. In Shewhart chart, the data is monitored by checking whether the observations cross an upper or lower limit (red lines). The example demonstrated below shows the pH observations in an aquarium where the desired level of performance (or reference point) is pH=7 in blue line (neither acidic nor alkaline).

The CUSUM control chart is a modified version of Shewhart, where it computes the deviation of each observation from the reference point and compute their cumulative sum until the last observation. Hence, CUSUM has the advantage of detecting small and gradual shift in the time series (or trend) as early as it occurs (see the graph below).

Untitled

What is a Self-Starting CUSUM?

The SS-CUSUM works on the same principle but, a reference point is not used in the computational process i.e., you do not have to say that the desired level of performance should be pH=7 (and hence you do not need that information to detect a trend). In SS-CUSUM, a ‘running mean’ is used instead of the reference point. The running mean is estimated from the data itself and updated when new observations are added to the time series. In long term, the running mean will approach closer to the reference point and as a result, the SS-CUSUM will appear similar to a normal basic CUSUM (see the graphs below).

UntitledLimitation of SS-CUSUM?

Obviously, the running mean will drive you bananas if the initial observations are not close to the reference point. However, SS-CUSUM is an excellent method if the user is sure about the distributional properties of data i.e., the initial observations are not outliers.

What is the story of my paper?

In the context of a data poor situation, my paper investigated ‘Whether SS-CUSUM can be used for monitoring indicators such as mean length, mean weight etc. so that a change in state of the fish stock can be detected at the earliest possible?’. The performance of SS-CUSUM is displayed below in the form of ‘Receiver Operator Characteristic’ (ROC) curves. The closer the apex of the ROC curve towards the upper left corner, the better is the performance of SS-CUSUM in detecting the change in state of the stock.

UntitledWe found that a trend can be detected even if there are only three observations in the time series and the best performances was obtained with the Large Fish Indicators (LFIs) 🙂

The freq.class function of TFSA


1. Download, Install and Open R

2. Install the R packages: lattice and TFSA (R>Packages> Install packages from local zip files). This is required only once. If the package is successfully installed, you will get the message “package ‘TFSA’ successfully unpacked and MD5 sums checked”.

3. Load both packages using the library function.
Example:  library (lattice); library(TFSA)

The freq.class function

freq.class function builds frequency distribution table from a vector of data by grouping them into class intervals. The frequency of a class interval is the number of observations that occur in a particular predefined interval. So, for example, if 20 fish samples in our data are aged between 3 to 5 , the frequency for the 3–5 interval is 20. The endpoints of a class interval are the lowest and highest values that a variable can take. Class interval width is the difference between the lower endpoint of an interval and the lower endpoint of the next interval.

Mandatory arguments (inputs):

'x'- A vector of data

Mandatory arguments for specifying class interval (inputs):

'll'- Lowest endpoint of class intervals

'ul'- Uppermost endpoint of class intervals

'cl'- Preferred width of the class interval.

The above arguments work only if all three of them are used together. Unless specified, the function categorize data into 10 classes

Optional arguments (inputs):

'groups'- The ordinal (factorial or categorical) variables if data belong to more than one group. Default is NA

'scales'- If TRUE, bar graph for groups will be plotted independent of their relation in size. Default is FALSE

'density.plot'- If TRUE, density graph will be plotted instead of bar graphs. Default is FALSE

Protocol to use freq.class function

1. Prepare data with variables in each column (qualitative and quantitative)

2. Import your data into R using the function read.table () or alternatively for learning purposes, you can use the ‘fishdata’ that comes with TFSA package. This data has 506 observations on fish length.

> data(fishdata)
> fishdata

2. Find the lowest and highest values of the variables using the functions max () and min ().

> max(fishdata)
[1] 299
> min(fishdata)
[1] 127

3. Decide on width of the class interval.

In deciding on the width of the class intervals, you will have to find a compromise between having intervals short enough so that not all of the observations fall in the same interval, but long enough so that you do not end up with only one observation per interval.

Usage:

1. To get a quick result, use freq.class() with the data vector. This will categorize data into 10 classes.

> freq.class(fishdata)

 Frequency Distribution Table 

   Lower Upper Midclass Frequency
1    127   144    135.5         1
2    144   161    152.5        18
3    161   178    169.5       104
4    178   195    186.5       137
5    195   212    203.5        63
6    212   229    220.5        68
7    229   246    237.5        57
8    246   263    254.5        45
9    263   280    271.5         8
10   280   297    288.5         4

2. To customize the class intervals, use the arguments ‘ll=’, ‘ul=’ and ‘cl=’. The fish data is in the range between 127 – 299. So let’s keep the lowest end point 125, highest endpoint 300 with a class interval of 7.

> freq.class(fishdata,ll=125,ul=300,cl=7)

 Frequency Distribution Table 

   Lower Upper Midclass Frequency
1    125   132    128.5         1
2    132   139    135.5         0
3    139   146    142.5         0
4    146   153    149.5         5
5    153   160    156.5         9
6    160   167    163.5        23
7    167   174    170.5        48
8    174   181    177.5        61
9    181   188    184.5        76
10   188   195    191.5        37
11   195   202    198.5        22
12   202   209    205.5        30
13   209   216    212.5        25
14   216   223    219.5        31
15   223   230    226.5        27
16   230   237    233.5        22
17   237   244    240.5        25
18   244   251    247.5        26
19   251   258    254.5        21
20   258   265    261.5         8
21   265   272    268.5         2
22   272   279    275.5         2
23   279   286    282.5         3
24   286   293    289.5         1
25   293   300    296.5         1

3. If the entire data belongs to different groups, this can be specified with the argument ‘groups=’.

TFSA package comes with a second example data known as ‘fishgrp’. This data is a vector of fish length measured from different commercial landing centres in India. The data have two columns in which one is factorial variable (landing centres) and the other one is quantitative variable (fish length).

If you are using your own data, make sure it is formatted in the standard way

> data(fishgrp)

> str(fishgrp)
'data.frame':   506 obs. of  2 variables:
 $ Stock: Factor w/ 4 levels "Calcutta","Gujarat",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ SL   : int  228 264 244 209 220 202 242 256 249 170 ...

> freq.class(fishgrp$SL,groups=fishgrp$Stock)

 Frequency Distribution Table 

   Lower Upper Midclass Calcutta Gujarat Mumbai Orissa
1    127   144    135.5        0       0      0      1
2    144   161    152.5        1       4      1     12
3    161   178    169.5       36      25      1     42
4    178   195    186.5       81      33      3     20
5    195   212    203.5       15      21     14     13
6    212   229    220.5        0      11     33     24
7    229   246    237.5        0       7     32     18
8    246   263    254.5        0      11     25      9
9    263   280    271.5        0       1      5      2
10   280   297    288.5        0       0      4      0

4. In the above graph, Calcutta have a frequency of 80 in one particular graph. Because of this, the true nature of frequency distribution from other locations are not clear. The argument ‘scales=TRUE’ would make the graphs independent of each other. This will show a better picture.

> freq.class(fishgrp$SL,groups=fishgrp$Stock,scales=TRUE)

5. If you are interested in a density plot instead of bar plot, the argument ‘density.plot=’ can be used.

> freq.class(fishgrp$SL,groups=fishgrp$Stock,density.plot=TRUE)

Please leave a comment or email if you find a bug. Thanks for reading 🙂

Developing TFSA package for R


I have been writing R functions since 2009 as part of my PhD in Fisheries Management. Since then I was longing to pull out useful functions into an R package so others can use it. I named it TFSA (Tropical Fish Stock Assessment) since my intention is to make the functions available to students who work with tropical fisheries where length data is more reliable than age estimation. TFSA is in its very premature form and under development. A few reasons why I got inspired and motivated in developing TFSA (Tropical Fish Stock Assessment) is listed below.

1. There are a few R packages that are  useful (FLR project, FSA, fishmethods, fsap) to employ and evaluate fish stock assessments for research and teaching. However, from my understanding the utility of these packages in a tropical system (e.g. Indian fisheries context) is limited since age data is highly unreliable and difficult to standardise due to the lack of strong seasons. The only reliable data would be the size of the fish species in terms of length and weight.

2. The number of R users are increasing in the very recent years. However, the change is dramatic for marine biology. Perhaps people are realizing the power of R in terms of its flexibility (through seminars or social networking) and increase in number of R books. However, there is no R book made for fisheries research despite of the number of R packages in CRAN.

3. Learning R is not easy as compared to other user friendly statistical softwares like Excel or SPSS.  But it seems now fisheries biologists are showing more interests after realising its power and flexibility in modelling. However, the resources to learn R is very limited in the perspective of a biologist. I conducted a training program in India on April 2012 and all the participants (professors and researchers in biology) except one person were unaware about R. The person who knew about the existence of R couldn’t manage to make use of it for her research work.

I’m looking forward for people who can contribute and help towards developing this package. Ofcourse, they will be acknowledged in the package description.

TFSA package is available to download from here. At the moment, TFSA only have the following functions:

1. vbgf – Compute length of a fish species given the parameters (k,t0 and Linf). This is based on the Von Bertalanffy Growth Function equation.

2. freq.class – Build the frequency distribution table based on specified class intervals. Useful for building length frequency table. The function can also handle groupings (eg: length data from different months).

More ideas and suggestions are welcome. A tutorial on using these functions will be published soon.