DIGRESSION ON R FUNCTION CALLED GL
Analysis of variance with Replications R program with example
gl GL lower case. Generate factors by specifying the pattern of their levels.
Usage:
gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)
Arguments:
n: an integer giving the number of levels.
k: an integer giving the number of replications.
length: an integer giving the length of the result.
labels: an optional vector of labels for the resulting factor levels.
ordered: a logical indicating whether the result should be ordered or
not (do not specify this as TRUE).
END OF DIGRESSION ----- - - -
#page 660 of the Hawkes-Marsh text (on reserve in Library)
#banana grower has 3 fertilizers to choose from
#he believes it matters whether the plot is on North, S, W or East side
# or direction since sunlight etc might differ
#Tree yields are tabulated
row FertA FertB FertC
S 53 31 58
N 48 47 53
W 50 48 56
E 50 47 54
#we collect all yields row-wise into a long array of length 4 times 3.
y=c(53, 31, 58, 48, 47, 53, 50, 48, 56, 50, 47, 54)
#there are 3 fertilizers with following labels with 12 in all
#gl is a command in R to Generate factors by specifying the
#pattern of their levels.
#first number in gl is No of levels, second is replications and last the length
# 3,1,12 here we have 3 levels, only 1 replication and 12 length
Fert=gl(3,1,12,labels=c("FertA","FertB","FertC")) #column effects
#there are 4 directions S, N W and E with 12 in all
direction=gl(4,1,12,labels=c("South","North","West","East")) #row effects
data.frame(y,Fert,direction)
# y Fert direction following is not correct representation of data
#1 53 FertA South
#2 31 FertB North #this is wrong
#3 58 FertC West
#4 48 FertA East
#5 47 FertB South
#6 53 FertC North
#7 50 FertA West
#8 48 FertB East
#9 56 FertC South
#10 50 FertA North
#11 47 FertB West
#12 54 FertC East
direction2=sort(direction) #we need to sort the direction values now the data are correct
#data.frame(y,Fert,direction2)
# y Fert direction2
#1 53 FertA South
#2 31 FertB South
#3 58 FertC South
#4 48 FertA North
#5 47 FertB North
#6 53 FertC North
#7 50 FertA West
#8 48 FertB West
#9 56 FertC West
#10 50 FertA East
#11 47 FertB East
#12 54 FertC East
plot(Fert,y) #figure shows "between group variability" not so big
#compared to "within group variability"
#The Fertilizer differences may be due to chance variation
plot(direction2,y)#figure shows between diff not so big
#compared to within group variability
#blocking by the direction of plots is not that useful
anova(lm(y~Fert+direction2))
#Analysis of Variance Table
#
#Response: y
# Df Sum Sq Mean Sq F value Pr(>F)
#Fert 2 290.667 145.333 4.3168 0.06893 .
#direction2 3 26.250 8.750 0.2599 0.85195
#Residuals 6 202.000 33.667
#Fertilizer is along columns their F has large P value >0.05 we accept Mull
#Null for columns is that all columns have indistinguishable means
#location plot is along rows their F has large P value >0.05 we accept Null
#Null for columns is that all rows have indistinguishable means
#page 661 of the Hawkes-Marsh text
#Comparison of on-time performance of airports and airlines
AlineA AlineB AlineC AlineD
AirportA 87 82 79 81
AirportB 88 84 81 82
AirportC 89 84 83 82
AirportD 90 86 85 83
#suppose we entered y numbers by columns as follows, it is OK.
y=c(87,88,89,90, 82,84,84,86, 79,81,83,85, 81,82,82,83)
Aline=gl(4,1,16, labels=c("AirlineA","AirlineB","AirlineC","AirlineD"))
#following Aline has sort and gl command together
#I find it convenient to apply the sort to ROW characteristic
Aport=sort(gl(4,1,16, labels=c("AirportA","AirportB","AirportC","AirportD")))
data.frame(y,Aline,Aport)
plot(Aline,y)
plot(Aport,y)
anova(lm(y~Aline+Aport)) #Aports and Alines are jointly significantly different
anova(lm(y~Aline)) # test if Alines are significantly different? NO
anova(lm(y~Aport)) # test if Aports are significantly different? Yes
If you are confused about gl and setup of y, you can use
the following trick. You fill the y values only after
defining the two factor variables and sorting one of them.
For the above example we begin with counting the number
of cells. With 4 rows and 4 columns we have 16 cells
Now we write a temporary place-holder statement for y
y=101:116 #these are not actual y values but temporary values
y
Aline=gl(4,1,16, labels=c("AlineA","AlineB","AlineC","AlineD"))
Aline
Aport=gl(4,1,16, labels=c("portA","portB","portC","portD"))
Aport
Aport2=sort(Aport) #define new object Aport2 for sorted values
#Remember to use the Aport2 in the data.frame function of R
data.frame(y,Aline,Aport2) #important to have Aport2 not Aport1
> data.frame(y,Aline,Aport2)
y Aline Aport2
1 101 AlineA AportA
2 102 AlineB AportA
3 103 AlineC AportA
4 104 AlineD AportA
5 105 AlineA AportB
6 106 AlineB AportB
7 107 AlineC AportB
8 108 AlineD AportB
9 109 AlineA AportC
10 110 AlineB AportC
11 111 AlineC AportC
12 112 AlineD AportC
13 113 AlineA AportD
14 114 AlineB AportD
15 115 AlineC AportD
16 116 AlineD AportD
looking at above table it is clear which particular
numbers from the data table should
replace the temporary y values 101 to 116.
AlineA AlineB AlineC AlineD
AirportA 87 82 79 81
AirportB 88 84 81 82
AirportC 89 84 83 82
AirportD 90 86 85 83
for example, 101 should be 87, 102 should be 82 etc along first row.
Once the y array and two factors for rows and columns are
defined, it is a simple matter to do the regression:
reg1=lm(y~Aline+Aport2)
and then do
summary(reg1)
anova(reg1)
anova(lm(y~Aline+Aport))#test if
# Aports and Alines are jointly significantly different
anova(lm(y~Aline)) # test if Alines are significantly different
anova(lm(y~Aport)) # test if Aports are significantly different
All the computational drudgery is removed by lm and anova functions of R.
ANSWER to Questions by Students
Question 1)
There are three possible anova's:
lm(y~Row)
lm(y~Column)
lm(y~Column+Row)
The output numbers seem inconsistent.
For example, the F value given by (y~Row) is different from
the F value listed for Row under (y~Column+Row).
Answer to Question 1:
The F values are supposed to be different. They are certainly
not inconsistent. Think of this as in regression of y on x and z.
The coefficients and t statistics will be different when you regress
y on x alone and y on x and z. Remember that F stats for individual
slopes are simply squares of t-stats.
Question 2)
If I wanted to determine whether the Row effect was
significant, do I use the F-value (or P-value) listed
under lm(y~Row) or the one for lm(y~Column+Row)?
Answer to Question 2:
Use the latter lm(y~Column+Row).
lm(y~Row) alone ignores the existence of Column effects.
Column effects might be like "control" factors. One wants
to focus on Row effects without being unduly influenced by
Column effects. Again this is similar to regression.
Regression coefficients estimate the effect on one while
holding the effect of the other variable constant.
Question 3)
Is it correct that
Pr(>F) value is greater than the alpha, the null should be accepted?
Answer to Question 3)
The null hypothesis is that the effect is zero (null). Yes it is
correct that when the p-value exceeds alpha (=0.05) we accept the null.