DIGRESSION ON R FUNCTION CALLED GL Analysis of variance with Replications R program with example gl GL lower case. Generate factors by specifying the pattern of their levels. Usage: gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) Arguments: n: an integer giving the number of levels. k: an integer giving the number of replications. length: an integer giving the length of the result. labels: an optional vector of labels for the resulting factor levels. ordered: a logical indicating whether the result should be ordered or not (do not specify this as TRUE). END OF DIGRESSION ----- - - - #page 660 of the Hawkes-Marsh text (on reserve in Library) #banana grower has 3 fertilizers to choose from #he believes it matters whether the plot is on North, S, W or East side # or direction since sunlight etc might differ #Tree yields are tabulated row FertA FertB FertC S 53 31 58 N 48 47 53 W 50 48 56 E 50 47 54 #we collect all yields row-wise into a long array of length 4 times 3. y=c(53, 31, 58, 48, 47, 53, 50, 48, 56, 50, 47, 54) #there are 3 fertilizers with following labels with 12 in all #gl is a command in R to Generate factors by specifying the #pattern of their levels. #first number in gl is No of levels, second is replications and last the length # 3,1,12 here we have 3 levels, only 1 replication and 12 length Fert=gl(3,1,12,labels=c("FertA","FertB","FertC")) #column effects #there are 4 directions S, N W and E with 12 in all direction=gl(4,1,12,labels=c("South","North","West","East")) #row effects data.frame(y,Fert,direction) # y Fert direction following is not correct representation of data #1 53 FertA South #2 31 FertB North #this is wrong #3 58 FertC West #4 48 FertA East #5 47 FertB South #6 53 FertC North #7 50 FertA West #8 48 FertB East #9 56 FertC South #10 50 FertA North #11 47 FertB West #12 54 FertC East direction2=sort(direction) #we need to sort the direction values now the data are correct #data.frame(y,Fert,direction2) # y Fert direction2 #1 53 FertA South #2 31 FertB South #3 58 FertC South #4 48 FertA North #5 47 FertB North #6 53 FertC North #7 50 FertA West #8 48 FertB West #9 56 FertC West #10 50 FertA East #11 47 FertB East #12 54 FertC East plot(Fert,y) #figure shows "between group variability" not so big #compared to "within group variability" #The Fertilizer differences may be due to chance variation plot(direction2,y)#figure shows between diff not so big #compared to within group variability #blocking by the direction of plots is not that useful anova(lm(y~Fert+direction2)) #Analysis of Variance Table # #Response: y # Df Sum Sq Mean Sq F value Pr(>F) #Fert 2 290.667 145.333 4.3168 0.06893 . #direction2 3 26.250 8.750 0.2599 0.85195 #Residuals 6 202.000 33.667 #Fertilizer is along columns their F has large P value >0.05 we accept Mull #Null for columns is that all columns have indistinguishable means #location plot is along rows their F has large P value >0.05 we accept Null #Null for columns is that all rows have indistinguishable means #page 661 of the Hawkes-Marsh text #Comparison of on-time performance of airports and airlines AlineA AlineB AlineC AlineD AirportA 87 82 79 81 AirportB 88 84 81 82 AirportC 89 84 83 82 AirportD 90 86 85 83 #suppose we entered y numbers by columns as follows, it is OK. y=c(87,88,89,90, 82,84,84,86, 79,81,83,85, 81,82,82,83) Aline=gl(4,1,16, labels=c("AirlineA","AirlineB","AirlineC","AirlineD")) #following Aline has sort and gl command together #I find it convenient to apply the sort to ROW characteristic Aport=sort(gl(4,1,16, labels=c("AirportA","AirportB","AirportC","AirportD"))) data.frame(y,Aline,Aport) plot(Aline,y) plot(Aport,y) anova(lm(y~Aline+Aport)) #Aports and Alines are jointly significantly different anova(lm(y~Aline)) # test if Alines are significantly different? NO anova(lm(y~Aport)) # test if Aports are significantly different? Yes If you are confused about gl and setup of y, you can use the following trick. You fill the y values only after defining the two factor variables and sorting one of them. For the above example we begin with counting the number of cells. With 4 rows and 4 columns we have 16 cells Now we write a temporary place-holder statement for y y=101:116 #these are not actual y values but temporary values y Aline=gl(4,1,16, labels=c("AlineA","AlineB","AlineC","AlineD")) Aline Aport=gl(4,1,16, labels=c("portA","portB","portC","portD")) Aport Aport2=sort(Aport) #define new object Aport2 for sorted values #Remember to use the Aport2 in the data.frame function of R data.frame(y,Aline,Aport2) #important to have Aport2 not Aport1 > data.frame(y,Aline,Aport2) y Aline Aport2 1 101 AlineA AportA 2 102 AlineB AportA 3 103 AlineC AportA 4 104 AlineD AportA 5 105 AlineA AportB 6 106 AlineB AportB 7 107 AlineC AportB 8 108 AlineD AportB 9 109 AlineA AportC 10 110 AlineB AportC 11 111 AlineC AportC 12 112 AlineD AportC 13 113 AlineA AportD 14 114 AlineB AportD 15 115 AlineC AportD 16 116 AlineD AportD looking at above table it is clear which particular numbers from the data table should replace the temporary y values 101 to 116. AlineA AlineB AlineC AlineD AirportA 87 82 79 81 AirportB 88 84 81 82 AirportC 89 84 83 82 AirportD 90 86 85 83 for example, 101 should be 87, 102 should be 82 etc along first row. Once the y array and two factors for rows and columns are defined, it is a simple matter to do the regression: reg1=lm(y~Aline+Aport2) and then do summary(reg1) anova(reg1) anova(lm(y~Aline+Aport))#test if # Aports and Alines are jointly significantly different anova(lm(y~Aline)) # test if Alines are significantly different anova(lm(y~Aport)) # test if Aports are significantly different All the computational drudgery is removed by lm and anova functions of R. ANSWER to Questions by Students Question 1) There are three possible anova's: lm(y~Row) lm(y~Column) lm(y~Column+Row) The output numbers seem inconsistent. For example, the F value given by (y~Row) is different from the F value listed for Row under (y~Column+Row). Answer to Question 1: The F values are supposed to be different. They are certainly not inconsistent. Think of this as in regression of y on x and z. The coefficients and t statistics will be different when you regress y on x alone and y on x and z. Remember that F stats for individual slopes are simply squares of t-stats. Question 2) If I wanted to determine whether the Row effect was significant, do I use the F-value (or P-value) listed under lm(y~Row) or the one for lm(y~Column+Row)? Answer to Question 2: Use the latter lm(y~Column+Row). lm(y~Row) alone ignores the existence of Column effects. Column effects might be like "control" factors. One wants to focus on Row effects without being unduly influenced by Column effects. Again this is similar to regression. Regression coefficients estimate the effect on one while holding the effect of the other variable constant. Question 3) Is it correct that Pr(>F) value is greater than the alpha, the null should be accepted? Answer to Question 3) The null hypothesis is that the effect is zero (null). Yes it is correct that when the p-value exceeds alpha (=0.05) we accept the null.