Predict Directive

What are associated factors?

Sometimes there is a hierarchical structure to factors which needs to be recognised. Common examples are Genotypes grouped into Families and Locations grouped by Region. We call these associated factors. Special care is required with associated factors, especially if prediction is required (see !ASSOCIATE qualifier). The key characteristic of associated factors is that they are coded such that the levels of one are uniquely nested in the levels of another. If one is unknown (coded as missing), all associated factors must be unknown for that data record. It is typically unnecessary to interact associated factors except when required to adequately define the variance structure.

Predicting with associated factors

It is necessary to correctly associate the levels of associated factors when predicting them or averaging over them.

!ASSOCIATE factors facilitates prediction when the levels of one factor group or classify the levels of another. factors is an list of factors in the model which have this hierarchical relationship. Typical examples are say 1000 individually named lines which represent 100 families typically with unequal numbers of lines per family, or a total of 100 trials conducted across three regions in a total of 17 locations.

Declaring factors as associated allows ASReml to combine the levels of the factors appropriately. For example, in the preceding example, when predicting a trial mean, to add the effect of the location and region where the trial was conducted. When identifying which levels are associated, ASReml checks that the association is strictly hierarchal. That is, each location is associated with only one region, and each trial with only one location. If a level code is missing for one component, it must be missing for all.

Averaging of associated factors will generally give differing results depending on the order in which the averaging is performed. We explore this with the following extended example. Consider the mean yields from 15 trials classified by region and location in Table 1.

Table 1. Trial means classified by region and location.
trial 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
region 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
location 1 1 2 2 2 3 4 4 5 5 5 6 6 7 8
yield 10 12 11 12 13 13 11 13 11 12 13 10 12 10 10

Assuming a simplified linear model yield ~ mu region location trial , the predict statement predict trial !associate region location trial will reconstruct the 15 trial means from the fitted trial, location and region effects.

Given these trial means, it is fairly natural to form location means by averaging the trials in each location to get the location means in Table 2.

Table 2. Location means classified by region
location12345678
region 11122222
yield 11 12 13 12 12 11 10 10

These are given by predict location !associate region location trial or equivalently
predict location !associate region location trial !ASAVERAGE trial
Note that without the !associate clause, ASReml would add the average of all the trial effects into all of the location means which is not appropriate. With !associate, it knows which trials to average to form each location mean. We use the alternate spelling of the !AVERAGE qualifier name to highlight that this is averaging by association and nor simple averaging.

However, for region means, we have a choice. We can average the trial means in Table 1 according to region obtaining region means of 11.83 and 11.33, or we can average the location means in Table 2 to get location means of 12 and 11.

The former is the default in ASReml produced by
predict region !associate region location trial or equivalently by
predict region !present region location trial We call this base averaging.

The latter implies sequential or hierarchical averaging and is given by predict region !assoc region location trial !ASAVE location

Similarly, an overall heirarchical mean of 11.5 is given by
predict mu !assoc region location trial !ASAVE reg locat trial while
predict mu !assoc region location trial !ASAVE reg gives a value of 11.58 being the average of region means 11.83 and 11.33 obtained by averaging trials within regions from Table 1, and
predict mu !associate region location trial !ASAVE location predicts mu as 11.38, the average of the 8 locations means in Table 2.

Further discussion of associated factors

The user may specify their own weights, using file input if necessary. The statement
predict region ... !ASAVE location {1 2 3}/6 {1 1 1 2 1}/6 would give region predictions of 11.67 and 10.84 respectively derived from the location predictions in Table 2. Note that because location is nested in region, the location weights must sum to 1.0 within levels of region. The alternate form of the !AVE ( !ASAVE) qualifiers allows the weights to be read from a file which the user can create elsewhere. Thus the code
!ASAVERAGE trial 'Tweight.csv',2 will read the weights from the second field of file Tweight.csv. Without the column specification, ASReml reads all the values in the file. The user must ensure the weights are in the coding order ASReml uses ( trial order in this instance, given in the .sln file or by using the TABULATE command).

It was noted that all !ASSOCIATE factors are included in the hyper table. If the lowest stratum is random, it may be appropriate to ignore it. Omitting it from the !ASSOCIATE list will allow it to reenter the Ignore set. Specifying it with the !IGNORE qualifier will exclude its effects from the prediction but not ignore the structural information implied by the association.

Normally it is not necessary for any model term to involve more than 1 of the associated factors. One exception is if an interaction is required so that the variance can differ between sections. For example, fitting the terms at(region).trial as random effects would allow the trials in region 1 to have a different variance component to those in region 2. Prediction in these cases is more complicated and has only been implemented for this specific case and the analagous region.trial case. The associated factors must occur together in this order for the prediction to give correct answers.

The !ASSOCIATE effect (with base averaging) can usually be achieved with the !PRESENT qualifier except when the factors have many levels so that the product of levels exceeds 2147 000 000; it fails in this case because the KEY for identifying the cells present is a simple combination of the levels and is stored as a normal (32bit) integer. However, !ASSOCIATE is preferred because it formally checks that there is a associated structure as well as allowing averaging at a higher level.

Two !ASSOCIATE clauses may be specified for example
PRED entry !ASSOC family entry !ASSOC reg loc trial !ASAVE reg loc.

Only one member of an !ASSOCIATE list may also appear in a !PRESENT list. If one member appears in the classify set,
only that member may appear in the !PRESENT list. For example
yield \sim region !r region.family entry PREDICT entry !ASSOCIATE family entry !PRESENT entry region
Association averaging is used to form the cells in the PRESENT table and PRESENT averaging is then applied.

See Also