Many issues of model selection were raised in the preceeding subsection. In brief, if there are several candidate genes, how can one select the ``best'' model? This is, the final model should explain a substantial protion of the total variation, including all genes which appear to have significant effect on the trait in question. However, there is a risk of overfitting by including genes of no, or negligible, effect. In practice, one adopts a parsimonious model which balances between simplicity and strength of explanation. This subsection considers the problem of deciding how many genes can be supported by the evidence in the data.
Suppose there are 10 candidate genes of known genotypes, but only 2 of them are ``real''. A natural approach to model selection, following the development in the preceeding subsections, would fit all 10 single-gene models, selecting the best and then fit all two-gene models that include that best gene. This forward selection idea, borrowed from multiple regression, may in fact get the best two. However, if any candidate genes are linked, the correlation between their genotyping across the offspring can interfere with selection of the ``best'' model. For instance, if the two real genes are linked, and a third fake gene is linked to both of them, forward selection could choose only this fake gene as it may have the best fit of a single candidate. The significance of either of the two real genes following this ``ghost'' may be negligible [\protect\citeauthoryearHaley and KnottHaley and Knott1992].
An alternative approach, advocated by jans:stam:1994, places all 10 candidates in the model initially. The least significant candidate gene (in a Type III adjusted sense) is dropped and the remaining 9 refit. One proceeds in this backward elimination approach until all remaining genes have significant effects when adjusted for the others present. This latter approach seems preferable to forward selection, as it avoids difficulties of model bias due to avoiding minor genes, as mentioned in the previous subsection.
The ideal model should have good predictive power for future experiments. A model that overfits the data at hand, by including too many negligible genes, will be a poor predictor because it relies on genes with no information about the trait. However, a model that underfits, by leaving out important genes, will be poor because it fails to use valuable information. Parsimony consists in finding a balance between these extremes. One approach is to use a measure of parsimony such as Mallow's Cp,
Cp = (2*p*k - n) + RSSp / MSE ,
in which MSE is the mean square error from the ``full'' model with all 10 candidate genes, n is the number of offspring, k is the number of parameters per gene (k=1 for BC and DH breeding systems, k=2 for F2), and p is the number of genes considered for a particular model. The residual sum of squares for a model with p genes is denoted by RSSp. For p too small, Cp is large due to bias. Ideally, Cp and k*p should agree. This suggests taking a model with the smallest number of genes (p) such that Cp and k*p are close. Ultimately, this is a judgement call.
There need to be large effects, or many offspring, in order to detect more than a few genes, even if many genes actively affect the trait. That is, it would be unusual to have 3 or more significant genes for a trait measured on 100-200 offspring. It may take thousands of offspring to detect 6 or more genes. This may seem prohibitive, but in fact it is not. Plant breeders would usually want to verify the presence of genes controlling a trait by fixing genes in subsequent generations. Thus today, 2 major genes might be fixed. Next year, two generations later, a QTL analysis could uncover 2 more genes with significant effects in decendents of the fixed lines.