Prédiction de l'interaction génotype à— environnement par linéarisation et régression PLS-mixte( Télécharger le fichier original )par Ibnou DIENG Universite Montpellier II - Doctorat 2007 |
Annexe B : Article au C.R.Biologie
http://france.elsevier.com/direct/CRASS3/ Biomodélisation / Biological modelling Linéarisation autour d'un témoin pour prédire la réponse de cultures Ibnou Dieng a,*, Éric Gozé b, Robert Sabatierc a Centre d'étude régional pour
l'amélioration de l'adaptation à la sécheresse, BP 3320,
Thiès-Escale, Thiès, Sénégal Reçu le 18 avril 2005; accepté le 17 janvier
2006 RésuméUne nouvelle méthode pour modéliser les interactions génotype x environnement : APLAT. Le rendement de génotypes prédit par un modèle de simulation de cultures est développé en série de Taylor à l'ordre 1 au voisinage du vecteur de paramètres d'un génotype de référence. À l'aide de cette linéarisation locale, l'estimation des paramètres de ces génotypes se fait par régression linéaire des rendements observés sur la sensibilité des sorties du modèle de simulation de cultures par rapport aux paramètres. Pour citer cet article :I. Dieng et al., C. R. Biologies 329 (2006). (c) 2006 Académie des sciences. Publié par Elsevier SAS. Tous droits réservés. AbstractPrediction of crop response by linearisation about control approximation. A new method for modelling genotype x environment interaction: APLAT. The yield predicted by a crop-simulation model is developed as a Taylor series in the neighbourhood of a parameter vector of a control genotype. With this local linearisation, these genotype parameters can be estimated by a linear regression of the observed yield on the derivatives of the crop-simulation model predictions with respect to its parameters. To cite this article: I. Dieng et al., C. R. Biologies 329 (2006). (c) 2006 Académie des sciences. Publié par Elsevier SAS. Tous droits réservés. Mots-clés: Linéarisation; Prédiction de la réponse de cultures ; Témoin; Interaction genotype x environnement Keywords: Linearization; Predict responses culture; Control; Genotype x environment interaction Abridged English version In Sahel, genotype x environment interactions are often large: this is the justification behind multilocation * Auteur correspondant. Adresses e-mail: ibnou.dieng@ceraas.org, dieng_ibnou@yahoo.fr (I. Dieng), eric.goze@cirad.fr (É. Gozé), sabatier@univ-montp1.fr (R. Sabatier). and pluriannual trials. Because of these sizeable environment effects and interactions, the prediction of an expected yield with a linear mixed model is generally imprecise. Improving this prediction can be achieved by modelling the environment effect. It is then partly shifted from the random part to the fixed part of a mixed model, by the use of a crop-simulation model like DHC, IRSIS, SarraH... This could not be possible with the empirical genotype × environment interactions analysis methods like AMMI and joint regression, which do not make use of environmental variables. The factorial regression method does make use of environmental variables; however, it requires their effect on the production to be linear, which might not be the case. Unfortunately, most crop-simulation models bear a number of parameters, the estimation of which requires a specific and costly experiment. As a consequence, these parameters are usually known, but for a small set of reference genotypes. It would not be sensible to invest in a parameter estimation experiment for every new genotype that is proposed for selection. To overcome this problem, one can notice that multisite experiments usually share a control variety for which parameters have already been estimated. In this paper, we propose to develop as a Taylor series the modelled response about the parameters of this control genotype. The other genotypes' parameters can then be estimated by a linear regression of the observed yields on the sensitivity to parameters, i.e., on the derivatives of the response with respect to the parameters. With this estimation, one can predict the new genotype responses in environments where they have not been tested. In a given location, this estimation can benefit from the available historic climatic records to estimate a distribution of probable yields. Let f (Zj,èi) denote the yield of a genotype i predicted by a crop simulation in an environment j and Yij the observed yield. We can write: Yij = f (Zj ,èi) + îj + uij where Zj is the vector of the jth environment regressors and èi the P -vector of the ith genotype parameters. The bias îj is that of the crop-simulation model. We suppose that it depends only on environment and is the same for all genotypes in a same environment. The error term uij is supposed random with expectation 0 and variance ó2u. Let us consider a control genotype, i.e., whose parameters are known or at least already estimated. Let è0 be the vector of parameters of this control genotype and let us suppose that f is a C1 class function in a neighbourhood of è0 and f' derivable in this neighbourhood. Moreover, let us suppose èi in the neighbourhood of è0. Then, a Taylor series expansion yields: f (Zj,èi) = f(Zj,è0) + P_1 aè(p) è=è0,Z=Zj EP [ ?f × (èi (p) - è0p))+ ?[(èi - è0)'(è - è0)] with èi (p) the pth component of the parameters vector of the ith genotype, è(p) 0 that of the control genotype. (p) ? Let Xj = [ ? fo(p) |è=è0,Z=Zj ] and âi (p) = èi (p) - è0(p) for p = 1, . . . , P. As f is not
known in closed form, one while â(p) i is a function of genotype i. Then, the local linearization yields: Yij - Y0j = ~P p=1 X(p) j ·âi (p) + ~ij where Y0j is the control response in the environment j and eij = uij - u0j. So, E(eij) = 0, V(eij) = 2ó2u, Cov(eij , ~i~j~) = 0, but Cov(eij, ei,j) = ó2u. This equation can be put in the form of a linear model with correlated errors: Y - (Y0 ? 1I) = X · â + e In this equation, Y is the vector of responses of the I genotypes in the J environments, Y'0 = (Y01 · · · Y0J), 1I is unit vector of size I × 1. The symbol ? indicates the Kronecker product and E is a random error vector. Its covariance matrix is ó2u? where:
and ? ? 0 ? ? ùj ? ? ..? . ùJ 1 2 ? ? ùj = (21 . . . The number of columns of the square matrices ? and ùj are respectively the number of observations for all the environments and the number of observations for environment j. Also, X = [ X(1) ? II · · · X(P) ? II ] where X(p)' = [X(p) 1 ··· X(p) J ] is a J × 1 vector and II is the I × I unit matrix. The dimension of X is then IJ × PI. inally, )= [â(1)' · · · â("] where â(p), = [ â Pit 1 We call this method APLAT for Approximation Par Linéarisation Autour d'un Témoin. Because of the large number of columns of X, some dimension reduction method like Partial Least Squares regression is necessary. The dimension of the space spanned by the regressors is then reduced from rank of X to k. The PLS regression is usually carried out with the NIPALS (Nonlinear estimation by Iterative Partial Least Squares) algorithm, where the calculation of the components is performed simultaneously with a set of regressions by ordinary least squares. Here, the error covariance matrix is ó2u?, not ó2uIIJ, generalized least squares should be used instead. As ? is symmetric and positive semi-definite, a work around consists in factorizing its inverse, finding a matrix ç such that ç~ç = ?-1. Then, estimating â by PLS with regressions by generalized least squares is equivalent to consider the model: çY - ç(Y0 ? 1I) = çX · â + çE where âPLS is the estimation with regressions made by ordinary least squares. The number of components is chosen to minimize cients, we used a bootstrap technique. Let z(p)~b i,PLS be the random variable defined by: (p)*b zi, PLS = â(p)~b i,PLS -â(p) i,PLS s*( â(p)~b i,PLS) where â(p) i,PLS is the (p · i)th element of âPLS,â(p)~b i,PLS is obtained at the bth draw with b = 1, . . . , B and s( àâ(p)~b i,PLS) is the standard error of â~b PLS. Let ~FB be the empirical distribution function of z(p)~b i,PLS. The frac- tile FB 1(á) is estimated by àt(á) such that #{z(p)~b i,PLS ~ àt(á)} = áB. A percentile-t confidence interval for the (p · i)th element of â is in the following form: ~ â(p) i,PLS -sOilitS ) · àt(1 - á), sOilltS) · àt(á)] To evaluate the quality of the new model, we compared its MSEP (Mean Squared Error of Prediction) with that of the average model defined for our data as follows: Yij = m + gi + Ej + äij where m is the population mean and gi the genotype effect. The term Ej is the year effect and it is assumed random with expectation 0 and variance ó2E. Errors äij are distributed independently with expectation 0 and variance ó2ä . The terms Ej and äij are assumed to be mutually independent. The data set consists of plant yields of 26 groundnut genotypes. The experiments have been carried out at Bambey (14?42N and 16?28W) in Senegal, over a period of five years from 1994 to 1998. The data of each year were kept in turn as a test sample. Yields are expressed in kilograms of pods per hectare. We used SarraH, a crop simulation model developed by CIRAD in collaboration with CERAAS, to calculate X. Taking into account the available number of data, we estimated two of its varietal parameters. The PRESS is minimal with six components for models adjusted without the data of 1994, 1995 and 1997. For each of the others, the PRESS is minimal with nine components. However, we decided to keep only five components, as the PRESS was not very different from its minimum value. The APLAT MSEPs are lower than the average model MSEP, except for prediction of 1998 data. Then the prediction of yield for these models by APLAT was better than that made with the average model four times out of five. With the APLAT method, the prediction of a genotype in a new environment comes at a relatively low price, using mostly available data, except for the environmental data, which has to be recorded for every site of the experiment, according to the crop-simulation model needs. This method seems promising, but requires additional studies with more numerous data. 1. IntroductionAu Sahel, les interactions genotype × environnement constatées lors des essais multilocaux et pluriannuels sont généralement importantes. Sur les réponses moyennes par variété et par environnement, le modèle linéaire généralement adopté s'écrit : Yij = m + gi + Ej + (gE)ij + eij (1) où Yij est la réponse du génotype i de l'environnement j, m la moyenne générale et gi l'effet fixe du génotype i. L'effet Ej de l'environnement j et l'interaction (gE)ij peuvent être fixes ou aléatoires. Pour l'objectif de prédiction des réponses de génotypes dans l'ensemble des environnements potentiels auxquels ils sont destinés, l'optique aléatoire est plus pertinente. Ainsi, supposons ces deux effets et le terme d'erreur eij aléatoires, iid et indépendants les uns des autres avec E(Ej) = E[(gE)ij] = E(eij) = 0 et V(Ej) = ó2 E, V[(gE)ij] = ó2gE et V(eij) = ó2e où E(·) et V(·) désignent l'espérance et la variance. Choisir un génotype i dans un environnement j suppose d'estimer l'espérance de sa performance dans j. La précision de cette estimation est fonction de ó2E, ó2gE et de ó2e . Dans cette zone du Sahel, l'environnement est variable, c'est-à-dire que ó2E et ó2gE sont grands, ce qui dégrade cette précision. Pour l'améliorer, une solution est de modéliser les variations de Yij en fonction de l'environnement par l'utilisation de modèles de simulation de cultures tels que DHC [1], IRSIS [2], SarraH [3], etc. De ce fait, une partie de l'effet aléatoire de l'environnement est reportée dans la partie fixe du modèle. Cette approche n'est pas possible avec les modèles classiques de l'interaction génotype × environnement. En effet, la méthode AMMI, Additive Main effects and Multiplicative Interactions [4] ainsi que la régression conjointe [5,6] ne tiennent pas compte des nouveaux environnements pour y prédire les réponses des génotypes. La régression factorielle [4,5] en tient compte, mais suppose que l'action des variables des environnements sur la production est linéaire, ce qui n'est pas certain. Cependant, les paramètres des modèles de simulation de cultures ne sont pour la plupart connus que pour un petit nombre de génotypes, car leur évaluation demande une expérimentation spécifique et des mesures coûteuses. L'objectif de cette étude se pose alors en ces termes : comment prédire le comportement de génotypes dans de nouveaux environnements en tenant compte de ces derniers, sans coût excessif ? |
|