Centre for Biodiversity Theory and Modelling|
Station d’Ecologie Théorique et Expérimentale du CNRS
09200 Moulis, France
Phone: +33 (0) 5 61 04 05 84
and its environmental and societal perspectives
I am interested in the understanding of the important evolutionary phenomena, such as the eukaryote/prokaryote transition or the consequence of SnowBall Earth on biodiversity, with a particular emphasis on the role of simplification and mutation. From a practical point of view, my expertise is in bio-informatics tools and models of sequence evolution to infer phylogenomic trees, but my objectives are to include as most as possible ecology theory in the evolutionary analysis of genomic data. From a philosophical perspective, I am fascinated by the strong positive correlation between human knowledge of the environment and the degradation of environment. So I am interested in knowing whether this correlation is fortuitous or whether the quest of a better knowledge is actively contributing to the degradation. I therefore try to evaluate the environmental footprint of scientific research, a direct, but insufficient, cause and to look for possible indirect effects of improved knowledge.
Need of transdisciplinary approaches and their implications
The fast pace of ecological and evolutionary changes requires us to simultaneously study ecological and evolutionary processes. Major advances in the acquisition of molecular data, in particular in nucleic acid sequencing, are providing an extraordinary wealth of information that has already revolutionized the way we conduct research. For example, DNA barcoding, metagenomics, comparative genomics and phylogenomics all contribute to a much deeper knowledge of the extant biodiversity. Synergistically, the miniaturization of measurement systems supplies us with a flood of ecological data at a scale comparable to that of DNA sequencing. Together, these theoretical and technological progresses thus open up promising perspectives for improving our understanding of Earth’s biodiversity.
Nevertheless, challenges are at least as important as promises, and these can be broadly classified into three categories:
- Technical — Acquiring and managing large amounts of heterogeneous data (genomic sequences, epigenetic modifications, expression level of genes and proteins, animal behavior, temperatures, etc.) is very difficult, especially when storage alone requires terabytes (and even petabytes) of disk space. One step in particular is becoming increasingly difficult to perform efficiently: error detection and elimination. Historically, optimal error handling has always implied some level of manual control, and failing to do so is bound to lead to a decrease in scientific quality. That is why erroneous results are beginning to crop up in the literature, for instance in transcriptomics and phylogenomics.
- Conceptual — As with any transdisciplinary research, the specialized background of most researchers comes at the expense of misunderstanding at least some of the domains of interest (here, ecology, evolution, molecular biology, statistics, mathematics, algorithmic). Such a narrow training not only prevents the emergence of a holistic view adequately incorporating the contributions of the diverse disciplines at play, but also generates major mental blocks stemming from incompatible cognitive processes and epistemologies (particularly between naturalists and mathematicians).
- Environmental — Ever more resources (ores, fossil energies, etc.) are spent to acquire and analyze these high-throughput data, which increasingly pollutes our environment and directly contributes to the erosion of biodiversity.
Reinforcing connections between molecular evolution and ecosystem functioning
Functional constraints are a key concept in molecular evolution to study natural selection. In practice, however, they are formalized only at the molecular (e.g., 3D-structure of proteins) and organismal (e.g., development) levels, and very rarely at the ecosystem level (e.g., symbiotic relationships). This results into major conceptual shortcomings, such as considering the adaptive landscape as relatively fixed over time (landscape versus seascape concepts). Moreover, the overarching prominence of the reductionist approach among molecular biologists generally leads them to assume that characters are independent, and thus to overlook the role of epistasis, which, in my opinion, strongly limits both the interpretation and the exploitation of genomic data.
I plan to use ecological thinking to better take account of epistasis in the statistical models of molecular evolution, with the aim to reduce tree reconstruction artifacts and to improve the detection of positive selection. This modeling could be mechanistic, in the spirit of the models taking account of the 3D-structure of proteins that we have developed previously. However, based on these past experiences, it will more probably be phenomenological (it will be a matter of giving flexibility at the right place without complicating the model too much).
Taking into account the heterogeneity in macroevolution
The homogeneity of the objects under study is an implicit hypothesis often assumed in molecular evolution and macroevolution, and probably due to the strong influence of physics on molecular biology and to the idea that individual variation is unimportant at large evolutionary scales, respectively. Hence, intraspecific polymorphism, the study of which is the "raison d’être” of population genetics, is considered as a negligible parameter, even of nuisance, in macroevolutionary studies. Actually, the heterogeneity inherent to living systems is extremely important since it is the sine qua non condition for their evolution. Models of sequence evolution presuppose that a mutation has to be fixed in the population before another mutation occurs, which is very far from the reality. For example, such an assumption cannot model a deleterious mutation being fixed thanks to a compensatory mutation. My objective is thus to try to incorporate intraspecific polymorphism into the models of molecular evolution used in macroevolutionary studies.
In addition to this intraspecific genetic polymorphism, other types of polymorphisms are often neglected: the genetic polymorphism within a multicellular organism, the epigenetic polymorphism, the polymorphism of gene expression level among genetically identical cells and the polymorphism of proteins within a cell. This latter polymorphism can be generated by the errors of the RNA polymerase, the machinery of mRNA maturation (e.g., the spliceosome) and the ribosome, as well as by post-translational modifications. Its role is at present overlooked and I think that it is a mistake. For example, in the insect endosymbiont Buchnera, the lack of fidelity of the RNA polymerase can overcome the frameshift caused by a single nucleotide deletion in the middle of a gene coding for an essential protein. I plan to examine the evolutionary potential of this intracellular polymorphism by relying on the codon usage as a proxy (the poorer the codon usage for a gene, the larger the protein polymorphism). Such a study would be inspired by ecology, which has already developed numerous concepts and methods to handle highly heterogeneous systems.
Phylogenomics to solve the last disputed nodes of the Tree of LifeThe use of genome-scale data allowed to drastically reduce the effect of stochastic error in phylogenetic inference, hence solve many important questions. However, with phylogenomics, the importance of data and, as expected, systematic errors increase. Data error is mainly caused by contaminations and frameshifts, but also by undetected paralogy and any type of data processing mistakes; it can deeply flaw phylogenetic inference, as we have shown in the case of sister-group of land plants. Systematic error is due to the inconsistency of the (necessarily) oversimplified methods we used; it particularly affects long, unbroken, branches, such as ctenophores, microsporidia, or platyhelminths.
We are working on bio-informatics protocols to detect and correct data errors. The difficulty is that contaminations require previous phylogenetic knowledge to be detected (and more generally biological knowledge, since an horizontally transferred sequence will behave as a contaminant sequence) and we will therefore try to avoid the introduction of any bias in dataset construction.
For avoiding systematic error, we will continue our effort to identify the non-modeled complexity of the evolutionary process (e.g., heteropecilly) that affects the most the accuracy of phylogenetic inference. Then, on the short run, we will discard the positions that violates model assumptions the most, and on the long run, we will try to develop new models handling these properties. These approaches will be applied to the challenging part of the Tree of Life, such as the position of Acoelomorpha (i.e., are they highly simplified deuterostomes?), of Ctenophora (i.e., did the nervous system appear twice?), of Microsporidia or of algae with complex plastids.
Heterogeneity of the substitution rate and molecular dating
The very great majority of species do not fossilize. Consequently, we need accurate molecular dating approaches for linking diversification/extinction events to geological and paleontological data, which in turn requires adequate modeling of the variation of the substitution rate. It is thus surprising that so few efforts have been made in this direction, especially given the high sensitivity of molecular dating methods to species sampling and to paleontological calibrations. Probably under the influence of the molecular clock hypothesis proposed by the neutral theory of Motoo Kimura, almost all modeling efforts in molecular evolution are based on Poisson processes (the same probability of change all the time), these being more or less modulated to accommodate rate changes. However, analysis of evolutionary rates on small intervals of time, in particular morphological rates, shows a very large variance. That is why I plan to develop models of sequence evolution in which the rate of evolution would be allowed to be much more chaotic, so as to improve the accuracy of molecular dating estimates.
These new methods will be applied to the evolution of eukaryotes, for which large amounts of sequence data allows to precisely inferring both the phylogenetic tree and the diversification dynamics. With this respect, the long stasis at the dawn of three primary photosynthetic lineages (green plants, red algae and glaucocystophytes) is very intriguing. Albeit impossible to carry out for the moment, correlation with paleontological and geological data (snowball earth, oxygenation of the oceans, or diversification of animals) is needed to understand the characteristics of the tree of eukaryotes. This is all the more interesting as the scale of the changes that Earth is presently undergoing (global warming, loss of biodiversity) is only comparable to some of these major events, all quite ancient. The knowledge that we can build from such an analysis will be de facto limited, as is the quantity of molecular and paleontological information on which it would be based. Nonetheless, this insight will be unique, especially because of the conceptual integration of numerous phenomena that were taking place at a slow pace on a great scale of time.
Environmental impact of scientific research
The parallel accumulation of large amounts of molecular and ecological data, as well as major developments in information technology, bring the tools needed to answer numerous questions pertaining to biodiversity, from its mere description to its short- and even long-term evolution. In a time when the extraction of natural resources becomes harder and harder, and thus more and more expensive, this positive prospect is counterbalanced by a negative outcome, i.e., the need for ever more resources (both energetic and mineral) not only to make clever use of this new knowledge but also to generate it in the first place! For example, the pollution associated to the mining activities that supply the ores required by our computing infrastructures clearly damages Earth’s biodiversity. Therefore, I wish to search for an answer to this provoking question: does a better knowledge of biodiversity indeed serve its protection enough to compensate the damages caused by its very study?
The first step will consist in estimating the environmental footprint of the acquisition of scientific knowledge on biodiversity. In collaboration with specialists of environmental footprint calculations, we will study the footprint of current phylogenetics, and its evolution since 1859, when Charles Darwin founded this research domain. In order to allow their wide adoption, we will develop methods of calculation as simple as possible. Then, to obtain a global estimate, we will strive to convince the research community to indicate on every article on biodiversity the environmental footprint of the published work. My bet is that by raising the environmental awareness of biodiversity researchers, such formalization would help them to more effectively reduce the impact of their work. The second step will consist in estimating the environmental footprint of the storage and especially of the exploitation of the knowledge so created, which should be much more complicated. Indeed, beyond increasingly heavier infrastructures, it will be necessary to include the cost of training scientists in an ever more complex and multidisciplinary knowledge. Once available, these estimations will be compared to those of the specialists of biodiversity preservation, who assess the effects of new protective approaches. Altogether, these two steps should thus allow a cost/benefit analysis of current practices in biodiversity research.