Considering the whole of the past thirty years, it is fairly safe to say that studies showing improved longevity in animal models have a terrible track record when it comes to the reproduction of findings. Small gains in life expectancy in one study promptly evaporate when it is attempted by other groups. Very few approaches to slowing aging can be reliably reproduced, and the most well-studied of those, calorie restriction, is probably the cause of many of the early failures. It used to be the case that all too few researchers controlled for the effects of calorie restriction: it is easy to make animals eat more or less as a consequence of pharmaceutical interventions, and the results due to changed calorie intake are larger than the results due to most of the interventions tested. The Interventions Testing Program has spent much of the last fifteen years demonstrating that most prior mouse studies of interventions thought to modestly slow aging should be taken with a grain of salt. The same is also true of studies in lower animals, in species with much shorter life spans, but where length of life is affected to a greater proportional degree by environmental influences.
The authors of the open access paper below try to put some numbers to the difficulties involved in picking out small changes in the aging process due to an experimental intervention. The animals involved tend to have quite variable life spans, and are very prone to life expectancy changes based on the details of their environment. This state of affairs requires larger numbers of animals and better statistical approaches to have any confidence in sifting out useful data. But the wrong conclusions are drawn, I think. The point of view of these researchers is that the way forward is to keep on chasing small effects on aging, and to improve the state of experimental design in order to make it more practical to find those small effects.
This is a ridiculous position. What should in fact happen is for the research community to put aside the lines of work that produce only small and erratic effects, stop digging into the biochemistry of exercise and calorie restriction, and focus instead on biotechnologies with results that are reliable, reproducible, and large enough to be clearly identified even given the challenges. Today that means senolytics capable of clearing senescent cells, cell therapies, amyloid clearance, other line items resulting from the SENS approach of damage repair, and little beyond that short list. If the last few decades has taught us anything, it should be that attempts to tinker with the operation of metabolism in order to slightly slow aging by recapturing some of the effects of calorie restriction are expensive, unreliable, and produce only small gains. Why then is this metabolic tinkering with poor outcomes still the primary choice for most of the research community? It makes little sense, at least to those of us interested in the development of working, effective therapies that can produce rejuvenation in old humans.
Over the last few years, science has been plagued by a reproducibility crisis. This crisis has also taken root in the aging research community, with several high-profile controversies regarding lifespan extensions. Frequently cited reasons for the failure of a result to reproduce are substandard technical ability, lack of attention to detail, failure to control environmental factors or that the initial positive result was a statistical outlier that was never real in the first place. One way to address these reproducibility problems would be to list the numerous controversies and to attempt to identify the individual underlying causes and to provide a possible explanation. This would be a long and arduous task resulting in largely speculative explanation and provide little in terms to resolve future controversies. An alternative way would be to assume that these controversies arise mostly through honest disputes of scientists standing by their results. If so, their frequency would suggest an underlying technical problem with standard practices in the field that foster such disputes. We decided to take the alternative way and to ask how reproducible lifespan experiments are under ideal conditions, in silico, allowing to control every environmental and technical aspect.
One important experimental consideration to minimize both false positive and false negative results is the power of detection (POD), or statistical power of a given experimental design. POD is defined as the probability to appropriately reject the null hypothesis in favor of the alternate hypothesis. For lifespan experiments, where the null hypothesis is that there is no effect on lifespan, the POD is the probability to correctly detect a true lifespan extension. Power calculations are a statistical tool to determine whether the experimental design is sufficient to detect the expected effects size. Power calculations are widely used in long term expensive mouse experiments or in clinical trials to ensure that the planned experiments have the necessary power to detect the expected effect. However, power calculations are rarely employed in experiments to measure the effects of genetic or environmental perturbations that could affect lifespan in invertebrate model organisms such as C. elegans.
In this study, we asked how POD is influenced by different experimental practices and how likely it is that underpowered experiments lead to scientific disputes between two groups conducting identical experiments. To address these questions, we generated a parametric model based on the Gompertz equation using lifespan data of 5,026 C. elegans. We then used this model to simulate lifespan experiments with different conditions to determine how experimental parameters affect the ability to detect lifespan increases of certain sizes. We considered two important experimental features that contribute to the workload of lifespan experiments: frequency of scoring and number of animals in each cohort. Our data show that the POD is greatly affected by the number of animals in each group, but less so by scoring frequency. We further show how inappropriately powered experiments negatively affect reproducibility. Our results make clear that current standard practices are unlikely to produce consistently reproducible results for real longevity effects below 20%, even under ideal conditions.