What You Need to Know About Adaptive Trials

July 1, 2006
Pharmaceutical Executive

Volume 0, Issue 0

Adaptive trials aren't just for propeller-heads anymore. They're one of the issues that need to be top-of-mind for the whole executive suite, as a driver of new processes and timelines, as a hot-spot on the budget, and as a battleground where public policy on drug safety and efficacy will be fought out.

Everyone in the industry knows how clinical trials for efficacy are usually done: You take your compound at the dose you worked out in Phase I, establish your "null hypothesis" (typically that your drug is no better than a placebo or the current standard of care) and you start collecting data, in the hopes that you'll fail to prove it. Everything stays carefully blinded. The investigators have no idea what they're administering and the patients have no idea what they're taking until a pre-determined endpoint—to do otherwise would destroy the statistics. At the end, the data are brought together and worked up while everyone waits to see if the efficacy measures reached significance, or, to be more precise, if you failed to prove lack of efficacy by a statistically significant amount. If you did, then things go on according to plan, and if you didn't, you can start thinking about a different patient population or dosing protocol and start the process again. (Naturally, you can also think about giving up completely.)

It's been clear for some time that this system is far from optimal. A widely noted survey by Accenture provided some alarming figures a few years ago: Eighty-nine percent of all drug candidates from the initiation of Phase I through FDA approval failed in the clinic. These figures, it should be noted, cover the 1990s, which makes them unlikely to have improved significantly since then. The Accenture study provided another reason for worry, suggesting that the primary reasons for failure actually changed during the period surveyed. Pharmacokinetic (PK) and bioavailability problems gave way to efficacy as the largest hurdle, while toxicological problems increased their share. Clearly, any techniques that could give an earlier read on these issues would be valuable.

If the writer's definition of a novel is a long work of fiction that has something wrong with it, the clinician's definition of a trial is a large body of drug data in humans that should have been collected differently—at a different dose, in different patients, with different endpoints, for a different length of time. In too many cases, the chief result of a trial is to show that the trial itself was set up wrong, in ways that only became clear after the data were unblinded. Did the numbers show that your dosage was suboptimal partway into a two-year trial? Too bad—you probably weren't allowed to know that. Were several arms of your study obviously pointless from the start? Even if you knew, what could you do about it without harming the validity of the whole effort? Over the last few years, such concerns have stimulated an unprecedented amount of work on new approaches. Ideas have come from industry, academia, and regulatory agencies (such as FDA's Critical Path initiative). Given the situation, anything has the potential to help—if 10 percent of the trials that fail today were to succeed, that would nearly double today's success rate.

A common theme in these efforts has been a move toward adaptive clinical trials. The term "adaptive" covers a lot of ground, from methods that are already fairly standard in certain clinical areas to controversial new designs that require sophisticated modeling and massive computing power. They have several things in common: First, they offer a potential solution to pharma's twin problems of too-long trials and too-low success rates. They are, in many cases, a good fit with the goals of FDA's Critical Path initiative and could allow companies to see the benefits of a new generation of biomarkers currently under development. Most important, after decades of adaptive techniques requiring a level of computing power that put them out of reach, they have started to look practical.

All these factors working together mean that adaptive trials aren't just for propeller-heads anymore. In the coming decade or so, they're one of the issues that need to be top-of-mind for the whole executive suite, as a driver of new processes and timelines, as a hot-spot on the budget (where big dollars can be spent—and saved), and as a battleground where public policy on drug safety and efficacy will be fought out.

What are adaptive trials? How do they work? And why do they matter so much? Let's take a look at the basics.

STAGE ONE

The simplest forms of adaptive trials are known as "staged" protocols or "group sequential" trials, and some of them are already well known in clinical oncology and some other indications. The most familiar example is the "3+3" Phase I trial design for finding a maximum-tolerated-dose (MTD). In a 3+3 trial, three patients start at a given dose, and if no dose-limiting toxic effects are seen, three more patients are added to the trial at a higher dose. If there is one instance of limiting toxicity in the first group, three more patients are added at the same dose. If two (or all three) in any cohort show dose-limiting toxicity, the next lower dose is declared to be the maximum tolerated. There are many variations on this sort of trial, with different numbers of stages and varying endpoints, but they tend to work in the same framework, no matter which clinical phase is being addressed. Generally, as with this MTD example, the decision that's being addressed at each stage is whether to continue or stop the study.

Essentially, group-sequential trials are fragmented versions of the classic trial design, giving the investigators more opportunities for decision points rather than waiting to see the whole picture at the end. Helpful as that is in theory, in practice, there are some concerns. One particularly important issue: The endpoints and number of patients at each stage have to be chosen to ensure that there is sufficient statistical power to actually answer the questions the trial is supposed to answer at each stage. For example, the sample size in the first stage needs to be large enough to give a low probability of a false-negative result—halting the trial of a compound that was actually efficacious. On the other hand, it is important to guarantee that the number of patients in the control groups is large enough to ensure that non-efficacious trials terminate swiftly, especially in later stages. In fact, it is well known that simple staged designs like the 3+3 are statistically underwhelming: They persist mainly because there's no consensus on what to replace them with.

Part of this problem is built into the very nature of staged trials: When trials are conducted sequentially, the false-positive and false-negative rates for each stage of a study inevitably grow. When a trial is broken into three parts, for example, the chances for a false positive readout can more than double. In 1989, Richard Simon of the National Cancer Institute proposed templates for staged trials that maximized the power for both positive and negative determinations, but the patient requirements on each side can be rather different—and sometimes mutually exclusive. Compromises are common, and several graphical and numerical methods have recently been proposed for finding designs that minimize sample sizes while maintaining as much statistical potency as possible.

INTO THE BAYESIAN UNIVERSE

This whole question of trial design and statistical power illustrates a fundamental issue with staged trials: When you conduct a trial using a classic (frequentist) statistical approach, you only have so much maneuvering room, and there can be limits to the interpretation of the data as well.

Bill Gillespie of Pharsight, a clinical consulting firm, points out that the hypothesis-testing mode of frequentist statistics still leaves the eventual decision-making in a binary state: "You set up a null hypothesis and hope to reject it. If you do, you end up with a fairly strong statement that you're better than nothing, but you don't know how much better you are." The data collection often has to be done in a binary mode as well, classifying patients, for example, as responders or non-responders according to pre-set criteria. In many cases, a more finely tuned readout would be helpful.

Such issues are why the quest for more complex and powerful trial designs leads, in many cases, to the alternate universe of Bayesian statistics. Gillespie is an advocate of this approach in adaptive clinical trials, which he says allows for much more flexibility. It's important to remember that the word "adaptive" isn't always a synonym for "Bayesian," but in moving to higher-level adaptive designs, the topic will always come up, since many of these designs are indeed easier to deal with in a Bayesian framework.

That's because the Bayesian approach was developed specifically to deal with new data as they come in, and to update the probabilities under investigation. Instead of determining the likelihood that a drug's efficacy could have happened by chance (which is what the frequentist approach does), a Bayesian trial will give a probability that the drug was effective. The usual question at this point is: Effective compared to what? And, the answer is: To the probability you calculated, before the data from the current trial, of the drug being effective. This comparison of (initial) prior probabilities and (updated) posterior probabilities is either one of the great strengths of Bayesian statistics or one of its greatest flaws, depending on which statistician you ask. For a pro-Bayesian viewpoint, consider the designs being explored at the MD Anderson Cancer Center in Houston.

A Bayesian alternative to the 3+3 MTD-finding design is the continual reassessment method (CRM). Before the trial begins, researchers develop a model of toxicity in relation to dose. Patients are initially assigned doses based on this model, but as the data come in, the probabilities are recalculated, and later doses close in on an MTD value more effectively than the 3+3 design. As with all Bayesian methods, it is crucial that the model—the prior hypothesis, in Bayesian jargon—be as well informed as possible. Some conservative hybrid designs start with a short 3+3-style dosing round to constrain the dose/toxicity curve before switching to the full CRM technique. There are many other CRM variations, which address issues such as selection of starting dosages, the size of the dose escalation steps, and the number of patients per dose.

Bayesian designs can provide for a transition from merely sequential to continuous monitoring of trial data, but they can also allow for a wide range of other parameters to be changed. Designs can be developed that can, on the fly, vary the number of patients needed, eligibility for joining the trial, how patients are to be divided between arms of the study, and what doses of the investigational drug they'll receive.

There are several ways to implement these response-adaptive patient randomizations. One well-known technique is "Random Play-the-Winner," one of the "urn" methods—so called because they can be modeled after different ways of pulling variously colored balls from an urn. Play-the-Winner mathematically weights the treatment arms that have produced the fewest adverse events and/or the most positive data so that more patients are assigned to them. Both the degree of positive or negative response in a given group and the number of patients already assigned to it have to be taken into account to ensure that all possibilities are being considered, while at the same time, responding to real differences in outcomes. A similar "Drop-the-Loser" rule can also be used, which means that in practice, entire dosage groups or efficacy arms can be added or dropped as the data develop.

It is obvious why this sort of adaptive randomization would be useful to a pharmaceutical company. But in the real world, it may take a long time to see a meaningful response in patients—too long for the data to effectively influence the course of the trial. A number of the more theoretical treatments of adaptive trial methods seem to have glossed over this difficulty, but as such techniques have moved into real practice, the situation has improved, with several statistical techniques proposed to deal with data from partial courses of treatment. Still, there's no doubt that adaptive designs are simplified considerably if the expected clinical readout is rapid—something that should happen more often as biomarkers take on a larger role in clinical research.

A Bayesian approach has the potential to make big changes in the way R&D is conducted. Even the basic Phase I/II/III divisions that the industry works with could easily blur together. Early dose-finding studies could settle on an optimum level, with more patients being added to that arm of the study while other dosage groups drop away. Any efficacy data that could be collected to that point wouldn't be lost, but would bolster the statistics of the trial as it shifted to Phase II concerns.

There is a price to be paid for these designs, though. They are computationally and logistically complex. One reason that Bayesian statistics were little used for decades was that their number-crunching demands could quickly go beyond anyone's practical capabilities. Modern hardware and software has put them in reach, but the amount of work needed is still substantial. The convolutions of the more powerful designs require a good amount of simulation to check them out before real-world implementation—sometimes tens of thousands of runs. Current computing power is finally able to meet that challenge. But all of this means a certain loss of obviousness, with schemes that are much harder to reduce to the back of an envelope (or a slide presentation laden with bullet points).

The sheer logistics of a high-level adaptive design also require careful thought. Quick and reliable electronic data collection would seem to be mandatory for a trial that is dependent on constant updating. A great deal of that data is going to have to be unblinded while such a trial goes on, in direct proportion to its continuously adaptive nature, and with all the complications that that entails. And adaptive designs present risks of accidental unblinding that are unknown in traditional designs—for example, if patients are to be enrolled at different rates to different treatment arms while the trial is going on. It's easy to envision a situation where people at each study site have access to unblinded results, but that is far too risky, argues Pharsight's Bill Gillespie. A safer alternative would be a central clearinghouse for preparation of study packs and evaluation of data, backed by a reliable inventory and distribution system.

ADAPTING TO THE REAL WORLD

To date, the best example in the open literature of a full-scale, real-world Bayesian trial is Pfizer's ASTIN trial (Acute Stroke Therapy by Inhibition of Neutrophils), published in the journal Stroke in 2003. The trial incorporated, among other features, adaptive allocation of patients to different dose groups and the possibility of a seamless transition to Phase III in the event of strong efficacy readout. As it happened, the trial seems to have been a good one with which to meet the realities of clinical research.

Stroke therapies are difficult to assess for efficacy, with many weeks of treatment and/or follow-up monitoring. This led to the problem of ensuring that data could be meaningfully incorporated back into the trial, which was addressed in this case by modeling outcomes based on early clinical signs. In addition to this anticipated problem, though, several unexpected factors put the design to the test.

The placebo response, for example, was much greater than anticipated, which led the data monitoring committee to stop the trial for futility at the first possible point (which was, to be sure, at the 48-week interval, reflecting the slow progression of any stroke therapy). Another unexpected complication was, paradoxically, the enthusiastic response of the trial centers to the adaptive design. The study's authors later estimated that the pace of recruiting actually outran the adaptive features of the trial, with more patients eventually recruited than would have been statistically required under more measured conditions. The preparation and distribution of clinical supplies seems also to have been a logistical challenge, given that more than 100 study centers worldwide participated.

Still, ASTIN seems to have done what it was designed to do—in this case, killing the drug promptly and decisively. "Pfizer seems to have walked away from it feeling that they'd had a complete answer," says Gillespie. This is an advantage that may have been under-appreciated. Too many negative trials end with one party or another feeling that if something had been done differently (longer, at a higher dose, etc.), the program might have survived. Adaptive designs, built to simultaneously explore clinical data more thoroughly, might leave more researchers satisfied that all the alternatives had been given a fair hearing.

There are other factors that the new designs can help clarify. Consider the two sorts of benefits to be sought with any clinical trial design: You can seek to improve the situation that the clinicians face during the studies, or seek to improve the situation of the eventual patients when the drug reaches the market. These two goals aren't always mutually compatible. For example, it may be more valuable, in the clinical setting, to investigate doses both higher and lower than will likely be used in real-world practice. Similarly, a narrowly focused efficacy trial might give cleaner data from a regulatory standpoint, at the expense of providing more information for the physicians who will eventually prescribe the drug.

Another consideration is the overlap between the statistical power of a trial and its scientific and clinical power. A large, expensive trial can, after all, prove what it set out to prove in terms of statistical significance, but end up confirming a too-small difference from the null hypothesis—one that won't be enough to convince physicians and patients that the new therapy is of any real value. Disturbingly, the larger and more powerful the trial, the greater the chances of this outcome if the issue is not addressed early in its design.

These same concerns apply to adaptive designs, whose flexibility actually makes these tradeoffs more explicit. The balance between them varies depending on the clinical stage. In Phase I, the benefits sought are almost exclusively for the clinical investigators, such as dose-finding. Only in oncology trials, where the Phase I patients are often not healthy volunteers, is there a greater chance for patient/physician benefits to be a consideration. Phase II is probably the stage when these two endpoints are most evenly matched.

From the outside perspective, the main point of a Phase II trial is to determine whether or not the drug benefits patients with the targeted disease. From the inside, though, an equally pressing reason for Phase II is to position things for the best chance of success in the far more costly Phase III. A design that allows for a seamless Phase II/III transition has the potential to allow statistically robust efficacy determinations, while allowing patients to directly benefit from their own participation in the trial, a benefit which is arguably impossible to realize with many standard designs. (This was probably one reason for the high enrollment rates seen in Pfizer's ASTIN trial).

As Pharsight's Gillespie notes, though, the situation today is that explicitly adaptive techniques are found most commonly in Phase I, along with group-sequential designs in Phase III, which leaves Phase II trials as the highest-value opportunity for adaptive designs that isn't currently being taken advantage of. He sees these as particularly valuable in situations where prior knowledge in the field is weak, with a corresponding need to learn as rapidly and efficiently as possible. Therapeutic areas that have difficulties in proving mechanisms in animal models might be a good fit.

Adaptive designs have the potential to change the way clinical research is conducted. But any such power has to be used wisely. As with every other stage of drug development, no magic is on offer here—a bad design cannot be made good by making it adaptive. Careful thought about the purpose and execution of an adaptive trial is needed to keep it from becoming an exercise in self-deception, something the industry is already well stocked with. Still, given the number of times that dosages, toxicities, and efficacies have been wrongly estimated preclinically, there are clear advantages to methods that can use fewer patients when an effect is greater than expected—and give greater statistical power when it turns out to be less. The next step is getting more drug candidates worthy of their trial designs.

Derek Lowe is the author of In the Pipeline, an industry blog. He can be reached at derek-lowe@sbcglobal.net