Evaluation of the US COVID-19 Scenario Modeling Hub for … – Nature.com
                            November 21, 2023
                            Overview of evaluation approaches for scenario projections    
    When evaluating the distance between a scenario projection and    an observation, there are two potential factors at play: (1)    the scenario assumptions may not match reality (e.g.,    scenario-specified vaccine uptake may underestimate realized    uptake), and (2) if there were to be alignment between the    scenario specifications and reality, model predictions may be    imperfect due to miscalibration. The difference between    projections and observations is a complex combination of both    sources of disagreement, and importantly, observing projections    that are close to observations does not necessarily imply    projections are well-calibrated (i.e., for scenarios very far    from reality, we might expect projections to deviate from    observations). To address both components, we evaluated the    plausibility of COVID-19 Scenario Modeling Hub (SMH) scenarios    and the performance of SMH projections (ensemble and component    models). A similar approach has been proposed by Hausfather et    al.45. Below, we    describe in turn the component models contributing to SMH, the    construction of the ensemble, the evaluation of scenario    assumptions, and our approaches to estimating model calibration    and SMH performance.  
    SMH advertised new rounds of scenario projections across    various modeling channels, using an open call to elicit    projections from independent modeling teams. Scenario    specifications were designed in collaboration with public    health partners and modeling teams, and final specifications    were published on a public GitHub repository (https://github.com/midas-network/covid19-scenario-modeling-hub).    Teams then submitted projections to this same repository. For    additional discussion about the philosophy and history of SMH,    as well as details about SMH process, see Loo et    al.43.  
    Over the course of the first sixteen rounds of SMH, thirteen    independent models submitted projections, with most submitting    to multiple rounds. Of participating models, prior experience    in public health modeling varied substantially, ranging from    teams with newly built models to address the COVID-19 pandemic    and those with long-established relationships with local,    state, and national public health agencies. The majority of    submitting models were mechanistic compartmental models, though    there was one semi-mechanistic model and two agent-based    models. Some models were calibrated to, and made projections    at, the county level, whereas others were calibrated    toand made projections at the state level; many, but not    all, had age structure. We have provided an overview of each    model in TableS1. As models changed    each round to accommodate different scenarios and adapt to the    evolving pandemic context, we chose not to focus here on    model-specific differences (in structure, parameters, or    performance). For more information on round-specific    implementations, we direct readers to other publications with    details5,22.  
    Our analysis included state- and national-level projections of    weekly incident cases, hospitalizations, and deaths from    individual models and various ensembles for fourteen of the    first sixteen rounds of SMH (Rounds 8 and 10 were not released    publicly, and therefore are not included; see also    TableS2 for a list of    jurisdictions included). Each round included projections from    between 4 and 9 individual models as well as ensembles. For a    given round, modeling teams submitted projections for all weeks    of the projection period, all targets (i.e., incident or    cumulative cases, hospitalizations, and deaths), all four    scenarios, and at least one location (i.e., states,    territories, and national). Here, we evaluated only individual    models that provided national projections in addition to    state-level projections (i.e., excluding individual models that    did not submit a national projection, though projections from    these models are still included in the state-level ensembles    that were evaluated). Submitted projections that did not comply    with SMH conditions (e.g., for quantifying uncertainty or    defining targets) were also excluded (0.8% of all submitted    projections). Detailed description of exclusions can be found    in TableS2.  
    Modeling teams submitted probabilistic projections for each    target via 23 quantiles (e.g., teams provided projected weekly    incident cases for Q1, Q2.5, Q5, Q10, Q20, , Q80, Q90, Q95,    Q97.5, and Q99). We evaluated 3 methods for aggregating    projections: untrimmed-LOP, trimmed-LOP (variations of    probability averaging or linear opinion pool18, LOP), and    median-Vincent (variation of quantile or Vincent    averaging36,37 which is also    used by other hubs11).  
    The untrimmed-LOP is calculated by taking an equally weighted    average of cumulative probabilities across individual models at    a single value. Because teams submitted projections for fixed    quantiles, we used linear interpolation between these    value-quantile pairs to ensure that all model projections were    defined for the same values. We assumed that all projected    cumulative probabilities jump to 0 and 1 outside of the defined    value-quantile pairs (i.e., Q1-Q99). In other words, for a    projection defined by cumulative distribution (F(x)) with quantile function    ({F}^{-1}(x)), we assume    that (F(x)=0) for all    (x < {F}^{-1}(0.01)) and    (F(x)=1) for all    (x > {F}^{-1}(0.99).)  
    The trimmed-LOP uses exterior cumulative distribution function    (CDF) trimming54 of the two    outermost values to reduce the variance of the aggregate,    compared to the untrimmed-LOP (i.e., the prediction intervals    are narrower). To implement this method, we follow the same    procedure as the untrimmed-LOP, but instead of using an    equally-weighted average, we exclude the highest and lowest    quantiles at a given value and equally weight all remaining    values in the average. Under this trimming method, the    exclusions at different values may be from different teams.  
    The median-Vincent aggregate is calculated by taking the median    value for each specified quantile. These methods were    implemented using the CombineDistributions    package19 for the R    statistical software55.  
    Projections in each SMH round were made for 4 distinct    scenarios that detailed potential interventions, changes in    behavior, or epidemiologic situations (Fig.1). Scenario design was    guided by one or more primary purposes56, which were    often informed by public health partners and our hypotheses    about the most important uncertainties at the time. SMH    scenarios were designed approximately one month before    projections were submitted, and therefore 4-13 months before    the end of the projection period, depending on the rounds    projection horizon. Scenario assumptions, especially those    about vaccine efficacy or properties of emerging viral    variants, were based on the best data and estimates available    at the time of scenario design (these were often highly    uncertain). Here, our purpose was to evaluate SMH scenario    assumptions using the best data and estimates currently    available, after the projection period had passed. We assessed    SMH scenarios from two perspectives:  
        based on their prospective purpose: we identified        whether scenarios bracketed reality along each        uncertainty axis (i.e., one axis of the 22 table        defining scenarios, based on one key source of uncertainty        for the round). Scenarios in most SMH rounds were designed        to bracket true values of key epidemic drivers (though the        true value was not known at the time of scenario design).        In other words, along each uncertainty axis in an SMH        round, scenarios specified two levels along this axis        (e.g., optimistic and pessimistic assumptions). Here we        tested whether the realized value fell between those two        assumptions (if so, we called this bracketing).      
        for retrospective evaluation of calibration: we        identified the set of plausible scenario-weeks for each        round. One of our primary goals in this analysis was to        assess and compare the calibration of different approaches        (e.g., individual models, SMH ensemble, comparator models).        To assess this in the most direct way possible, we chose        scenarios and projection weeks that were close to what        actually happened (i.e., we isolated error due to        calibration by minimizing deviation between scenarios and        reality; see Overview of evaluation approaches for        scenario projections for details).      
    An evaluable scenario axis was defined as an axis for which    assumptions could be confronted with subsequently observed data    on epidemic drivers; in some instances, we could not find    relevant data and the axis was not considered evaluable (e.g.,    NPI, see below). To evaluate scenario assumptions, we used    external data sources and literature (TableS3). Due to    differences across these sources, we validated each type of    scenario assumption differently (vaccination, NPI, and variant    characteristics; Fig.2), as described in    detail below and in TableS3. Vaccine    specifications and realized coverage are shown in Figs.    S2S5, while details of    our round-by-round evaluation are provided below.  
    Rounds 1-4 concentrated on the early roll-out of the vaccine in    the US and compliance with NPIs. To evaluate our vaccine    assumptions in these rounds, we used data on reported uptake    from the US Centers for Disease Control and Prevention    database57. For these    rounds, scenarios prescribed monthly national coverage    (state-specific uptake was intentionally left to the discretion    of the modeling teams), so we only used national uptake to    evaluate the plausibility of each vaccination scenario    (Fig.S2). In these    scenarios, bracketing was defined as reality falling between    cumulative coverage in optimistic and pessimistic scenarios for    50% or more of all projection weeks. The plausible scenario    was that scenario with the smallest absolute difference between    cumulative coverage in the final projection week (or in cases    of variant emergence, the last week of projections before    emergence; details below) and the observed cumulative coverage.    We also considered choosing the plausible scenario via the    cumulative difference between observed and scenario-specified    coverage over the entire projection period; this always led to    selecting the same scenario as plausible.  
    When scenarios specified a coverage threshold, we compared    assumptions with the reported fraction of people vaccinated at    the end of the projection period. For instance, in Round 2    scenario C and D, we stipulated that coverage would not exceed    50% in any priority group, but reported vaccination exceeded    this threshold. In Rounds 3-4, the prescribed thresholds were    not exceeded during the truncated projection period.  
    By Round 5 (May 2021), vaccine uptake had started to saturate.    Accordingly, in Rounds 5-7, vaccine assumptions were based on    high and low saturation thresholds that should not be exceeded    for the duration of the projection period, rather than monthly    uptake curves. For these rounds, we evaluated which of the    prescribed thresholds was closest to the reported cumulative    coverage at the end of the projection period    (Fig.S3). Later rounds    took similar approaches to specifying uptake of childhood    vaccination (Round 9) and bivalent boosters (Round 14-16).    Rounds 9 (Fig.S4), and 14-15    (Fig.S5) specified weekly    coverage and Round 16 specified a coverage threshold; we    followed similar approaches in evaluating these scenarios.  
    For vaccine efficacy assumptions, we consulted population-level    studies conducted during the period of the most prevalent    variant during that round (TableS3). Similarly, for    scenarios about emerging viral variants (regarding    transmissibility increases, immune escape, and severity) and    waning immunity, we used values from the literature as a ground    truth for these scenario assumptions. We identified the most    realistic scenario as that with the assumptions closest to the    literature value (or average of literature values if multiple    were available, TableS3).  
    Rounds 1-4 included assumptions about NPIs. We could not    identify a good source of information on the efficacy    ofand compliance to NPIs that would match the specificity    prescribed in the scenarios (despite the availability of    mobility and policy data, e.g., Hallas et    al.58). Rounds 13-15    included assumptions about immune escape and severity of    hypothetical variants that may have circulated in the    post-Omicron era. Round 16 considered broad variant categories    based on similar levels of immune escape, in response to the    increasing genetic diversity of SARS-CoV-2 viruses circulating    in fall 2022. There were no data available for evaluation of    immune escape assumptions after the initial Omicron BA.1 wave.    As such, NPI scenarios in Rounds 1-4 and immune escape variant    scenarios in Rounds 13-16 were not evaluable for bracketing    analyses, and therefore we considered all scenarios realistic    in these cases. Overall, across 14 publicly released rounds, we    identified a single most realistic scenario in 7 rounds, and    two most realistic scenarios in the other 7.  
    Finally, in some rounds, a new viral variant emerged during the    projection period that was not specified in the scenarios for    that round. We considered this emergence to be an invalidation    of scenario assumptions, and removed these weeks from the set    of plausible scenario-weeks. Specifically, emergence was    defined as the week after prevalence exceeded 50% nationally    according to outbreak.info variant reports59,60,61, accessed via    outbreak.info R client62. Accordingly,    the Alpha variant (not anticipated in Round 1 scenarios)    emerged on 3 April 2021, the Delta variant (not anticipated in    Rounds 2-5) emerged on 26 June 2021, and the Omicron variant    (not anticipated in Round 9) emerged on 25 December 2021.  
    To assess the added value of SMH projections against plausible    alternative sources of information, we also assessed comparator    models or other benchmarks. Comparator models based on    historical data were not available here (e.g., there was no    prior observation of COVID-19 in February in the US when we    projected February 2021). There are many potential    alternatives, and here we used three comparative models: naive,    4-week forecast, and trend-continuation.  
    The baseline naive model was generated by carrying recent    observations forward, with variance based on historical    patterns (Figs. S13S15). We used the    4-week ahead baseline model forecast from the COVID-19    Forecast Hub11 for the first    week of the projection period as the naive model, and assumed    this projection held for the duration of the projection period    (i.e., this forecast was the naive projection for all weeks    during the projection period). Because the COVID-19 Forecast    Hub collects daily forecasts for hospitalizations, we drew 1000    random samples from each daily distribution in a given week and    summed those samples to obtain a prediction for weekly    hospitalizations. The naive model is flat and has relatively    large prediction intervals in some instances.  
    As a forecast-based comparator, we used the COVID-19 Forecast    Hub COVIDhub-4_week_ensemble ensemble model (Figs.    S7S9). This model    includes forecasts (made every week) from multiple component    models (e.g., on average 41 component models between January    and October 202111). We obtained    weekly hospitalization forecasts from the daily forecasts of    the COVID-19 Forecast Hub using the same method as the naive    model. This 4-week forecast model is particularly skilled at    death forecasts11; however, in    practice, there is a mismatch in timing between when these    forecasts were made and when SMH projections were made. For    most SMH projection weeks, forecasts from this model would not    yet be available (i.e., projection horizons more than 4 weeks    into the future); yet, for the first 4 weeks of the SMH    projection period, SMH projections may have access to more    recent data. It should also be noted that the team running the    COVID-19 Forecast Hub has flagged the 4-week ahead predictions    of cases and hospitalizations as unreliable63. Further, SMH    may be given an advantage by the post-hoc selection of    plausible scenario-weeks based on the validity of scenario    assumptions.  
    Finally, the trend-continuation model was based on a    statistical generalized additive model (Figs. S10S12). The model was    fit to the square root of the 14-day moving average with cubic    spline terms for time, and was fit separately for each    location. We considered inclusion of seasonal terms, but there    were not enough historic data to meaningfully estimate any    seasonality. For each round, we used only one year of data to    fit the model, and projected forward for the duration of the    projection period. The SMH ensemble consistently outperformed    this alternative comparator model (see Figs. S16S21).  
    Prediction performance is typically based on a measure of    distance between projections and ground truth observations.    We used the Johns Hopkins CSSE dataset64 as a source of    ground truth data on reported COVID-19 cases and deaths, and    U.S. Health and Human Services Protect Public Data    Hub65 as a source of    reported COVID-19 hospitalizations. These sources were also    used for calibration of the component models. CSSE data were    only produced through 4 March 2023, so our evaluation of Rounds    13-16 ended at this date (1 week before the end of the 52 week    projection period in Round 13, 11 weeks before the end of the    52 week projection period in Round 14, 9 weeks before the end    of the 40 week projection period in Round 15, and 8 weeks    before the end of the 26 week projection period in Round 16).  
    We used two metrics to measure performance of probabilistic    projections, both common in the evaluation of infectious    disease predictions. To define these metrics, let (F) be the projection of interest    (approximated by a set of 23 quantile-value pairs) and    (o) be the corresponding    observed value. The (alpha)% prediction interval is the    interval within which we expect the observed value to fall with    (alpha)% probability, given    reality perfectly aligns with the specified scenario.  
        Ninety-five percent (95%) coverage measures the        percent of projections for which the observation falls        within the 95% projection interval. In other words, 95%        coverage is calculated as      
          $${C}_{95%}(F,o)=frac{1}{N}mathop{sum          }limits_{i=1}^{N}1left({F}^{-1}(0.025)le ole          {F}^{-1}(0.975)right)$$        
          (1)        
        where (1(cdot )) is the        indicator function, i.e., (1({F}^{-1}(0.025)le ole        {F}^{-1}(0.975))=1) if the observation falls        between the values corresponding to Q2.5 and Q97.5, and is        0 otherwise. We calculated coverage over multiple locations        for a given week (i.e., (i=1...N) for (N) locations), or across all weeks        and locations.      
        Weighted interval score (WIS) measures the extent to        which a projection captures an observation, and penalizes        for wider prediction intervals35. First,        given a projection interval (with uncertainty level        (alpha)) defined by        upper and lower bounds, (u={F}^{-1}left(1-frac{alpha        }{2}right)) and (l={F}^{-1}left(frac{alpha        }{2}right)), the interval score is calculated as      
      $${{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha      }(F,o)=(u-l)+frac{2}{alpha }(l-o)1(o, <      ,l)+frac{2}{alpha }(o-u)1(u, < ,o)$$    
      (2)    
    where again, (1(cdot )) is    the indicator function. The first term of (I{S}_{alpha }) represents the width of    the prediction interval, and the second two terms are penalties    for over- and under-prediction, respectively. Then, using    weights that approximate the continuous rank probability    score66, the weighted    interval score is calculated as  
      $${{{{{rm{WIS}}}}}}(F,o)=frac{1}{K+1/2}left(frac{1}{2}{{{{{rm{|}}}}}}o-{F}^{-1}(0.5){{{{{rm{|}}}}}}+mathop{sum      }limits_{i=1}^{K}frac{{alpha      }_{K}}{2},{{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha      }right)$$    
      (3)    
    Each projection is defined by 23 quantiles comprising 11    intervals (plus the median), which we used for the calculation    of WIS (i.e., we calculated ({{{{{rm{I}}}}}}{{{{{{rm{S}}}}}}}_{alpha    }) for (alpha=0.02,,0.05,,0.1,,0.2,...,0.8,,0.9)    and (K=11)). It is worth    noting that these metrics do not account for measurement error    in the observations.  
    WIS values are on the scale of the observations, and therefore    comparison of WIS across different locations or phases of the    pandemic is not straightforward (e.g., the scale of case counts    is very different between New York and Vermont). For this    reason, we generated multiple variations of WIS metrics to    account for variation in the magnitude of observations. First,    for average normalized WIS (Fig.3b), we calculated the    standard deviation of WIS, ({sigma    }_{s,w,t,r}), across all scenarios and models for a    given week, location, target, and round and divided the WIS by    this standard deviation (i.e., ({{{{{{rm{WIS}}}}}}/sigma    }_{s,w,t,r})). Doing so accounts for the scale of that    week, target, and round, a procedure implemented in analyses of    climate projections67. Then, we    averaged normalized WIS values across strata of interest (e.g.,    across all locations, or all locations and weeks). Other    standardization approaches that compute WIS on a log scale have    been proposed68, though may not    be as well suited for our analysis which focuses on planning    and decision making.  
    An alternative rescaling introduced by Cramer et    al.11, relative WIS,    compares the performance of a set of projections to an    average projection. This metric is designed to compare    performance across predictions from varying pandemic phases.    The relative WIS for model (i) is based on pairwise comparisons (to    all other models, (j)) of    average WIS. We calculated the average WIS across all    projections in common between model (i) and model (j), where ({{{{{rm{WIS}}}}}}(i)) and ({{{{{rm{WIS}}}}}}(j)) are the average    WIS of these projections (either in one round, or across all    rounds for overall) for model (i) and model (j), respectively. Then, relative WIS is    the geometric average of the ratio, or  
      $${{{{{rm{relative;      WIS}}}}}}={left(mathop{prod      }limits_{j=1}^{N}frac{{{{{{rm{WIS}}}}}}(i)}{{{{{{rm{WIS}}}}}}(j)}right)}^{1/N}$$    
      (4)    
    When comparing only two models that have made projections for    all the same targets, weeks, locations, rounds, etc. the    relative WIS is equivalent to a simpler metric, the ratio of    average WIS for each model (i.e., (frac{{{{{{rm{WIS}}}}}}(i)}{{{{{{rm{WIS}}}}}}(j)})).    We used this metric to compare each scenario from SMH ensemble    to the 4-week forecast model (Fig.4). For this scenario    comparison, we provided bootstrap intervals by recalculating    the ratio with an entire week of projections excluded (all    locations, scenarios). We repeated this for all weeks, and    randomly drew from these 1000 times. From these draws we    calculated the 5th and 95th quantiles to derive the 90%    bootstrap interval, and we assumed performance is significantly    better for one scenario over the others if the 90% bootstrap    intervals do not overlap. We also used this metric to compare    the ensemble projections to each of the comparative models    (Fig.S22).  
    In addition to traditional forecast evaluation metrics, we    assessed the extent to which SMH projections predict the    qualitative shape of incident trajectories (whether trends will    increase or decrease). We modified a method from McDonald et    al.40 to classify    observations and projections as increasing, flat or    decreasing. First, we calculated the percent change in    observed incident trajectories on a two week lag (i.e.,    (log ({o}_{T}+1)-log    ({o}_{T-2}+1)) for each state and target). We took the    distribution of percent change values across all locations for    a given target and set the threshold for a decrease or increase    assuming that 33% of observations will be flat    (Fig.S23). Based on this    approach, decreases were defined as those weeks with a percent    change value below 23% for incident cases, 17% for incident    hospitalizations, and 27% for incident deaths, respectively.    Increases have a percent change value above 14%, 11%, 17%,    respectively. See Fig.S34 for    classification results with a one week lag and different    assumptions about the percent of observations that are flat.  
    Then, to classify trends in projections, we again calculated    the percent change on a two week lag of the projected median    (we also consider the 75th and 95th quantiles because our    aggregation method is known to generate a flat median when    asynchrony between component models is high). For the first two    projection weeks of each round, we calculated the percent    change relative to the observations one and two weeks prior (as    there are no projections to use for reference in the week    prior, and two weeks prior, projection start date). We applied    the same thresholds from the observations to classify a    projection, and compared this classification to the observed    classification. This method accounts for instances when SMH    projections anticipate a change in trajectory but not the    magnitude of that change (see Fig.S44), and it does not    account for instances when SMH projections anticipate a change    but miss the timing of that change (this occurred to some    extent in Rounds 6 and 7, Delta variant wave). See Figs.    S24S33 for    classifications of all observations and projections.  
    We assessed how well SMH projections captured incident trends    using precision and recall, two common metrics in evaluating    classification tasks with three classes: increasing, flat,    and decreasing41. To calculate    these metrics, we grouped all projections by the projected and    the observed trend (as in Fig.5d). Let ({N}_{{po}}) be the number of    projections classified by SMH as trend (p) (rows of Fig.5d) and the    corresponding observation was trend (o) (columns of Fig.5d). All possible    combinations are provided in Table2. Then, for class    (c) (either decreasing,    flat, or increasing),  
        precision is the fraction of projections correctly        classified as (c), out        of the total number of projections classified as        (c), or      
          $${{{{{rm{precisio}}}}}}{{{{{{rm{n}}}}}}}_{c}=frac{{N}_{{cc}}}{{sum          }_{j=1}^{3}{N}_{{cj}}}$$        
          (5)        
        For example, the precision of increasing trends is the        number of correctly classified increases (({N}_{{II}})) divided by the total        number of projections classified as increasing (({N}_{{ID}}+{N}_{{IF}}+{N}_{{II}})).      
        recall is the fraction of projections correctly        classified as (c), out        of the total number of projections observed as (c), or      
      $${{{{{rm{recal}}}}}}{{{{{{rm{l}}}}}}}_{c}=frac{{N}_{{cc}}}{{sum      }_{j=1}^{3}{N}_{{jc}}}$$    
      (6)    
    For example, the recall of increasing trends is the number of    correctly classified increases (({N}_{{II}})) divided by the total    number of observations that increased (({N}_{{DI}}+{N}_{{FI}}+{N}_{{II}})).  
    In some instances, we provide precision and recall summarized    across all three classes; to do so, we average precision or    recall across each of the three projected classes (decreasing,    flat, increasing). The code and data to reproduce all analyses    can be found in the public Github repository69.  
    Further information on research design is available in    theNature Portfolio    Reporting Summary linked to this article.  
See the rest here: 
Evaluation of the US COVID-19 Scenario Modeling Hub for ... - Nature.com