In-Person School? More Research is Needed
The issue of whether or not to keep schools open that are already in-person, or open up schools that are currently remote, is of fundamental importance to many communities, families, and educators. A study published in December 2020 used county and district-wide data from Washington and Michigan to investigate whether in-person or hybrid options were correlated with increased spread of COVID-19. Of course, efforts along these lines are highly valuable, and data-based research definitely ought to be what is driving public policy.
However, this particular study has serious flaws, which were not widely reported in the popular press, as usual. Instead, the ballyhoo was, at its worst and most oversimplified: “in-person school does not increase community spread of COVID-19.” Slightly more nuanced reporting included the key aspect: “As long as existing community spread is low (a relative, qualitative term not even very well defined by the authors of the study), in-person school does not increase community spread.” What was virtually never reported were three key facts about this study:
- The method used by the researchers was multivariate linear regression, probably not the most informative statistical analysis to use, given the nature of the data and the research question.
- A key finding was that “simple, naive” regression analyses showed *very high correlation* between in-person instruction and increased community spread of COVID-19, regardless of rates of community spread.
- Although the researchers included covariate analysis and use a fixed effects strategy, they neglect to perform a crucial step before conducting their regression analysis, which is to test their selected variables for variance inflation, and to eliminate multicollinearity and interaction effects of variables.
The form of this study is as a white paper that I don’t think was peer reviewed. If I adopt a peer reviewer stance and read the study critically, I would definitely recommend several revisions before publication. Here’s the list of major revisions I would have suggested:
- Employ at least one other method of analyzing the multivariate data outside of OLS regression, *after* testing all variables for variance inflation. The data lends itself to a variety of other statistical analyses, and the days of simply creating a multivariate OLS linear regression and claiming to have found informative results are long gone, or they should be. A great way to analyze the data would be to set a threshold for a binary cutoff, and then use boosted regression trees to generate a probability of increased spread based on school modality. Or, another approach would be to use Bayesian analysis, which the data is particularly suited for, since there is a huge amount of pre-existing information, which would perhaps result in very strong posterior probabilities. The point is, it is all too easy to add variables to a multivariate linear regression, without accounting for multicollinearity or variance inflation, and come up with whatever correlation results one wants. I trust their “simple and naive” regression *more* than their more complex regression equation. The hard data rationale for their choice of variables is never provided.
- There is very little to no actual epidemiology or biology in this study. Were epidemiologists part of the study? How do these results make sense from an epidemiological perspective? Why are low pre-existing rates of community spread informative to the consequently, apparently low correlation between in-person learning and community spread? Is there an epidemiological and biological reason for this? This is where another method, called network analysis or graph theory, would have been highly informative. Modeling how “low” rates of existing community spread impact school communities would have offered some epidemiological rationale for these findings.
- It is extremely important to promote the limitations of a study like this much more assertively in the discussion section. There is mention of “caution” in interpreting the results, a lack of data regarding “minority” populations, and other limitations. However, the authors should realize that, in our current climate, *any* data-based research that seems to offer a way out of the log jam of restrictions in place for public health will be seized upon very optimistically by a weary and exasperated public. Actual lives are in the balance.
- What are the specific policy recommendations? If the study is strong, make some actual hard recommendations. Provide clear, simple, threshold case rates per 100000, hard mask recommendations, distancing, building ventilation parameters, and on and on. If the study results are not strong enough to offer those recommendations, revisit the data, methods, results, and conclusions, and produce something more useful. As it is, this study has been thoroughly propagandized.
As I mention above, data-based research on the impact of major public policy decisions on the public health situation regarding COVID-19 are crucially important. There has been a concerted campaign of disinformation regarding this pandemic. Data-based analyses that offer strong results arrived at through a variety of well-tested statistical methods are to be applauded. However, it’s unfortunate to encounter a data-based study that has serious flaws, and ought to have been re-focused and significantly strengthened in several ways before becoming public.