A Bayesian Analysis Of One Aspect Of The SARS-CoV-2 Origin Story — Where The First Recorded Outbreak Occurred.
This article applies Bayesian analysis to the hypothesis that the proximal origin of SARS-CoV-2 was an uncontrolled release from a laboratory using, as evidence, one aspect of the SARS-CoV-2 origin story — where the first recorded outbreak occurred.
The goal is not to derive a single definitive estimate of the probability that the proximal origin of SARS-CoV-2 was a laboratory or was natural — any given Bayesian analysis is only as unbiased as its priors and the priors are only as unbiased as those selecting them.
Instead, we sample 2 of the parameters across 2 orders of magnitude and the 3rd across a plausible range, and show how the posterior probabilities vary as the parameters vary across these orders of magnitude. The article shows that even if you do not believe the laboratory origin hypothesis, the location of the first recorded outbreak is still highly relevant to a rational revision of your initial belief in that hypothesis.
Bayes Theorem
The eponymously named Bayes Theorem was discovered by the Reverend Thomas Bayes in the 1700’s and saved for posteriority by an archivist of his papers who discovered the work posthumously. In common language, it provides a rational technique for revising a prior belief in light of new evidence. The equation for Bayes Theorem is given below:
where:
- H is the statement of the hypothesis of interest
- P(H) is the prior probability that the hypothesis is true, independent of the evidence.
- E is the evidence being used to revise the belief in hypothesis
- P(E) is the marginal likelihood of the evidence, independent of the hypothesis
- P(E|H) is the likelihood the evidence, given that the hypothesis is true
- P(H|E) is the posterior probability of the hypothesis, given the evidence.
P(E) is sometimes difficult to estimate, but the following identity must hold:
Here P(E|^H) is the probability of the evidence, assuming the hypothesis is false and P(^H) is the probability the hypothesis is false which is the same as 1-P(H). Estimating the two conditional probabilities P(E|H) and P(E|^H) is generally easier than estimating the unconditional probability, P(E).
The Scenarios
The Hypothesis
The hypothesis, H, for the main scenario is:
That the outbreak of SARS-CoV-2 in the human population was caused by an uncontrolled release of the virus from a virology laboratory in China.
By using the term uncontrolled release, we are specifically excluding from consideration the possibility that the pathogen was deliberately released from the laboratory.
The hypothesis is stated without reference to the particular city in which the outbreak occurred nor to a particular laboratory. The hypothesis only assumes the same state of knowledge that would be shared by anyone learning, for the first time, about the outbreak of a novel coronavirus in China in early 2020.
The Evidence
The evidence, E, for the main scenario is:
The first recorded outbreak of SARS-CoV-2 in the human population occurred in a city that is also home to a virology laboratory that actively performs research on closely related viruses.
In this case, the city is Wuhan and the virology laboratory is run by the Wuhan Institute of Virology.
Supporting Facts
These are the facts of the matter that are relevant to the analysis. These are by no means all the relevant facts to the origin of SARS-CoV-2, but are sufficient to conduct this Bayesian analysis of the factual scenario and a selected counter-factual scenario.
- COVID-19 is a disease known to by caused by SARS-CoV-2, a species of betacoronavirus.
- Wuhan is home to Wuhan Institute Of Virology which is a well-known centre of expertise about bat coronaviruses in general and of one of the closest known relatives of SARS-CoV-2, RaTG13, in particular.
- the two closest known wild relatives of SARS-CoV-2, RaTG13 and RmYN02, were discovered in Yunnan province, near Mojiang
- the population of China is 1.3b people
- population of Wuhan prefecture, 11m ~ 0.8% of Chinese population
- area of Wuhan prefecture, 8494km²
- distance from Mojiang to Wuhan, capital of Hubei province ~ 1800km
- area of Kunming prefecture, 21000km²
- population of Kunming prefecture, 6.6m ~ 0.5% of Chinese population
- distance from Mojiang to Kunming, capital of Yunnan province ~ 250km
The Parameters
The three parameters to estimate are: P(H), P(E|H) and P(E|^H). P(E) can be calculated from P(H), P(E|H) and P(E|^H) given the identity above.
P(H)
P(H) is the prior probability that the hypothesis is true, independent of the evidence.
Estimating the prior probability of the hypothesis is difficult to do accurately and care must be taken to discount, for now, any knowledge one might actually have about the relative location of the Wuhan Institute of Virology to the location of the outbreak, since that “knowledge” will be provided by the statement of evidence.
Is a laboratory outbreak impossible? No, clearly not. Laboratory outbreaks have occurred before and they will occur in the future. For the purposes of illustration, I have selected 3 different values to be considered: 1/10000, 1/1000 and 1/100 that span 3 orders of magnitude — in other words the prior probability of the hypothesis is no more than 1% but may be as small as 0.01%.
Ardent believers in the laboratory origin hypothesis might argue that the prior probability of the hypothesis is more than 1%. Likewise, ardent sceptics might argue that it can’t possibly be greater than 0.01%. However, the range chosen does cover 2 orders of magnitude and those that wish to argue that this range does not include the actual prior probability of the hypothesis being true will need to make their own arguments about why this is so.
Is there a rational basis for excluding smaller or larger probabilities? I don’t see the need to include possibilities less than 1/10000 in the absence of an argument that demonstrates the need to do so. Likewise, I can’t provide an argument for considering possibilities of more than 1 in 100, since laboratory outbreaks are far from everyday occurrences.
So, for the purposes of this analysis, we are assuming that:
P(E|^H)
P(E|^H) is the conditional probability of the evidence, given that the hypothesis is actually false.
If a laboratory wasn’t responsible for the uncontrolled release of SARS-CoV-2, then what is the chance that the first recorded outbreak of the virus would occur in Wuhan, home of the Wuhan Institute of Virology?
There might be many ways to answer that. With perfect knowledge and unlimited computing resources, one could enumerate all possible transmission chains from the most likely natural reservoir of SARS-CoV-2 (or its close predecessors) and then count up how many of those end up causing the first recorded outbreak in Wuhan rather than elsewhere in China (or, indeed, the rest of the world). This would be a pretty good estimate of the chance that the first recorded outbreak occurs in Wuhan.
In the absence of perfect knowledge, we can find other ways to produce an estimate.
One way that springs to mind is to consider the relative area of Wuhan prefecture within the 1800km radius circle centred on Mojiang caves in Yunnan.
So, if estimated the probability that the first recorded SARS-CoV-2 outbreak were to occur in Wuhan rather than anywhere else in the 1800km radius around Mojiang caves, then that probability is the vanishingly small 0.00083 or close to 1/1000.
However, that estimate is not fair to the natural origin hypothesis because the population is not uniformly distributed across the larger area, indeed, some of that area is indubitably over the ocean.
An option that is fairer to the natural origin hypothesis is estimate the probability based on the chance that everyone in China had equal chance of being the initial vector of SARS-CoV-2 and then count how many of those people live in Wuhan.
This estimate has the advantage of better taking account of population density but is biased against the laboratory origin hypothesis because it assumes that everyone in Wuhan was just as likely as everyone in Kunming to be the first to contract SARS-CoV-2 even though Kunming is at least 6 times closer to Mojiang than Wuhan is. However, we will select P(E|^H) = 0.01 as the centre point of our estimates for this parameter and choose estimates of 0.001 and 0.1 to account for the extremes on either side.
P(E|H)
P(E|H) is the conditional probability of the evidence, given the hypothesis is true.
In other words, if there was an uncontrolled release of a virus from a laboratory, what are the chances that the resulting outbreak occurs in the same city as the laboratory itself?
If the disease caused by the virus had an incubation period of years and the infectious agent was primarily transmitted by contact with bodily fluids such as during relatively infrequent episodes of sexual contact, one could imagine scenarios where a travelling scientist catches it in a laboratory in one city and then travels to another city and only then do circumstances arise where an outbreak occurs in a city remote to the lab. For hypothetical diseases of this kind, you might assign a low to moderate value of P(E|H) to reflect the possibility that evidence might escape detection in the vicinity of the laboratory where zoonotic transfer first occurred.
In most other cases, such as this one, where the infectious agent is known to be highly contagious and the incubation period is relatively short, the chance that first outbreak occurs in the immediate vicinity of the laboratory where the zoonotic transfer first occurred is much greater.
One way to estimate this number more rigorously might be, for example, tabulating all the known cases of zoonotic transfer involving laboratory animals that have caused outbreaks and identifying which percentage of those cases involved outbreaks in the same city as the city where zoonotic transfer occurred. We can use this percentage as an estimate of P(E|H)
For the reasons given above, it seems likely that this number is greater than 0.5 for SARS-CoV-2 and must be less than or equal to 1. To obtain a middle estimate which is fairer to the natural origin case than not, we will use the geometric mean of 0.51 and 1, so 0.71.
Populating The Cube
With 3 estimates for the 3 unknown parameters of posterior probability equation, we can calculate a cube of estimates for P(H|E) as follows.
Each of the green squares can be considered as a layer of a cube. The values of P(E|H) correspond to a coordinate of the top-bottom axis, the values of P(H) correspond to a coordinate of the left-right axis and the values of P(E|^H) correspond to the coordinates of a back-front axis.
Parameter values are arranged so that values that increase the value of P(H|E) range from left-to-right or top-to-bottom or back-to-front, so sceptical estimates tend to be in the back, upper, left corner of the cube, susceptible estimates tend to be in the front, lower, right corner cube.
The ardent sceptic’s revised estimate is no more than 0.05% or 5/10000. It applies to someone who was initially very sceptical about a lab origin, who believes there is no more than 51% chance that an uncontrolled release of a highly contagious disease would lead to a local outbreak and who thinks there was at least a 10% chance that a natural outbreak of a virus native to Yunnan would have occurred in Wuhan before any place else.
On the other extreme, is the ardent believer who started with at least a 1% belief in a laboratory outbreak, is 100% certain that an uncontrolled laboratory release would result in a local outbreak and believes that the probability that a natural outbreak of a virus native to Yunnan would occur in Wuhan before any place else is less than 0.1%. The ardent believer’s revised belief is that the probability that the Wuhan outbreak was caused by an uncontrolled laboratory release is at least 91%.
In the centre, is the so-called “central” observer who accepts that the central values for each of the parameter ranges are reasonable estimates of the true values of the probability being estimated. The central observer started with an initially sceptical belief in the hypothesis of 0.1%, believes that average citizen in Wuhan was a likely as any other citizen of China to be the initial vector of the virus into the human population and believes that there is no more or less than a 71% chance that an uncontrolled release from a laboratory of a highly contagious pathogen such as SARS-CoV-2 would result in a local outbreak as opposed to an outbreak in some other location. The central observer’s revised belief in the hypothesis is 6.6%.
It should be noted that the “central” observer isn’t necessarily the same thing as the unbiased or neutral observer. In particular, the central values for all 3 parameters are likely, in the opinion of this author, to be at least slightly biased towards the natural origin position.
A Counter Factual Scenario
To illustrate how important the parameter P(E|^H) is to the posterior probability of the hypothesis, we can apply the same analysis to a counter factual scenario in which the probability of the evidence being true given the hypothesis is false is far higher — a scenario where the first outbreak occurred in Kunming.
Kunming is the capital city of Yunnan province. It doesn’t necessarily have a world-leading bat coronavirus research lab, but it does host Kunming Institute of Zoology (KIZ) and, at least in principle, might conduct research into viruses.
However, the key fact that increases P(E|^H) in this case is the sheer proximity of Kunming to the natural reservoir of the two closest known relatives (RaTG13, RmYN02) of SARS-CoV-2 (~300 km) compared to Wuhan which is 1900 km away. It is quite easy to believe there is a multitude of pathways from the caves of Mojiang to the first recorded outbreak in the provincial capital. For this reason, we can assign a range of probabilities between 50% and 100% with a central assignment of 71%.
If we do this, we get a probability cube with a much reduced probability of P(H|E) as shown below:
Strength Of Belief Revision
The point of the counter factual is to illustrate that the sheer unlikelihood of an outbreak occurring in Wuhan if the hypothesis is false is enough to revise the initial strength of one’s belief in the hypothesis by between 1 and 2 orders of magnitude as is illustrated by this table which shows the ratio between P(H|E) and P(H) for the Wuhan scenario
Discussion
This post has not attempted to derive the one true and absolute probability that the first outbreak of SARS-CoV-2 in Wuhan originated from an uncontrolled release from a laboratory.
It has presented a Bayesian analysis, with sceptical priors, and a plausible range of likelihood estimates that should encompass the positions of most sceptics, believers and neutral observers. By tabulating these values and the ratio between P(H|E) and P(H), it is possible to see how important the fact the first outbreak occurred in Wuhan is to a rational revision to any prior belief about the likelihood of a uncontrolled release from a laboratory.
As a sanity check that the analysis is reasonable, this post has also presented a counterfactual which shows how weak a posterior belief in a lab origin would remain if the outbreak had instead occurred in Kunming. The only change of relevance in the counter factual analysis, was the far greater likelihood that the evidence would occur, even if the hypothesis was false, in Kunming due to its relative proximity to the natural reservoirs of the closest known relatives to SARS-CoV-2.
Postscript
Since beginning to write this post in late November, the work [1] of Gilles Demaneuf and Rodolphe De Maistre has come to my attention. At the time of writing this postscript, I had not read past the abstract of that work, but I understand those authors also selected, using different arguments and counter argumentts, a conservative value of 1% for the parameter P(E|^H). I decided not to read their paper until I had finally published this post so that I could truthfully claim that my analysis had not been influenced by theirs, but now look forward to reading what they have written.
[1] Outlines of a probabilistic evaluation of possible SARS-CoV-2 origins