Within the spirit of studying the whole lot I might about COVID-19, I attended the day-long COVID-19 and AI: A Digital Convention organized by the Stanford Human-AI (HAI) group. One of many audio system was Prof. Nigam Shah, who spoke about his Medical Middle’s Information Science Response to the Pandemic, and described the forms of Information Science fashions that may inform coverage to fight the virus. As well as, he additionally wrote this Medium publish about Profiling presenting signs of sufferers screened for SARS-Cov-2 the place he used the identical diagram for his unified mannequin, which is what caught my eye. Hat tip to my colleague Helena Deus for locating and posting the hyperlink to the article on our inner Slack channel.
In any case, the Medium publish describes a textual content processing pipeline designed by Prof. Nigam’s group to extract scientific observations from notes written by care suppliers on the Emergency Division of Stanford Well being Care, when screening sufferers for COVID-19. The pipeline is constructed utilizing what appear like guidelines based mostly on the NegEx algorithm amongst different issues, and Snorkel to coach fashions that acknowledge these observations in textual content utilizing these noisy guidelines. The frequency of those observations had been then tabulated and possibilities calculated, in the end resulting in an Excel spreadsheet, which Prof. Nigam and his workforce had been variety sufficient to share with the world.
There have been 895 sufferers thought of for this dataset, of which 64 examined optimistic for SARS-Cov-2 (new title is COVID-19) and 831 examined destructive. So at this cut-off date, the prevalence of COVID-19 within the cohort (and by extension, probably within the broader neighborhood) was 7.2%. The observations thought of within the mannequin had been those that occurred at the very least 10 instances throughout all of the affected person notes.
So what can we do with this knowledge? My first thought was a symptom checker, which might compute the likelihood {that a} specific affected person take a look at optimistic given a number of of the observations (or signs, though I’m utilizing the time period a bit loosely, there are fairly a couple of observations right here that aren’t signs). For instance, if we wished to compute the likelihood of the affected person testing optimistic provided that the affected person displays solely cough and no different symptom, we’d denote this as P(D=True|S0=True, S1=False, …, S49=False).
In fact, this relies on the simplifying (and really possible incorrect) assumption that the observations are unbiased, i.e., the truth that a affected person has a cough is unbiased from the truth that he has a sore throat. Additionally, the opposite factor to recollect is that predictions from the symptom checker shall be depending on the right worth of the present illness prevalence price. The 7.2% worth we have now is just appropriate for the time and place the place the information was collected, so will should be up to date accordingly if we want to use the checker even with all its limitations. Here’s a schematic of the mannequin.
Implementation sensible, I initially thought of a Bayesian Community, utilizing SQL tables to mannequin it as taught by Prof. Gautam Shroff in his now-defunct Internet Intelligence and Large Information course on Coursera (this is a fast notice on how you can use SQL tables to mannequin Bayesian Networks because the approach, despite the fact that its tremendous cool, doesn’t seem like mainstream), however I noticed (because of this Math StackExchange dialogue on expressing Conditional Chance given a number of unbiased occasions), that the formulation could be way more easy, as proven beneath, so I used this as an alternative.
The thought of utilizing the proportionality relationship is to normalize the numerator by computing P(D=True|∩Sokay).P(D=True) and P(D=False|∩Sokay).P(D=False) and divide by the sum to get the likelihood of a optimistic take a look at given a set of signs. As soon as that was accomplished, it led to a number of extra fascinating questions. First, what occurs to the likelihood as we add increasingly signs? Second, what occurs to the likelihood with totally different prevalence charges? Lastly, what’s the “symptom profile” for a typical COVID-19 affected person based mostly on the information? Solutions to those questions and the code to get to those solutions could be present in my Github Gist right here.
I’ve stated it earlier than, and given that individuals would possibly have a tendency to know at straws due to the pandemic state of affairs, I’m going to say it once more. That is only a mannequin, and really possible an imperfect one. Conclusions from such fashions should not an alternative choice to medical recommendation. I do know most of you notice this already, however simply in case, please don’t use the conclusions from this mannequin to make any actual life choices with out unbiased verification.