4 Awesome COVID Machine Learning Projects

Forward thinking ways to apply Machine Learning in a Pandemic

Photo by Neil Thomas on Unsplash

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.


The pandemic has changed our lives: a lot. From all sides, the lives we lived before are no longer the same as they once were. Our workplaces are different; our families are different, our expectations are different too.

Given that most of us are working from home, I’ve put together 54interesting machine learning COVID based projects below, they’re all worth checking out! Each of these have their own place and some are more practical than others. However in terms of the raw application of knowledge, these are all great!

Let’s get right to it!


Face Mask Facial Recognition

Facial recognition is a huge field and it’s only set to grow in the coming months and years. Computer Vision is developing rapidly as technology in this space, including autonomous driving and identification, become more and more widespread.

At scale, Coronavirus has resulted in a demographic and societal change whereby people have to physically change their actions. Given that, masks are becoming compulsory in a huge number of countries and as such, the ability to identify whether people are wearing masks is also growing in demand.

Photo by Flavio Gasperini on Unsplash

Building a system that can determine if you’re wearing a mask or not is awfully similar to the problem of Facial Recognition, so the solution to this problem isn’t that difficult to create. Given that, the following sources are those that I’ve found quite useful in researching into it:

I really appreciated the work by PyImageSource and even implemented the framework on my own home computer. It worked so well as two scripts are provided meaning that you can do less of the fiddly stuff, and more of the playing around:

  1. Face mask recognition in images
  2. Face mask recognition in videos

Definitely worth a play around at home!

Photo by bruce mars on Unsplash

Social Distance Recognition

Following on from the mask recognition project: social distancing is one of the key themes of 2020. In the UK for example, you have to remain at a distance of more than 2 metres from people outside of your ‘bubble’, not to mention this distance varies between regions in Europe.

The trouble with this is to implement it in a way that doesn’t require new hardware. Existing camera’s don’t really have an innate concept of distance, so two markers are usually set to inform the program what approximately constitutes a safe distance.

Given that, the following sources will help you to develop your own Social Distance Recognition tool!

Blog by Aqeel Anwar: source and Github

Photo by Joshua Earle on Unsplash

Symptom Checker

Say you’re coming down with a cold, getting a fever and generally feeling a bit run over. Should you worry?

Yes. Get a COVID test.

But if you can muster some energy, you can always use machine learning to aid in the determination of how likely you are to have COVID (or so the theory goes).

Using a sample data set from as generated here, you can quite easily throw it into a Random Forest and understand (a) how likely you are to have coronavirus and (b) how much you should be worried about each symptom.

Blog by Tanveer Hurra: source

Also If you do have symptoms, go get checked and isolate!

However, the trick with this project is getting the symptom data. It’s not easy, but the more symptom data you get, the better your predictions!

Graph Databases

Social distancing is a big deal in the pandemic because the virus can transfer from person to person quite quickly over short distances. Given that, if an individual is tested positive, then it’s important to understand (a) who is in their network of people (which is actually easy to identify), but how likely each person is to have been infected. This allows policy makers to easily trace who may be infected and to isolate such people.

Given that, Nebula Graph is an open source project that allows users to generate graphs and determine connections between people based on arbitrary settings, in this case: people and places. A graph is loaded with data on both sick and healthy people, along with the addresses that people were travelling to: hoping to answer how people get sick when no one they came in contact was sick at the time of contact.

The blog by Min Wu is really insightful here, and despite it not coming with code, it’s not a difficult project to translate into Python.

My recommendation would be to first build a model working with randomly generated data, then, to find a real data set or, generate your own within your network!

Photo by Nicholas Sampson on Unsplash

Despite us all being in lock down, there’s a surge in creativity in the space of machine learning as lots of new problems are being posed. New problems require smart solutions, and thankfully Machine Learning is able to play its part.

Hopefully, you’ve looked into the above and tried to take a stab at some of the projects. Some are easier than the others, but any forward steps you do make can surely make a huge difference!


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Gross Failures in COVID Reporting

Photo by Deniz Fuchidzhiev on Unsplash

Reporting Inextricable Statistics is a Problem

If its use of these items is typical of the NHS at large, the range of daily demand would be between 7.5 million to 12 million, more than the 5.5 million actually supplied. [Source]


It’s been quite clear from the beginning of the epidemic that statistical modelling is not the forte of the UK government. From expected infection counts to levels of social distancing, unexpected care home fatalities to PPE requirements: the UK Government have really struggled to put forward numbers that can be used as a basis of comparison and when they do it’s undoubtedly too late. Moreover, at times, the inextricable statistics often presented can be used against the Government to stoke fear.

As a statistician, it’s imperative to me that the public should understand how to interpret this data properly.

The UK Government is not alone in inextricable reporting though. All Governments across the world have been at fault of this, and they should make the extra effort to report relative statistics. As these statistics are being used for the basis of solving a problem, then the statistic reported must be relative to the problem. Don’t tell us how many items of PPE you’ve sourced: tell us what percentage of demand have you sufficed.

Statistics used for comparisons must be relative

One thing that stuck with me from having been lucky enough to study under Sir Professor David Mackay was that Statistics needs to approachable to make a difference. Back of the envelope statistics can really carry weight however, you have to recognise which question your statistic is answering. As follows, I demonstrate how making simple adjustments to commonly reported figures help in answering pretty big questions.

The Case of Infection Counts

The data I use is from the most reliable source: John Hopkins University. More of this at the end.

Despite a seemingly simple task (it’s not), counting the number of infection cases is important for healthcare organisations to monitor the spread of a pandemic. However once the counts get quite large, is it still meaningful?

Let’s look at this common chart: here we have the infections per country. They all look to be increasing and slightly flattening around the 225,000 mark (France and Italy a bit lower at 160kish, with Sweden down at 25kish).

Figure 1. Infections per Country. : The code used to create this chart is provided at the end of the article

Now as we look at this chart, we could say from it that Sweden is doing the best of all countries, for it has the fewest cases. However, Swedens population is almost 10x lower than the other countries, so we could expect (assuming a similar rate of spreading) that the number of cases would be 10x less by virtue of population size.

Therefore to compare how well one country is doing in relation to another country, we should be comparing a relative statistic: we should look at a statistic that has been adjusted for population to monitor the relative risk of being infected. This is known as Period prevalence:

Period prevalence is the number of individuals identified as cases during a specified period of time, divided by the total number of people in that population.

The following chart shows exactly this and the perspective entirely changes:

Figure 2. Infections per million, per country: The code used to create this chart is provided at the end of the article

By monitoring period prevalence, we can now see that Spains infection count per million is much higher there than in other countries, and therefore much worse. Moreover from this perspective, Sweden does not seem to be doing the best of all other countries, that looks to actually be Germany.

Reporting absolute case counts under-represents the significance of the problem in Sweden.

By transforming our absolute count statistic to a normalised measure, we can more effectively monitor relative risk and make better judgements about how one country performs in relation to another country.


On from this, we know that the epidemic is spreading widely and governments are reacting, but how well are they reacting and how useful have their actions been? To monitor this, we can look at how this period prevalence changes between two reasonable time periods.

You would hope that as a country goes into lockdown, that less people are getting infected. To monitor this, we can look at the changes in infection rate to monitor how well governments are dealing with the spread, and how this has changed through time.

To monitor change, we need to pick a time period that robust. We know of issues of weekend seasonality and the front-loading of US case counts, so in the following chart I take a 10 day difference of the infections per million count to smooth over these features and more effectively monitor how the rate of growth:

Note: the result from this chart is robust to different time gaps — the user is encouraged to experiment. Spoiler alert — the result is largely the same.

Figure 3. 10-day change in infections per million, per country: The code used to create this chart is provided at the end of the article

So, in the past 10 days, the UK’s infection count (per million population) has reported 750 more cases (per million), compared to Spain which has reported an increase of 250 (per million). This tells us that in the UK, the virus is still spreading more than in Spain. It actually seems that the UK is currently in the worst position of major European countries and Sweden looks to be in the second worst position — owing largely due to not going into lock down — something you can not tell from looking at Figure 1.


In the above article, I show that by dividing by population, and calculating differences over time, we can form a picture that provides us as much insight as other more academically thorough statistics. Other metrics (like the R0 and others here) can often be seen to be inextricable because of their complex derivations and concepts. However, the public need to given simple mathematics to quickly gauge and understand the severity of the problem.

Everyone: and I mean everyone can learn something from looking at these simple numbers and looking at statistics more relatively.


Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.


Data

Academics at John Hopkins University bring together data from various reliable sources (including all major health organisations) to track the rate of growth of Coronavirus. The dataset I look at (covid-19_confirmed_global) is updated at regular intervals during the day and can be accessed with a simple read_csv function from pandas (in python) to import the data into a data frame. I remove the present day due to the data being updated intraday.


Code

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
# Countries I want to compare
eu_list = ['United Kingdom','France','Germany','Spain','Italy','Sweden']
# Download John Hopkins Data
fp = 'https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv'
df = pd.read_csv(fp).drop(['Province/State','Lat','Long'],axis=1)
# Sum Across Countries/Regions ~ then you get daily differences
df = df.groupby(['Country/Region',]).sum().T.diff()
# Get population of each country
from countryinfo import CountryInfo
pop = {}
for c in eu_list:
pop[c] = CountryInfo(c).population()

pop = pd.DataFrame(pd.DataFrame(pop, index=pop.keys()).ix[0,:])
pop.columns = ['country']
# Adjust all statistics by Population
pop['multiplier'] = 1000000. / pop['country']
df2 = df.copy()
for k in eu_list:
df2[k] = (df2[k] * pop.ix[k,'multiplier'])
# Plot Infections per million
df[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per country [Updated up to 20200508]').grid(); plt.show()
# Cases Per Million Population Plot
df2[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per million, per country [Updated up to 20200508]').grid(); plt.show()
# Cases Per Million Population Plot
df2[eu_list].cumsum().diff(10).plot(figsize=(15,7),title = '10-day change in infections Per Million People [Updated up to 20200508]').grid(); plt.show()
Design a site like this with WordPress.com
Get started