Statistics – Code Python

Why the RepNet is so important

Using Deep Learning to Count Repetitions

In our daily lives, repeating actions occur frequently. This ranges from organic cycles such as heartbeats and breathing, through programming and manufacturing, to planetary cycles like day-night rotation and seasons.

The need to recognise these repetitions, like those in videos, is unavoidable and requires a system that can identify and count repetitions. Think exercising — how many repetitions are you doing?

The Unsolved Problem

Isolating repeating actions is a difficult task. I know, it seems pretty straight forward when you see someone in front of you jumping up and down but translating that into the form of a machine learning problem makes it much more difficult. How do you teach a computer what a jumping jack looks like from all 360 degrees? How can you generalise any inference from video?

Previous work in the space took the approach of analysing videos at a fine-grain level using a cycle-consistency constraint across different videos of the same action. Reading the paper of the old model, you can see that you’re basically building a model that compares frames in a collection of videos:

Temporal Cycle-Consistency Learning: [source]

However, in the real world problems are faced such as camera motion, objects in the field that distort the vision, and changes of form of the repeating view: basically trying to calculate features invariant to such noise. The existing process required a lot of work to ‘densely label data’ and it’d be much more ideal if an algorithm could learn a sequence from a single video.

That’s where RepNet comes in

A RepNet solves the problem of counting repetitions in real-world videos, incorporating a noise that ranges from having camera motion, obscured vision, drastic scale difference and changes in form etc.

Unlike in the past where this problem was addressed directly by comparing pixel intensities in frames, a RepNet can solve this in a single video that contains period action. The RepNet returns the number of repetitions of any such video.

A RepNet is composed of three components: a frame encoder, a temporal self-similarity matrix as an intermediate representation, and a period predictor.

Its frame encoder generates embeddings by fleeting each frame of a video to the encoder of the ResNet architecture.

Then the temporal self-similarity matrix (TSM) can be calculated by comparing each frame with every other frame in the video.

As such, a matrix that is easy for subsequent modules to analyse is returned for counting repetitions. Transformers are used directly from the sequence of similarities in the TSM.

Once the period is attained, the per-frame count can now be obtained from dividing the number of frames captured in a periodic segment by the period length.

An advantage of this representation is that the models interpretability is baked into the network architecture, as the network is being forced to predict the period from the self-similarity matrix only, as opposed to inferring the period from latent high-dimensional features (such as from the frames themselves).

Note that this learning architecture also allows the model to take into account speed changes of the repetition, but also, any obfuscation of a repeating series (i.e. a video that’s rotating whilst also showing a repeated task). The reason why that’s important is because it shows a model that’s generalising. Models that can generalise can be applied to a much wider array of problems, a great leap forward in ML.

You can can use the following resource for more information, including a downloadable pre-trained RepNet: source

The use case of the ability to capture repeating tasks is important. The current level of sophistication struggles to capture relatively well-posed questions like, how many push ups am I doing? Rather, we want to be able to build upon this to be able to infer more complicated repeated actions that we see.

Given that current foundation, it’s only a matter of time till we’re able to characterise more complicated dynamics in video and it’s exciting to see the steps that researchers are making on a daily basis. Google is printing so many papers that the rate of development, along with the rate of knowledge progress is insane.

This article highlights one piece of the AI puzzle. More pieces are yet to come!

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The Sampling Distribution of Pearson’s Correlation

Pearson’s Correlation reflects the dispersion of a linear relationship (top row), but not the angle of that relationship (middle), nor many aspects of nonlinear relationships (bottom). [Source]

How a Data Scientists can get the most of this statistic

People are quite familiar with the colloquial usage of the term ‘correlation’: that it tends to resemble a phenomena where ‘things’ move together. If a pair of variables are said to be correlated then if one variable goes up, it’s quite likely that variable two will also go up as well.

Pearson’s Correlation is the simplest form of the Mathematical definition in that uses the covariance between two variables to discern a statistical, and albeit, a linear relationship. It looks at the dot product between the two vectors of data and normalises this summation: the resulting metric is a statistic which is bound towards +/- 1. A positive (negative) correlation indicates that the variables move in the same (different) direction with +1 at the extreme, indicating that the variables are moving in perfect harmony. For reference:

Pearson’s Correlation is the covariance between x and y, over the standard deviation of x multiplied by the standard deviation of y.

Where x and y are the two variables, ⍴ is the correlation statistic, and σ is the covariance metric.

It’s also interesting to note that the OLS Beta and Pearson’s Correlation are intrinsically linked. Beta is mathematically defined as the covariance between two variables over the variance of the first: it attempts to uncover the linear relationship between a set of variables.

OLS Beta is intrinsically linked to Pearson’s Correlation

However, the only difference between the two metrics is the ratio that scales the correlation based on the standard deviation of each variable: (sigma x / sigma y). This normalises the boundaries of the beta coefficient to +/- 1 and thereby giving us the correlation metric.

Let’s now move onto the Sampling Distribution of Pearson’s Correlation

Expectation of Pearson’s Correlation

Now we know that a sample variance calculation that is adjusted for Bessel’s Correction is an unbiased estimator. As Pearson’s correlation involves effectively 3 sample variances, therefore, inferring that the metric itself is also unbiased:

Standard Error of Pearsons Correlation

A problem with the correlation coefficient occurs when we sample from a distribution involving highly correlated variables, not to mention the changing dynamics as the number of observations change.

These two variables which are intrinsic to the calculation of the correlation coefficient can really complicate matters at the extreme, which is why empirical methods like permutation methods or bootstrap methods are used to derive a standard error.

These are both relatively straight forward and can be referenced elsewhere (note here and here). Let’s quickly go through a bootstrap method.

Say we have two time-series of length 1000 (x and y). Then we can take a random subset of n samples from x, and the corresponding samples from y to now have x* and y* which are subsets of the original data (with replacement). From there, we can calculate a Pearson’s correlation to make one single data point. We re-run this, say, 10000 times and now have a vector of 10000 bootstrapped samples. From this, we can calculate a standard deviation of these 10000 samples to result in the standard error of our metric.

As an example, here I take two random normal variables of length 10,000. I subsample 100 data points and calculate the correlation of it over 1000 times. I then calculate the standard deviation of this dataset and empirically derive a standard error of 0.1 (code as below):

Now this empirical derivation is definitely practical when you’re unsure of the underlying distribution of the pairs of variables. Given that we’re sampling from bivariate normal and independent data, we can approximate the standard error by also looking at Fishers Transform as follows:

which in the above case would be approximately 0.10. A great reference on Fishers Transform is here, which deserves an article in itself, so I will not go into detail.

Moreover, if we want to retain the functional form of the correlation coefficient itself, we can also derive the empirical standard error of the correlation coefficient as:

which is ~ root((1-⍴²)/n) and ⍴= 0 so ~ root(1/n)~1/root(n) = 0.1. Again, I will not go into detail here as this derivation is lengthy, but, another great reference is here.

The important thing to note here is that the three of these results converge because the series that we’re testing have an underlying normal distribution. If this wasn’t the case, then you would see that the standard error approximation of the normal distribution, and the standard error approximated from fishers distribution begin to diverge.

Note: I deplore the reader to focus on empirical methods when estimating standard errors for correlations. Distributions can move under the surface so it’s much more reliable to use empirical methods like permutation or bootstrapping.

The correlation coefficient is an incredibly powerful tool to discern relationships between variables. It actually provides information in a useful manner, although as discussed, it’s distributional properties are quite sensitive to the underlying data. If you can make a function that samples quickly, then the empirical methods will really set you in the right direction

Thanks again! Please message me if you need anything 🙂

Code

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

all_data = pd.DataFrame(np.random.rand(10000,2))

n = 100
nsamples = 1000
bootstraps = pd.DataFrame([all_data.sample(n).corr()[0][1] for m in range(nsamples)])

sns.distplot(bootstraps)
plt.title(‘Standard Error is=[{:.3f}]’.format(bootstraps[0].std()))
plt.grid()

Plotting with Seaborn in Python

Figure 0: Pair Plot using Seaborn — [more information]

4 Reasons Why and 3 Examples How

import seaborn as sns

Finding a pattern can sometimes be the easy bit when researching so let’s be honest: conveying a pattern to the team or your customers is sometimes a lot more difficult than it should be. Not only are some patterns hard to explain in layman terms (try explaining a principal component to non-mathematicians) but sometimes, you’re trying to signify the dependency on a conditional joint distribution…say what?

Charting is imperative to our job as researchers so we need to be able to convey our story well. Without this: our knowledge and findings carry much less weight but with the best visuals, we can be sure to convey our story as well as we can.

In the following article, I’ll discuss Seaborn and why I prefer it to other libraries. I’ll also give my top 3 charts that I use daily.

Why Seaborn

Popular Python charting libraries are surprisingly few and far between because it’s hard to make a one-fits-all setting: think Matplotlib designed to be reflective of the Matlab output and ggplot as the pullover from the R version.

As to reasons why I prefer Seaborn against other top libraries:

Seaborn requires a lot less code than Matplotlib to make similar high-quality output
Chartifys’ visuals aren’t that great (sorry Spotify — it’s just a bit too blocky).
ggplot doesn’t seem to be native to Python so it feels like I’m always stretching to make it work for me.
Plotly has a ‘community edition’ which makes me feel uncomfortable with this worry of licensing so I generally stay away from anything involving legal sign-offs. Design-wise and functionality it’s actually a pretty good and has a broad set of offerings, but I’d say for the added headache, it’s not that much (if at all) better than Seaborn.

Most importantly, a Researcher spends a lot of their time plotting distributions and if you can’t plot distributions easily, your plotting package is essentially redundant. Seaborn intersects histograms and KDE’s perfectly which other packages really struggle to do (Plotly is the exception here).

Finally, Seaborn has the whole design side of things covered which leaves you, the researcher, with more time to research. Matplotlib sucks for visuals and Chartify is too blocky for my liking.

I’m going to keep my conclusion short and sweet: Seaborn is awesome. There’s no hiding that I use it a lot more than other libraries and recommend you to do the same. Let’s now move onto some charts that I use daily.

Univariate Distribution

If you’ve found a random variable whose distribution makes for an interesting story, then Seaborn's displot function works great. It helps to convey the picture by showing:

The underlying empirical distribution in the form of a histogram
A Kernel that’s been approximated over the top to give a smoothed picture

The colours (a nice translucent unoffensive blue) with the grid lines and clear fonts make for a simple and effective offering!

Figure 1: Univariate Distribution of Random Numbers —

Joint Distribution

Here we try to convey a bit more of a complicated dynamic. We have two variables that we feel should be related but how can we visualise this relationship?

The two distributions plotting on the sides of the chart are great for visually seeing how the marginal distributions look but the area plot is perfect for identifying those areas where a concentration of density arises.

Figure 2: Joint Distribution between two Random Variables —

I use this plot in both my research and in my decks as it allows me to keep the univariate dynamics (with the kernel plots) and the joint-dynamics in the forefront of my thought and my audiences: all whilst conveying the story that I’m trying to paint. It’s been super useful in layering discussion and I’d highly recommend it.

Box and Whisker Plots

The problem with distributional plots is that they can often get skewed by outliners which really distorts a story unless you know that those outliers exist and you deal with them in advanced.

Box plots are used so widely as its an effective way to display robust metrics like the median and the interquartile range, which are much more resilient to outliers (due to their high breakdown point),

Seaborn's implementation of the box-plot looks fantastic as it’s able to convey a fairly complicated story by highlighting a number of dimensions, whilst also, looking visually good enough to be fit for an academic journal. Moreover, Seaborn also does a fantastic job of making the code incredibly efficient so the researcher doesn’t have to spend time plot to make it readable.

Being able to discern and discuss a multitude of features and patterns at the same time is imperative to the success of your research, so I highly recommend using this chart. At the same time, you need to make sure you target the chart for your audience: at times you don’t want to go into too much detail!

In the above article, I broadly discussed why for me, Seaborn is the best plotting package and I give my top 3 examples of charts that I use. I’m a strong believer of conveying a message in an easy and understandable manner: the fewer words the better! Cogency is key!

These charts make it so much easier for you to do that and so if you’re a visual thinker, a storyteller, or if you love to see the big picture, then Seaborn is for you.

Thanks again, message me if you have any questions!

Code

The following pieces of code are the simple snippets to recreate the awesome charts above!

Figure 0: Pair Plotting

import seaborn as sns
df = sns.load_dataset(“iris”)
sns.pairplot(df, hue=”species”)

Figure 1: Univariate Distribution

x = np.random.normal(size=100)
sns.distplot(x);

Figure 2: Joint Distribution

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
sns.jointplot(x="x", y="y", data=df, kind="kde");

Figure 3: Lots of Joint Distributions

iris = sns.load_dataset('iris')
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

Figure 4: Box and Whisker Plot

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

# Initialize the figure with a logarithmic x axis
f, ax = plt.subplots(figsize=(7, 6))
ax.set_xscale("log")

# Load the example planets dataset
planets = sns.load_dataset("planets")

# Plot the orbital period with horizontal boxes
sns.boxplot(x="distance", y="method", data=planets,whis="range", palette="vlag")

# Add in points to show each observation
sns.swarmplot(x="distance", y="method", data=planets,size=2, color=".3", linewidth=0)

# Tweak the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

The Student t-Distribution

For the Sake of Statistics, forget the Normal Distribution.

To be clear: This is targeted at Data Scientiststs/Machine Learning Researchers and not at Physicists

Statistical normality is overused. It‘s not as common and only really occurs in the impractical ‘limits’ [[2][3][4]]. To garner normality, you need to have substantial well-behaved independent dataset (CLT) but for most research projects: small sample sizes and independence are usually what researchers are faced with. We tend to have crappy data that you fudge to look normal but in fact, those ‘anomalies’ in the extremes are telling you something is up.

Lots of things are ‘approximately’ normal. That’s where the danger is.

The point of this article is not to talk about Kurtosis, but rather to discuss why phenomena within society do not follow the Normal Distribution, why extreme events are more likely and why relaxing certain constraints make you realise that the Student t-Distribution is more prevalent than thought. Most importantly, why researchers assume normality when it’s just not normal.

Unlikely events are actually more likely than we expect and not because of kurtosis, but because we’re modelling the data wrong to begin with. Say that we run a forecast on the weather over a 100 year period: is 100 years of data enough to assume normality? Yes? No! The world has been around for millions of years. We are underestimating the tails simply because we’re not considering the limits of our data. 100 years is just a sample of the entire historical data: much of which we’ll never see.

Moreover, Limpert and Stahl in 2011(which we discuss later: [4]) discuss this exact phenomenon: how often a symmetric normal distribution can be assumed, tests can be run and conclusions can be made, but where researchers have clearly misinterpreted results because of their trust in symmetric normality.

As a statistician, it’s important to know what you don’t know and the distribution that you’re basing your inferences on. It’s important to remember we primarily use a normal distribution for inferences because it offers a tidy closed form solution (note OLS Regression, Gaussian Processes etc), but in reality, the difficulties in solving harder distributions are why, at times, they make better predictions.

Firstly, let’s talk about the Mathematics of t-Distributions.

Please skip the maths to jump to more discourse

Where did the Student t-distribution come from?

In the Guinness Brewery of Dublin Ireland, William Sealy Gosset published a paper in 1895 under the pseudonym ‘Student’ detailing his statistical work into the “frequency distribution of standard deviations of samples drawn from a normal population”. There’s a debate about who actually came up with the student t-distribution first, but noting work by (Helmert and Luroth, 1876) came a bit earlier, let’s focus on the maths.

Assume we have a random variable with a mean μ and a variance σ², derived from a normal distribution. Then we know that if we find a sample estimate of the mean (say μ⁰), then the following variable z = [μ⁰-μ] / σ is a normal distribution, but, it now has a mean of 0 and a variance of 1. We’ve normalised the variable, or, we’ve standardised it.

However, imagine we have a relatively small sample size (say n < 30) which is part of a greater population. Then our estimate of mean (μ) will remain the same, however, our estimate of standard deviation, our sample standard deviation, has a denominator of n-1 (bessels correction).

Because of this, our attempt to normalise our random variable has not resulted in a standard normal, but rather has resulted in a variable with a different distribution, namely: a Student-t Distribution of the form:

X bar is the sample mean, μ is the estimate of the population mean, s is the standard error, and n is the number of samples.

This is significant to note because it tells us that even though something may be normally distributed, in small sample sizes, the dynamics of sampling from this distribution completely change, which is largely being characterised by Bessel’s correction.

I find it amazing that a small nuance to the formulae of the variance has far-reaching consequences into the distributional properties of the statistic.

What are Degrees of Freedom?

Degrees of freedom are a combination of how much data you have and how many parameters you need to estimate. They show how much independent information go into a parameter estimate. In this light, you want a lot of information to go into parameter estimates to obtain more precise estimates and more powerful hypothesis tests (note sample variance in relation to number of samples). So to make better estimates, you want many degrees of freedom, however usually, the degree’s of freedom corroborate with your sample size (and more precisely, the n-1 of Bessel’s Correction).

Varying the number of degrees of freedom for on a Student t-distribution

Degrees of Freedom are important to Student t-Distributions as they characterise the shape of the curve. The more degrees of freedom you have, the more your curve looks bell shaped and converges to a standard normal distribution.

Proof of Convergence to Normal Distribution

The probability density function for the t-distribution is complex but here it is:

is the number of *degrees of freedom and* Γ is the gamma function.

The properties of it can be quite fascinating: in fact, the Student t-Distribution with ν =1 is approximately Cauchy, and on the other end of the spectrum, the t distribution approaches a normal distribution as ν > 30. The proof is as follows. If Xn is a t-distributed variable, it can be rearranged to show that the variable can be written as follows:

I wish I could make these formulae smaller

where Y is a standard normal variable and X²n is a Chi-square random variable with n degrees of freedom, independent of Y. Separately, we know that X²n can be written as a sum of squares of n independent standard normal variables Z¹ …. Z¹⁰⁰:

and when n tends to infinity, the following ratio of the chi-squared variable

converges in probability to E[Z²i] = 1 by the law of large numbers.

Moreover, as a consequence of Slutsky’s theorem, Xn converges in distribution to X=μ+σY, which, thus, is normal. Further supplementary material can be found here.

Examples of failures of the Normal Distribution

I’ve explained in the introduction why the normal distribution isn’t relevant a lot of the time and I’ve proven how the Student t-Distribution is intrinsically related to both the Cauchy and the Normal Distribution. However, now let’s look at examples of where the Normal Distribution (and the dynamics it has) are assumed to be fundamental but in reality, results are skewed which questions any reliance on any normal assumptions.

The article here discusses the case of being ‘95% Confident’. Assuming a normal distribution, 95% of your results should be within 2σ of your mean (in a symmetric manner). Therefore, any results outside of this range are anomalous. Limpert and Stahl (2011) show that skewness in the results distorts the symmetry postulated by a number of authors in a number of different fields, thereby miscalculating likelihoods being an issue plaguing several fields.

Fields where this has caused problems are as wide-ranging as you can imagine:

Moreso, Potvin and Roff (1993) argue the case for non-normality being more prevalent in ecological data, and look at alternative non-parametric statistical methods, with Micceri close behind (1989) comparing the prevalence of the normal distribution for psychometric measures to that of the unicorn and other improbable creatures.

These are serious accusations and with a fine tooth and comb, it becomes pretty clear that all that seems normal is not.

In anger, I’ve explained why statistical normality is overused and why it’s caused so many failed out of sample experiments. We assume too much and reluctantly let the data speak. Other times, we let the data speak too much and ignore the limitations of our data.

We need to think more about the practical limitations of our data, but also the fundamental limitations of any distribution we assume. By making a realistic assumption on the full shape of data, we would see that making more conservative estimates would tend to perform significantly better out of sample.

Thanks for reading! Please message me if you have any more questions!

References

Helmert FR (1875). “Über die Berechnung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler”.
Potvin, C. & Roff, D.A. (1993). Distribution-free and robust statistical methods: Viable alternatives to parametric statistics? Ecology 74 (6), 1617–1628. Read
Micceri, T. (1989) The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin 105 (1), 156–166. Read [free pdf]
Limpert, Stahl (2011): Problems with Using the Normal Distribution — and Ways to Improve Quality and Efficiency of Data Analysis

Robust Statistical Methods

Anomalies hidden in plain sight. Chart from Liu and Neilson (2016)

Methods that Data Scientists Should Love

A robust statistic is a type of estimator used when the distribution of the data set is not certain, or when egregious anomalies exist. If we’re confident on the distributional properties of our data set, then traditional statistics like the Sample Mean are well positioned. However, if our data has some underlying bias or oddity, is our Sample Mean still the right estimator to use?

Let’s imagine a situation where the data isn’t so friendly.

Let’s take an example that involves the sample mean estimator. We know that the sample mean gives every data point a 1/N weight which means that if a single data point is infinity, then the sample mean will also go to infinity as this data point will have a weight of ∞/N = ∞.

This is at odds to our sample median which is little affected by any single value being ±∞. That’s because the sample median does not apply weight to every datapoint. In fact, we can say that the sample median is resistant to gross errors whereas the sample mean is not.

A gross error is a data point that is misleading (usually 3σ or more)

In fact, the median will tolerate up to 50% gross errors before it can be made arbitrarily large; we say its breakdown point is 50% whereas that for the sample mean is 0%.

The breakdown point of an estimator is the proportion of gross errors an estimator can withstand before giving an abnormal result.

Robust statistics are often favoured to traditional sample estimators due to the higher breakdown point. It’s not unusual for data to involve anomalies if the recording of data involves some manual effort, however, the mean and median should normally be quite close. Now if you assume that your underlying data contains some gross errors, then it’s worthwhile using a robust statistic.

Let’s first look at what outliers mean in terms of relative efficiency.

Relative Efficiency

Relative Efficiency is the comparison between variances of sample estimators. We previously saw that if data is well behaved, the variance of a sample estimator should go to 0 as n goes to ∞. We also saw that for normally distributed data, the sample mean has a lower efficiency than the sample median. But what if the data is not normally distributed?

If we have Student T-distributed data with 5 degrees of freedom, the sample median has a much lower efficiency and is, therefore, a better estimator to use to approximate the population mean. So much so, it can have an Asymptotic Relative Efficiency (ARE) of 96%.

Let’s say we’re doing an example on stock returns: Stock returns have roughly student t-distributed data with about 5–7 degrees of freedom so given the above discussion, the median is a rather good metric here.

The Sample Median has a much higher degree of efficiency than the Sample Mean for Financial Data

If you can smell something fishy in your data, I recommend using methods with higher degrees of efficiency and higher breakdown points. Let’s look at robust regression methods.

M-Estimators in Robust Regression

OLS Regression applies a certain amount of weight to every datapoint:

Closed form derivation of OLS regression coefficient [source]

Say X~N(0,1), and Y is also ~N(0,1). Say X¹=1, its contribution to beta would be (X¹*Y¹)/(X¹*X¹) = (1 * Y¹/1*1) = Y¹. As Y¹ is also uniform normal, we would expect the Beta to be around +/- 1 (both sets have the same variance, so regression is equivalent to correlation).

However, say now Y¹ was accidentally stored as 10,000 (you can blame the intern), the contribution to the estimator of this point beta would go up from 1 to 10,000! That’s crazy and clearly not desired!

Regressions are thus very sensitive to anomalous data-points (at worst, the problem can be exponential) and given the above discussion, we would prefer to use an estimator with a higher breakdown point and a higher degree of efficiency. This is to ensure that our estimator doesn’t get thrown around by rogue data-points so if the potential lack of normality in the data is worrying, then the researcher should use robust estimation methods:

M-estimators are variants of Maximum Likelihood Estimation (MLE) methods. MLE methods attempt to maximise the joint-probability distribution whereas M-estimators try to minimise a function ⍴ as follows:

Solving the problem of M-Estimators [source]

The astute reader will quickly see that Linear Regression is actually a type of M-Estimator (minimise the sum of squared residuals) but it’s not fully robust. Below we have 4 other types of M estimators and more can be found here:

Different choices of functions for your M-Estimator [source]

As an example, Least Absolute Deviation (LAD) estimates the coefficients that minimises the sum of the absolute residuals as opposed to sum of squared errors. This means that LAD has the advantage of being resistant to outliers and to departures from the normality assumption despite being computationally more expensive.

As a practitioner, I would encourage researchers to try multiple method because there’s no hard and fast rule. It’s much more convincing to demonstrate to use several estimators giving similar results, rather than a sporadic and unexplainable set of results.

As a final point, we have to remember though that M-estimators are only normal asymptotically so even when samples are large, approximation can be still be very poor. It all depends on type and size of the anomaly!

In the above article, we broadly discuss the field of Robust Statistics and how a practitioner should approach with caution. Normal data may exist but at the limit, kurtosis plagues reality. Experiments on fatter tails (Student T-distributed) data highlights that the sample median is much more efficient than the sample mean but I generally like to put both side by side to see any noticeable differences. Further, robust regression methods offer a higher breaking point to give more realistic estimations but are pretty slow to compute.

Robust Statistics are a bit of an art because sometimes you need them and sometimes you don’t. Ultimately every data point is important so leaving some out (or down weighting certain ones) is rarely desirable. Given that limitation, I always encourage researchers to use multiple statistics in the same experiment so that you can compare results and get a better feel for relationships because after all, one ‘good’ result may just be lucky.

Thanks for reading! If you have any questions please message — always happy to help!

References

Huber, Peter J. (1981), Robust statistics
Little, T. The Oxford Handbook of Quantitative Methods in Psychology. Retrieved October 14, 2019
Liu, X., & Nielsen, P.S. (2016). Regression-based Online Anomaly Detection for Smart Grid Data. ArXiv, abs/1606.05781.

The Sampling Distribution of OLS Estimators

Details, details: it’s all about the details!

Ordinary Least Squares (OLS) is usually the first method every student learns as they embark on a journey of statistical euphoria. It’s a method that quite simply finds the line of best fit within a two dimensional dataset. Now the assumptions behind the model, along with the derivations are widely covered online, but what isn’t actively covered is the sampling distribution of the estimator itself.

The sampling distribution is important because it informs the researcher how accurate the estimator is for a given sample size, and more so, it allows us to determine how the estimator behaves as the number of data points increase.

To determine the behaviour of the sampling distribution, let’s first derive the expectation of the estimator itself.

Expectation of OLS Estimator

Remember that the OLS Coefficient is traditionally calculated as follows:

Where Y = XB + e. Substitute the equation of Y into the formulae above, and continue the derivation below:

The expectation of the Beta coefficient is Beta, thereby also being unbiased [source]

Again, we know that an estimate of beta has a closed form solution, where if we replace y with xb+e, you start at the first line. Deriving out as we do, and remembering that E[e]=0, then we derive that our OLS estimator Beta is unbiased.

Variance of your OLS Estimator

Now that we have an understanding of the expectation of our estimator, let’s look at the variance of our estimator.

The expectation of the beta estimator actually goes to 0 as n goes to infinity. [source]

To get to the first line you have to remember that your sample estimator (beta hat) can be expanded and simplified as follows:

where e~N(0, σ²). From this, we can also determine that E[e’e]=σ², which is a constant and can therefore move out of the equation to leave the X’s which are all multiplied together, cancel each other out to just leave the inverse of the squared X.

Ultimately, this leaves σ²/(X’X) which is asymptotically 0 as if n increases substantially, then the variance of your OLS estimator goes to 0 as σ² remains the same but (X’X) would grow exponentially.

Sampling Distribution

Now that we’ve characterised the mean and the variance of our sample estimator, we’re two-thirds of the way on determining the distribution of our OLS coefficient.

Remember that as part of the fundamental OLS assumptions, the errors in our regression equation should have a mean of zero, be stationary, and also be normally distributed: e~N(0, σ²). Remember that the OLS coefficient is simply a linear combination of these ‘disturbances’ and therefore, our OLS coefficient is therefore driven by these normal disturbances. Therefore:

Distribution of the OLS Coefficient: [source]

And there we have it! I’ve (a) derived the expectation of the OLS estimator and shown how it is also unbiased. (b) I’ve derived the variance of the sample estimator and shown how it’s asymptotically actually 0. And (c), we use the intuition behind the distribution of the error term to infer the sampling distribution of our estimator. (Note that for sample sizes greater than around 30, the sampling distribution would be approximately normal anyways because of the Central Limit Theorem).

On the whole, I hope that the reader has a much deeper awareness and understanding of their beta coefficient. The information above can be used in a powerful way to make robust estimates of relationships: moreover showing the importance of increasing the number of samples to decrease the variance of your sample estimator.

Ultimately, the insights you gain from understanding fundamental details will shape the way you think when experimenting!

Thanks for reading and hope I helped! Please message me if you need any help!

Asymptotic Distributions

Infinity (and beyond…)

The study of asymptotic distributions looks to understand how the distribution of a phenomena changes as the number of samples taken into account goes from n → ∞. Say we’re trying to make a binary guess on where the stock market is going to close tomorrow (like a Bernoulli trial): how does the sampling distribution change if we ask 10, 20, 50 or even 1 billion experts?

The understanding of asymptotic distributions has enhanced several fields so its importance is not to be understated. Everything from Statistical Physics to the Insurance industry has benefitted from theories like the Central Limit Theorem (which we cover a bit later).

However, something that is not well covered is that the CLT assumes independent data: what if your data isn’t independent? The views of people are often not independent, so what then?

Let’s first cover how we should think about asymptotic analysis in a single function.

Asymptotic Analysis

From first glance at looking towards the limit, we try to see what happens to our function or process when we set variables to the highest value: ∞.

As an example, assume that we’re trying to understand the limits of the function f(n) = n² + 3n. The function f(n) is said to be “asymptotically equivalent to n² because as n → ∞, n² dominates 3n and therefore, at the extreme case, the function has a stronger pull from the n² than the 3n. Therefore, we say “f(n) is asymptotic to n²” and is often written symbolically as f(n) ~ n².

Conceptually, this is quite simple so let’s make it a bit more difficult. Let’s say we have a group of functions and all the functions are kind of similar. Let’s say each function is a variable from a distribution we’re unsure of e.g. a bouncing ball. How does it behave? What’s the average heigh of 1 million bounced balls? Let’s see how the sampling distribution changes as n → ∞.

Asymptotic Distribution

An Asymptotic Distribution is known to be the limiting distribution of a sequence of distributions.

Imagine you plot a histogram of 100,000 numbers generated from a random number generator: that’s probably quite close to the parent distribution which characterises the random number generator. However, this intuition supports theorems behind the Law of Large numbers, but doesn’t really talk much about what the distribution converges to at infinity (it kind of just approximates it).

For that, the Central Limit Theorem comes into play. In a previous blog (here) I explain a bit behind the concept. This theorem states that the sum of a series of distributions converges to a normal distribution: a result that is independent of the parent distribution. So if a parent distribution has a normal, or Bernoulli, or Chi-Squared, or any distribution for that matter: when enough estimators of over distributions are added together, the result is a normal.

I would say that to most readers who are familiar with the Central Limit Theorem though, you have to remember that this theorem strongly relies on data being assumed to be IID: but what if it’s not, what if data is dependant on each other? Stock prices are dependent on each other: does that mean a portfolio of stocks has a normal distribution?

The answer is no.

(Ledoit, Crack, 2009) assume stochastic process which is not in-dependent:

A non-IID data-generating Gaussian process

As we can see, the functional form of Xt is the simplest example of a non-IID generating process given its autoregressive properties. The distribution of the sample mean here is then latterly derived in the paper (very involved) to show that the asymptotic distribution is close to normal but only at the limit:

Sampling Distribution of a non-IID estimator [source]

however, for all finite values of N (and for all reasonable numbers of N that you can imagine), the variance of the estimator is now biased based on the correlation exhibited within the parent population.

“You may then ask your students to perform a Monte-Carlo simulation of the Gaussian AR(1) process with ρ ≠ 0, so that they can demonstrate for themselves that they have statistically significantly underestimated the true standard error.”

This demonstrates that when data is dependant, the variance of our estimators is significantly wider and it becomes much more difficult to approximate the population estimator. This begins to look a bit more like a student-t distribution that a normal distribution.

However given this, what should we consider in an estimator given the dependancy structure within the data? Ideally, we’d want a consistent and efficient estimator:

Asymptotic Consistent Estimators

Now in terms of probability, we can say that an estimator is said to be asymptotically consistent when as the number of samples increase, the resulting sequence of estimators converges in probability to the true estimate.

An estimator is said to be consistent when as the number of samples goes to infinity, the estimated parameter value converges to the true parameter [source]

Let’s say that our ‘estimator’ is the average (or sample mean) and we want to calculate the average height of people in the world. Now we’d struggle for everyone to take part but let’s say 100 people agree to be measured.

Now we’ve previously established that the sample variance is dependant on N and as N increases, the variance of the sample estimate decreases, so that the sample estimate converges to the true estimate. So now if we take an average of 1000 people, or 10000 people, our estimate will be closer to the true parameter value as the variance of our sample estimate decreases.

Now a really interesting thing to note is that an estimator can be biased and consistent. For example, take a function that calculates the mean with some bias: e.g. f(x) = μ + 1/N. As N → ∞, 1/N goes to 0 and thus f(x)~μ, thus being consistent.

This is why in some use cases, even though your metric may not be perfect (and biased): you can actually get a pretty accurate answer with enough sample data.

Asymptotic Efficiency

An estimator is said to be efficient if the estimator is unbiased and where the variance of the estimator meets the Cramer-Rao Lower Inequality (the lower bound on an unbiased estimator). However a weaker condition can also be met if the estimator has a lower variance than all other estimators (but does not meet the cramer-rao lower bound): for which it’d be called the Minimum Variance Unbiased Estimator (MVUE).

Take the sample mean and the sample median and also assume the population data is IID and normally distributed (μ=0, σ²=1). We know from the central limit theorem that the sample mean has a distribution ~N(0,1/N) and the sample median is ~N(0, π/2N).

Now we can compare the variances side by side. For the sample mean, you have 1/N but for the median, you have π/2N=(π/2) x (1/N) ~1.57 x (1/N). So the variance for the sample median is approximately 57% greater than the variance of the sample mean.

At this point, we can say that the sample mean is the MVUE as its variance is lower than the variance of the sample median. This tells us that if we are trying to estimate the average of a population, our sample mean will actually converge quicker to the true population parameter, and therefore, we’d require less data to get to a point of saying “I’m 99% sure that the population parameter is around here”.

Example of Overlapping distributions where one ha a higher variance than the other but both have the same mean at 0 [source]. Can infer that samples from the thinner distribution will be more likely to be closer to the mean.

As such, when you look towards the limit, it’s imperative to look at how the second moment of your estimator reacts as your sample size increases — as it can make life easier (or more difficult!) if you choose correctly!

In a number of ways, the above article has described the process by which the reader should think about asymptotic phenomena. At first, you should consider what the underlying data is like and how that would effect the distributional properties of sample estimators as the number of samples grows.

Secondly, you would then consider for what you’re trying to measure, which estimator would be best for you. In some cases, a median is better than a mean (e.g. for data with outliers), but in other cases, you would go for the mean (converges quicker to the true population mean).

In either case, as Big Data becomes a bigger part of our lives — we need to be cognisant that the wrong estimator can bring about the wrong conclusion. This can cause havoc as the number of samples goes from 100, to 100 million. Therefore, it’s imperative to get this step right.

Message if you have any questions — always happy to help!

Code for Central Limit Theorem with Independent Data

# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)

# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean()[0] for i in range(1000)]

# Plot Sample Means
plt.figure(figsize=(20,10))
sns.distplot(s_mu).grid()
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')
plt.legend()
plt.show()

The Distribution of the Sample Mean

One-sided test on a distribution that is shaped like a Bell Curve. [Image from Jill Mac from Source (CC0) ]

All Machine Learning Researchers should know this

Most machine learning and mathematical problems involve extrapolating a subset of data to infer for a global population. As an example, we may only get 100 replies on a survey to our new website, whereas our target market is 10 million customers. It’s infeasible to ask all 10 million customers what they think, so we have to use the feedback from the 100 to infer.

Probability Distributions explain to us the likelihood of different things happening. Conceptually, they can tell us “Event A is probably not going to happen, but Event B is much more likely”. They can also tell us “The likelihood of Z>5 is quite low”. Now we generally think that distributions apply to data however when you delve deeper into the theoretical side of statistics, we find that actually everything has its own distribution.

As a practitioner of Machine Learning and Data Science: day in and day out we use the Sample Mean. Therefore, we need to be at one with the dynamics and limitations of our most-used tool.

With the above example, say we have a new feature that we want to incorporate into our business. We can’t ask all 10 million customers what they think of the new feature before we integrate it, so instead we find a small group (our sample) of customers and calculate the mean result of that group. Again we think though, would the results change if we tried to measure this of another group of customers? What if we found 100 groups of 20 customers: would each group give the same result?

In this example, we are ‘sampling’ a small subset (through groups) of our population (of 10 million customers) to try approximate how the population distribution thinks. We use the sample mean to approximate the population mean, and as it’s an approximation, it has its own distribution.

Now as distributions are characterised by the mean and variance, we can make a big leap to defining the distribution of the sample mean by first deriving a mean and variance.

From this, we can achieve our goal in saying: “From our experiment, the average customer is X% likely to approve the new feature and we are confident on that percentage within +/- Y%”

Let’s begin.

Mean of the Sample Mean

To calculate the expectation of any statistic, you can simply wrap an expectation around the functional form of the statistic:

Derivation proving that our mean statistic is unbiased: more work can be found in Reference

The derivation continues by calculating the expectation of the formulae. We take out the constants (1/n) and are left with a sum of expectations of the variable X (which are all independent). These are all the same (μ), so we have (nμ)/μ = μ, and thus, we are left with the statistic we started with.

This result is huge because it proves that the sample mean we derive is directly approximating the population mean. So for example, if we have a group of 100 customers from a population of a 10 million independent customers (samples), then the mean of this small sample is a very good estimate of the population mean (within a certain range). We can now make more powerful predictions with significantly less data.

Now that we’ve derived the first moment of our distribution, let’s move onto the second moment.

Variance of the Sample Mean

The proof for the variance for the sample mean is equally as simple.

Derivation of Variance of Sample Mean: Reference.

Again we see that the variance function can feed through the sample mean statistic to extract 1/n. We are left with our X variables which because they are they independent, the variance of these can be added together. From here, it’s a simple substitution and rearranging to get the equation together.

The end result tells us that the variance of each sample mean is equivalent to the variance of the underlying data divided by the number of data points in each sample. This is another huge result as it tells us by how much the variance of our mean estimate decreases as N increases. We can almost fully approximate how the distribution of our sample mean should look. For example: with normally distributed data (variance =1), when N=100 samples, we can now say that a sample mean would be within +/- 0.1 (as 1 divided by square root of 100 = 0.1).

Distribution of the Sample Mean: Central Limit Theorem

The expressions derived above for the mean and variance of the sampling distribution are not difficult to derive or new. However the simplicity of it all is really stands out with the Central Limit Theorem that regardless of the shape of the parent population (be it from any distribution), the sampling distribution of the sample mean approaches a normal distribution.

Now the Central Limit Theorem (CLT) proves (under certain conditions) that when random variables are added, by adding together these random variables, at the limit (asymptote), the distribution converges towards a normal distribution even if the original variables themselves are not normally distributed. This generally occurs when N>25 (proof and demonstration).

Note: The proof of the CLT is not a short proof so I’ve left it out from this article.

The mean and variance derived above characterise the shape of the distribution and given that we now have knowledge of the asymptotic distribution, we can now infer even more with even less data. The characteristics of the normal distribution are extremely well covered and we can use what knowledge we have now to even more better understand the dynamics our estimate of the sample mean.

Bootstrapping

Now given that we have proven that that our sample mean statistic has a mean of μ and a variance of sigma²/n, let us now show in practise what happens when we repeatedly calculate the sample mean, and, whether this looks like a normal distribution or not.

Bootstrap methods use Monte Carlo Simulation to approximate a distribution. In the below example, (and as per the code at the end), we have generated 10,000 samples (from a random number generator being seeded by a normal distribution). Then from this population, we calculate the mean of samples of 100 numbers, record this mean, and do it again (100 times)

Bar Chart of 100 Sample Means (where N = 100). Its shape is similar to a bell curve. Code at end.

Now it’s awesome to see that the mean of sample means is quite close to the mean of a normal distribution (0), which we expected given that the expectation of a sample mean approximates the mean of the population, and which we know the underlying data to have as 0. Moreover, the standard deviation of the sample means is 0.1, which is also correct as the standard deviation = root(sigma²/N) = 1/root(100) = 1/10 = 0.1. Further, the shape of the distribution looks like a bell curve, and if we increase N (from 100, to say, 10000) we can see how the distribution looks even better:

Bar Chart of 10,000 Sample Means (where N = 100). Its shape is almost identical now to a bell curve. Code at end.

Also, as another example, if we change the underlying population data to be uniformly distributed (say, between 0 and 100): if we calculate the sample mean 10,000 times, where each sample mean contains 100 points— as derived by the central limit theorem, we again converge to a normal distribution:

Bar Chart of 10,000 Sample Means (where N = 100) and underlying data is uniformly distributed. Its shape is again almost identical now to a bell curve. Code at end.

Conclusion

In the above article, I derived the mean and variance of the sample mean, I then used bootstrap techniques to highlight the central limit theorem: all of which we can use to aid our understanding of the sample mean, which in turn helps us to approximate the dynamics of the underlying population.

Now as a reader, I hope that you understand that the distributional properties of such a simple statistic can allow for very powerful inferences. Knowing the limitations of your estimates becomes more important in times of stress. From a business perspective, there’s no point making new features or changing a business strategy if you know that your goal is outside the realm of the limits that your current business resides within. As such, you can use these methods to guide your inference to help you make sensible decisions as you go along.

Reference Code

# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)

# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean()[0] for i in range(1000)]

# Plot Sample Means
plt.figure(figsize=(20,10))
sns.distplot(s_mu).grid()
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')
plt.legend()
plt.show()

Parts-based learning by Non-Negative Matrix Factorisation

Visualising the principal components of portrait facial images. ‘Eigenfaces’ are the decomposed images in the direction of largest variance.

Why we can’t relate to eigenfaces

Traditional methods like Principal Component Analysis (PCA) would decompose a dataset into some form of latent representation e.g. eigenvectors, which at times can be meaningless when visualised — what actually is my first principal component? Non-Negative Matrix Factorisation (NNMF) was a method developed in 1996 by Lee and Seung that showed data could also be deconstructed (i.e. a set of facial portraits) into parts and extract features like the nose, eyes, and a smile.

The differences between PCA and NNMF arise from a constraint in their derivation: namely, NNMF does not allow the weights of different features to be negative. On the other hand, PCA approximates each image as a linear combination of all images or basis. Although eigenfaces are directions of largest variance, they are hard to interpret: what actually is the first principal component showing me?

Eigenfaces are meaningless directions of largest variance

A linear combinations of distributed components can feel difficult to relate to when attempting to visualise the data. NNMF is interesting because it can extract parts from the data that when we visualise it, we can actually relate to it.

In the seminal paper by Lee and Seung, they take a series of facial portraits and by running the NNMF algorithm, they decompose the images to extract the following features:

An image taken from “Learning the parts of objects by non-negative matrix factorisation” (Lee, Seung, 1999) [1]

We can see (moving from the top left to the bottom right) that each feature kind of looks like a different facial part. We have the eyes, the jaw structure, the nose, the T zone, the eye brows all come out. Compare this to the figure at the top of this article, and you’ll see how differently the two models explain the structure of the same data.

Likewise, this method can also applied to articles of text where you can decompose each article into a topic:

An example of topic modelling by using NNMF: link

On the right we have our data matrix M, which is decomposed (in the same manner as the image example) into the topics matrix A and the weights matrix W. (Note: reference [3] will help you remove stop words)

To appreciate how this simple example can give such high results, let’s first go over the Mathematics:

Mathematics of Non-Negative Matrix Factorisation

The goal of NNMF is to decompose an image database (matrix V) into two smaller matrices W and H with the added constraint that W>0 and H>0 :

V is a matrix of our Image database. The r columns of W are called basis images. Each column of H is called an encoding and is in one-to-one correspondence with a face in V. Check reference [1] for more information.

Let V be our matrix of data. In the case of images, you would vectorise each image and concatenate them sideways to have something that’s (N x M) with N total pixels and M total images. The iterative procedure used to derive an approximation for V is as follows:

We initialise W and H randomly and as follows:

Normalise the H Matrix
Incorporate this normalised H Matrix to W
Normalise the W matrix
Calculate the improvement in the following objective function (if improvement is less than a threshold, go back to 1):

Unlike with PCA, NNMF has no closed form solution so this iterative method (similar to gradient descent) keeps working till the improvements are marginal. Note — code is at the end.

In reality the objective function or the methodology provided are not important here. The the observant will notice that the non-negativity constraint implemented is that characterises the difference between PCA and NNMF and can result in an approachable latent representation i.e. parts by learning. In NNMF, every latent feature is added (given weights) to each other to recreate each part of sample data. Many of these weights can be 0.

With PCA on the other hand, we can compare it in a NNMF framework in that it constraints the the columns of W to be orthonormal and the rows of H to be orthogonal. This separation allows for a distributed representation but not an additive representation, thus the difference in the latent representation between the two models.

The model you use makes a huge difference in its interpretability, its usefulness and the inference you draw from it. I’m a proponent of using various methods as benchmark because without measuring performance in connection with some form a null hypothesis, you can’t really monitor effective performance.

Given that, each model also has it’s marginal benefits so it also makes sense to try learn something from each model. In the above case, simply changing the negativity constraint can give the user an entirely different result. I encourage the reader to take the below code away and see what other features they can come out with!

References:

DD Lee, HS Seung. 1999 “Learning the parts of objects by non-negative matrix factorization”
Dempster, A. P., Laired, N. M. & Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm.
Reference for list of common (“Stop”) words, referenced from: http://xpo6.com/list-of-englishstop-words/

Code for NNMF

The coding below is in Matlab (don’t ask…) but it should be easy enough to translate to any language.

[n,m]=size(V);
stopconv=40;      % Stopping criterion (can be adjusted)
niter = 1000;     % maximum number of iterations (can be adjusted)

cons=zeros(m,m);
consold=cons;
inc=0;

% Counter
j=0;

% initialize random w and h
W=rand(n,R);
H=rand(R,m); 

for i=1:niter    
% This is the update equation for H as per the Nature paper, with
% normalisation
    x1=repmat(sum(W,1)',1,m);
    H=H.*(W'*(V./(W*H)))./x1;

% This is the update equation for W as per the Nature paper with
% normalisation
    x2=repmat(sum(H,2)',n,1);
    W=W.*((V./(W*H))*H')./x2;

% test convergence every 10 iterations
if(mod(i,10)==0)  
        j=j+1;

% adjust small values to avoid undeflow
        H=max(H,eps);
        W=max(W,eps);

% construct connectivity matrix
        [~,index]=max(H,[],1);   %find largest factor
        mat1=repmat(index,m,1);  % spread index down
        mat2=repmat(index',1,m); % spread index right
        cons=mat1==mat2;

if(sum(sum(cons~=consold))==0) % connectivity matrix has not changed
            inc=inc+1;                     %accumulate count 
        else
            inc=0;                         % else restart count
        end

        % prints number of changing elements 
        fprintf('t%dt%dt%dn',i,inc,sum(sum(cons~=consold))), 

if(inc>stopconv)
break,                % assume convergence is connectivity stops changing 
        end 

        consold=cons;
    end
end

Gross Failures in COVID Reporting

Reporting Inextricable Statistics is a Problem

If its use of these items is typical of the NHS at large, the range of daily demand would be between 7.5 million to 12 million, more than the 5.5 million actually supplied. [Source]

It’s been quite clear from the beginning of the epidemic that statistical modelling is not the forte of the UK government. From expected infection counts to levels of social distancing, unexpected care home fatalities to PPE requirements: the UK Government have really struggled to put forward numbers that can be used as a basis of comparison and when they do it’s undoubtedly too late. Moreover, at times, the inextricable statistics often presented can be used against the Government to stoke fear.

As a statistician, it’s imperative to me that the public should understand how to interpret this data properly.

The UK Government is not alone in inextricable reporting though. All Governments across the world have been at fault of this, and they should make the extra effort to report relative statistics. As these statistics are being used for the basis of solving a problem, then the statistic reported must be relative to the problem. Don’t tell us how many items of PPE you’ve sourced: tell us what percentage of demand have you sufficed.

Statistics used for comparisons must be relative

One thing that stuck with me from having been lucky enough to study under Sir Professor David Mackay was that Statistics needs to approachable to make a difference. Back of the envelope statistics can really carry weight however, you have to recognise which question your statistic is answering. As follows, I demonstrate how making simple adjustments to commonly reported figures help in answering pretty big questions.

The Case of Infection Counts

The data I use is from the most reliable source: John Hopkins University. More of this at the end.

Despite a seemingly simple task (it’s not), counting the number of infection cases is important for healthcare organisations to monitor the spread of a pandemic. However once the counts get quite large, is it still meaningful?

Let’s look at this common chart: here we have the infections per country. They all look to be increasing and slightly flattening around the 225,000 mark (France and Italy a bit lower at 160kish, with Sweden down at 25kish).

Figure 1. Infections per Country. : The code used to create this chart is provided at the end of the article

Now as we look at this chart, we could say from it that Sweden is doing the best of all countries, for it has the fewest cases. However, Swedens population is almost 10x lower than the other countries, so we could expect (assuming a similar rate of spreading) that the number of cases would be 10x less by virtue of population size.

Therefore to compare how well one country is doing in relation to another country, we should be comparing a relative statistic: we should look at a statistic that has been adjusted for population to monitor the relative risk of being infected. This is known as Period prevalence:

Period prevalence is the number of individuals identified as cases during a specified period of time, divided by the total number of people in that population.

The following chart shows exactly this and the perspective entirely changes:

Figure 2. Infections per million, per country: The code used to create this chart is provided at the end of the article

By monitoring period prevalence, we can now see that Spains infection count per million is much higher there than in other countries, and therefore much worse. Moreover from this perspective, Sweden does not seem to be doing the best of all other countries, that looks to actually be Germany.

Reporting absolute case counts under-represents the significance of the problem in Sweden.

By transforming our absolute count statistic to a normalised measure, we can more effectively monitor relative risk and make better judgements about how one country performs in relation to another country.

On from this, we know that the epidemic is spreading widely and governments are reacting, but how well are they reacting and how useful have their actions been? To monitor this, we can look at how this period prevalence changes between two reasonable time periods.

You would hope that as a country goes into lockdown, that less people are getting infected. To monitor this, we can look at the changes in infection rate to monitor how well governments are dealing with the spread, and how this has changed through time.

To monitor change, we need to pick a time period that robust. We know of issues of weekend seasonality and the front-loading of US case counts, so in the following chart I take a 10 day difference of the infections per million count to smooth over these features and more effectively monitor how the rate of growth:

Note: the result from this chart is robust to different time gaps — the user is encouraged to experiment. Spoiler alert — the result is largely the same.

Figure 3. 10-day change in infections per million, per country: The code used to create this chart is provided at the end of the article

So, in the past 10 days, the UK’s infection count (per million population) has reported 750 more cases (per million), compared to Spain which has reported an increase of 250 (per million). This tells us that in the UK, the virus is still spreading more than in Spain. It actually seems that the UK is currently in the worst position of major European countries and Sweden looks to be in the second worst position — owing largely due to not going into lock down — something you can not tell from looking at Figure 1.

In the above article, I show that by dividing by population, and calculating differences over time, we can form a picture that provides us as much insight as other more academically thorough statistics. Other metrics (like the R0 and others here) can often be seen to be inextricable because of their complex derivations and concepts. However, the public need to given simple mathematics to quickly gauge and understand the severity of the problem.

Everyone: and I mean everyone can learn something from looking at these simple numbers and looking at statistics more relatively.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Data

Academics at John Hopkins University bring together data from various reliable sources (including all major health organisations) to track the rate of growth of Coronavirus. The dataset I look at (covid-19_confirmed_global) is updated at regular intervals during the day and can be accessed with a simple read_csv function from pandas (in python) to import the data into a data frame. I remove the present day due to the data being updated intraday.

Code

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt

# Countries I want to compare
eu_list = ['United Kingdom','France','Germany','Spain','Italy','Sweden']

# Download John Hopkins Data
fp = 'https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv'
df = pd.read_csv(fp).drop(['Province/State','Lat','Long'],axis=1)

# Sum Across Countries/Regions ~ then you get daily differences
df = df.groupby(['Country/Region',]).sum().T.diff()

# Get population of each country
from countryinfo import CountryInfo
pop = {}
for c in eu_list:
    pop[c] = CountryInfo(c).population()

pop = pd.DataFrame(pd.DataFrame(pop, index=pop.keys()).ix[0,:])
pop.columns = ['country']
# Adjust all statistics by Population
pop['multiplier'] = 1000000. / pop['country']
df2 = df.copy()
for k in eu_list:
 df2[k] = (df2[k] * pop.ix[k,'multiplier'])

# Plot Infections per million
df[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per country [Updated up to 20200508]').grid(); plt.show()

# Cases Per Million Population Plot
df2[eu_list].cumsum().plot(figsize=(15,7),title = 'Infections per million, per country [Updated up to 20200508]').grid(); plt.show()

# Cases Per Million Population Plot
df2[eu_list].cumsum().diff(10).plot(figsize=(15,7),title = '10-day change in infections Per Million People [Updated up to 20200508]').grid(); plt.show()