Data Science – Code Python

An Introduction to Neural Style Transfer for Data Scientists

Text based neural style transfer can alter the style of your text

If 2Pac was only allowed to release music under the pretence that his style was to match the Queen’s English, the world would have been a significantly worse place.

The advent of Style transfer (the ability to project a style of one text to another) means that it’s now possible for a Neural Network to change the feel of a text.

As you can probably guess, the application of this technology would make it useful in a number of different settings. A simple first application would be to make an article sound more formal:

Informal : I’d say it is punk though.

Formal : However, I do believe it to be punk.

but even further from this, this technology could be used to help people with problems like dyslexia.

More recently, the news that Microsoft was laying off journalists wasn’t groundbreaking news: advertising revenues are down across the board and newspapers are generally struggling to be as profitable as they were before (which was already a bit of a struggle). However, the news that they were to replace this team with AI is what startled people.

I’ve always loved writing but I’ve always sucked. My english teacher refused to let me answer questions, because undoubtedly, my answer would be wrong.

Fast forward 15 years and I’m building machine learning tools to solve just about any problem I can think of. More importantly, Neural Networks have recently found a new domain to better. Microsoft Word now incorporates a new AI that can offer to rewrite a suggestion in full, rather than simple spelling and grammatical fixes.

Have you ever been unable to express something in a given way?

Being unable to phrase something in a certain tone or to give of a certain impression is something that many writers struggle with. To preserve time, focus and energy, this tool will help writers to be able to more effectively captivate their audience by tilting the wording better. That’s what Microsoft aimed to fix here, and in what follows i’ll explain how. Microsoft have said:

“In internal evaluations, it was nearly 15 percent more effective than previous approaches in catching mistakes commonly made by people who have dyslexia.”

Neural Style Transfer

The updates that Microsoft have recently incorporated are broadly similar to the product that grammarly are well known for [can reference this]. Both sets of researchers are taking advantages of recent developments in the field of Style Transfer.

Neural Style Transfer was initially used between images, whereby, a certain composition of an image could be projected onto something similar.

Neural Style Transfer between Images [Source]

However, this technique has recently been adapted for the use case of text style transfer. To do this, researchers took advantage of neural machine translations models to serve the purpose of style transferring. Think about it: a certain ‘tone’ or ‘style’ could be seen as another language and therefore:

“We create the largest corpus for a particular stylistic transfer (formality) and show that techniques from the machine translation community can serve as strong baselines for future work”

The baseline model in the theory of neural machine translation is based on Yoshua Bengio’s paper here, building upon Sutskevers work on Sequence to Sequence learning. A neural network is formed as a RNN Encoder Decoder which works as follows.

Here, a phrase is passed into the encoder which coverts the string into a vector. This vector effectively contains a latent representation of the phrase, which is then translated using a decoder. This is called an ‘encoder-decoder architecture’ and in this manner, Neural Machine Translation (NMT) can translate local translation problems.

An example of Neural Machine Translation from Sutskever et al (2014). We can see that all the japanese text is encoded into h values, which is then decoded into English.

For neural machine translation, it uses a bidirectional RNN to process the source sentence into vectors (encoding) along with a second RNN to predict words in the target language (decoding). This process, while differing from phrase-based models in method, prove to be comparable in speed and accuracy.

Creating a model

To create a neural style transfer model, we generally have 3 key steps that we have to take:

1) Embedding

Words are categorical in nature so the model must first be able to embed the words, finding an alternative representation that can be used in the network. A vocabulary (size V) is selected with only frequent words treated as unique, all other words are converted to an “unknown” token and get the same embedding. The embedding weights, one set per language, are usually learned during training.

# Embedding
embedding_encoder = variable_scope.get_variable("embedding_encoder",    [src_vocab_size, embedding_size], ...)
encoder_emb_inp = embedding_ops.embedding_lookup(embedding_encoder, encoder_inputs)

2) Encoding

Once the word embedding are retrieved, they are fed as the input into the main model which consists of two multi-layer RNNs, where one of these is an encoder for the source language and the other is a decoder for the target language. In practice, these two RNN’s are trained to have different parameters (such models do a better job when fitting large training datasets).

# Build RNN cell
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

# Run Dynamic RNN
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_sequence_length, time_major=True)

The reader who’s paying attention to the code will see that sentences can have different lengths and to avoid wasting computation here, we tell dynamic_rnn the exact source sentence lengths through source_sequence_length and since our input is time major, we set time_major=True.

3) Decoding

The decoder needs to have access to source information. A simple way to achieve this is to initialise it with the last hidden state of the encoder, encoder_state.

# Build RNN cell
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

# Helper
helper = tf.contrib.seq2seq.TrainingHelper(decoder_emb_inp, decoder_lengths, time_major=True)

# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)

# Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(decoder, ...)
logits = outputs.rnn_output

Lastly, we haven’t mentioned projection_layer which is a dense matrix to turn the top hidden states to logit vectors of dimension V. We illustrate this process at the top of Figure 2.

projection_layer = layers_core.Dense(tgt_vocab_size, use_bias=False)

and finally, given the logits above, we are now ready to compute our training loss:

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=decoder_outputs, logits=logits)
train_loss = (tf.reduce_sum(crossent * target_weights)/batch_size)

We have now defined the forward pass of our NMT model. Computing the back propagation pass is just a matter of a few lines of code:

# Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(train_loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_gradient_norm)

Now from here, you’re ready to begin the optimisation procedures behind creating your own neural style transfer model!

Note: the code above was largely taken from the tensorflow github documentation and more information about this procedure can be found online.

The theory of Neural Machine Style Transfer is quite an extensive history and it’s taken a while for academia to reach the current perch it’s sat upon. Translations are a notoriously difficult task because of grammatical problems, but also, interpretability for text to sound somewhat human: somewhat colloquial.

The progress that’s been made is fantastic and it’s something that will be great if it keeps on developing.

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

Code Support:

This Python Library Will Help You Build Scalable Data Science Projects

Pytest: A Testing Framework for Python Code

How can you check that your code changes actually achieve what they’re meant to?

Ensuring your code has integrity is actually quite difficult to ensure, especially at scale. Usually you’ll work in a large team with different people working on different parts of the system. Everyone is tinkering about with something and if you’re using an agile methodology, you’ll be committing code multiple times a day. So how can you keep track that all your changes are backwards compatible? How can you keep track that your code changes maintain at least the same functionality as the code you’re removing (without the bad bits)?

You can test code in a number of ways. A lot of it tends to be sensibility testing and coming up with situations or extreme cases in which the code will definitely fail, and then by narrowing the scope.

It’s a long process, but this is so important because some code holds a lot of responsibility: at times, faulty code can bring down the company.

That’s not a joke, check these stories:

GETCO lost $400m in Trading caused by Computer Error (and was rescued through acquisition by KCG)
Y2K Bug that reportedly cost the industry upwards of $300bn to resolve (that’s billion with a b!).
AT&T Goes does for 9 hours and 75 million calls go unanswered — what caused it? A software update

These stories broke headlines and broke companies but just think about it as a customer as well: would you use an application that was super buggy? No, I wouldn’t either.

The Python community has appreciated testing for a while and pretty much all developers should know how to test their code. In what follows, we’ll be discussing the library pytest.

Photo by beuwy.com Alexander Pütter on Unsplash

So what is pytest?

Pytest a framework that makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.

Pytest has been built over a number of years and it’s been so popular for the following reasons:

easy and simple syntax
run specific tests or a subset of tests in parallel
built-in automatic detection of tests and the ability to skip tests
encourages test parametrisation and gives useful information on failure
encompasses minimal boilerplate
makes testing easy by providing special routines and extensibility (many of plugins, hooks etc. are available)
open-source i.e. allows contribution from the larger community

These are but a few points that make pytest easy to use. Note that testing also forms an integral part of your continuous integration and continuous development process, but this will be covered elsewhere.

Getting Started with PyTest

To install pytest, open up your command line and run the following command:

pip install pytest

You can also check whether the version is installed correctly with:

$ pytest — version

Your First Test

Before getting started, create a folder with name “Sample Test” and make a python file called testing_sample.py.

The assert statements are used to ascertain a true or false status in a method or test expectation.

Now in what follows, we’ll produce your first test in just 4 lines of code:

# testing_sample.py
def func(y):
   return y + 1

def test_answer():
   assert func(2) == 4

On execution of the above test function with:

$ pytest testing_sample.py

It would return a failure report because func(2) is not equal to 4 (i.e. 3!= 4). Additionally, it provides you with a solution for test failure. However, if you reset the code to: func(2)==3, this would then pass as it isTrue. Make sure to correct this before progressing.

Note: with standard test discovery rules, you can store multiple tests of your files in your current directory and its subdirectories and pytest will run through them all

Assert Exception

To determine if a piece of code causes an exception, you can create a test with an assert with the raise helper as shown:

# testing_sysexit.py
import pytest
def p():
   raise SystemExit(1)

def test_mytest():
   with pytest.raises(SystemExit):

p()

This test would result in an exception being thrown but as it’s expected (and controlled using the raise keyword), the test would pass and continue as on.

Multiple Tests Class Grouping

Now if you want to run multiple tests, you can choose to group them as a class as such:

# testing_class.py

class TestClass:
    def test_one(self):
        y = “this”
        assert “h” in y
    def test_two(self):
        y = “hello”
        assert hasattr(y, “check”)

Pytest discovers every test in the class bearing prefixes like “testing” as used. we can now run the code with:

$ pytest testing_class.py

You can discover that the first test passed while the second failed to give the reasons for the failure to gain a proper understanding.

FIXTURES

pytest fixtures are decorator functions, which essentially run before you run a function. A pytest fixture is implemented in the same manner as a decorator function as follows:

@pytest.fixture
def abc(input):
   ...

The implementation in pytest offers dramatic improvements over the classic xUnit style of setup/teardown functions because primarily, fixtures have explicit names and are activated by declaring their use from test functions. Fixtures are also modular, so each fixture name triggers a fixture function which can itself use other fixtures. Finally, fixture management scales from a simple unit test to more complex parametrised fixtures. With high-grade production code in a big organisation, the ability to configure and re-use fixtures is imperative.

In more layman terms, fixtures are generally used to set the data for a test by e.g. setting a database connections or connecting to a URL to test: generally, setting some sort of input data. So instead of running the same code for every test, we can attach fixture function to the tests and it will run and return the data to the test before executing each test.

Let’s look at a ream example. Create a file test_div.py and add the below code to it

# test_div.py
import pytest

@pytest.fixture
def input_value():
   input = 39
   return input

def test_divisible_by_3(input_value):
   assert input_value % 3 == 0

def test_divisible_by_6(input_value):
   assert input_value % 6 == 0

When running this file, we will get a pass in the first test but a fail in the second test as 39 is not divisible by 6.

Now the approach comes with its own limitation. A fixture function defined inside a file can only be used within that file only (as it’s in that scope). To make a fixture available to multiple test files, we have to define the fixture function in a file called conftest.py.

The coding community feels strongly about testing because it really does save a lot of time and hassle when you’re making changes as part of a wider web of code. We’ve all been in that case where you make a small change and suddenly everything breaks and there’s no clear reason why. Testing ensures that we stay on-top of all problems and can isolate/fix them quickly.

I’d recommend it as a practitioner because testing allows you to build a solid foundation to any new framework. You can at times test to death, but, it’s better than not testing at all. After all, you want to ensure that your code has the highest degree of integrity as possible — not only for when you merge it, but for there after.

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

The Top 4 Virtual Environments in Python for Data Scientists

Which Environment Is Yours?

Virtual Environments are a relatively difficult thing for new programmers to understand. One problem I had in understanding Virtual Environments was that I could see my environment existed within an MacOS framework, I was using PyCharm and my code was running, what else did I need?

However, as your career as a Data Scientist or Machine Learning Engineer progresses, you realise that you get these annoying as hell dependency issues between projects and as an amateur who’s self taught in this space (as many readers here), it just takes forever to figure it out.

In what follows, I go through the most common virtual environments and why/when you should use which. To be honest, you should probably use Docker as it’s the latest technology and it’s what everyone is using (and if you’re interviewing, you’ll be asked about it). I talk about Docker here.

However, it’s super important to appreciate existing technology and how it works. Here it goes!

VENV

VirtualEnv (or Venv for short) was (and kind of still is) the default virtual environment for most programmers. You can install is using pip as follows

pip install virtualenv

and once it’s installed, go to your chosen director and to create a virtual environment, run the following command:

python3 -m venv env

Before you can start installing or using packages in your virtual environment you’ll need to activate it. Activating a virtual environment will put the virtual environment-specific python and pip executables into your shell’s PATH.

source env/bin/activate

And now that you’re in an activated virtual environment, you can start installing libraries as normal:

pip install requests

Finally, to make your repo reusable, make sure to create a record of everything that’s installed in your new environment, run

pip freeze > requirements.txt

If you are creating a new virtual environment from a requirements.txt file, you can run

pip install -f requirements.txt

If you open your requirements file you will see a different package with its version in each line.

Finally, to deactivate the virtual environment, you can simply use the deactivate command to close the virtual environment. If you want to re-enter the virtual environment just follow the same instructions above about activating a virtual environment. There’s no need to re-create the virtual environment.

So we can see that so far, we’ve had to manually create a virtual environment, we’ve had to then activate it and then also freeze the session and save everything into a requirements.txt file to make it portable. But what if we didn’t have to have to this two part process?

Enter pipenv.

PipEnv

While venv is still the official virtual environment tool that ships with the latest version of Python, Pipenv is gaining ground in the Python Community.

For example, in what we just described about with venv, in order to create virtual environments so you could run multiple projects on the same computer you’d need:

A tool for creating a virtual environment (likevenv)
A utility for installing packages (like pip or easy_install)
A tool/utility for managing virtual environments (like virtualenvwrapper or pyenv)

Pipenv includes all of the above, and more, out of the box.

Moreover, Pipenv handles dependency management really well compared to requirements.txt and pip freeze. Pipenv works the same as pip when it comes to installing dependencies and if you get a conflict you still have to manage it (although you can issue pipenv graph to view a full dependency tree, which should help).

But once you‘ve solved the issue, Pipfile.lock keeps track of all of your application’s interdependencies, including their versions, for each environment so you can basically forget about interdependencies. This is really a step up.

To install pipenv, you need to install pip first. Then do

pip install pipenv

Next, you create a new environment by using

pipenv install

This will look for a pipenv file, if it doesn’t exist, it will create a new environment and activate it.

To activate you can simply run the following command:

pipenv shell

To install new packages in this environment you can simply use pip install package , and pipenv will automatically add the package to the pipenv file that’s called Pipfile.

You can also install package for just the dev environment by calling

pip install <package> --dev

And once you’re ready to ship to production, all you do is:

pipenv lock

This will create/update your Pipfile.lock, which you’ll never need to edit manually. You should always use the generated file. Now, once you get your code and Pipfile.lock in your production environment, you should install the last successful environment recorded:

pipenv install --ignore-pipfile

This tells pipenv to ignore the pipfile for installation and use what’s in the Pipfile.lock. Given this Pipfile.lock, pipenv will create the exact same environment you had when you ran pipenv lock, sub-dependencies and all.

The lock file enables deterministic builds by taking a snapshot of all the versions of packages in an environment (similar to the result of a pip freeze).

There you have it! Now we’ve compared pipenv and venv and shown that pipenv is a much easier solution.

Conda Environment

Anaconda is distribution of Python that makes it simple to install packages and it’s generally a good place for Python beginners. At the same time, Anaconda also has its own virtual environment system conda. Similar to the above, to create the environment:

conda create --name environment_name python=3.6

You can save all the info necessary to recreate the environment in a file by calling

conda env export > environment.yml

To recreate the environment you can do the following:

conda env create -f environment.yml

Last, you can activate your environment with the invocation:

conda activate conda-env

And deactivate it with:

conda deactivate

Environments created with conda live by default in the envs/ folder of your Conda directory.

Now in my experience, conda is OK but I prefer the approach taken by venv for two reasons. ️Firstly, iIt makes it easy to tell if a project utilises an isolated environment by including the environment as a sub-directory.

Further, It allows you to use the same name for all of your environments, meaning you can activate each with the same command. However as conda puts environments in a certain folder (rather than initiating the environment), it makes it easier to make an environment.

Docker

In a previous blog post I talk about Docker and go into detail explaining how to use it, so I won’t bore you here.

Docker is a library that creates docker containers. These containers contain images of how your operating system looks like, whereas virtualenv only looks at the dependency structure of your python project. So, a virtualenv only encapsulates Python dependencies. A docker container encapsulates an entire OS.

Because of this, with a Python virtualenv, you can easily switch between Python versions and dependencies, but you’re stuck with your host OS. However with a docker image, you can swap out the entire OS — install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core.

There are Docker images out there with every combination of OS and Python versions you can think of, ready to pull down and use on any system with docker installed.

If you think about each of the environments listed above, you’ll realise that there’s a natural divide between them. Conda is better suited (naturally) for those whoa re using the Anaconda distribution (so mostly for beginners in Python) whereas pipenv and venv are for those individuals who are more seasoned and know the ropes. Of these two, if you’re something from scratch i’d really recommend to go with pipenv as it’s just been built with the difficulties of venv in mind.

However, Docker is both easy to use and has such widespread recognition that you just have to know how this works. They all actually work out of the tin and do what they need to, but the portability between operating systems is what makes Docker the real stand out because when it comes to production, you don’t need to worry about the OS on your server as the container has it all sorted for you.

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

You’re living in 1985 if you don’t use Docker for your Data Science Projects

What is it Docker and How to to use it with Python

One of the hardest problems that new programmers face is understanding the concept of an ‘environment’. An environment is what you could say, the system that you code within. In principal it sounds easy, but later on in your career you begin to understand just how difficult it is to maintain.

The reason being is that libraries and IDE’s and even the Python Code itself goes through updates and version changes, then sometimes, you’ll update one library, and a separate piece of code will fail, so you’ll need to go back and fix it.

Moreover, if we have multiple projects being developed at the same time, there can be dependency conflicts, which is when things really get ugly as code fails directly because of another piece of code.

Also, say you want to share a project to a team mate working on a different OS, or even ship your project that you’ve built on your Mac to a production server on a different OS, would you have to reconfigure your code? Yes, you probably will have to.

So to mitigate any of these issues, containers were proposed as a method to separate projects and the environments that they exist within. A container is basically a place where an environment can run, separate to everything else on the system. Once you define what’s in your container, it becomes so much easier to recreate the environment, and even share the project with teammates.

Requirements

To get started, we need to install a few things to get set up:

Windows or macOS: Install Docker Desktop
Linux: Install Docker and then Docker Compose

Containerise a Python service

Let’s imagine we’re creating a Flask service called server.py and let’s say the contents of the file are as follows:

from flask import Flask
server = Flask(__name__)

@server.route("/")
 def hello():
    return "Hello World!"

if __name__ == "__main__":
   server.run(host='0.0.0.0')

Now as I said above, we need to keep a record of the dependencies for our code so for this, we can create a requirements.txt file that can contain the following requirement:

Flask==1.1.1

So our package has the following structure:

app
├─── requirements.txt
└─── src
     └─── server.py

The structure is pretty logical (source kept is kept in a separate directory). To execute our Python program, all is left to do is to install a Python interpreter and run it.

Now to run the program, we could run it locally but suppose we have 15 projects we’re working through — it makes sense to run it in a container to avoid any conflicts with any other projects.

Let’s move onto containerisation.

Dockerfile

To run Python code, we pack the container as a Docker image and then run a container based on it. So as follows:

Create a Dockerfile that contains instructions needed to build the image
Then create an image by the Docker builder
The simple docker run <image> command then creates a container that is running an app

Analysis of a Dockerfile

A Dockerfile is a file that contains instructions for assembling a Docker image (saved as myimage):

# set base image (host OS)
FROM python:3.8

# set the working directory in the container
WORKDIR /code

# copy the dependencies file to the working directory
COPY requirements.txt .

# install dependencies
RUN pip install -r requirements.txt

# copy the content of the local src directory to the working directory
COPY src/ .

# command to run on container start
CMD [ "python", "./server.py" ]

A Dockerfile is compiled line by line so the builder generates an image layer and stacks it upon previous images.

We can also observe in the output of the build command the Dockerfile instructions being executed as steps.

$ docker build -t myimage .
Sending build context to Docker daemon 6.144kB

Step 1/6 : FROM python:3.8
3.8.3-alpine: Pulling from library/python
…
Status: Downloaded newer image for python:3.8.3-alpine
---> 8ecf5a48c789

Step 2/6 : WORKDIR /code
---> Running in 9313cd5d834d
Removing intermediate container 9313cd5d834d
---> c852f099c2f9

Step 3/6 : COPY requirements.txt .
---> 2c375052ccd6

Step 4/6 : RUN pip install -r requirements.txt
---> Running in 3ee13f767d05
…
Removing intermediate container 3ee13f767d05
---> 8dd7f46dddf0

Step 5/6 : COPY ./src .
---> 6ab2d97e4aa1

Step 6/6 : CMD python server.py
---> Running in fbbbb21349be
Removing intermediate container fbbbb21349be
---> 27084556702b
Successfully built 70a92e92f3b5
Successfully tagged myimage:latest

Then, we can see that the image is in the local image store:

$ docker images
REPOSITORY    TAG       IMAGE ID        CREATED          SIZE
myimage       latest    70a92e92f3b5    8 seconds ago    991MB

During development, we may need to rebuild the image for our Python service multiple times and we want this to take as little time as possible.

Note: Docker and virtualenv are quite similar but different. Virtualenv only allows you to switch between Python Dependencies but you’re stuck with your host OS. However with Docker, you can swap out the entire OS — install and run Python on any OS (think Ubuntu, Debian, Alpine, even Windows Server Core). Therefore if you work in a team and want to future proof your technology, use Docker. If you don’t care about it — venv is fine, but remember it’s not future proof. Please reference this if you still want more information.

There you have it! We’ve shown how to containerise a Python service. Hopefully, this process will make it a lot easier and gives your project a longer shelf life as it’ll be less likely to come down with code-bugs as dependencies change.

Thanks for reading, and please let me know if you have any questions!

Keep up to date with my latest articles here!

The Future of AI is in Model Compression

New research can reduce the size of your neural net in a super easy way

The future looks towards running deep learning algorithms on more compact devices as any improvements in this space make for big leaps in the usability of AI.

If a Raspberry Pi could run large neural networks, then artificial intelligence could be deployed in a lot more places.

Recent research in the field of economising AI has led to a surprisingly easy solution to reduce the size of large neural networks. It’s so simple, it could fit in a tweet:

Train the Neural Network to Completion
Globally prune the 20% of weights with the lowest magnitudes.
Retrain with learning rate rewinding for the original training time.
Iteratively repeat steps 2 and 3 until the desired sparsity is reached.

Further, if you keep repeating this procedure, you can get the model as tiny as you want. However, it’s pretty certain that you’ll lose some model accuracy along the way.

This line of research grew out of the an ICLR paper last year (Frankle and Carbin’s Lottery Ticket Hypothesis) which showed that a DNN could perform with only 1/10th of the number of connections if the right subnetwork was found in training.

The timing of this finding coincides well with reaching new limitations in computational requirements. Yes, you can send a model to train on the cloud but for seriously big networks, along with considerations of training time, infrastructure and energy usage — more efficient methods are desired because they’re just easier to handle and manage.

Bigger AI models are more difficult to train and to use, so smaller models are preferred.

Following this desire for compression, pruning algorithms came back into the picture following the success of the ImageNet competition. Higher performing models were getting bigger and bigger but many researchers proposed techniques try keep them smaller.

Song Han of MIT, developed a pruning algorithm for neural networks called AMC (AutoML for model compression) which removed redundant neurons and connections, when then the model is retrained to retain its initial accuracy level. Frankle took this method and developed it further by rewinding the pruned model to its initial weights and retrained it at a faster initial rate. Finally, in the ICLR study above, the researchers found that the model could be rewound to its early training rate and without playing with any parameters or weights.

Generally as the model gets smaller, the accuracy gets worse however this proposed model performs better than both Han’s AMC and Frankle’s rewinding method.

Now it’s unclear why this model works as well as it does, but the simplicity of it is easy to implement and also doesn’t require time-consuming tuning. Frankle says: “It’s clear, generic, and drop-dead simple.”

Model compression and the concept of economising machine learning algorithms is an important field that we can make further gains in. Leaving models too large reduces the applicability and usability of them (I mean, you can keep your algorithm sitting in an API in the cloud) but there are so many constraints in keeping them local.

For most industries, models are often limited in their usability because they may be too big or too opaque. The ability to discern why a model works so well will not only enhance the ability to make better models, but also more efficient models.

For neural nets, the models are so big because you want the model to naturally develop connections, which are being driven by the data. It’s hard for a Human to understand these connections but regardless, the understanding the model can chop out useless connections.

The golden nugget would be to have a model that can reason — so a neural network which trains connections based on logic, thereby reducing the training time and final model size, however, we’re some time away from having an AI that controls the training of AI.

Thanks for reading, and please let me know if you have any questions!

Keep up to date with my latest articles here!

The Difference between “is” and “==” in Python

Equality and Identity

Python is full of neat tips and tricks and something worth noting are the different ways to indicate equality, and how these specific two ways are different.

The == and is command both indicate some form of equality and are often used interchangeably. However, this isn’t exactly correct. To be clear, the == command checks for equality and the is operator, however, compares the identities of the objects.

In what follows, i’ll quickly explain the difference between the both, including code examples.

To understand this better, let’s look at some Python Code. First, let’s create a new list object and name it a, and then define another variable b that points to the same list object:

>>> a = [5, 5, 1]
>>> b = a

Let’s first print out these two variables to visually confirm that they look similar:

>>> a
[5, 5, 1]
>>> b
[5, 5, 1]

As the two objects look the same we’ll get the expected result when we compare them for equality using the == operator:

>>> a == b
True

This is because the == operator is looking for equality. On the other hand, that doesn’t tell us if a and b are pointing to the same object.

Now, we know they do because we set them as such earlier, but imagine a situation where we didn’t know—how can we find out?

If we simply compare both variables with the is operator, then we can confirm that both variables are in fact pointing to the same object:

>>> a is b
True

Digging deeper with examples, let’s see what happens when we make a copy of our object. We can do this by calling list() on the existing list to create a copy that we’ll name c:

>>> c = list(a)

Now again, you’ll see that the new object we just created looks identical to the list object pointed to by a and b:

>>> c
[5, 5, 1]

Now this is where it gets interesting. When we compare our copy c with the initial list a using the == operator. What answer do you expect to see?

>>> a == c
True

This is expected because the contents of the object are identical, and as such, they’re considered equivalent by Python. However, they are actually pointing to different objects, identified by using the is command:

>>> a is c
False

Being able to differentiate between identity and equality is a simple but important step in learning the complete scope of Python. These neat tips and tricks have helped me as a Data Scientist improve not only my coding skills, but also my analytics.

Thanks again for taking the time to read! Please message me if you have any questions, always happy to help!

Keep up to date with my latest work here!

The Future of GIT (2020)

Opinion

5 Predictions of what Data Scientists can expect

Version Control is a pretty boring topic for most people but for coders and researchers, it’s imperative to understand. The importance of version control is really understood when you work in a big team working on a big project. With multiple users working on the same files at the same time, it’s a crowd you’re trying to control and ensuring that they’re all working towards the same goal.

As we know it today, version control plays an integral part in our coding ecosystem and in all honesty, a lot of people are generally happy with it. I mean there are kinks and quirks that we could improve on but the fact that nearly every single coding team I’ve been part of uses git — that says something.

Given its widespread and integral use in the coding society and how the sphere of coding and technology has changed so much since 2005, the following are my top predictions as to how git will improve in the coming decade.

Prediction 1: GIT GOES USER FRIENDLY

My first prediction is going to be short and sweet. Beginners always struggle to learn git. Even people who’ve known git for 5 years+ still mess up in rebasing or changing branches and lose work along the way.

I’ll be honest, I’ve been using source control systems for almost 10 years and I only became comfortable with using git through PyCharm. It’s embarrassing but true. Without my DevOps team at the moment, I’d be lost.

Prediction 2: GIT GOES REAL TIME

The fact that git can tell you who has made what change is both a good and a bad thing. It’s good because it tells you who has done what (well, that is its purpose after all), but, it doesn’t tell you who is working on what at any point in time.

Generally speaking that isn’t a problem but often two coders can be working on the same file at the same time and this may not be a big problem, though, it would be useful to know if the functional changes that both coders are making on the same file will interfere with each other — so they don’t have to go through the awkward dance of merging their commits. It’d certainly be helpful to know if another coder is working on the same file as you, and on which branch.

Prediction 3: GIT GOES CONNECTED

Why do we do git fetch still? Why do we do git pull still? There has to be a better way.

It’s second nature for coders who actively work in a shared environment to update their repository frequently during the day but for those of us who sit in a research role or a quasi-coding position, it’s considered ‘good practise’ to update your branch regularly: but what is ‘good practise’?

In reality, it means to update it as often as the core developers would, but for those of less well read into DevOps, shouldn’t GIT take this into account? Shouldn’t it say “Hey, this code (or your project) has changed considerably, you should definitely refresh your project”? Wouldn’t that be helpful?

Prediction 4: GIT GOES DIRTY

A piece of code can be considered dirty if it’s not committed. Code which isn’t committed can often fall between the cracks if the computer shuts down or a session ends before you stash it or commit it — as it isn’t really saved anywhere but locally.

However, sometimes you don’t want to commit code because you’re not finished, and you really want to go get your lunch. What would your commit code read?

I guess this is what stashing is good for but it’d be great if git had a dirty mode, which you could switch on and it would auto-stash every few minutes to ensure that any faults in your local system were completely covered.

Prediction 5: GIT GOES AI COMMIT MESSAGE

Let’s be honest — there’s an art to writing commit messages and I have not mastered this art at all. I’m really, really bad at them.

You can even reference this post that complains about them.

I’m in two minds about this because I love reading awful commit messages but wouldn’t it be awesome if you didn’t have to write commit messages, and rather, the engine could determine what change you made and leave the notes instead?

For one, it’d probably be more informative and also it’d probably be more transparent. Further, the coder may even realise issues if the commit message has a different message to what he was expecting.

Predicting the future of git is hard because in reality, who knows what it’s going to look like. It already does a pretty good job and despite the complaints on the internet about all its faults, there aren’t that many real competitors.

Tides are changing and people are starting to look more towards CICD frameworks, where the repository plays an integral role and given that, I expect a lot of improvements to come our way.

Especially with everyone in lock down, what else do we have to think about?

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

References:

How to Deploy Streamlit on Heroku

Opinion

For Endless Possibilities in Data Science

In a previous post, I predicted that the popularity of Flask would really take a hit once Streamlit comes more into the mainstream. I also made the comment that I would never use Flask again.

I still stand by both of these comments.

In that time, I’ve made four different machine learning projects, all of which are being used by family and friends on a daily basis, including:

A COVID Dashboard for my local friends and family that focuses in on our local area
A simple application for restaurants to take bookings (help out my fathers business)
A small game involving a face mask recognition system for my nephew

None of these are for financial gain, rather, they’re made exactly for what I’ve always enjoyed about artificial intelligence: they’re fun, novel and creative projects that just cool.

I can now develop, code and deploy novel applications in less than a couple of hours.

For those who don’t know, Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It’s super useful when you want to build something small that scales, but more importantly, it’s really helpful for small pet projects.

The free tier is something that I really value and would really recommend readers to deploy more projects so that they’re free for everyone to play with. What’s the point of making cool stuff if you can’t show anyone?

In what follows, I’ll take you through the different steps required. The first or second time it may be a bit slow, but after, you’ll fly through it.

Note: the following instructions are only verbose to a degree to help you deploy simple applications, without much thought of scalability etc. My advice is targeted at individuals who want to build small applications, so if you expect to have 100m users, this is not really the tutorial for you

Firstly: make sure you have a Heroku login! Use this link and navigate through making a free tier. If you struggle here: I’ll be very disappointed.

Requirement .txt

Make sure that your terminal is in your project folder. When you launch your project into the cloud, you need to create a requirements.txt file so that the server knows what to download to run your code. The library pipreqs can autogenerate requirements (it’s pretty handy) so I’d recommend installing it as follows:

pip install pipreqs

Then once it’s downloaded, just step out of the folder, run the following command, and in the folder, you should find your requirements.txt file.

pipreqs <directory path>

It should contain libraries and their versions in the following format:

numpy==1.15.0
pandas==0.23.4
streamlit==0.60.0
xlrd==1.2.0

setup.sh and Procfile

Now, the next step is a little bit messy but bear with me. You need to set up two more things: a setup.sh file, and a Procfile.

The setup.sh file contains some commands to set the problem on the Heroku side, so create a setup.sh file (you can use the nano command) and save the following in that file (change the email in the middle of the file to your correct email)

mkdir -p ~/.streamlit/
echo "
[general]n
email = "your@domain.com"n
" > ~/.streamlit/credentials.toml
echo "
[server]n
headless = truen
enableCORS=falsen
port = $PORTn

Nice!

Now, make (and I’m cheating a little bit here), but make a procfile using the command:

nano Procfile

and from there, you want to insert the following piece of code (remember to replace [name-of-app].py to whatever your app is called. Most of mine are just app.py)

web: sh setup.sh && streamlit run [name-of-app].py

Moving the files across with Git

Heroku builds systems using Git and it’s insanely easy to get set up. Git is a version control system and runs as default on a lot of operating systems. Check if you have it, if not, install it.

Once you’re happy with your installation, you’ll need to make a git repository in your project folder. To do this, make sure you are in your project folder and run the following command:

git init

This initialises a git repository in your project folder and you should see something like the following print out:

Initialized empty Git repository in /Users/…

Nice!

The first time you do this you’ll need to click here and install the Heroku CLI.We’re using the free version of Heroku which is great but naturally has drawbacks as it doesn’t have certain desirable features configured. The features are more useful for larger projects (like SSL and scalability, also our machines tend to go to sleep if they are idle for more than 30 minutes) but hey, it’s free!

Once you’ve downloaded the Heroku CLI, run the following login command:

C:Users...> heroku login

This opens up a browser window, from which you can log in.

Once you’re in, it’s time to create your cloud instance. Run the following command

C:Users...> heroku create

and you’ll see that Heroku will create some oddly named instance (don’t worry, it’s just how it is):

Creating app… done, ⬢ true-poppy-XXXX

https://true-poppy-XXXXX.herokuapp.com/ | https://git.heroku.com/true-poppy-XXXXX.git

So Heroku created an app called ‘true-poppy’ for me. Not sure why, but I’ll take it. Now all that’s remaining is to move the code across, so in our project folder we run the following commands:

git add .
git commit -m "Enter your message here"
git push heroku master

Once it’s merged, the Heroku application will start downloading and installing everything on the server side. This takes a couple of minutes but if all is good, then you should something like:

remote: Verifying deploy… done.

Now if you run the following:

heroku ps:scale web=1

Your job is done! If you copy the url it gives you in the command line into your browser, you’ll see that you can now run your application online. You can even check it on your mobile phone, it’s perfect!

Streamlit and Heroku make a phenomenal combination. Once you’ve done the actual hard-work of creating a machine learning model that makes sense and generates sensible results, the difficult part should never be the deployment. You should want people to play with a working product. Streamlit takes care of so much of the aesthetic and Heroku takes care of the rest.

Yes, there are several drawbacks to the above methodology in that it’s limited in scope, design and scalability but I challenge you to find a quicker way of deploying a half-decent and fully functional MVP that users can interact with. Even better, I’ll even do a race!

I really recommend these pragmatic approaches because without users interacting with your product — you’ll just never know if it’s any good.

Give it a go. Surprise yourself.

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The difference between ‘git pull’ and ‘git fetch’?

The question we secretly ask

This is a brief explanation for the difference between git pull and git fetch then merge. It’s a question that a lot of people want the answer to, being the 4th more most updated question on stackoverflow.

The reason so many people get confused is that upon first glance, they seem to do the same thing (fetching is kind of the same as pulling, right?), but, each has a distinctly different job.

git is what you would call a version control system. It tracks changes in code for software development, ensuring that there’s one central truth to code- and any changes to it are accurately recorded. It’s designed to coordinate work amongst programmers, but can be used to track any set of fils really. It’s pretty handy and a lot of people use it!

Note: In a coming article, I talk about git in a lot more detail, so I’ll be leaving out an introduction to it in this post. If you have any questions though, please leave a comment at the bottom!

In Brief:

If you want to update your local repo from the remote repo, but, you don’t want to merge any differences, then you can use:

git fetch

Then after we download the updates, we can check for any differences as follows:

git diff master origin/master

After which, if we’re happy with any differences, then we can simply merge the differences as follows:

git merge

On the other hand, we can simply fetch and merge at the same time, where any differences would need to be solved. This can be done as follows:

git pull

Which to use:

Depending on how quick you work and how your project is set up: it’ll determine whether you want to use get fetch or pull. I tend to use git pull more because i’m generally working from a fresh and clean project.

As git pull attempts to merge remote changes with your local ones, you’re often at risk of creating a ‘merge conflict’. Merge conflicts are where your local branch and the branch on your network differ, so any differences need to be sorted before merging the differences. You can always use git stash and un-stash in the face of differences (making conflict resolution a bit easier to deal with).

Other situations as to why you may want to use git fetch instead of pull are as follows:

a)You want to check which new commits have been made after you’ve made some local changes:

git fetch origin master
git cherry FETCH_HEAD HEAD

b)You want to to apply a commit from master of another remote repository to your local branch.

git fetch <url_of_another_repository> master
git cherry-pick commit_abc

c)Your colleague has made a commit to a ref refs/reviews/XXX and asks you to have a review but you don’t have a web service to do it online. So you fetch it and checks it out.

git fetch origin refs/reviews/XXX
git checkout FETCH_HEAD

git isn’t the easiest product to get the hang of but after a while of playing around with it and fixing some mistakes, you’ll learn to depend on it quite a lot. Especially when you’re working with a big team, you’ll need to depend on git to coordinate code integration. Given that, decisions as to whether you do git pull or git fetch becomes quite imperative and hopefully, you’ll know which to do now!

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

4 Awesome COVID Machine Learning Projects

Forward thinking ways to apply Machine Learning in a Pandemic

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

The pandemic has changed our lives: a lot. From all sides, the lives we lived before are no longer the same as they once were. Our workplaces are different; our families are different, our expectations are different too.

Given that most of us are working from home, I’ve put together 54interesting machine learning COVID based projects below, they’re all worth checking out! Each of these have their own place and some are more practical than others. However in terms of the raw application of knowledge, these are all great!

Let’s get right to it!

Face Mask Facial Recognition

Facial recognition is a huge field and it’s only set to grow in the coming months and years. Computer Vision is developing rapidly as technology in this space, including autonomous driving and identification, become more and more widespread.

At scale, Coronavirus has resulted in a demographic and societal change whereby people have to physically change their actions. Given that, masks are becoming compulsory in a huge number of countries and as such, the ability to identify whether people are wearing masks is also growing in demand.

Building a system that can determine if you’re wearing a mask or not is awfully similar to the problem of Facial Recognition, so the solution to this problem isn’t that difficult to create. Given that, the following sources are those that I’ve found quite useful in researching into it:

Baidu: source
PyImageSource: source
LeeWayHertz: source
facemask PyPi: source

I really appreciated the work by PyImageSource and even implemented the framework on my own home computer. It worked so well as two scripts are provided meaning that you can do less of the fiddly stuff, and more of the playing around:

Face mask recognition in images
Face mask recognition in videos

Definitely worth a play around at home!

Social Distance Recognition

Following on from the mask recognition project: social distancing is one of the key themes of 2020. In the UK for example, you have to remain at a distance of more than 2 metres from people outside of your ‘bubble’, not to mention this distance varies between regions in Europe.

The trouble with this is to implement it in a way that doesn’t require new hardware. Existing camera’s don’t really have an innate concept of distance, so two markers are usually set to inform the program what approximately constitutes a safe distance.

Given that, the following sources will help you to develop your own Social Distance Recognition tool!

Blog by Aqeel Anwar: source and Github

Symptom Checker

Say you’re coming down with a cold, getting a fever and generally feeling a bit run over. Should you worry?

Yes. Get a COVID test.

But if you can muster some energy, you can always use machine learning to aid in the determination of how likely you are to have COVID (or so the theory goes).

Using a sample data set from as generated here, you can quite easily throw it into a Random Forest and understand (a) how likely you are to have coronavirus and (b) how much you should be worried about each symptom.

Blog by Tanveer Hurra: source

Also If you do have symptoms, go get checked and isolate!

However, the trick with this project is getting the symptom data. It’s not easy, but the more symptom data you get, the better your predictions!

Graph Databases

Social distancing is a big deal in the pandemic because the virus can transfer from person to person quite quickly over short distances. Given that, if an individual is tested positive, then it’s important to understand (a) who is in their network of people (which is actually easy to identify), but how likely each person is to have been infected. This allows policy makers to easily trace who may be infected and to isolate such people.

Given that, Nebula Graph is an open source project that allows users to generate graphs and determine connections between people based on arbitrary settings, in this case: people and places. A graph is loaded with data on both sick and healthy people, along with the addresses that people were travelling to: hoping to answer how people get sick when no one they came in contact was sick at the time of contact.

The blog by Min Wu is really insightful here, and despite it not coming with code, it’s not a difficult project to translate into Python.

Blog by Min Wu: source

My recommendation would be to first build a model working with randomly generated data, then, to find a real data set or, generate your own within your network!

Despite us all being in lock down, there’s a surge in creativity in the space of machine learning as lots of new problems are being posed. New problems require smart solutions, and thankfully Machine Learning is able to play its part.

Hopefully, you’ve looked into the above and tried to take a stab at some of the projects. Some are easier than the others, but any forward steps you do make can surely make a huge difference!

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

4 Sorting Algorithms in Python

Including Time Complexities and Code

Have you ever tried to sort a deck of cards by hand? You probably, intuitively, used the insertion sort algorithm. The following article will explain why this algorithm works and how long it takes. The good news is that you do end up with a sorted set, but, the bad news is that it can, unfortunately, take a while! The following article tells you why.

Computational Efficiency is key in Software Engineering and in that, a common time sink is routine tasks involving sorting. As an example, the following domains are directly impacted by the ability to sort something quickly:

Searching: searching is much more efficient on a sorted set
Selecting: selecting the n’th element is much more efficient on a sorted set
Duplicates: identifying duplicates is much quicker on a sorted set
Distributional Analysis: much quicker on a sorted set.

Moreover, as we enter an age of Big Data, we need to make sure we’re using the right sorting algorithm to efficiently manage our data to help us, as researchers and practitioners, get to conclusions quicker.

Trust me, it can take a while to sort millions of rows if you use the wrong algorithm.

Now before we get into the specific of each of these algorithms, we first need to decide on a metric to compare the speed of each algorithm, or otherwise, the worst case scenario. Lets therefore first go over the concept of Time Complexity.

Time Complexity and Big O Notation

In Layman terms, the computational complexity is the amount of time it takes to run an algorithm. Often, this is represented in what’s called “Big O Notation”, which is a method of writing the limiting (or worst case) behaviour of an algorithm.

So for example, if you wanted to count the number of items in a list, you could just go through the list one by one and this would take n steps. Even for a list that’s monumentally large, this algorithm would take n steps. Therefore the notation, in this case, would be O(n).

Here are some other common examples:

O(n²): This is easier to explain with an example. Say you’re checking if any element in your list is duplicated, then you have to compare every element in your list to the chosen element (iteratively), thus, you’re doing n checks across n items, thus, n*n = n²

O(log n): Say you have a list, you cut it in half, and again, and again, until you only have one item left.

The following chart helps you visualise the speeds of different time complexities, for which, you can see O(log(n)) is quicker than O(n), which is quicker than O(n²)

Number of Steps in different time scales [source]

Now we’ve covered the concept of Time Complexity and Big O Notation, let’s move onto our algorithms.

Bubble Sort

The O.G. (maybe a bit of a stretch…) of sorting algorithms is the Bubble Sort. It’s the most commonly known sorting algorithm that you’ll probably have been asked about it in an interview or two.

The algorithm takes repeated steps through a list and compares adjacent items, swapping them if they’re in the wrong order. The algorithm keeps looping through the list until the list is ordered. Now depending on how you aim to order your list (highest to lowest or vice versa), this can naturally lead to differences in the time complexity of the problem.

Now assuming everything is in exactly the wrong order, then for every change you make, you have to check every other item in your list, so ultimately, if you are in this unfortunate setting, a Bubble Sort algorithm would take you roughly О(n²). Therefore, bubble sort is not a practical sorting algorithm.

Let’s move onto something a bit better.

Example code at the end

Merge Sort

Here we begin to get a bit fancier. Merge Sort works by splitting an unsorted list into separate groups (by half) recursively until there is one element per group. Then, you compare each element and regroup them, repeating until the list is merge and sorted.

This sounds a bit complicated, but Swfung8 has created a great gif below to show how it works in practise here:

Credit to Swfung8 — Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=14961648

Now the overall time complexity of Merge sort is O(nLogn) because the splitting phase takes roughly O(logn) time with the regrouping requiring n steps for each sub-group, coming together to take O(nlogn), which is much faster than Bubble Sort.

Note if we want to get even fancier, we can also talk about the space complexity of a problem (which is the amount of space a problem requires). For Merge Sort, this is O(n). This means that this algorithm takes a lot of space and may slow down operations for the last data points.

Example code at the end

Quick Sort

Like Merge Sort, QuickSort is a Divide and Conquer algorithm, but it’s famous for being about 2 or 3 times quicker. It picks an element as pivot and then partitions the list around the pivot point. The sub-arrays are then sorted recursively which can be done in-place, requiring small additional amounts of memory to perform the sorting.

Credit to : RolandH, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1965827

In most standard implementations it performs significantly faster than merge sort and rarely reaches its worst case complexity of O(n²), however, generally speaking, it tends to be quicker than O(nlogn).

Example code at the end

Insertion Sort

Now Insertion Sort deserves a quick shout out here because it’s actually reflective on how people sort cards out manually. On each iteration, people would remove one card from the deck, then find where the card belongs within another sorted deck and inserts it in its place, repeating this until the deck of cards is sorted.

The time complexity of this problem, as you can guess, is quite inefficient as at the worst case, for every step, you have to go through an entirely separate list, and so you can, therefore, reach the worst case quadratic time complexity of O(n²).

Note that this code has an extremely useful property that because it’s so simple, it can be coded very efficiently, and also, it works ‘offline’, in that it can sort a list as it receives it. However, it still is pretty slow so I’d recommend using quick sort above this.

Example code at the end

So as you can see, the story kind of develops from Bubble Sort to the widely used Quick Sort algorithm, with Insertion Sort showing us the benefits of a code-efficient and intuitive construction, despite it being relatively time inefficient.

A lot of these algorithms are preprogrammed into your language, for example, the sort function in pandas (in Python) uses Quick Sort, so I wouldn’t worry about it too much. Although, if you expect your data is already sorted (in a particular way), then selectively choosing a sorting algorithm you use can definitely improve its efficiency.

Either way, by taking a few extra minutes to think about the problem you’re trying to solve, you can definitely make headway in making quicker and more efficient programs =]

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Code

`Bubble Sort`

def bubble_sort(array):
    n = len(array)
    for i in range(n):
         for j in range(n - i - 1):
             if array[j] > array[j + 1]:
                 array[j], array[j + 1] = array[j + 1], array[j]
                 already_sorted = False
         if already_sorted:
             break
     return array

Merge Sort

Assuming the list has a length greater than 1.

def merge_list(left_list,right_list):
    result=[] 
    i,j=0,0
    while i<len(left_list) and j<len(right_list):
        if left_list[i] < right[j]:
            result.append(left_list[i])
            i+=1
        else:
            result.append(right_list[j])
            j+=1
    result.extend(left_list[i:]) 
    result.extend(right_list[j:])
    return result

def merge_sort(data):
    middle=len(data)//2
    left_data=merge_sort(data[:middle])
    right_data=merge_sort(data[middle:])
    return merge_list(left_data,right_data)

Quick Sort

def quciksort(list):
    less = []
    equal = []
    greater = []
    pivot = list[0]
    for x in list:
        if x < pivot:
            list.append(x)
        elif x == pivot:
            equal.append(x)
        elif x > pivot:
            greater.append(x)
    return sort(less)+equal+sort(greater)

Insertion Sort

def insertionSort(list):
   for i in range(1, len(list)):
       current = list[i]
       while i>0 and list[i-1]>current:
           list[i] = list[i-1]
           i = i-1
           list[i] = current
   return list

How to Join DataFrames in Pandas in Python

{inner, outer, left, right}

In 2008, Wes Mckinney was at the Hedge Fund AQR and developed a small piece of software which became the pre-cursor to Pandas, developing and finalising it later on. Since then, Pandas has become one of the most important Python libraries that most Data Scientists (if not all) use on a daily basis.

It allows users to manage and manipulate data with levels of efficiency not seen before. From loading data, managing missing data, pivoting and reworking data: Panda’s doesn’t (really) achieve anything new, but, existing tools were inefficient, whereas Panda’s makes light work of these complicated tasks.

Now regardless of whether you use SQL or Pandas, you need to know how to join tables. A table join is a process by which you combine two separate ‘tables’ (or in Pandas land, DataFrames) together.

Let’s play with some actual data and say we have the following DataFrame (or in SQL land, Table):

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'],'C': ['C0', 'C1', 'C2', 'C3'],'D': ['D0', 'D1', 'D2', 'D3']},index=[0, 1, 2, 3])

which looks something like this:

Note that the index values have been set to [0,1,2,3]. Remember that an index to a table is an identifier to the location of a particular row in a table. For example, your office may be on the 3rd floor in an building. But in some other buildings, the may call the 3rd floor something arbitrary “Floor C” for example. In this case, Floor C and Floor 3 are equivalent, their just identifiers of a location, like an index.

Say we want to combine df1 with the following DataFrame, df4:

df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'], 'D': ['D2', 'D3', 'D6', 'D7'],'F': ['F2', 'F3', 'F6', 'F7']}, index=[2, 3, 6, 7])

which looks like this:

Now depending on how we want to combine the data depends on what type of a join we use. Note that we generally ‘join’ tables (or DataFrames) based on their index.

If we want to combine the data so we only have index rows that are in both A and B, then we would use an ‘inner’ join.

Inner Join

So as you can see, here we simply use the pd.concat function to bring the data together, setting the join setting to 'inner’:

result = pd.concat([df1, df4], axis=1, join='inner')

which as can be seen, only has index rows [2,3] . That’s because index values of 2 and 3 exist in both DataFrames 1 and 4.

Now all the other rows (or index values) in each DataFrame that are not in the other DataFrame (e.g. df1 contains index row values of [0, 1] which are not in df4, these are thrown away).

Now if we want to retain all the index values from both DataFrame, we can use an Outer Join.

Outer Join

So when we change the join setting to outer, we can see that now, all index values are present ({0,1,2,3} from df1, and {2,3,6,7} from df2).

Let’s now move onto something which some of you may have deduced:

Left Join

Say we want to stick to the index values in df1, for that, we would then do a left join:

df1.merge(df4, how=’left’)

This would give us the corresponding DataFrame:

and likewise on the other side:

Right Join

Say we want to stick to the index values in df4, for that, we would then do a right join:

df1.merge(df4, how=’right’)

This would give us the corresponding DataFrame:

So there you go! Definitely take this away and have a go yourself. Joins are a bit confusing and it takes a few different tries to fully get your head around it but try to remember the following for if you have two tables A and B:

Inner joins only take index rows for what’s in A and B
Outer joins make a new table that contain index values for both A or B, and set the missing segments to nan.
Left joins make a new table that contain index values for both only the ‘left’ table, which in our case would be df1, or A.
Right joins make a new table that contain index values for both only the ‘right’ table, which in our case would be df4, or B.

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Sorry, the TensorFlow Developer Certificate is Pointless

Plenty of Better Alternatives Exist to Prove your Skillset

Google’s overall openness and investment in the space of AI has been phenomenal. I really think that unequivocally, the whole world has a lot to thank them for. Academic breakthroughs are published and code is often made free on GitHub. What more could anyone ask for?

Google are impressive.

Given my respect for their work, I was taken by surprise when I heard about the new TensorFlow certificate. So were responders on Quora. So were a few people on Reddit.

Now two achievements that changed my life were (a) teaching myself to program and (b) teaching myself enough Machine Learning to get into my chosen masters program. My academic background at that point was Mathematical Economics for which ‘STATA’ was the closest I’d been to programming — so the learning curve was steep and fast.

Teaching myself AI was hard

It’s no surprise when I say that AI (whatever you want to call it) is a gruelling discipline. Few other subjects cross the divide between Mathematics, Statistics, Computer Science — not to mention domain specialities within Linguistics, Imagery and more.

Moreover, domain experts with experience can attest to every new AI problem being different to existing. So to realise great results, you have to be creative with the deepest parts of academic knowledge. ‘Curriculums’ rarely go that far in creativity.

In what follows, I give 4 issues that I have with this certification.

Issue 1 : The Certificate Involves No Mathematics

Feed a man a fish, he’ll eat for a day. Teach a man to fish, he’ll eat for life.

Studying the Mathematics behind Machine Learning and AI is really, really hard because the level of understanding required is very advanced. However without going to these depths, you’ll simply make mistakes.

For example, understanding the distributional assumptions behind a model is imperative to make good judgement on which model to use. Changing assumptions about the distributional properties of your noise, for example, can alter your entire optimisation procedure (from closed form solutions assuming normal distribution to Laplace approximations assuming Laplace distributions).

Moreover, certain models are not suited for certain problems and you can only understand this if you understand why. Take the example of vanishing gradients: this invalidates the use of Neural Networks in certain problems.

Issue 2: Citizens who live in Sanctioned Countries are Excluded

This is something a bit more political but the beauty of MOOC’s and general online learning is that it doesn’t discriminate on location.

However, residents in the following “sanctioned countries” are not eligible for to take this exam: ● Cuba ● Iran ● Syria ● Sudan ● North Korea ● Crimea Region of Ukraine.

Now, these individuals are free to take the course outside of their country, but I don’t think it’s as easy as ‘just leaving the country for an exam’, for many people in these countries would struggle to even do so.

I understand that these countries are on a sanctioned list but I don’t see why education is being restricted here. If the law is really inhibiting them being examined, an alternative should be put in place in the spirit of complete open accreditations.

This is from TensorFlow’s Handbook [Page 7]

I really don’t understand this position as you could argue that the Students in these ineligible countries deserve these exams more than in developed nations.

This sanctioned list defeats the purpose of open accreditations.

Knowledge doesn’t acknowledge borders and neither should we.

Issue 3: The Certificate has Fee’s

Fee’s are a great divider in society and despite the stipend provided by Google to help Students from disadvantaged backgrounds, the author of this article still spends over $200 for the whole package.

It’s not that I have a problem with charging people for education: I think that Skillshare and Coursera are fantastic. However, the benefits to education have to be exceptionally clear whereas in this case, it’s not. There are a variety of Summer Schools and MOOC’s with much have better value-propositions, and these should definitely be explored.

Issue 4: There are better ways to prove your worth

Ultimately, the choice of spending over $100 on an exam and a lot of hours on trying to pass the exam that gives you an unproven edge is a decision that each person has to make independently.

However, having been through a lot of interviews and having interviewed a lot of candidates, I can firmly tell you that if you nail a few of the following, they’ll be much more useful:

Summer Schools or MOOC’s

Assuming you’ve taken away all the mathematics required to build ML models from scratch, built a network whilst studying and reached out to key people in these eco-systems. Even for MOOC’s, large online forums exist where you can network and make friends with people who can help.

Kaggle Competitions

There exists a lot of kudos in the industry for scoring highly here. Seriously, it goes along way to say you’re in the top 1% of Kaggle for a particular competition.

GitHub

From a companies perspective, there’s nothing better than showing off what you can do for a prospective employer. Make a Jupyter Notebook and show how you would solve a problem in their industry. If you can’t get real data? Generate it using Monte Carlo. Recreate results from Academic Papers and show how you’d improve on their results. This is what you would do on the job anyways.

Contributing to Libraries

People who contribute to core ML libraries are rated very, very highly in industry. Yes, you have to be a good coder but reaching out and offering your time/fixing things goes a long way here.

As with all decisions in life, if the situation you’re in forces you to take the exam then you should definitely take it. However, if your argument is ‘I think this will help me but I’m not sure exactly how’ then I think you should look at the issues raised and see which alternatives make more sense.

I think the rationale for the qualification may have some merit (to unify some accreditation in an open-source manner), but it’s obscure syllabus that lacks any form of mathematics is something I really can’t get along with.

The high bar of requirement in Artificial Intelligence is there because the work is hard. It takes a long time, grit and determination to make it through the process.

Trust me: it’s worth taking the long-way round and doing it properly. Your life will change for the better.

Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Links to TensorFlow’s Developer Certificate

How to Remove Racial Discrimination from Data Science

On Finding and Fixing Latent Racial Bias

The recent protests which unfolded across the United States (and more recently across the World) reminded us how important it is to acknowledge and resolve both unfair and undue bias from society.

It’s important that events like these teach and remind us to take a look at what we create to ensure that we don’t make the same mistakes as the mistakes made before us.

Models that are developed (and currently deployed), unfortunately, can also fall foul of these biases. Now, this is an egregious thought but the following examples show that the problem is real:

Isn’t this exactly what we don’t want?

When I studied Machine Learning, I was taught about methods to remove latent bias by, for example, ensuring the importance of ‘balanced datasets’ and ‘to ensure your training dataset is representative of your domain’ but often in the process of increasing your accuracy, we can fall foul of developing a model that doesn’t generalise well.

As an example, say your model is attempting to predict who should be the next United States President from a list candidates. If you look at the training data set for this problem, well, the United States has never had a female President. So if the gender of the individual is an input into the model, could this cause an unfair bias? What about their name? Or even their height?

The problem here is that if the domain of the problem changes, can our models recognise it and cope?

This was really exemplified by the conversational chat-bot developed at Microsoft and deployed on Twitter, Tay bot. This conversational agent was a massive technical leap forward but despite this, the progression was overshadowed by its egregious replies.

On one side, many reporters noted (here) that the conversational bot made these terrible replies in response to terrible questions: but replies such as “Jews did 9/11” is nothing but awful and in reality, the model should have stopped well short of making such accusatory and dangerous comments.

Yes, AI agents have the ability to discern understanding from complex patterns but falsely labelling or recognising a phenomena has to be fixed in this space. There are just some things we can’t afford to get wrong.

In what follows, I cover a few methodologies to remove these biases from machine learning datasets, and how we these rules can be implemented.

Method 1: Incorporating Adversarial Inputs into Training

Work by both Google and OpenAI have highlighted that even the best of ML models are susceptible to robustness issues: which is to say that if you slightly alter the input to the model, your output can differ greatly.

Take the following example from OpenAI:

Figure 1: Add a bit of static white noise and the AI model completely misclassifies: [OpenAI Source]

Here we see that a latent phenomena (in the form of white noise) is not only changing the classification of a sample, but this phenomena is also altering the confidence in its suggestion.

Now, this error can happen because the model is being trained on data in such a way that (a) it overfits the training sample but also (b) the model is unable to generalise well.

These issues are widely known and managing them requires adding equal parts robustness measures and adversarial examples into your training procedures. These can be of the form of:

Altering your training samples to create more training samples (i.e. rotated images should be equivalent). This solution is naturally domain specific but the researcher should think responsibly and really hard here. Should a CV scanner really treat identical CV’s of different genders differently? Should gender even be a feature here?
Sampling techniques to ensure your training dataset is balanced
Optimisation fitting methods to become more robust and more stable around slight changes in the input space.

The above listed methods are not exhaustive and nor do they intend to be. Training an ML models is domain specific but the methods above can certainly help in removing latent bias.

Great work in this space is being led by Google and some examples of their findings can be found here and here.

Method 2: Create tests and actively use them

There are some mistakes that we can live with, but there are others that we simply cannot.

Put simply, some models cannot show bias and should be tested in anger.

Models that contain parameters for features such as Race, Gender and other discriminatory angles should be addressed. In more complex models (such as in deep learning methods), a “never-wrong” data set should be used to ensure that the model is responsible and cognisant. It’s a minimum requirement, and we owe it to society to have one.

Chris Rock: “In some jobs, everybody gotta be good…”

Models are often incomplete and problems may slip through the cracks, but if the model maker is not held accountable and the construction of these models is not improved to account for these known issues, then the problem will continue.

Yes we expect our test set to be representative of the domain but if our model does unexpected things (as in Microsoft’s Tay bot), then the fault is still with the model maker and the model is not fit for purpose.

In these really weird instances, I’m a firm believer of being more conservative. As per lessons learned in the bias-variance trade off: sometimes being a bit more conservative in attempts to estimate can result in a significantly better outcome not just for some, but for everyone.

Airport scanners should not discriminate on colour. They just shouldn’t.

Method 3: Surprise Detection and ‘Unknown Unknowns’

Another key issue is that a machine learning model has to understand what its remit is. Take the example from Figure 1 above: by adding noise to the problem, the confidence in its guess actually increases! This tells us that either the measure of confidence in this space is clearly miscalculated, or, the model can easily be confused.

Generally in image classification, it’s quite common to use monochrome images however, this process removes a number of data points which are incredibly valuable. Colour, for example, is just one of many features that you can use to discern an anomalous sample.

Anomalies will always exist and we need to integrate more ways to deal with them.

When you move from training to cross validation to testing, the dynamics of the domain can shift. Practitioners often assume that the underlying dynamics of the domains are all the same e.g. the sampling distribution of the age/colour/gender remains the same but we know that this is not always the case. For example, some training sets only contain young people, whilst the training set may only contain old people.

Given this reality, it would be preferred if an algorithm was to say “I can’t process this job applicant as I don’t recognise line 10 on their C.V.” rather than scoring the sample incorrectly. An incorrect score can result in both positive and negative discrimination which simply isn’t right.

By ensuring that you actively measure latent dynamics in your problem space and measuring them to the domain that your model was trained on is not only good practice, but it could save a lot of pain down the road. What use is an english-to-french translation engine if you now input german phrases? Likewise, if a feature is not familiar, the difference should be measured and acted upon.

We cannot let the same problems that exist within society effect the way technology develops. Given that we recognise that these problems exist, we have a responsibility to ensure that these models are not deployed until they are proven to be unbiased.

The European Commission released a study in early January 2020 highlighting a significant number of issues and challenges facing the deployment of AI. The examples at the beginning of this article echo this study but ultimately, a lot has to be done to fix these problems.

We need to constantly measure and improve our technology. We have to be responsible for this change.

Thanks for reading! Please message for any questions!

Keep up to date with my latest work here!

Top 5 Linux Commands for Beginners

Data Science on the Command Line

As data sets are getting larger and more prevalent, researchers are having to do a lot more of the leg work in regards to core programming — thereby spending more time with tools like GIT and Linux (something we rarely had to before!).

For the software engineers reading this post: you probably won’t find the following super useful but as someone who’s been through those early self-taught days as a junior researcher, I feel the pain of budding Data Scientists or ML researchers!

Given all that, I thought about which commands I use daily and which commands I wished I had known earlier. So from that, I now present my top 5 Linux commands that have helped me in my career!

Command 1: `grep`

grep sounds like the noise frogs make, but actually it stands for Global regular expression print. That long phrase doesn’t make much sense outright, but the essential use case for the grep command is to search for a particular string in a given file.

The function is fairly quick and incredibly helpful when you’re trying to diagnose an issue on your production box, in which for example, you may think a TXT file has some bad data.

As an example, say we’re searching for the string 'this’ in any file which begins with the name 'demo_’:

$ grep "this" demo_*
demo_file:this line is the 1st lower case line in this file.
demo_file:Two lines above this line is empty.
demo_file:And this is the last line.
demo_file1:this line is the 1st lower case line in this file.
demo_file1:Two lines above this line is empty.
demo_file1:And this is the last line.

Not so bad huh? We can see on the left hand side that there are two files that begin with demo (demo_file and demo_file1)

Command 2: wget

Now we move onto something a little bit more sophisticated but still something we use quite a lot. The wget command is a useful utility used to download files from the internet. It runs in the background so can be used in scripts and cron jobs.

To utility is called as follows:

wget <URL> -O <file_name>

Where the following is an example if we wanted to download a file:

wget https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.17.2.tar.xz

Command 3: wc

Often you have a file of arbitrary length and something smells fishy: maybe the size of the file seems too small for the number of rows you expect or something you’re just curious how many words are in it. Either way, you want to inspect it a bit more and need a command to do so.

The wc command helps out in that it essentially counts a few different things for the file in reference:

# wc --help

Usage: wc [OPTION]... [FILE]...
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
  -L, --max-line-length  print the length of the longest line
  -w, --words            print the word counts
      --help			display this help and exit
      --version			output version information and exi

So, say we want to count the number of lines in a file:

wc -L tecmint.txt

16 tecmint.txt

or maybe the number of characters:

wc -m tecmint.txt

112 tecmint.txt

Awesome!

Command 4: Vi

The vi command is super helpful as it allows you to open and explore a file. The command works as follows:

vi [filepath]

And it takes you into an editor sort of thing. Now in this editor, you can use the following characters to navigate:

k    Up one line  
j    Down one line  
h    Left one character  
l    Right one character (or use <Spacebar>)
w    Right one word
b    Left one word

However, in reality, you’ll find navigation pretty naturally. The following commands will be the most useful though:

ZZ     Write (if there were changes), then quit
:wq    Write, then quit  
:q     Quit (will only work if file has not been changed)  
:q!    Quit without saving changes to file

You’ll learn to love vi, I swear!

Command 5: `CTRL+R`

So I’ve saved the best for last as I really use this command quite a lot. CTRL+R isn’t really a command but more a shortcut type of thing. It allows you to search your history of used commands by typing in something which resembles the command, and then similar commands that you’ve used before come up!

For example, say you’ve just run a really long command and for whatever reason your terminal session breaks and you have to re-run the command again. With this command, you can quickly search for it again instead of reconstructing the command from scratch!

Let’s say I’m trying to remember a command that begins with hi, but I can’t remember it all. I type in ctrl+r and then I see what it recommends:

$ history

bck-i-search: his_

Perfect! The command history has been recommended and that’s exactly the command we were looking for. If you press tab at this point, the autocomplete fills in the line:

$ history

I’ve actually always struggled to use both Linux and GIT but over time, I’ve managed to remember a few key commands that’ve helped my development as an independent researcher. I can work fairly independently now and it’s thanks to the above command line tools that I’m able to so.

Therefore, I really recommend spending a few hours getting used to linux as the small lessons you take now will really help progress your use of the system going forward. It’s pure upside!

Thanks again! If you have any questions or need any help, please message =]

Keep up to date with my latest work here!

What does the keyword “yield” do in Python?

Handling Python Memory Issues when faced with Big Data

As the programming language Python develops over time, added functionality improves both its usability and performance. Python has become (if not) the foremost language in the Data Science and its handling of big data sets is amongst one of the reasons why.

It’s no wonder that the language is highly favoured with libraries like Pandas and core functionality like Generators that are both so easy to use. Commentary does exist of new languages like Julia or Go taking precedence but regardless, Python is here for the long run.

Researchers from all corners of academia complain about computational memory issues. Large data-sets are inherent to the problem in fields like Bioinformatics, Finance and broadly Machine Learning, so efficient and effective Memory Handling is required as a standard.

With Big Data comes Big Memory Issues

Let’s take the following example. Say we want to count how many rows are in a file so we write some inefficient code as follows:

open_file = read_csv("standard_csv_file.csv")
row_count = 0

for row in open_file:
    row_count += 1

print("Row count is {}".format(row_count))

Now looking at this example, the function read_csv opens the file and loads all the contents into open_file. Then the program iterates over this list and increments row_count.

However, once the file size grows big (let’s say, to 2gb), does this piece of code still work as well? What if the file size is larger than the memory you have available?

Now, unfortunately, if the file size is stupendously large, you’ll probably notice your computer slows to a halt as you try to load it in. You might even need to kill the program entirely.

So, how can you handle these huge data files?

Generator functions allow you to declare a function that behaves like an iterator. Once an item has been presented from the iterator it’s expected to not be used again and can be cleared from memory. That means at any one point in time you only have one item in memory, rather than the entire problem set.

So in terms of counting rows in a file, we now load up one row at a time instead of loading the whole file all at once. To do this, we can simply rework the code and introduce the keyword yield:

def read_csv(file_name):
    for row in open(file_name, "r"):
        yield row

By introducing the keyword yield, we’ve essentially turned the function into a generator function. This new version of our code opens a file, loops through each line, and yields each row.

The Python Yield Statement

When the code you’ve written reaches the yield statement, the program will suspend execution there and return the corresponding value to you. Now when a function is suspended in this case, the state of the function is saved somewhere magical. Everything linked to the state of that function is saved, including any variable bindings local to the generator, the instruction pointer, the internal stack, and any exception handling.

Let’s make a literal example. If the yield statement pauses code and suspends execution, then calling the function again should continue where it left off, so let’s make a function with multiple yield statements:

>>> def double_yield():
...     yield "This will print string number one"
...     yield = "This will print string number two"

>>> double_obj = double_yield()
>>> print(next(double_obj))
This will print string number one
>>> print(next(double_obj))
This will print string number two
>>> print(next(multi_obj))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Given the advent of Big Data, large data sets are incredibly prevalent these days so memory-efficient coding is a must for Data Scientists and Machine Learning practitioners alike.

The article above highlights key benefits to using Generators and an example is shown in which a Generator is clearly favoured to not using one as its shown to greatly enhance memory handling when faced with large data sets.

Generators have become an integral part of my coding and as a practitioner myself, I encourage you to try them out!

Thanks again! Please send me a message if you have any questions! =]

How to Derive an OLS Estimator in 3 Easy Steps

A Data Scientist’s Must-Know

OLS Estimation was originally derived in 1795 by Gauss. 17 at the time, the genius mathematician was attempting to define the dynamics of planetary orbits and comets alike and in the process, derived much of modern day statistics. Now the methodology I show below is a hell of a lot simpler than the method he used (a redacted Maximum Likelihood Estimation method) but can be shown to be equivalent.

I as a Statistician, owe a lot to the forefathers of Physics.

They derived much of what we know due to necessity. A lot of assumptions had to be made because of their imprecise measuring instruments because unlike today, they couldn’t measure very much or very well at all.

The advances they made in Mathematics and Statistics is almost holy-like given the pedantic depth they explored with such few resources. At the time, very few other people understood their work but it’s because of their advances that we are where we are today.

To the present: OLS Regression is something I actually learned in my second year of undergraduate studies which, as a Mathematical Economist, felt pretty late but I’ve used it ever since.

I like the matrix form of OLS Regression because it has quite a simple closed-form solution (thanks to being a sum of squares problem) and as such, a very intuitive logic in its derivation (that most statisticians should be familiar with).

Moreover, knowing the assumptions and facts behind it has helped in my studies and my career. So from my experience at least, it’s worth knowing really well.

So, from the godfathers of modern Physics and Statistics:

I give to you, OLS Regression.

The goal of OLS Regression is to define the linear relationship between our X and y variables, where we can pose the problem as follows:

Now we can observe y and X, but we cannot observe B. OLS Regression attempts to define Beta.

Beta is very important.

It explains the linear relationship between X and y, which, is easy to visualise directly:

The red line is also known as the ‘line of best fit’ where the slope of the red line is what we’re trying to define. [source]

Beta essentially answers the question that “if X goes up, how much can we expect y to go up by?”. Or as in an example, how much does the weight of a person go up by if they grow taller in height?

5 OLS Assumptions

Now before we begin the derivation to OLS, it’s important to be mindful of the following assumptions:

The model is linear in the parameters
No Endogeneity in the model (independent variable X and e are not correlated)
Errors are normally distributed with constant variance
No autocorrelation in the errors
No Multicollinearity between variables

Note: I will not explore these assumptions now, but if you are unfamiliar with them, please look into them or message me as I look to cover them in another article! You can reference this in the meantime.

Now, onto the derivation.

Step 1 : Form the problem as a Sum of Squared Residuals

In any form of estimation or model, we attempt to minimise the errors present so that our model has the highest degree of accuracy.

OLS Regression is shown to be MVUE (explained here) but the rationale as to why we minimise the sum of squares (as opposed to say, the sum of cubed) residuals is both simple and complicated (here and here), but boils down to maximising the likelihood of the parameters, given our sample data, which gives an equivalent (albeit requires a more complicated derivation) result.

With this understanding, we can now formulate an expression for the matrix method derivation of the linear regression problem:

which is easy to expand:

Step 2: Differentiate with respect of Beta.

As we are attempting to minimise the squared errors (which is a convex function), we can differentiate with respect to beta, and equate this to 0. This is quite easy thanks to our objective function being a squared function (and thereby convex), so it’s easy to differentiate:

Step 3: Rearrange to equal Beta

Now that we have our differentiated function, we can then rearrange it as follows:

and rearrange again to derive our Beta with a nice closed form solution.

And there you have it!

The beauty of OLS regression is that because we’re minimising the sum of squared residuals (to the power 2), the solution is closed form. If it wasn’t to the power 2, we would have to use alternative methods (like optimisers) to solve for Beta. Moreover, changing the power alters how much it weights each datapoint and therefore alters the robustness of a regression problem.

Ultimately, this method of derivation hinges on the problem being a sum of squares problem and the OLS Assumptions, although, these are not limiting reasons not to use this method. Most problems are defined as such and therefore, the above methodology can be (and is) used widely.

However, it’s important to recognise these assumptions exist in case features within the data allude to different underlying distributions or assumptions. For example, if your underlying data has a lot of anomalies, it may be worthwhile using a more robust estimator (like Least Absolute Deviation) than OLS.

Hope you enjoyed reading and thanks again! If you have any questions, please let me know and leave a comment!

Flask’s Latest Rival in Data Science

Streamlit Is The Game Changing Python Library That We’ve Been Waiting For

Developing a user-interface is not easy. I’ve always been a mathematician and for me, coding was a functional tool to solve an equation and to create a model, rather than providing the user with an experience. I’m not artsy and nor am I actually that bothered by it. As a result of this, my projects always remained, well, projects. It’s a bit of a problem.

As ones own journey goes, I often need to do a task that’s outside of my domain: usually to deploy code in a manner by which other people can use it. Not even to launch the next ‘big thing’, but just to have my mother, sister or father use a cool little app I built that recommends new places to eat. The answer always required more effort than I desired to put into it, which used to be:

Develop a novel solution (my speciality) — [This I can do]
Design a website using a variety of frameworks that require months of education [This I cannot do]
Deploy code to a server on some web domain [This I can do]

So (2) is where I always lacked motivation because in reality, it wasn’t my speciality. Even if I did find the motivation to deploy some code, the aesthetics of my work would render it unusable.

The problem with using a framework like Flask is that just requires way too much from the individual. Check this blog here: clearly it’s ridiculous to have to navigate all that just to build a small nifty website that can deploy some code. It would literally take ages.

And that’s why Streamlit is here.

On the comparison between Flask and Streamlit: a reader noted that Flask has capabilities in excess of Streamlit. I appreciate this point and would encourage users to look at their use cases and use the right technology. For the users who require a tool to deploy models for your team or clients, Streamlit is very efficient, however, for users who require more advanced solutions, Flask is probably better. Competitors of Streamlit would include Bokeh and Dash.

Streamlit

This is where Streamlit comes into its own, and why they just raised $6m to get the job done. They created a library off the back of an existing python framework that allows users to deploy functional code. Kind of similar to how Tensorflow works: Streamlit adds on a new feature in its UI, corresponding to a new function being called in the Python Script.

For example the following 6 lines of code. I append a “title” method, a “write” method, a “select” method and a “write” method (from Streamlit):

import streamlit as st

st.title(‘Hello World’)
st.write(‘Pick an option’)

keys = [‘Normal’,’Uniform’]
dist_key = st.selectbox(‘Which Distribution do you want?’,keys)

st.write(‘You have chosen {}’.format(dist_key))

Save that into a file called “test.py”, then run “streamlit run test.py” and it produces the following in your browser on http://localhost:8501 /:

Code above produced this. Fantastic how efficient Streamlits library makes UI programming.

Now this is awesome. It’s both clean to look at and clearly efficient to create.

Jupyter Notebooks is also another successful “alternative” but it’s a bit different. Notebooks is better as a framework for research or report writing however there’s little you can do in the way of actually letting someone else use your code as it’s impractical to give someone else a notebook of code. Co-labs kind of bridges that gap but it’s still not as clean.

Streamlit fills this void by giving the user an ability to deploy code in an easy manner so the client use the product. For those of us who like making small things, this has always been an issue.

Ease of Use

Ok so let’s create something that we may actually want someone else to use.

Let’s say I want to teach my nephew about distributions. I want to make an app that he can use where he selects a distribution, and then it draws a line chart of it. Something as simple as the following:

In this example, you can see that the user has a choice between 2 items in a drop down menu: and when he selects either, you hope that the line chart would up date with the chart. Taking a step back, I’m providing the user with:

Some Information about a problem
The user then has the ability to make a choice
The corresponding chart is then returned to the user

Now in Flask, something like the above would easily require hundreds of lines of code (before even getting to the aesthetics) however Streamlit have achieved the above in a negligible amount of code. Note that the above required the following ~11 lines of code:

import streamlit as st
import numpy as np

# Write a title and a bit of a blurb
st.title(‘Distribution Tester’)
st.write(‘Pick a distribution from the list and we shall draw the a line chart from a random sample from the distribution’)

# Make some choices for a user to select
keys = [‘Normal’,’Uniform’]
dist_key = st.selectbox(‘Which Distribution do you want to plot?’,keys)

# Logic of our program
if dist_key == ‘Normal’:
    nums = np.random.randn(1000)
elif dist_key == ‘Uniform’:
    nums = np.array([np.random.randint(100) for i in range(1000)])

# Display User
st.line_chart(nums)

I find it amazing because the amount of code required is so small to produce something that actually looks and works pretty good.

For anyone who’s played around with UI before, you’ll know how difficult it is to achieve something of this quality. To have Streamlit produce an open-source framework for researchers and teams a like, development time has been immensely reduced. I cannot emphasis this point enough.

Given this, no Data Scientist or Machine Learning Researcher can ever complain about not being able to deploy work. Nor can they complain about getting an MVP running. Streamlit have done all the hard work.

Amazing job guys!

Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The Sampling Distribution of Pearson’s Correlation

Pearson’s Correlation reflects the dispersion of a linear relationship (top row), but not the angle of that relationship (middle), nor many aspects of nonlinear relationships (bottom). [Source]

How a Data Scientists can get the most of this statistic

People are quite familiar with the colloquial usage of the term ‘correlation’: that it tends to resemble a phenomena where ‘things’ move together. If a pair of variables are said to be correlated then if one variable goes up, it’s quite likely that variable two will also go up as well.

Pearson’s Correlation is the simplest form of the Mathematical definition in that uses the covariance between two variables to discern a statistical, and albeit, a linear relationship. It looks at the dot product between the two vectors of data and normalises this summation: the resulting metric is a statistic which is bound towards +/- 1. A positive (negative) correlation indicates that the variables move in the same (different) direction with +1 at the extreme, indicating that the variables are moving in perfect harmony. For reference:

Pearson’s Correlation is the covariance between x and y, over the standard deviation of x multiplied by the standard deviation of y.

Where x and y are the two variables, ⍴ is the correlation statistic, and σ is the covariance metric.

It’s also interesting to note that the OLS Beta and Pearson’s Correlation are intrinsically linked. Beta is mathematically defined as the covariance between two variables over the variance of the first: it attempts to uncover the linear relationship between a set of variables.

OLS Beta is intrinsically linked to Pearson’s Correlation

However, the only difference between the two metrics is the ratio that scales the correlation based on the standard deviation of each variable: (sigma x / sigma y). This normalises the boundaries of the beta coefficient to +/- 1 and thereby giving us the correlation metric.

Let’s now move onto the Sampling Distribution of Pearson’s Correlation

Expectation of Pearson’s Correlation

Now we know that a sample variance calculation that is adjusted for Bessel’s Correction is an unbiased estimator. As Pearson’s correlation involves effectively 3 sample variances, therefore, inferring that the metric itself is also unbiased:

Standard Error of Pearsons Correlation

A problem with the correlation coefficient occurs when we sample from a distribution involving highly correlated variables, not to mention the changing dynamics as the number of observations change.

These two variables which are intrinsic to the calculation of the correlation coefficient can really complicate matters at the extreme, which is why empirical methods like permutation methods or bootstrap methods are used to derive a standard error.

These are both relatively straight forward and can be referenced elsewhere (note here and here). Let’s quickly go through a bootstrap method.

Say we have two time-series of length 1000 (x and y). Then we can take a random subset of n samples from x, and the corresponding samples from y to now have x* and y* which are subsets of the original data (with replacement). From there, we can calculate a Pearson’s correlation to make one single data point. We re-run this, say, 10000 times and now have a vector of 10000 bootstrapped samples. From this, we can calculate a standard deviation of these 10000 samples to result in the standard error of our metric.

As an example, here I take two random normal variables of length 10,000. I subsample 100 data points and calculate the correlation of it over 1000 times. I then calculate the standard deviation of this dataset and empirically derive a standard error of 0.1 (code as below):

Now this empirical derivation is definitely practical when you’re unsure of the underlying distribution of the pairs of variables. Given that we’re sampling from bivariate normal and independent data, we can approximate the standard error by also looking at Fishers Transform as follows:

which in the above case would be approximately 0.10. A great reference on Fishers Transform is here, which deserves an article in itself, so I will not go into detail.

Moreover, if we want to retain the functional form of the correlation coefficient itself, we can also derive the empirical standard error of the correlation coefficient as:

which is ~ root((1-⍴²)/n) and ⍴= 0 so ~ root(1/n)~1/root(n) = 0.1. Again, I will not go into detail here as this derivation is lengthy, but, another great reference is here.

The important thing to note here is that the three of these results converge because the series that we’re testing have an underlying normal distribution. If this wasn’t the case, then you would see that the standard error approximation of the normal distribution, and the standard error approximated from fishers distribution begin to diverge.

Note: I deplore the reader to focus on empirical methods when estimating standard errors for correlations. Distributions can move under the surface so it’s much more reliable to use empirical methods like permutation or bootstrapping.

The correlation coefficient is an incredibly powerful tool to discern relationships between variables. It actually provides information in a useful manner, although as discussed, it’s distributional properties are quite sensitive to the underlying data. If you can make a function that samples quickly, then the empirical methods will really set you in the right direction

Thanks again! Please message me if you need anything 🙂

Code

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

all_data = pd.DataFrame(np.random.rand(10000,2))

n = 100
nsamples = 1000
bootstraps = pd.DataFrame([all_data.sample(n).corr()[0][1] for m in range(nsamples)])

sns.distplot(bootstraps)
plt.title(‘Standard Error is=[{:.3f}]’.format(bootstraps[0].std()))
plt.grid()

Plotting with Seaborn in Python

Figure 0: Pair Plot using Seaborn — [more information]

4 Reasons Why and 3 Examples How

import seaborn as sns

Finding a pattern can sometimes be the easy bit when researching so let’s be honest: conveying a pattern to the team or your customers is sometimes a lot more difficult than it should be. Not only are some patterns hard to explain in layman terms (try explaining a principal component to non-mathematicians) but sometimes, you’re trying to signify the dependency on a conditional joint distribution…say what?

Charting is imperative to our job as researchers so we need to be able to convey our story well. Without this: our knowledge and findings carry much less weight but with the best visuals, we can be sure to convey our story as well as we can.

In the following article, I’ll discuss Seaborn and why I prefer it to other libraries. I’ll also give my top 3 charts that I use daily.

Why Seaborn

Popular Python charting libraries are surprisingly few and far between because it’s hard to make a one-fits-all setting: think Matplotlib designed to be reflective of the Matlab output and ggplot as the pullover from the R version.

As to reasons why I prefer Seaborn against other top libraries:

Seaborn requires a lot less code than Matplotlib to make similar high-quality output
Chartifys’ visuals aren’t that great (sorry Spotify — it’s just a bit too blocky).
ggplot doesn’t seem to be native to Python so it feels like I’m always stretching to make it work for me.
Plotly has a ‘community edition’ which makes me feel uncomfortable with this worry of licensing so I generally stay away from anything involving legal sign-offs. Design-wise and functionality it’s actually a pretty good and has a broad set of offerings, but I’d say for the added headache, it’s not that much (if at all) better than Seaborn.

Most importantly, a Researcher spends a lot of their time plotting distributions and if you can’t plot distributions easily, your plotting package is essentially redundant. Seaborn intersects histograms and KDE’s perfectly which other packages really struggle to do (Plotly is the exception here).

Finally, Seaborn has the whole design side of things covered which leaves you, the researcher, with more time to research. Matplotlib sucks for visuals and Chartify is too blocky for my liking.

I’m going to keep my conclusion short and sweet: Seaborn is awesome. There’s no hiding that I use it a lot more than other libraries and recommend you to do the same. Let’s now move onto some charts that I use daily.

Univariate Distribution

If you’ve found a random variable whose distribution makes for an interesting story, then Seaborn's displot function works great. It helps to convey the picture by showing:

The underlying empirical distribution in the form of a histogram
A Kernel that’s been approximated over the top to give a smoothed picture

The colours (a nice translucent unoffensive blue) with the grid lines and clear fonts make for a simple and effective offering!

Figure 1: Univariate Distribution of Random Numbers —

Joint Distribution

Here we try to convey a bit more of a complicated dynamic. We have two variables that we feel should be related but how can we visualise this relationship?

The two distributions plotting on the sides of the chart are great for visually seeing how the marginal distributions look but the area plot is perfect for identifying those areas where a concentration of density arises.

Figure 2: Joint Distribution between two Random Variables —

I use this plot in both my research and in my decks as it allows me to keep the univariate dynamics (with the kernel plots) and the joint-dynamics in the forefront of my thought and my audiences: all whilst conveying the story that I’m trying to paint. It’s been super useful in layering discussion and I’d highly recommend it.

Box and Whisker Plots

The problem with distributional plots is that they can often get skewed by outliners which really distorts a story unless you know that those outliers exist and you deal with them in advanced.

Box plots are used so widely as its an effective way to display robust metrics like the median and the interquartile range, which are much more resilient to outliers (due to their high breakdown point),

Seaborn's implementation of the box-plot looks fantastic as it’s able to convey a fairly complicated story by highlighting a number of dimensions, whilst also, looking visually good enough to be fit for an academic journal. Moreover, Seaborn also does a fantastic job of making the code incredibly efficient so the researcher doesn’t have to spend time plot to make it readable.

Being able to discern and discuss a multitude of features and patterns at the same time is imperative to the success of your research, so I highly recommend using this chart. At the same time, you need to make sure you target the chart for your audience: at times you don’t want to go into too much detail!

In the above article, I broadly discussed why for me, Seaborn is the best plotting package and I give my top 3 examples of charts that I use. I’m a strong believer of conveying a message in an easy and understandable manner: the fewer words the better! Cogency is key!

These charts make it so much easier for you to do that and so if you’re a visual thinker, a storyteller, or if you love to see the big picture, then Seaborn is for you.

Thanks again, message me if you have any questions!

Code

The following pieces of code are the simple snippets to recreate the awesome charts above!

Figure 0: Pair Plotting

import seaborn as sns
df = sns.load_dataset(“iris”)
sns.pairplot(df, hue=”species”)

Figure 1: Univariate Distribution

x = np.random.normal(size=100)
sns.distplot(x);

Figure 2: Joint Distribution

mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
sns.jointplot(x="x", y="y", data=df, kind="kde");

Figure 3: Lots of Joint Distributions

iris = sns.load_dataset('iris')
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

Figure 4: Box and Whisker Plot

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

# Initialize the figure with a logarithmic x axis
f, ax = plt.subplots(figsize=(7, 6))
ax.set_xscale("log")

# Load the example planets dataset
planets = sns.load_dataset("planets")

# Plot the orbital period with horizontal boxes
sns.boxplot(x="distance", y="method", data=planets,whis="range", palette="vlag")

# Add in points to show each observation
sns.swarmplot(x="distance", y="method", data=planets,size=2, color=".3", linewidth=0)

# Tweak the visual presentation
ax.xaxis.grid(True)
ax.set(ylabel="")
sns.despine(trim=True, left=True)

The Power-Law Distribution

Pareto’s Power-Law Distribution [source]

Explaining the Laws of Nature (Including the Golden Ratio)

The laws of nature are complicated and throughout time, Scientists from all corners of the world have attempted to model and reengineer what they see around them to extract some value from it. Quite often we see a pattern that comes up time and time again: be it the golden ratio, or be it that fractal spiral.

In its empirical form, the Power Law describes how a lot of the time, not much actually happens but more often than not, some patterns cover a wide range of magnitudes. Think about the following relatable examples:

Number of Comments on a Post
Number of Social Media Followers
Money Grossed at Box Offices
Books sold for Top Authors
Market Capitalisation of American Companies

They all seem to fit the pattern, but we can also see these examples widely in the form of natural phenomena:

There are plenty more of these examples but the thing that really begs the question is, what is actually driving this skewed phenomena? Why is there a lot of density at low values and why, at other times, do we get incidences at really far ends of the tail?

Let’s first cover the mathematics of the it, before discussing it more in detail.

Probability Distribution of Power-Laws

A random variable that characterises a power-law Distribution can be defined as follows:

where C is the normalisation constant C=(a-1)x^α-1. This equation only really makes sense if α>1, but as a more compact form, we can also write it as follows:

Note that from this functional form, we can see that if we apply a logarithmic function, the functional form becomes linear:

Scale Invariance of Power Law Distributions

So if we compare the densities at p(x) and at some other p(c x), where c is a constant, we find that they’re always proportional. That is, p(c x) ∝ p(x). This behaviour shows us that the relative likelihood between small and large events is the same, no matter what choice of “small” we make.

Note: I’m not going to go into detail about the moments of Power Distributions because the phenomena surrounding these ‘infinite’ moments is fascinating and requires an article of its own!

Let’s look back at some examples and see what we find: Pinto et al (2012) show a few examples where we can really see the distribution come into its own:

Log-log plot of distribution of wealth: Pinto et al (2012): [Source]

and here for forest fires:

Log-log plot of distribution of forest fires: Pinto et al (2012): [Source]

It’s crazy how well the power distribution fits these phenomena and how linear the log-log plots are. I encourage the reader to try this out for themselves: it’s always surprising what you find!

Golden Ratio

The Golden Ratio or ‘80–20’ rule exists as a colloquial natural phenomena. It postulates things like: 20% of the worlds population own 80% of the wealth. Let’s assume for a second that wealth is defined by the Power Law which and is characterised by some α. What fraction W of the total wealth is held by the richest fraction P of the population?

Now we can integrate the power-law function above to derive the fraction of the population whose wealth is at least x, given by the cumulative distribution function:

Moreover, the fraction wealth held by those people is given by:

where α>2. If we now solve the first equation and substitute it into the second, we find an expression that does not depend on wealth (x) at at all:

Now this is crazy to me: by making small assumptions about the distributional properties of wealth distribution, we can remove wealth from the equation and still show how wealth is spread. This extreme top-heaviness is sometimes called the “80–20 rule,” meaning that 80% of the wealth is in the hands of the richest 20% of people.

As an example, say we want to know how much wealth the top 20% of richest people own? Let’s set α=2.2, then (α-2)/(α-1) = 0.2/1.2 = 0.167. Then we set P = 20%, so W = 20% to the power of 1/3, = 76%, which is not far off 80%! Funnily enough, this is actually a pretty good fit for society.

Note: that the relationship can skew if we change the value of α, becoming more extreme as α<2, which shows that wealth is held by a single person.

It’s exactly because of this this functional form being so unique in nature and so eloquent, that we can simplify characteristics as ‘80–20’. It’s not an exact science but social science rarely is. However, deriving an α for these social dynamics goes a long way in telling us exactly how these natural phenomena realise and act.

The Power-Law Distribution is phenomenal because small insights to its functional form can lead to incredibly detailed explanations of natural phenomena.

From Statistical Physics to man-made artefacts like comments on blogs, these all seem to share an underlying level of respect. Moreover, its scale invariance and log-log view shows how accessible these complicated models can be.

I’ve only scratched the surface of this model, but please keep reading as there’s so much more out there.

Thanks again for reading! If you have any questions, please message!

The Student t-Distribution

For the Sake of Statistics, forget the Normal Distribution.

To be clear: This is targeted at Data Scientiststs/Machine Learning Researchers and not at Physicists

Statistical normality is overused. It‘s not as common and only really occurs in the impractical ‘limits’ [[2][3][4]]. To garner normality, you need to have substantial well-behaved independent dataset (CLT) but for most research projects: small sample sizes and independence are usually what researchers are faced with. We tend to have crappy data that you fudge to look normal but in fact, those ‘anomalies’ in the extremes are telling you something is up.

Lots of things are ‘approximately’ normal. That’s where the danger is.

The point of this article is not to talk about Kurtosis, but rather to discuss why phenomena within society do not follow the Normal Distribution, why extreme events are more likely and why relaxing certain constraints make you realise that the Student t-Distribution is more prevalent than thought. Most importantly, why researchers assume normality when it’s just not normal.

Unlikely events are actually more likely than we expect and not because of kurtosis, but because we’re modelling the data wrong to begin with. Say that we run a forecast on the weather over a 100 year period: is 100 years of data enough to assume normality? Yes? No! The world has been around for millions of years. We are underestimating the tails simply because we’re not considering the limits of our data. 100 years is just a sample of the entire historical data: much of which we’ll never see.

Moreover, Limpert and Stahl in 2011(which we discuss later: [4]) discuss this exact phenomenon: how often a symmetric normal distribution can be assumed, tests can be run and conclusions can be made, but where researchers have clearly misinterpreted results because of their trust in symmetric normality.

As a statistician, it’s important to know what you don’t know and the distribution that you’re basing your inferences on. It’s important to remember we primarily use a normal distribution for inferences because it offers a tidy closed form solution (note OLS Regression, Gaussian Processes etc), but in reality, the difficulties in solving harder distributions are why, at times, they make better predictions.

Firstly, let’s talk about the Mathematics of t-Distributions.

Please skip the maths to jump to more discourse

Where did the Student t-distribution come from?

In the Guinness Brewery of Dublin Ireland, William Sealy Gosset published a paper in 1895 under the pseudonym ‘Student’ detailing his statistical work into the “frequency distribution of standard deviations of samples drawn from a normal population”. There’s a debate about who actually came up with the student t-distribution first, but noting work by (Helmert and Luroth, 1876) came a bit earlier, let’s focus on the maths.

Assume we have a random variable with a mean μ and a variance σ², derived from a normal distribution. Then we know that if we find a sample estimate of the mean (say μ⁰), then the following variable z = [μ⁰-μ] / σ is a normal distribution, but, it now has a mean of 0 and a variance of 1. We’ve normalised the variable, or, we’ve standardised it.

However, imagine we have a relatively small sample size (say n < 30) which is part of a greater population. Then our estimate of mean (μ) will remain the same, however, our estimate of standard deviation, our sample standard deviation, has a denominator of n-1 (bessels correction).

Because of this, our attempt to normalise our random variable has not resulted in a standard normal, but rather has resulted in a variable with a different distribution, namely: a Student-t Distribution of the form:

X bar is the sample mean, μ is the estimate of the population mean, s is the standard error, and n is the number of samples.

This is significant to note because it tells us that even though something may be normally distributed, in small sample sizes, the dynamics of sampling from this distribution completely change, which is largely being characterised by Bessel’s correction.

I find it amazing that a small nuance to the formulae of the variance has far-reaching consequences into the distributional properties of the statistic.

What are Degrees of Freedom?

Degrees of freedom are a combination of how much data you have and how many parameters you need to estimate. They show how much independent information go into a parameter estimate. In this light, you want a lot of information to go into parameter estimates to obtain more precise estimates and more powerful hypothesis tests (note sample variance in relation to number of samples). So to make better estimates, you want many degrees of freedom, however usually, the degree’s of freedom corroborate with your sample size (and more precisely, the n-1 of Bessel’s Correction).

Varying the number of degrees of freedom for on a Student t-distribution

Degrees of Freedom are important to Student t-Distributions as they characterise the shape of the curve. The more degrees of freedom you have, the more your curve looks bell shaped and converges to a standard normal distribution.

Proof of Convergence to Normal Distribution

The probability density function for the t-distribution is complex but here it is:

is the number of *degrees of freedom and* Γ is the gamma function.

The properties of it can be quite fascinating: in fact, the Student t-Distribution with ν =1 is approximately Cauchy, and on the other end of the spectrum, the t distribution approaches a normal distribution as ν > 30. The proof is as follows. If Xn is a t-distributed variable, it can be rearranged to show that the variable can be written as follows:

I wish I could make these formulae smaller

where Y is a standard normal variable and X²n is a Chi-square random variable with n degrees of freedom, independent of Y. Separately, we know that X²n can be written as a sum of squares of n independent standard normal variables Z¹ …. Z¹⁰⁰:

and when n tends to infinity, the following ratio of the chi-squared variable

converges in probability to E[Z²i] = 1 by the law of large numbers.

Moreover, as a consequence of Slutsky’s theorem, Xn converges in distribution to X=μ+σY, which, thus, is normal. Further supplementary material can be found here.

Examples of failures of the Normal Distribution

I’ve explained in the introduction why the normal distribution isn’t relevant a lot of the time and I’ve proven how the Student t-Distribution is intrinsically related to both the Cauchy and the Normal Distribution. However, now let’s look at examples of where the Normal Distribution (and the dynamics it has) are assumed to be fundamental but in reality, results are skewed which questions any reliance on any normal assumptions.

The article here discusses the case of being ‘95% Confident’. Assuming a normal distribution, 95% of your results should be within 2σ of your mean (in a symmetric manner). Therefore, any results outside of this range are anomalous. Limpert and Stahl (2011) show that skewness in the results distorts the symmetry postulated by a number of authors in a number of different fields, thereby miscalculating likelihoods being an issue plaguing several fields.

Fields where this has caused problems are as wide-ranging as you can imagine:

Moreso, Potvin and Roff (1993) argue the case for non-normality being more prevalent in ecological data, and look at alternative non-parametric statistical methods, with Micceri close behind (1989) comparing the prevalence of the normal distribution for psychometric measures to that of the unicorn and other improbable creatures.

These are serious accusations and with a fine tooth and comb, it becomes pretty clear that all that seems normal is not.

In anger, I’ve explained why statistical normality is overused and why it’s caused so many failed out of sample experiments. We assume too much and reluctantly let the data speak. Other times, we let the data speak too much and ignore the limitations of our data.

We need to think more about the practical limitations of our data, but also the fundamental limitations of any distribution we assume. By making a realistic assumption on the full shape of data, we would see that making more conservative estimates would tend to perform significantly better out of sample.

Thanks for reading! Please message me if you have any more questions!

References

Helmert FR (1875). “Über die Berechnung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler”.
Potvin, C. & Roff, D.A. (1993). Distribution-free and robust statistical methods: Viable alternatives to parametric statistics? Ecology 74 (6), 1617–1628. Read
Micceri, T. (1989) The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin 105 (1), 156–166. Read [free pdf]
Limpert, Stahl (2011): Problems with Using the Normal Distribution — and Ways to Improve Quality and Efficiency of Data Analysis

Robust Statistical Methods

Anomalies hidden in plain sight. Chart from Liu and Neilson (2016)

Methods that Data Scientists Should Love

A robust statistic is a type of estimator used when the distribution of the data set is not certain, or when egregious anomalies exist. If we’re confident on the distributional properties of our data set, then traditional statistics like the Sample Mean are well positioned. However, if our data has some underlying bias or oddity, is our Sample Mean still the right estimator to use?

Let’s imagine a situation where the data isn’t so friendly.

Let’s take an example that involves the sample mean estimator. We know that the sample mean gives every data point a 1/N weight which means that if a single data point is infinity, then the sample mean will also go to infinity as this data point will have a weight of ∞/N = ∞.

This is at odds to our sample median which is little affected by any single value being ±∞. That’s because the sample median does not apply weight to every datapoint. In fact, we can say that the sample median is resistant to gross errors whereas the sample mean is not.

A gross error is a data point that is misleading (usually 3σ or more)

In fact, the median will tolerate up to 50% gross errors before it can be made arbitrarily large; we say its breakdown point is 50% whereas that for the sample mean is 0%.

The breakdown point of an estimator is the proportion of gross errors an estimator can withstand before giving an abnormal result.

Robust statistics are often favoured to traditional sample estimators due to the higher breakdown point. It’s not unusual for data to involve anomalies if the recording of data involves some manual effort, however, the mean and median should normally be quite close. Now if you assume that your underlying data contains some gross errors, then it’s worthwhile using a robust statistic.

Let’s first look at what outliers mean in terms of relative efficiency.

Relative Efficiency

Relative Efficiency is the comparison between variances of sample estimators. We previously saw that if data is well behaved, the variance of a sample estimator should go to 0 as n goes to ∞. We also saw that for normally distributed data, the sample mean has a lower efficiency than the sample median. But what if the data is not normally distributed?

If we have Student T-distributed data with 5 degrees of freedom, the sample median has a much lower efficiency and is, therefore, a better estimator to use to approximate the population mean. So much so, it can have an Asymptotic Relative Efficiency (ARE) of 96%.

Let’s say we’re doing an example on stock returns: Stock returns have roughly student t-distributed data with about 5–7 degrees of freedom so given the above discussion, the median is a rather good metric here.

The Sample Median has a much higher degree of efficiency than the Sample Mean for Financial Data

If you can smell something fishy in your data, I recommend using methods with higher degrees of efficiency and higher breakdown points. Let’s look at robust regression methods.

M-Estimators in Robust Regression

OLS Regression applies a certain amount of weight to every datapoint:

Closed form derivation of OLS regression coefficient [source]

Say X~N(0,1), and Y is also ~N(0,1). Say X¹=1, its contribution to beta would be (X¹*Y¹)/(X¹*X¹) = (1 * Y¹/1*1) = Y¹. As Y¹ is also uniform normal, we would expect the Beta to be around +/- 1 (both sets have the same variance, so regression is equivalent to correlation).

However, say now Y¹ was accidentally stored as 10,000 (you can blame the intern), the contribution to the estimator of this point beta would go up from 1 to 10,000! That’s crazy and clearly not desired!

Regressions are thus very sensitive to anomalous data-points (at worst, the problem can be exponential) and given the above discussion, we would prefer to use an estimator with a higher breakdown point and a higher degree of efficiency. This is to ensure that our estimator doesn’t get thrown around by rogue data-points so if the potential lack of normality in the data is worrying, then the researcher should use robust estimation methods:

M-estimators are variants of Maximum Likelihood Estimation (MLE) methods. MLE methods attempt to maximise the joint-probability distribution whereas M-estimators try to minimise a function ⍴ as follows:

Solving the problem of M-Estimators [source]

The astute reader will quickly see that Linear Regression is actually a type of M-Estimator (minimise the sum of squared residuals) but it’s not fully robust. Below we have 4 other types of M estimators and more can be found here:

Different choices of functions for your M-Estimator [source]

As an example, Least Absolute Deviation (LAD) estimates the coefficients that minimises the sum of the absolute residuals as opposed to sum of squared errors. This means that LAD has the advantage of being resistant to outliers and to departures from the normality assumption despite being computationally more expensive.

As a practitioner, I would encourage researchers to try multiple method because there’s no hard and fast rule. It’s much more convincing to demonstrate to use several estimators giving similar results, rather than a sporadic and unexplainable set of results.

As a final point, we have to remember though that M-estimators are only normal asymptotically so even when samples are large, approximation can be still be very poor. It all depends on type and size of the anomaly!

In the above article, we broadly discuss the field of Robust Statistics and how a practitioner should approach with caution. Normal data may exist but at the limit, kurtosis plagues reality. Experiments on fatter tails (Student T-distributed) data highlights that the sample median is much more efficient than the sample mean but I generally like to put both side by side to see any noticeable differences. Further, robust regression methods offer a higher breaking point to give more realistic estimations but are pretty slow to compute.

Robust Statistics are a bit of an art because sometimes you need them and sometimes you don’t. Ultimately every data point is important so leaving some out (or down weighting certain ones) is rarely desirable. Given that limitation, I always encourage researchers to use multiple statistics in the same experiment so that you can compare results and get a better feel for relationships because after all, one ‘good’ result may just be lucky.

Thanks for reading! If you have any questions please message — always happy to help!

References

Huber, Peter J. (1981), Robust statistics
Little, T. The Oxford Handbook of Quantitative Methods in Psychology. Retrieved October 14, 2019
Liu, X., & Nielsen, P.S. (2016). Regression-based Online Anomaly Detection for Smart Grid Data. ArXiv, abs/1606.05781.

The Sampling Distribution of OLS Estimators

Details, details: it’s all about the details!

Ordinary Least Squares (OLS) is usually the first method every student learns as they embark on a journey of statistical euphoria. It’s a method that quite simply finds the line of best fit within a two dimensional dataset. Now the assumptions behind the model, along with the derivations are widely covered online, but what isn’t actively covered is the sampling distribution of the estimator itself.

The sampling distribution is important because it informs the researcher how accurate the estimator is for a given sample size, and more so, it allows us to determine how the estimator behaves as the number of data points increase.

To determine the behaviour of the sampling distribution, let’s first derive the expectation of the estimator itself.

Expectation of OLS Estimator

Remember that the OLS Coefficient is traditionally calculated as follows:

Where Y = XB + e. Substitute the equation of Y into the formulae above, and continue the derivation below:

The expectation of the Beta coefficient is Beta, thereby also being unbiased [source]

Again, we know that an estimate of beta has a closed form solution, where if we replace y with xb+e, you start at the first line. Deriving out as we do, and remembering that E[e]=0, then we derive that our OLS estimator Beta is unbiased.

Variance of your OLS Estimator

Now that we have an understanding of the expectation of our estimator, let’s look at the variance of our estimator.

The expectation of the beta estimator actually goes to 0 as n goes to infinity. [source]

To get to the first line you have to remember that your sample estimator (beta hat) can be expanded and simplified as follows:

where e~N(0, σ²). From this, we can also determine that E[e’e]=σ², which is a constant and can therefore move out of the equation to leave the X’s which are all multiplied together, cancel each other out to just leave the inverse of the squared X.

Ultimately, this leaves σ²/(X’X) which is asymptotically 0 as if n increases substantially, then the variance of your OLS estimator goes to 0 as σ² remains the same but (X’X) would grow exponentially.

Sampling Distribution

Now that we’ve characterised the mean and the variance of our sample estimator, we’re two-thirds of the way on determining the distribution of our OLS coefficient.

Remember that as part of the fundamental OLS assumptions, the errors in our regression equation should have a mean of zero, be stationary, and also be normally distributed: e~N(0, σ²). Remember that the OLS coefficient is simply a linear combination of these ‘disturbances’ and therefore, our OLS coefficient is therefore driven by these normal disturbances. Therefore:

Distribution of the OLS Coefficient: [source]

And there we have it! I’ve (a) derived the expectation of the OLS estimator and shown how it is also unbiased. (b) I’ve derived the variance of the sample estimator and shown how it’s asymptotically actually 0. And (c), we use the intuition behind the distribution of the error term to infer the sampling distribution of our estimator. (Note that for sample sizes greater than around 30, the sampling distribution would be approximately normal anyways because of the Central Limit Theorem).

On the whole, I hope that the reader has a much deeper awareness and understanding of their beta coefficient. The information above can be used in a powerful way to make robust estimates of relationships: moreover showing the importance of increasing the number of samples to decrease the variance of your sample estimator.

Ultimately, the insights you gain from understanding fundamental details will shape the way you think when experimenting!

Thanks for reading and hope I helped! Please message me if you need any help!

Asymptotic Distributions

Infinity (and beyond…)

The study of asymptotic distributions looks to understand how the distribution of a phenomena changes as the number of samples taken into account goes from n → ∞. Say we’re trying to make a binary guess on where the stock market is going to close tomorrow (like a Bernoulli trial): how does the sampling distribution change if we ask 10, 20, 50 or even 1 billion experts?

The understanding of asymptotic distributions has enhanced several fields so its importance is not to be understated. Everything from Statistical Physics to the Insurance industry has benefitted from theories like the Central Limit Theorem (which we cover a bit later).

However, something that is not well covered is that the CLT assumes independent data: what if your data isn’t independent? The views of people are often not independent, so what then?

Let’s first cover how we should think about asymptotic analysis in a single function.

Asymptotic Analysis

From first glance at looking towards the limit, we try to see what happens to our function or process when we set variables to the highest value: ∞.

As an example, assume that we’re trying to understand the limits of the function f(n) = n² + 3n. The function f(n) is said to be “asymptotically equivalent to n² because as n → ∞, n² dominates 3n and therefore, at the extreme case, the function has a stronger pull from the n² than the 3n. Therefore, we say “f(n) is asymptotic to n²” and is often written symbolically as f(n) ~ n².

Conceptually, this is quite simple so let’s make it a bit more difficult. Let’s say we have a group of functions and all the functions are kind of similar. Let’s say each function is a variable from a distribution we’re unsure of e.g. a bouncing ball. How does it behave? What’s the average heigh of 1 million bounced balls? Let’s see how the sampling distribution changes as n → ∞.

Asymptotic Distribution

An Asymptotic Distribution is known to be the limiting distribution of a sequence of distributions.

Imagine you plot a histogram of 100,000 numbers generated from a random number generator: that’s probably quite close to the parent distribution which characterises the random number generator. However, this intuition supports theorems behind the Law of Large numbers, but doesn’t really talk much about what the distribution converges to at infinity (it kind of just approximates it).

For that, the Central Limit Theorem comes into play. In a previous blog (here) I explain a bit behind the concept. This theorem states that the sum of a series of distributions converges to a normal distribution: a result that is independent of the parent distribution. So if a parent distribution has a normal, or Bernoulli, or Chi-Squared, or any distribution for that matter: when enough estimators of over distributions are added together, the result is a normal.

I would say that to most readers who are familiar with the Central Limit Theorem though, you have to remember that this theorem strongly relies on data being assumed to be IID: but what if it’s not, what if data is dependant on each other? Stock prices are dependent on each other: does that mean a portfolio of stocks has a normal distribution?

The answer is no.

(Ledoit, Crack, 2009) assume stochastic process which is not in-dependent:

A non-IID data-generating Gaussian process

As we can see, the functional form of Xt is the simplest example of a non-IID generating process given its autoregressive properties. The distribution of the sample mean here is then latterly derived in the paper (very involved) to show that the asymptotic distribution is close to normal but only at the limit:

Sampling Distribution of a non-IID estimator [source]

however, for all finite values of N (and for all reasonable numbers of N that you can imagine), the variance of the estimator is now biased based on the correlation exhibited within the parent population.

“You may then ask your students to perform a Monte-Carlo simulation of the Gaussian AR(1) process with ρ ≠ 0, so that they can demonstrate for themselves that they have statistically significantly underestimated the true standard error.”

This demonstrates that when data is dependant, the variance of our estimators is significantly wider and it becomes much more difficult to approximate the population estimator. This begins to look a bit more like a student-t distribution that a normal distribution.

However given this, what should we consider in an estimator given the dependancy structure within the data? Ideally, we’d want a consistent and efficient estimator:

Asymptotic Consistent Estimators

Now in terms of probability, we can say that an estimator is said to be asymptotically consistent when as the number of samples increase, the resulting sequence of estimators converges in probability to the true estimate.

An estimator is said to be consistent when as the number of samples goes to infinity, the estimated parameter value converges to the true parameter [source]

Let’s say that our ‘estimator’ is the average (or sample mean) and we want to calculate the average height of people in the world. Now we’d struggle for everyone to take part but let’s say 100 people agree to be measured.

Now we’ve previously established that the sample variance is dependant on N and as N increases, the variance of the sample estimate decreases, so that the sample estimate converges to the true estimate. So now if we take an average of 1000 people, or 10000 people, our estimate will be closer to the true parameter value as the variance of our sample estimate decreases.

Now a really interesting thing to note is that an estimator can be biased and consistent. For example, take a function that calculates the mean with some bias: e.g. f(x) = μ + 1/N. As N → ∞, 1/N goes to 0 and thus f(x)~μ, thus being consistent.

This is why in some use cases, even though your metric may not be perfect (and biased): you can actually get a pretty accurate answer with enough sample data.

Asymptotic Efficiency

An estimator is said to be efficient if the estimator is unbiased and where the variance of the estimator meets the Cramer-Rao Lower Inequality (the lower bound on an unbiased estimator). However a weaker condition can also be met if the estimator has a lower variance than all other estimators (but does not meet the cramer-rao lower bound): for which it’d be called the Minimum Variance Unbiased Estimator (MVUE).

Take the sample mean and the sample median and also assume the population data is IID and normally distributed (μ=0, σ²=1). We know from the central limit theorem that the sample mean has a distribution ~N(0,1/N) and the sample median is ~N(0, π/2N).

Now we can compare the variances side by side. For the sample mean, you have 1/N but for the median, you have π/2N=(π/2) x (1/N) ~1.57 x (1/N). So the variance for the sample median is approximately 57% greater than the variance of the sample mean.

At this point, we can say that the sample mean is the MVUE as its variance is lower than the variance of the sample median. This tells us that if we are trying to estimate the average of a population, our sample mean will actually converge quicker to the true population parameter, and therefore, we’d require less data to get to a point of saying “I’m 99% sure that the population parameter is around here”.

Example of Overlapping distributions where one ha a higher variance than the other but both have the same mean at 0 [source]. Can infer that samples from the thinner distribution will be more likely to be closer to the mean.

As such, when you look towards the limit, it’s imperative to look at how the second moment of your estimator reacts as your sample size increases — as it can make life easier (or more difficult!) if you choose correctly!

In a number of ways, the above article has described the process by which the reader should think about asymptotic phenomena. At first, you should consider what the underlying data is like and how that would effect the distributional properties of sample estimators as the number of samples grows.

Secondly, you would then consider for what you’re trying to measure, which estimator would be best for you. In some cases, a median is better than a mean (e.g. for data with outliers), but in other cases, you would go for the mean (converges quicker to the true population mean).

In either case, as Big Data becomes a bigger part of our lives — we need to be cognisant that the wrong estimator can bring about the wrong conclusion. This can cause havoc as the number of samples goes from 100, to 100 million. Therefore, it’s imperative to get this step right.

Message if you have any questions — always happy to help!

Code for Central Limit Theorem with Independent Data

# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)

# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean()[0] for i in range(1000)]

# Plot Sample Means
plt.figure(figsize=(20,10))
sns.distplot(s_mu).grid()
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')
plt.legend()
plt.show()

The Distribution of the Sample Mean

One-sided test on a distribution that is shaped like a Bell Curve. [Image from Jill Mac from Source (CC0) ]

All Machine Learning Researchers should know this

Most machine learning and mathematical problems involve extrapolating a subset of data to infer for a global population. As an example, we may only get 100 replies on a survey to our new website, whereas our target market is 10 million customers. It’s infeasible to ask all 10 million customers what they think, so we have to use the feedback from the 100 to infer.

Probability Distributions explain to us the likelihood of different things happening. Conceptually, they can tell us “Event A is probably not going to happen, but Event B is much more likely”. They can also tell us “The likelihood of Z>5 is quite low”. Now we generally think that distributions apply to data however when you delve deeper into the theoretical side of statistics, we find that actually everything has its own distribution.

As a practitioner of Machine Learning and Data Science: day in and day out we use the Sample Mean. Therefore, we need to be at one with the dynamics and limitations of our most-used tool.

With the above example, say we have a new feature that we want to incorporate into our business. We can’t ask all 10 million customers what they think of the new feature before we integrate it, so instead we find a small group (our sample) of customers and calculate the mean result of that group. Again we think though, would the results change if we tried to measure this of another group of customers? What if we found 100 groups of 20 customers: would each group give the same result?

In this example, we are ‘sampling’ a small subset (through groups) of our population (of 10 million customers) to try approximate how the population distribution thinks. We use the sample mean to approximate the population mean, and as it’s an approximation, it has its own distribution.

Now as distributions are characterised by the mean and variance, we can make a big leap to defining the distribution of the sample mean by first deriving a mean and variance.

From this, we can achieve our goal in saying: “From our experiment, the average customer is X% likely to approve the new feature and we are confident on that percentage within +/- Y%”

Let’s begin.

Mean of the Sample Mean

To calculate the expectation of any statistic, you can simply wrap an expectation around the functional form of the statistic:

Derivation proving that our mean statistic is unbiased: more work can be found in Reference

The derivation continues by calculating the expectation of the formulae. We take out the constants (1/n) and are left with a sum of expectations of the variable X (which are all independent). These are all the same (μ), so we have (nμ)/μ = μ, and thus, we are left with the statistic we started with.

This result is huge because it proves that the sample mean we derive is directly approximating the population mean. So for example, if we have a group of 100 customers from a population of a 10 million independent customers (samples), then the mean of this small sample is a very good estimate of the population mean (within a certain range). We can now make more powerful predictions with significantly less data.

Now that we’ve derived the first moment of our distribution, let’s move onto the second moment.

Variance of the Sample Mean

The proof for the variance for the sample mean is equally as simple.

Derivation of Variance of Sample Mean: Reference.

Again we see that the variance function can feed through the sample mean statistic to extract 1/n. We are left with our X variables which because they are they independent, the variance of these can be added together. From here, it’s a simple substitution and rearranging to get the equation together.

The end result tells us that the variance of each sample mean is equivalent to the variance of the underlying data divided by the number of data points in each sample. This is another huge result as it tells us by how much the variance of our mean estimate decreases as N increases. We can almost fully approximate how the distribution of our sample mean should look. For example: with normally distributed data (variance =1), when N=100 samples, we can now say that a sample mean would be within +/- 0.1 (as 1 divided by square root of 100 = 0.1).

Distribution of the Sample Mean: Central Limit Theorem

The expressions derived above for the mean and variance of the sampling distribution are not difficult to derive or new. However the simplicity of it all is really stands out with the Central Limit Theorem that regardless of the shape of the parent population (be it from any distribution), the sampling distribution of the sample mean approaches a normal distribution.

Now the Central Limit Theorem (CLT) proves (under certain conditions) that when random variables are added, by adding together these random variables, at the limit (asymptote), the distribution converges towards a normal distribution even if the original variables themselves are not normally distributed. This generally occurs when N>25 (proof and demonstration).

Note: The proof of the CLT is not a short proof so I’ve left it out from this article.

The mean and variance derived above characterise the shape of the distribution and given that we now have knowledge of the asymptotic distribution, we can now infer even more with even less data. The characteristics of the normal distribution are extremely well covered and we can use what knowledge we have now to even more better understand the dynamics our estimate of the sample mean.

Bootstrapping

Now given that we have proven that that our sample mean statistic has a mean of μ and a variance of sigma²/n, let us now show in practise what happens when we repeatedly calculate the sample mean, and, whether this looks like a normal distribution or not.

Bootstrap methods use Monte Carlo Simulation to approximate a distribution. In the below example, (and as per the code at the end), we have generated 10,000 samples (from a random number generator being seeded by a normal distribution). Then from this population, we calculate the mean of samples of 100 numbers, record this mean, and do it again (100 times)

Bar Chart of 100 Sample Means (where N = 100). Its shape is similar to a bell curve. Code at end.

Now it’s awesome to see that the mean of sample means is quite close to the mean of a normal distribution (0), which we expected given that the expectation of a sample mean approximates the mean of the population, and which we know the underlying data to have as 0. Moreover, the standard deviation of the sample means is 0.1, which is also correct as the standard deviation = root(sigma²/N) = 1/root(100) = 1/10 = 0.1. Further, the shape of the distribution looks like a bell curve, and if we increase N (from 100, to say, 10000) we can see how the distribution looks even better:

Bar Chart of 10,000 Sample Means (where N = 100). Its shape is almost identical now to a bell curve. Code at end.

Also, as another example, if we change the underlying population data to be uniformly distributed (say, between 0 and 100): if we calculate the sample mean 10,000 times, where each sample mean contains 100 points— as derived by the central limit theorem, we again converge to a normal distribution:

Bar Chart of 10,000 Sample Means (where N = 100) and underlying data is uniformly distributed. Its shape is again almost identical now to a bell curve. Code at end.

Conclusion

In the above article, I derived the mean and variance of the sample mean, I then used bootstrap techniques to highlight the central limit theorem: all of which we can use to aid our understanding of the sample mean, which in turn helps us to approximate the dynamics of the underlying population.

Now as a reader, I hope that you understand that the distributional properties of such a simple statistic can allow for very powerful inferences. Knowing the limitations of your estimates becomes more important in times of stress. From a business perspective, there’s no point making new features or changing a business strategy if you know that your goal is outside the realm of the limits that your current business resides within. As such, you can use these methods to guide your inference to help you make sensible decisions as you go along.

Reference Code

# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate Population Data
df = pd.DataFrame(np.random.randn(10000,1)) #normal
#df = pd.DataFrame(np.random.randint(0,100,size=(10000,1))) #uniform
pop_mu = df.mean(axis=0)
pop_st = df.std(axis=0)

# Generate Sample Means and Standard Deviations
s_mu = [df.sample(100).mean()[0] for i in range(1000)]

# Plot Sample Means
plt.figure(figsize=(20,10))
sns.distplot(s_mu).grid()
plt.title('Sampling Distribution of Sample Mean (100 samples where N = 1000)')
plt.axvline(x=np.mean(s_mu), label='Mean of Sample Means')
plt.axvline(x=np.mean(s_mu) + np.std(s_mu), label='Std of Sample means', color='r')
plt.axvline(x=np.mean(s_mu) - np.std(s_mu), label='Std of Sample means', color='r')
plt.legend()
plt.show()

Parts-based learning by Non-Negative Matrix Factorisation

Visualising the principal components of portrait facial images. ‘Eigenfaces’ are the decomposed images in the direction of largest variance.

Why we can’t relate to eigenfaces

Traditional methods like Principal Component Analysis (PCA) would decompose a dataset into some form of latent representation e.g. eigenvectors, which at times can be meaningless when visualised — what actually is my first principal component? Non-Negative Matrix Factorisation (NNMF) was a method developed in 1996 by Lee and Seung that showed data could also be deconstructed (i.e. a set of facial portraits) into parts and extract features like the nose, eyes, and a smile.

The differences between PCA and NNMF arise from a constraint in their derivation: namely, NNMF does not allow the weights of different features to be negative. On the other hand, PCA approximates each image as a linear combination of all images or basis. Although eigenfaces are directions of largest variance, they are hard to interpret: what actually is the first principal component showing me?

Eigenfaces are meaningless directions of largest variance

A linear combinations of distributed components can feel difficult to relate to when attempting to visualise the data. NNMF is interesting because it can extract parts from the data that when we visualise it, we can actually relate to it.

In the seminal paper by Lee and Seung, they take a series of facial portraits and by running the NNMF algorithm, they decompose the images to extract the following features:

An image taken from “Learning the parts of objects by non-negative matrix factorisation” (Lee, Seung, 1999) [1]

We can see (moving from the top left to the bottom right) that each feature kind of looks like a different facial part. We have the eyes, the jaw structure, the nose, the T zone, the eye brows all come out. Compare this to the figure at the top of this article, and you’ll see how differently the two models explain the structure of the same data.

Likewise, this method can also applied to articles of text where you can decompose each article into a topic:

An example of topic modelling by using NNMF: link

On the right we have our data matrix M, which is decomposed (in the same manner as the image example) into the topics matrix A and the weights matrix W. (Note: reference [3] will help you remove stop words)

To appreciate how this simple example can give such high results, let’s first go over the Mathematics:

Mathematics of Non-Negative Matrix Factorisation

The goal of NNMF is to decompose an image database (matrix V) into two smaller matrices W and H with the added constraint that W>0 and H>0 :

V is a matrix of our Image database. The r columns of W are called basis images. Each column of H is called an encoding and is in one-to-one correspondence with a face in V. Check reference [1] for more information.

Let V be our matrix of data. In the case of images, you would vectorise each image and concatenate them sideways to have something that’s (N x M) with N total pixels and M total images. The iterative procedure used to derive an approximation for V is as follows:

We initialise W and H randomly and as follows:

Normalise the H Matrix
Incorporate this normalised H Matrix to W
Normalise the W matrix
Calculate the improvement in the following objective function (if improvement is less than a threshold, go back to 1):

Unlike with PCA, NNMF has no closed form solution so this iterative method (similar to gradient descent) keeps working till the improvements are marginal. Note — code is at the end.

In reality the objective function or the methodology provided are not important here. The the observant will notice that the non-negativity constraint implemented is that characterises the difference between PCA and NNMF and can result in an approachable latent representation i.e. parts by learning. In NNMF, every latent feature is added (given weights) to each other to recreate each part of sample data. Many of these weights can be 0.

With PCA on the other hand, we can compare it in a NNMF framework in that it constraints the the columns of W to be orthonormal and the rows of H to be orthogonal. This separation allows for a distributed representation but not an additive representation, thus the difference in the latent representation between the two models.

The model you use makes a huge difference in its interpretability, its usefulness and the inference you draw from it. I’m a proponent of using various methods as benchmark because without measuring performance in connection with some form a null hypothesis, you can’t really monitor effective performance.

Given that, each model also has it’s marginal benefits so it also makes sense to try learn something from each model. In the above case, simply changing the negativity constraint can give the user an entirely different result. I encourage the reader to take the below code away and see what other features they can come out with!

References:

DD Lee, HS Seung. 1999 “Learning the parts of objects by non-negative matrix factorization”
Dempster, A. P., Laired, N. M. & Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm.
Reference for list of common (“Stop”) words, referenced from: http://xpo6.com/list-of-englishstop-words/

Code for NNMF

The coding below is in Matlab (don’t ask…) but it should be easy enough to translate to any language.

[n,m]=size(V);
stopconv=40;      % Stopping criterion (can be adjusted)
niter = 1000;     % maximum number of iterations (can be adjusted)

cons=zeros(m,m);
consold=cons;
inc=0;

% Counter
j=0;

% initialize random w and h
W=rand(n,R);
H=rand(R,m); 

for i=1:niter    
% This is the update equation for H as per the Nature paper, with
% normalisation
    x1=repmat(sum(W,1)',1,m);
    H=H.*(W'*(V./(W*H)))./x1;

% This is the update equation for W as per the Nature paper with
% normalisation
    x2=repmat(sum(H,2)',n,1);
    W=W.*((V./(W*H))*H')./x2;

% test convergence every 10 iterations
if(mod(i,10)==0)  
        j=j+1;

% adjust small values to avoid undeflow
        H=max(H,eps);
        W=max(W,eps);

% construct connectivity matrix
        [~,index]=max(H,[],1);   %find largest factor
        mat1=repmat(index,m,1);  % spread index down
        mat2=repmat(index',1,m); % spread index right
        cons=mat1==mat2;

if(sum(sum(cons~=consold))==0) % connectivity matrix has not changed
            inc=inc+1;                     %accumulate count 
        else
            inc=0;                         % else restart count
        end

        % prints number of changing elements 
        fprintf('t%dt%dt%dn',i,inc,sum(sum(cons~=consold))), 

if(inc>stopconv)
break,                % assume convergence is connectivity stops changing 
        end 

        consold=cons;
    end
end