This Python Library Will Help You Build Scalable Data Science Projects

Pytest: A Testing Framework for Python Code

Photo by Ratanjot Singh on Unsplash

How can you check that your code changes actually achieve what they’re meant to?

Ensuring your code has integrity is actually quite difficult to ensure, especially at scale. Usually you’ll work in a large team with different people working on different parts of the system. Everyone is tinkering about with something and if you’re using an agile methodology, you’ll be committing code multiple times a day. So how can you keep track that all your changes are backwards compatible? How can you keep track that your code changes maintain at least the same functionality as the code you’re removing (without the bad bits)?

You can test code in a number of ways. A lot of it tends to be sensibility testing and coming up with situations or extreme cases in which the code will definitely fail, and then by narrowing the scope.

It’s a long process, but this is so important because some code holds a lot of responsibility: at times, faulty code can bring down the company.

That’s not a joke, check these stories:

  1. GETCO lost $400m in Trading caused by Computer Error (and was rescued through acquisition by KCG)
  2. Y2K Bug that reportedly cost the industry upwards of $300bn to resolve (that’s billion with a b!).
  3. AT&T Goes does for 9 hours and 75 million calls go unanswered — what caused it? A software update

These stories broke headlines and broke companies but just think about it as a customer as well: would you use an application that was super buggy? No, I wouldn’t either.

The Python community has appreciated testing for a while and pretty much all developers should know how to test their code. In what follows, we’ll be discussing the library pytest.

Photo by beuwy.com Alexander Pütter on Unsplash

So what is pytest?

Pytest a framework that makes it easy to write small tests, yet scales to support complex functional testing for applications and libraries.

Pytest has been built over a number of years and it’s been so popular for the following reasons:

  1. easy and simple syntax
  2. run specific tests or a subset of tests in parallel
  3. built-in automatic detection of tests and the ability to skip tests
  4. encourages test parametrisation and gives useful information on failure
  5. encompasses minimal boilerplate
  6. makes testing easy by providing special routines and extensibility (many of plugins, hooks etc. are available)
  7. open-source i.e. allows contribution from the larger community

These are but a few points that make pytest easy to use. Note that testing also forms an integral part of your continuous integration and continuous development process, but this will be covered elsewhere.


Getting Started with PyTest

To install pytest, open up your command line and run the following command:

pip install pytest

You can also check whether the version is installed correctly with:

$ pytest — version

Your First Test

Before getting started, create a folder with name “Sample Test” and make a python file called testing_sample.py.

The assert statements are used to ascertain a true or false status in a method or test expectation.

Now in what follows, we’ll produce your first test in just 4 lines of code:

# testing_sample.py
def func(y):
return y + 1
def test_answer():
assert func(2) == 4

On execution of the above test function with:

$ pytest testing_sample.py

It would return a failure report because func(2) is not equal to 4 (i.e. 3!= 4). Additionally, it provides you with a solution for test failure. However, if you reset the code to: func(2)==3, this would then pass as it isTrue. Make sure to correct this before progressing.

Note: with standard test discovery rules, you can store multiple tests of your files in your current directory and its subdirectories and pytest will run through them all

Assert Exception

To determine if a piece of code causes an exception, you can create a test with an assert with the raise helper as shown:

# testing_sysexit.py
import pytest
def p():
raise SystemExit(1)
def test_mytest():
with pytest.raises(SystemExit):
p()

This test would result in an exception being thrown but as it’s expected (and controlled using the raise keyword), the test would pass and continue as on.

Multiple Tests Class Grouping

Now if you want to run multiple tests, you can choose to group them as a class as such:

# testing_class.py
class TestClass:
def test_one(self):
y = “this”
assert “h” in y
def test_two(self):
y = “hello”
assert hasattr(y, “check”)

Pytest discovers every test in the class bearing prefixes like “testing” as used. we can now run the code with:

$ pytest testing_class.py

You can discover that the first test passed while the second failed to give the reasons for the failure to gain a proper understanding.

FIXTURES

pytest fixtures are decorator functions, which essentially run before you run a function. A pytest fixture is implemented in the same manner as a decorator function as follows:

@pytest.fixture
def abc(input):
...

The implementation in pytest offers dramatic improvements over the classic xUnit style of setup/teardown functions because primarily, fixtures have explicit names and are activated by declaring their use from test functions. Fixtures are also modular, so each fixture name triggers a fixture function which can itself use other fixtures. Finally, fixture management scales from a simple unit test to more complex parametrised fixtures. With high-grade production code in a big organisation, the ability to configure and re-use fixtures is imperative.

In more layman terms, fixtures are generally used to set the data for a test by e.g. setting a database connections or connecting to a URL to test: generally, setting some sort of input data. So instead of running the same code for every test, we can attach fixture function to the tests and it will run and return the data to the test before executing each test.

Let’s look at a ream example. Create a file test_div.py and add the below code to it

# test_div.py
import pytest
@pytest.fixture
def input_value():
input = 39
return input
def test_divisible_by_3(input_value):
assert input_value % 3 == 0
def test_divisible_by_6(input_value):
assert input_value % 6 == 0

When running this file, we will get a pass in the first test but a fail in the second test as 39 is not divisible by 6.

Now the approach comes with its own limitation. A fixture function defined inside a file can only be used within that file only (as it’s in that scope). To make a fixture available to multiple test files, we have to define the fixture function in a file called conftest.py.


The coding community feels strongly about testing because it really does save a lot of time and hassle when you’re making changes as part of a wider web of code. We’ve all been in that case where you make a small change and suddenly everything breaks and there’s no clear reason why. Testing ensures that we stay on-top of all problems and can isolate/fix them quickly.

I’d recommend it as a practitioner because testing allows you to build a solid foundation to any new framework. You can at times test to death, but, it’s better than not testing at all. After all, you want to ensure that your code has the highest degree of integrity as possible — not only for when you merge it, but for there after.


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

The Top 4 Virtual Environments in Python for Data Scientists

Which Environment Is Yours?

Photo by Shahadat Rahman on Unsplash

Virtual Environments are a relatively difficult thing for new programmers to understand. One problem I had in understanding Virtual Environments was that I could see my environment existed within an MacOS framework, I was using PyCharm and my code was running, what else did I need?

However, as your career as a Data Scientist or Machine Learning Engineer progresses, you realise that you get these annoying as hell dependency issues between projects and as an amateur who’s self taught in this space (as many readers here), it just takes forever to figure it out.

In what follows, I go through the most common virtual environments and why/when you should use which. To be honest, you should probably use Docker as it’s the latest technology and it’s what everyone is using (and if you’re interviewing, you’ll be asked about it). I talk about Docker here.

However, it’s super important to appreciate existing technology and how it works. Here it goes!


VENV

Photo by Lucrezia Carnelos on Unsplash

VirtualEnv (or Venv for short) was (and kind of still is) the default virtual environment for most programmers. You can install is using pip as follows

pip install virtualenv

and once it’s installed, go to your chosen director and to create a virtual environment, run the following command:

python3 -m venv env

Before you can start installing or using packages in your virtual environment you’ll need to activate it. Activating a virtual environment will put the virtual environment-specific python and pip executables into your shell’s PATH.

source env/bin/activate

And now that you’re in an activated virtual environment, you can start installing libraries as normal:

pip install requests

Finally, to make your repo reusable, make sure to create a record of everything that’s installed in your new environment, run

pip freeze > requirements.txt

If you are creating a new virtual environment from a requirements.txt file, you can run

pip install -f requirements.txt

If you open your requirements file you will see a different package with its version in each line.

Finally, to deactivate the virtual environment, you can simply use the deactivate command to close the virtual environment. If you want to re-enter the virtual environment just follow the same instructions above about activating a virtual environment. There’s no need to re-create the virtual environment.

So we can see that so far, we’ve had to manually create a virtual environment, we’ve had to then activate it and then also freeze the session and save everything into a requirements.txt file to make it portable. But what if we didn’t have to have to this two part process?

Enter pipenv.


PipEnv

While venv is still the official virtual environment tool that ships with the latest version of Python, Pipenv is gaining ground in the Python Community.

For example, in what we just described about with venv, in order to create virtual environments so you could run multiple projects on the same computer you’d need:

  • A tool for creating a virtual environment (likevenv)
  • A utility for installing packages (like pip or easy_install)
  • A tool/utility for managing virtual environments (like virtualenvwrapper or pyenv)

Pipenv includes all of the above, and more, out of the box.

Moreover, Pipenv handles dependency management really well compared to requirements.txt and pip freeze. Pipenv works the same as pip when it comes to installing dependencies and if you get a conflict you still have to manage it (although you can issue pipenv graph to view a full dependency tree, which should help).

But once you‘ve solved the issue, Pipfile.lock keeps track of all of your application’s interdependencies, including their versions, for each environment so you can basically forget about interdependencies. This is really a step up.

To install pipenv, you need to install pip first. Then do

pip install pipenv

Next, you create a new environment by using

pipenv install

This will look for a pipenv file, if it doesn’t exist, it will create a new environment and activate it.

To activate you can simply run the following command:

pipenv shell

To install new packages in this environment you can simply use pip install package , and pipenv will automatically add the package to the pipenv file that’s called Pipfile.

You can also install package for just the dev environment by calling

pip install <package> --dev

And once you’re ready to ship to production, all you do is:

pipenv lock

This will create/update your Pipfile.lock, which you’ll never need to edit manually. You should always use the generated file. Now, once you get your code and Pipfile.lock in your production environment, you should install the last successful environment recorded:

pipenv install --ignore-pipfile 

This tells pipenv to ignore the pipfile for installation and use what’s in the Pipfile.lock. Given this Pipfile.lock, pipenv will create the exact same environment you had when you ran pipenv lock, sub-dependencies and all.

The lock file enables deterministic builds by taking a snapshot of all the versions of packages in an environment (similar to the result of a pip freeze).

There you have it! Now we’ve compared pipenv and venv and shown that pipenv is a much easier solution.


Conda Environment

Photo by Marius Masalar on Unsplash

Anaconda is distribution of Python that makes it simple to install packages and it’s generally a good place for Python beginners. At the same time, Anaconda also has its own virtual environment system conda. Similar to the above, to create the environment:

conda create --name environment_name python=3.6

You can save all the info necessary to recreate the environment in a file by calling

conda env export > environment.yml

To recreate the environment you can do the following:

conda env create -f environment.yml

Last, you can activate your environment with the invocation:

conda activate conda-env 

And deactivate it with:

conda deactivate

Environments created with conda live by default in the envs/ folder of your Conda directory.

Now in my experience, conda is OK but I prefer the approach taken by venv for two reasons. ️Firstly, iIt makes it easy to tell if a project utilises an isolated environment by including the environment as a sub-directory.

Further, It allows you to use the same name for all of your environments, meaning you can activate each with the same command. However as conda puts environments in a certain folder (rather than initiating the environment), it makes it easier to make an environment.


Docker

Photo by Iswanto Arif on Unsplash

In a previous blog post I talk about Docker and go into detail explaining how to use it, so I won’t bore you here.

Docker is a library that creates docker containers. These containers contain images of how your operating system looks like, whereas virtualenv only looks at the dependency structure of your python project. So, a virtualenv only encapsulates Python dependencies. A docker container encapsulates an entire OS.

Because of this, with a Python virtualenv, you can easily switch between Python versions and dependencies, but you’re stuck with your host OS. However with a docker image, you can swap out the entire OS — install and run Python on Ubuntu, Debian, Alpine, even Windows Server Core.

There are Docker images out there with every combination of OS and Python versions you can think of, ready to pull down and use on any system with docker installed.


If you think about each of the environments listed above, you’ll realise that there’s a natural divide between them. Conda is better suited (naturally) for those whoa re using the Anaconda distribution (so mostly for beginners in Python) whereas pipenv and venv are for those individuals who are more seasoned and know the ropes. Of these two, if you’re something from scratch i’d really recommend to go with pipenv as it’s just been built with the difficulties of venv in mind.

However, Docker is both easy to use and has such widespread recognition that you just have to know how this works. They all actually work out of the tin and do what they need to, but the portability between operating systems is what makes Docker the real stand out because when it comes to production, you don’t need to worry about the OS on your server as the container has it all sorted for you.


Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The Future of AI is in Model Compression

New research can reduce the size of your neural net in a super easy way

Photo by Markus Spiske on Unsplash

The future looks towards running deep learning algorithms on more compact devices as any improvements in this space make for big leaps in the usability of AI.

If a Raspberry Pi could run large neural networks, then artificial intelligence could be deployed in a lot more places.

Recent research in the field of economising AI has led to a surprisingly easy solution to reduce the size of large neural networks. It’s so simple, it could fit in a tweet:

  1. Train the Neural Network to Completion
  2. Globally prune the 20% of weights with the lowest magnitudes.
  3. Retrain with learning rate rewinding for the original training time.
  4. Iteratively repeat steps 2 and 3 until the desired sparsity is reached.

Further, if you keep repeating this procedure, you can get the model as tiny as you want. However, it’s pretty certain that you’ll lose some model accuracy along the way.

This line of research grew out of the an ICLR paper last year (Frankle and Carbin’s Lottery Ticket Hypothesis) which showed that a DNN could perform with only 1/10th of the number of connections if the right subnetwork was found in training.

The timing of this finding coincides well with reaching new limitations in computational requirements. Yes, you can send a model to train on the cloud but for seriously big networks, along with considerations of training time, infrastructure and energy usage — more efficient methods are desired because they’re just easier to handle and manage.

Bigger AI models are more difficult to train and to use, so smaller models are preferred.

Following this desire for compression, pruning algorithms came back into the picture following the success of the ImageNet competition. Higher performing models were getting bigger and bigger but many researchers proposed techniques try keep them smaller.

Photo by Yuhan Du on Unsplash

Song Han of MIT, developed a pruning algorithm for neural networks called AMC (AutoML for model compression) which removed redundant neurons and connections, when then the model is retrained to retain its initial accuracy level. Frankle took this method and developed it further by rewinding the pruned model to its initial weights and retrained it at a faster initial rate. Finally, in the ICLR study above, the researchers found that the model could be rewound to its early training rate and without playing with any parameters or weights.

Generally as the model gets smaller, the accuracy gets worse however this proposed model performs better than both Han’s AMC and Frankle’s rewinding method.

Now it’s unclear why this model works as well as it does, but the simplicity of it is easy to implement and also doesn’t require time-consuming tuning. Frankle says: “It’s clear, generic, and drop-dead simple.”


Model compression and the concept of economising machine learning algorithms is an important field that we can make further gains in. Leaving models too large reduces the applicability and usability of them (I mean, you can keep your algorithm sitting in an API in the cloud) but there are so many constraints in keeping them local.

For most industries, models are often limited in their usability because they may be too big or too opaque. The ability to discern why a model works so well will not only enhance the ability to make better models, but also more efficient models.

For neural nets, the models are so big because you want the model to naturally develop connections, which are being driven by the data. It’s hard for a Human to understand these connections but regardless, the understanding the model can chop out useless connections.

The golden nugget would be to have a model that can reason — so a neural network which trains connections based on logic, thereby reducing the training time and final model size, however, we’re some time away from having an AI that controls the training of AI.


Thanks for reading, and please let me know if you have any questions!

Keep up to date with my latest articles here!

Python’s Raise Keyword

How to manually throw an exception in Python

Photo by Jonathan Daniels on Unsplash

Exception handling in Python can be daunting. I find it particularly difficult because as a researcher, I’m just not very good at thinking like a ‘programmer’ should. I’m thinking more about the speed of my optimisation procedures, rather than ‘is my code handling all edge cases’.

For better or worse, that’s my flaw in coding. However over time, I’ve picked up a few tricks that’ve helped in making life easier in terms of making more robust coding.

Keywords in Python make life a lot easier and I’ve previously been over Ternary Conditional Operators and the keyword Yield. Let’s cover then the keyword ‘raise’.


The keyword raise is used when the coder wants to bring forward an exception conditional on something occurring. As an example, the syntax is as follows:

if test:
raise Exception(Message)

Given this, we could use it as follows:

    #Input:
string = "Ciao"

if string=="Ciao" or string=="Howdy" or string=="Bye":
raise Exception("This word is not allowed")

#Output:
Exception: This word is not allowed

As you can see here — if the condition passes (in this case, the string does indeed say ‘Hello’, then the exception is raised. Likewise:

# Input a positive number and raise an exception 
# if input is a negative value

num = int(input("Enter a positive number: "))

if num<0:
raise Exception("Please input only positive value ")

print("num = ", num)

which will give the result:

First run:
Enter a positive number: 20
num = 20

Second run:
Enter a positive number: -10
Traceback (most recent call last):
File "/home/main.py", line 10, in <module>
raise Exception("Please input only positive value ")
Exception: Please input only positive value

So as you can see, once the user inputs a number that is not a positive integer, the specific exception that we defined has been raised. This is great because now, we have control over a specific common problem that can arise, giving us more control over our code.

The keyword raise is super easy to use. All you need to do is be clear on what you’re trying to test for and what you want to do once you’ve found it.


Python has been made with simplicity in mind and the keyword raise makes it easier for coders to be able to control their handling of exceptions.

Using this method has made my work a lot better defined and in particular, has meant that I’ve had more control over my code. This really helps when you look back on code after a couple of months.


Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The Difference between “is” and “==” in Python

Equality and Identity

Photo by Mr Xerty on Unsplash

Python is full of neat tips and tricks and something worth noting are the different ways to indicate equality, and how these specific two ways are different.

The == and is command both indicate some form of equality and are often used interchangeably. However, this isn’t exactly correct. To be clear, the == command checks for equality and the is operator, however, compares the identities of the objects.

In what follows, i’ll quickly explain the difference between the both, including code examples.


To understand this better, let’s look at some Python Code. First, let’s create a new list object and name it a, and then define another variable b that points to the same list object:

>>> a = [5, 5, 1]
>>> b = a

Let’s first print out these two variables to visually confirm that they look similar:

>>> a
[5, 5, 1]
>>> b
[5, 5, 1]

As the two objects look the same we’ll get the expected result when we compare them for equality using the == operator:

>>> a == b
True

This is because the == operator is looking for equality. On the other hand, that doesn’t tell us if a and b are pointing to the same object.

Photo by ETA+ on Unsplash

Now, we know they do because we set them as such earlier, but imagine a situation where we didn’t know—how can we find out?

If we simply compare both variables with the is operator, then we can confirm that both variables are in fact pointing to the same object:

>>> a is b
True

Digging deeper with examples, let’s see what happens when we make a copy of our object. We can do this by calling list() on the existing list to create a copy that we’ll name c:

>>> c = list(a)

Now again, you’ll see that the new object we just created looks identical to the list object pointed to by a and b:

>>> c
[5, 5, 1]

Now this is where it gets interesting. When we compare our copy c with the initial list a using the == operator. What answer do you expect to see?

>>> a == c
True

This is expected because the contents of the object are identical, and as such, they’re considered equivalent by Python. However, they are actually pointing to different objects, identified by using the is command:

>>> a is c
False

Being able to differentiate between identity and equality is a simple but important step in learning the complete scope of Python. These neat tips and tricks have helped me as a Data Scientist improve not only my coding skills, but also my analytics.

Thanks again for taking the time to read! Please message me if you have any questions, always happy to help!

Keep up to date with my latest work here!

What is CICD? Where is it in 2020?

Opinion

An in-depth look at the growing market

Photo by Robynne Hu on Unsplash

CICD is a development methodology which has become more important over time. In today’s software driven world, development teams are tasked with delivering applications quickly, consistently, and error-free: every single time.

While the challenges are plentiful, CI/CD is simple at its core.

For many organisations, achieving true continuous delivery is near impossible. Development teams are quickly getting more agile while the rest of the organisation struggles to adapt.

What Is CICD

CICD is an acronym for continuous integration (and) continuous delivery.

The CI portion reflects a consistent and automated way to build, package and test applications. A consistent process here allows teams to commit code changes more frequently, encouraging better collaboration and software.

On the flip side, continuous delivery automates the process of delivering an application to selected infrastructure environments. As teams develop in any number of environments (e.g. dev, test): CD makes sure that there’s an automated way to push changes through.

If you’re heard of the following companies:

  • Jenkins
  • Gitlab
  • CircleCI
  • TravisCI

Then you’re probably a little aware of CICD.

Photo by Arif Riyanto on Unsplash

The Benefits of CICD

CICD improves efficiency and deployment times across the DevOps board — having been original originally designed to increase the speed of software delivery. According to this DZone report, three-quarters of respondents (DevOps) have benefitted: not to mention the shortened development cycle time and the increase in release frequency.

A 75% success rate is very, very good.

What the DevOps Market Feels

Small to Medium sized Enterprises have begun to ramp up their investment in CICD over the last three years and are starting to compete with their larger peers.

According to DZones 2020 Study on CICD, Jenkins remains the dominant CICD platform, but GitLab has been gaining ground over the past couple of years, not to mention CircleCI.

At each stage of the CICD pipeline, the report also indicated that the majority of developers said they have automation built into the process to test code and deploy it to the next stage.

Now despite the importance of automation being built into the CICD pipeline, it’s still possible for teams to get lazy and to rely too heavily on the automation.

As your team’s responsibilities shift and new tasks arise, it is easy to automate processes too soon, just for the sake of time.

Automating poorly designed processes may save time in the short term, but in the long term, it can swell into a major bottleneck that is difficult to fix.

In doing this, teams have to be mindful to properly resolve process bottlenecks before automating and if anything does arise, they need to strip it out and fix it fully.

Moreover, developers should audit their automated protocols regularly to ensure they maintain accuracy, while also testing current processes for efficacy.

This is all part of the problem which takes time to resolve, but the effort is worth it.

Photo by Christopher Gower on Unsplash

The future in CDaas?

We’ve seen the benefits of CICD, but the DZone report highlighted that almost 45% of respondents had environments hosted on site.

Now an emerging solution for organisations is to leverage micro-services and containers to allow customer facing applications to scale.

For this, Continuous Delivery-as-a-service (CDaas) is seen to be an emerging solution of which almost 50% of of those respondents considering moving across.

I’d be interested to hear from any users of CDaas and their experiences thus far.


CICD is a growing market and the past couple of years have seen (a) more market competitors and (b) more solutions in the field. The benefits in this space are blindingly clear and it’s encouraging to see so many respondents in the space moving towards active integration of these technologies.

Trends of questions on stack [Source]

Who knows what the future holds, but it certainly looks to be in CICD.


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest articles here!

The Future of GIT (2020)

Opinion

5 Predictions of what Data Scientists can expect

Photo by Yancy Min on Unsplash

Version Control is a pretty boring topic for most people but for coders and researchers, it’s imperative to understand. The importance of version control is really understood when you work in a big team working on a big project. With multiple users working on the same files at the same time, it’s a crowd you’re trying to control and ensuring that they’re all working towards the same goal.

As we know it today, version control plays an integral part in our coding ecosystem and in all honesty, a lot of people are generally happy with it. I mean there are kinks and quirks that we could improve on but the fact that nearly every single coding team I’ve been part of uses git — that says something.

Given its widespread and integral use in the coding society and how the sphere of coding and technology has changed so much since 2005, the following are my top predictions as to how git will improve in the coming decade.

Photo by Franck V. on Unsplash

Prediction 1: GIT GOES USER FRIENDLY

My first prediction is going to be short and sweet. Beginners always struggle to learn git. Even people who’ve known git for 5 years+ still mess up in rebasing or changing branches and lose work along the way.

I’ll be honest, I’ve been using source control systems for almost 10 years and I only became comfortable with using git through PyCharm. It’s embarrassing but true. Without my DevOps team at the moment, I’d be lost.

Prediction 2: GIT GOES REAL TIME

The fact that git can tell you who has made what change is both a good and a bad thing. It’s good because it tells you who has done what (well, that is its purpose after all), but, it doesn’t tell you who is working on what at any point in time.

Generally speaking that isn’t a problem but often two coders can be working on the same file at the same time and this may not be a big problem, though, it would be useful to know if the functional changes that both coders are making on the same file will interfere with each other — so they don’t have to go through the awkward dance of merging their commits. It’d certainly be helpful to know if another coder is working on the same file as you, and on which branch.

Prediction 3: GIT GOES CONNECTED

Why do we do git fetch still? Why do we do git pull still? There has to be a better way.

It’s second nature for coders who actively work in a shared environment to update their repository frequently during the day but for those of us who sit in a research role or a quasi-coding position, it’s considered ‘good practise’ to update your branch regularly: but what is ‘good practise’?

In reality, it means to update it as often as the core developers would, but for those of less well read into DevOps, shouldn’t GIT take this into account? Shouldn’t it say “Hey, this code (or your project) has changed considerably, you should definitely refresh your project”? Wouldn’t that be helpful?

Photo by NordWood Themes on Unsplash

Prediction 4: GIT GOES DIRTY

A piece of code can be considered dirty if it’s not committed. Code which isn’t committed can often fall between the cracks if the computer shuts down or a session ends before you stash it or commit it — as it isn’t really saved anywhere but locally.

However, sometimes you don’t want to commit code because you’re not finished, and you really want to go get your lunch. What would your commit code read?

I guess this is what stashing is good for but it’d be great if git had a dirty mode, which you could switch on and it would auto-stash every few minutes to ensure that any faults in your local system were completely covered.

Prediction 5: GIT GOES AI COMMIT MESSAGE

Let’s be honest — there’s an art to writing commit messages and I have not mastered this art at all. I’m really, really bad at them.

You can even reference this post that complains about them.

I’m in two minds about this because I love reading awful commit messages but wouldn’t it be awesome if you didn’t have to write commit messages, and rather, the engine could determine what change you made and leave the notes instead?

For one, it’d probably be more informative and also it’d probably be more transparent. Further, the coder may even realise issues if the commit message has a different message to what he was expecting.


Predicting the future of git is hard because in reality, who knows what it’s going to look like. It already does a pretty good job and despite the complaints on the internet about all its faults, there aren’t that many real competitors.

Tides are changing and people are starting to look more towards CICD frameworks, where the repository plays an integral role and given that, I expect a lot of improvements to come our way.

Especially with everyone in lock down, what else do we have to think about?


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!


References:

  1. https://ohshitgit.com/
  2. https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/
  3. https://medium.com/better-programming/stop-writing-bad-commit-messages-8df79517177d

How to Deploy Streamlit on Heroku

Opinion

For Endless Possibilities in Data Science

Photo by Kevin Ku on Unsplash

In a previous post, I predicted that the popularity of Flask would really take a hit once Streamlit comes more into the mainstream. I also made the comment that I would never use Flask again.

I still stand by both of these comments.

In that time, I’ve made four different machine learning projects, all of which are being used by family and friends on a daily basis, including:

  1. A COVID Dashboard for my local friends and family that focuses in on our local area
  2. A simple application for restaurants to take bookings (help out my fathers business)
  3. A small game involving a face mask recognition system for my nephew

None of these are for financial gain, rather, they’re made exactly for what I’ve always enjoyed about artificial intelligence: they’re fun, novel and creative projects that just cool.

I can now develop, code and deploy novel applications in less than a couple of hours.

For those who don’t know, Heroku is a platform as a service (PaaS) that enables developers to build, run, and operate applications entirely in the cloud. It’s super useful when you want to build something small that scales, but more importantly, it’s really helpful for small pet projects.

The free tier is something that I really value and would really recommend readers to deploy more projects so that they’re free for everyone to play with. What’s the point of making cool stuff if you can’t show anyone?

Photo by Paul Green on Unsplash

In what follows, I’ll take you through the different steps required. The first or second time it may be a bit slow, but after, you’ll fly through it.

Note: the following instructions are only verbose to a degree to help you deploy simple applications, without much thought of scalability etc. My advice is targeted at individuals who want to build small applications, so if you expect to have 100m users, this is not really the tutorial for you

Firstly: make sure you have a Heroku login! Use this link and navigate through making a free tier. If you struggle here: I’ll be very disappointed.

Requirement .txt

Make sure that your terminal is in your project folder. When you launch your project into the cloud, you need to create a requirements.txt file so that the server knows what to download to run your code. The library pipreqs can autogenerate requirements (it’s pretty handy) so I’d recommend installing it as follows:

pip install pipreqs

Then once it’s downloaded, just step out of the folder, run the following command, and in the folder, you should find your requirements.txt file.

pipreqs <directory path>

It should contain libraries and their versions in the following format:

numpy==1.15.0
pandas==0.23.4
streamlit==0.60.0
xlrd==1.2.0

setup.sh and Procfile

Now, the next step is a little bit messy but bear with me. You need to set up two more things: a setup.sh file, and a Procfile.

The setup.sh file contains some commands to set the problem on the Heroku side, so create a setup.sh file (you can use the nano command) and save the following in that file (change the email in the middle of the file to your correct email)

mkdir -p ~/.streamlit/
echo "
[general]n
email = "your@domain.com"n
" > ~/.streamlit/credentials.toml
echo "
[server]n
headless = truen
enableCORS=falsen
port = $PORTn

Nice!

Now, make (and I’m cheating a little bit here), but make a procfile using the command:

nano Procfile

and from there, you want to insert the following piece of code (remember to replace [name-of-app].py to whatever your app is called. Most of mine are just app.py)

web: sh setup.sh && streamlit run [name-of-app].py

Moving the files across with Git

Heroku builds systems using Git and it’s insanely easy to get set up. Git is a version control system and runs as default on a lot of operating systems. Check if you have it, if not, install it.

Once you’re happy with your installation, you’ll need to make a git repository in your project folder. To do this, make sure you are in your project folder and run the following command:

git init

This initialises a git repository in your project folder and you should see something like the following print out:

Initialized empty Git repository in /Users/…

Nice!

The first time you do this you’ll need to click here and install the Heroku CLI.We’re using the free version of Heroku which is great but naturally has drawbacks as it doesn’t have certain desirable features configured. The features are more useful for larger projects (like SSL and scalability, also our machines tend to go to sleep if they are idle for more than 30 minutes) but hey, it’s free!

Once you’ve downloaded the Heroku CLI, run the following login command:

C:Users...> heroku login

This opens up a browser window, from which you can log in.

Once you’re in, it’s time to create your cloud instance. Run the following command

C:Users...> heroku create

and you’ll see that Heroku will create some oddly named instance (don’t worry, it’s just how it is):

Creating app… done, ⬢ true-poppy-XXXX
https://true-poppy-XXXXX.herokuapp.com/ | https://git.heroku.com/true-poppy-XXXXX.git

So Heroku created an app called ‘true-poppy’ for me. Not sure why, but I’ll take it. Now all that’s remaining is to move the code across, so in our project folder we run the following commands:

git add .
git commit -m "Enter your message here"
git push heroku master

Once it’s merged, the Heroku application will start downloading and installing everything on the server side. This takes a couple of minutes but if all is good, then you should something like:

remote: Verifying deploy… done.

Now if you run the following:

heroku ps:scale web=1

Your job is done! If you copy the url it gives you in the command line into your browser, you’ll see that you can now run your application online. You can even check it on your mobile phone, it’s perfect!


Streamlit and Heroku make a phenomenal combination. Once you’ve done the actual hard-work of creating a machine learning model that makes sense and generates sensible results, the difficult part should never be the deployment. You should want people to play with a working product. Streamlit takes care of so much of the aesthetic and Heroku takes care of the rest.

Yes, there are several drawbacks to the above methodology in that it’s limited in scope, design and scalability but I challenge you to find a quicker way of deploying a half-decent and fully functional MVP that users can interact with. Even better, I’ll even do a race!

I really recommend these pragmatic approaches because without users interacting with your product — you’ll just never know if it’s any good.

Give it a go. Surprise yourself.


Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

The difference between ‘git pull’ and ‘git fetch’?

The question we secretly ask

Photo by Kristina Flour on Unsplash

This is a brief explanation for the difference between git pull and git fetch then merge. It’s a question that a lot of people want the answer to, being the 4th more most updated question on stackoverflow.

The reason so many people get confused is that upon first glance, they seem to do the same thing (fetching is kind of the same as pulling, right?), but, each has a distinctly different job.

git is what you would call a version control system. It tracks changes in code for software development, ensuring that there’s one central truth to code- and any changes to it are accurately recorded. It’s designed to coordinate work amongst programmers, but can be used to track any set of fils really. It’s pretty handy and a lot of people use it!

Note: In a coming article, I talk about git in a lot more detail, so I’ll be leaving out an introduction to it in this post. If you have any questions though, please leave a comment at the bottom!

In Brief:

If you want to update your local repo from the remote repo, but, you don’t want to merge any differences, then you can use:

  • git fetch

Then after we download the updates, we can check for any differences as follows:

  • git diff master origin/master

After which, if we’re happy with any differences, then we can simply merge the differences as follows:

  • git merge

On the other hand, we can simply fetch and merge at the same time, where any differences would need to be solved. This can be done as follows:

  • git pull

Which to use:

Depending on how quick you work and how your project is set up: it’ll determine whether you want to use get fetch or pull. I tend to use git pull more because i’m generally working from a fresh and clean project.

As git pull attempts to merge remote changes with your local ones, you’re often at risk of creating a ‘merge conflict’. Merge conflicts are where your local branch and the branch on your network differ, so any differences need to be sorted before merging the differences. You can always use git stash and un-stash in the face of differences (making conflict resolution a bit easier to deal with).


Other situations as to why you may want to use git fetch instead of pull are as follows:

a)You want to check which new commits have been made after you’ve made some local changes:

git fetch origin master
git cherry FETCH_HEAD HEAD

b)You want to to apply a commit from master of another remote repository to your local branch.

git fetch <url_of_another_repository> master
git cherry-pick commit_abc

c)Your colleague has made a commit to a ref refs/reviews/XXX and asks you to have a review but you don’t have a web service to do it online. So you fetch it and checks it out.

git fetch origin refs/reviews/XXX
git checkout FETCH_HEAD
Photo by Giorgio Trovato on Unsplash

git isn’t the easiest product to get the hang of but after a while of playing around with it and fixing some mistakes, you’ll learn to depend on it quite a lot. Especially when you’re working with a big team, you’ll need to depend on git to coordinate code integration. Given that, decisions as to whether you do git pull or git fetch becomes quite imperative and hopefully, you’ll know which to do now!


Thanks for reading! If you have any messages, please let me know!

Keep up to date with my latest articles here!

PyCharm vs VSCode

Opinion

Is it time to change your IDE?

Photo by Thao Le Hoang on Unsplash

Maybe I’m a bit behind the curve, or maybe because JetBrains have such a big hold on the Python IDE market, it became clear to me in a previous post that a lot more Python coders are using VSCode than I was expecting.

Now I’ve used a combination of PyCharm and Notebooks for a while and I’m super happy with it. I love that if I have some data I want to explore then Notebooks is pretty easy to navigate, keep track of my work and also visualise data. On the other hand, PyCharm is just a pure machine when it comes to production: it’s never let me down and helps me churn through most tasks.

I also like the fact that the makers of PyCharm (JetBrains) are not some big American Goliath (like Microsoft), but comes from a much more humble region.

Either way, Visual Studio Code (or VSCode for short) is Microsofts open-source IDE. Its initial release was in 2015 and since then (according to Stack Overflow) it’s become the most in-demand IDE.

Given the fact that I’ve never really spent much time using VSCode and what it offers, I’ve decided to put it next to PyCharm try to figure out which is better, and which should I use?

PyCharm > VSCode

One would expect that developing code would feel more natural in a purpose built IDE and as PyCharm was created with the sole purpose of coding in Python. Does that make a difference?

Let’s take the example of autocomplete support. VSCode struggles at times with autocomplete support whereas when using PyCharm, it works nearly perfectly in every instance. My personal experience of VSCode was that the autocomplete can at times work great and other times not. It’s not just me though, people on reddit complain about the same thing: it’s oddly temperamental.

Further, VSCode struggles to load extensions at times and I thought it may have been me, however, this seems to be a bit of a recurring theme as its been reported multiple times: here, and here, and here, and here, and here, and here, and the issue is still present.

Now at first, you’re thinking “Oh awesome, I can customise my VSCode to be exactly how I want” but in reality, it never works that well and you end up having to spend a lot more time trying to fix the bug and less time developing, which is something you just don’t need to worry about in PyCharm.

So for those reasons, PyCharm being native to Python and built to really capitalise on that gives it a huge edge over VSCode. However, VSCode has a lot to offer as well.

Photo by i yunmai on Unsplash

VSCode > PyCharm

First and most importantly, VSCode is free. Yup, completely. The pure editor is pretty simple and you can expand its capabilities by installing plugins. PyCharm Professional, on the other hand, isn’t exactly cheap.

There is a free version of PyCharm (called the Community Edition) but it has fewer functionalities: it doesn’t include tools for developing databases or web related things, nor does it include advanced features such as performance profiling and remote debugging. VSCode has way more functionality than the free PyCharm Community edition, so let’s keep our focus on PyCharm Professional.

Now, something that PyCharm users are aware of is how big its memory footprint is. At the upper limit, it can take up to 1.5gb in disk space and that does have a knock on effect on your coding experience. If your computer can’t handle that then it’ll take ages to load up and sometimes it’ll take a bit longer to get through basic tasks: no one likes that!

Visual Studio Code has a much smaller footprint for memory consumption and physical disk space, about 30% that of PyCharm. So as VSCode is relatively light weight, it’s a particularly good editor for smaller projects or applications, and when performing quick edits to one or more files.

Finally, people generally seem OK with having to build a custom IDE in VSCode, as compared to PyCharm which works great out of the box and you don’t really need to do much more to it. However with VSCode, you have to build it from the beginning with plugins to even get Python working on it, so users are already comfortable with upgrading its functionality with plugins. This means that these users are also thinking about further enhancements which over time, leads to more development and a better coding experience, whereas with PyCharm, it’s mostly left to JetBrains.

Which is best?

Both PyCharm and VSCode allow the community to create plugins to enhance their user experience. Both have full-blown IDE’s and really do tick all the boxes in terms of what you need and want, although, neither are entirely perfect. Both have a strong community behind them and despite VSCode not being around for as long as PyCharm, both do have fairly mature systems in terms of technical capability.

I think it ultimately comes down to you. Do you want to pay for PyCharm professional and have a more specialised experience, or, would you rather have the free VSCode experience with a little bit less specialism, but, potentially more extensibility?

So what does my gut say?

Stick to PyCharm if you only code in Python. If not, VSCode.


The decision is ultimately up to you but the IDE you use can really alter your perception and experience in a coding language. I would expect advanced programmers to be using a variety of IDE’s depending on the project in hand (not to mention to the number of languages coders jump between) so being flexible with your tools definitely makes life easier.

Despite all that: I’ll probably stick to my Jupyter Notebooks and PyCharm combination, but I’d be interested to hear from any full-time VSCode users as to why they won’t be switching any time soon!


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

4 Awesome COVID Machine Learning Projects

Forward thinking ways to apply Machine Learning in a Pandemic

Photo by Neil Thomas on Unsplash

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.


The pandemic has changed our lives: a lot. From all sides, the lives we lived before are no longer the same as they once were. Our workplaces are different; our families are different, our expectations are different too.

Given that most of us are working from home, I’ve put together 54interesting machine learning COVID based projects below, they’re all worth checking out! Each of these have their own place and some are more practical than others. However in terms of the raw application of knowledge, these are all great!

Let’s get right to it!


Face Mask Facial Recognition

Facial recognition is a huge field and it’s only set to grow in the coming months and years. Computer Vision is developing rapidly as technology in this space, including autonomous driving and identification, become more and more widespread.

At scale, Coronavirus has resulted in a demographic and societal change whereby people have to physically change their actions. Given that, masks are becoming compulsory in a huge number of countries and as such, the ability to identify whether people are wearing masks is also growing in demand.

Photo by Flavio Gasperini on Unsplash

Building a system that can determine if you’re wearing a mask or not is awfully similar to the problem of Facial Recognition, so the solution to this problem isn’t that difficult to create. Given that, the following sources are those that I’ve found quite useful in researching into it:

I really appreciated the work by PyImageSource and even implemented the framework on my own home computer. It worked so well as two scripts are provided meaning that you can do less of the fiddly stuff, and more of the playing around:

  1. Face mask recognition in images
  2. Face mask recognition in videos

Definitely worth a play around at home!

Photo by bruce mars on Unsplash

Social Distance Recognition

Following on from the mask recognition project: social distancing is one of the key themes of 2020. In the UK for example, you have to remain at a distance of more than 2 metres from people outside of your ‘bubble’, not to mention this distance varies between regions in Europe.

The trouble with this is to implement it in a way that doesn’t require new hardware. Existing camera’s don’t really have an innate concept of distance, so two markers are usually set to inform the program what approximately constitutes a safe distance.

Given that, the following sources will help you to develop your own Social Distance Recognition tool!

Blog by Aqeel Anwar: source and Github

Photo by Joshua Earle on Unsplash

Symptom Checker

Say you’re coming down with a cold, getting a fever and generally feeling a bit run over. Should you worry?

Yes. Get a COVID test.

But if you can muster some energy, you can always use machine learning to aid in the determination of how likely you are to have COVID (or so the theory goes).

Using a sample data set from as generated here, you can quite easily throw it into a Random Forest and understand (a) how likely you are to have coronavirus and (b) how much you should be worried about each symptom.

Blog by Tanveer Hurra: source

Also If you do have symptoms, go get checked and isolate!

However, the trick with this project is getting the symptom data. It’s not easy, but the more symptom data you get, the better your predictions!

Graph Databases

Social distancing is a big deal in the pandemic because the virus can transfer from person to person quite quickly over short distances. Given that, if an individual is tested positive, then it’s important to understand (a) who is in their network of people (which is actually easy to identify), but how likely each person is to have been infected. This allows policy makers to easily trace who may be infected and to isolate such people.

Given that, Nebula Graph is an open source project that allows users to generate graphs and determine connections between people based on arbitrary settings, in this case: people and places. A graph is loaded with data on both sick and healthy people, along with the addresses that people were travelling to: hoping to answer how people get sick when no one they came in contact was sick at the time of contact.

The blog by Min Wu is really insightful here, and despite it not coming with code, it’s not a difficult project to translate into Python.

My recommendation would be to first build a model working with randomly generated data, then, to find a real data set or, generate your own within your network!

Photo by Nicholas Sampson on Unsplash

Despite us all being in lock down, there’s a surge in creativity in the space of machine learning as lots of new problems are being posed. New problems require smart solutions, and thankfully Machine Learning is able to play its part.

Hopefully, you’ve looked into the above and tried to take a stab at some of the projects. Some are easier than the others, but any forward steps you do make can surely make a huge difference!


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Rating the 8 Top Python IDE’s in 2020

PyCharm has Competition

Photo by Hybrid on Unsplash

The IDE you use can completely change your experience when programming. Especially in the early days when you’re learning, you can find it quite challenging if the IDE you use isn’t geared towards solving the problem that you face.

At its best, programming is an expression of creativity because us, as researchers, we’re trying to solve big problems. And it’s that expression of creativity which keeps us wanting to solve problems, so we need good tools to do so.

An IDE (integrated development environment) is a software application that provides facilities to programmers for software development.

It’s what Microsoft Word is to Writers. It’s what Adobe Photoshop is to Creators. It’s where we do our work.

My own journey in programming started with the use of IDLE for years and years, before moving into Sublime Text Editor, then PyCharm and then Notebooks. However, there are a number of other IDE’s that we list below and are worth exploring.

In the following article, I’ll cover the following IDE’s for which I’ve given each a score based on my opinion. If you disagree with any, let me know!

  1. IDLE (5/10)
  2. Jupyter Notebooks (7/10)
  3. PyCharm (9/10)
  4. Sublime (6/10)
  5. Spyder (4/10)
  6. Atom (7/10)
  7. Eric (8/10)
  8. VScode (8.5/10)
Photo by 青 晨 on Unsplash

IDLE

IDLE was my first development environment when I started programming. I favoured it for a long time because partly, it was already installed on my computer, and also it was just easy to use.

My local Python IDLE IDE

As a beginner, you want to be able to see the fruits of your labour quite quickly, and the command line interface allows for just that. By using the IDE as a quasi calculator and quasi script runner meant that I could physically see what I was creating and I could see how every line of my code was important.

Now IDLE stands for “Integrated Development and Learning Environment”. It’s coded in 100% pure Python (using tkinter), and is cross-platform: working mostly the same on Windows, Unix, and macOS. Its functionalities are as basic as they come, but include:

  • Colorising of code input, output, and error messages
  • multi-window text editor with multiple undo, smart indent, call tips, auto-completion, and other features
  • debugger with persistent breakpoints, stepping, and viewing of global and local namespaces

If you’re starting off on your journey into programming, I’d highly recommend using IDLE because you see the fruits of your labour quite quickly and as a beginner, you just want to be able to build quick, fail quick and iterate.

However, if you want to build anything substantive, it’s just a bit limited in what it offers. You’ll see later that your IDE should be geared towards the type of project you have (I’d split general coding into either scientific computing or production software) and IDLE is somewhere in between. Code debugging, project management, quick-searching, visual displays (and more) are all tasks that we regularly complete when coding and IDLE just doesn’t provide much in the way of these.

Given that, I give it a low score of 5/10. Easy to use, but not that expansive.


For the Visualisers: Jupyter Notebooks

Now if you want to work in a more structured manner (I think along with most of the Data Science community), I would highly recommend using Jupyter Notebooks.

Jupyter runs in your browser and is super lightweight. Its purpose is to present and structure the framework of your code in a report like a framework, which is aesthetically quite pleasing. Its interface is actually quite similar to Mathematica and SageMath, but has become much more popular.

Screenshot from my local Notebooks

Functionally, Jupyter does have limitations and you can’t really use anything you make here in a production environment (unless you ship it across to .py files) as Jupyter Notebooks are built using a JSON framework, so you need purpose python files for anything you want to take away. Moreover, broader functionalities that are less research and more software engineering focused is where Notebooks really lacks.

Take version control. It isn’t really a thing using notebooks (it’s not very natural at least). For example, if you want to share some code, you could send the notebook: but what if you update something on your side, or your colleague updates something on their side, would you have to keep sending notebooks to each other?

Moreover, features like autocomplete, automatic code refactoring, code profiling, version control integration, and database tools are all things you just don’t get in Notebooks. But do you care?

At the end of the day, it depends on how you use an IDE. For me, I use Notebooks more than anything else because I need to visualise my results and I need to continually monitor them in a manageable way. Given how comfortable I am in using Notebooks and given that it’s so geared towards research and less about production (if at all), then scoring it on production based tasks is futile, so as a pure research development environment, Notebooks is awesome. 8/10.


For the Production User: PyCharm

PyCharm is an IDE that’s been built to make programming in Python as efficient as possible. From searching through entire repositories, to debugging to deployment, PyCharm is built with programmers in mind. Hand on heart: PyCharm is a fantastic IDE.

[Source]

As I’ve said before, every person codes a little bit differently but for me, I use PyCharm to code up my production software. The reason being is that tasks like debugging, testing, profiling, integrating, and all the other things you tasks involved in creating production level code just work right out of the box. You actually need to set up very little.

For example, PyCharm even has a shortcut to reformat code to make it more readable. It’s something I really feel strongly about so it makes me happy to see that so have the engineers at PyCharm.

Note: PyCharm does have both a Community and Professional edition and if you can afford, the Professional edition is worth it. However, the Community edition is still fantastic and would recommend learning to use it.

Now PyCharm is a bit difficult to get used to. I’d consider myself an OK coder at best and it even took me a long time to fully get to grips with debugging. It’s not that it’s particularly difficult: it’s more that PyCharm has so many features that at times you can feel overwhelmed.

However, over time, you’ll learn more and more on PyCharm and eventually you won’t be able to live without it. I symbiotically work between PyCharm and Notebooks which works very well for me. PyCharm even has a new native Notebooks tool (that I’ve not spent too much time with admittedly). Given that PyCharm can do everything you want and doesn’t cut any corners anywhere, I really do think its fantastic and so, I give it a 9/10.


Other Python IDE’s

Sublime

Sublime is a text editor that crosses the divide between PyCharm and IDLE. It has quite a few impressive tricks like multiple selection, split editing, impressive performance and is cross platform. However, its breadth of functionalities is nothing compared to PyCharm.

When you first play with Sublime, you’ll find yourself loving the feel of coding in it. Everything works quickly and it’s really easy to write a lot of code in it. This makes me wish that IDLE would actually use more of what Sublime has to offer but, to me, Sublime comes up a bit short in that it’s just not a native Python IDE.

For example, you can’t really do step-by-step debugging as you would in say PyCharm. After a while, this can become quite frustrating, especially when your projects are at an industrial scale. You’ll always find yourself coming back to PyCharm for one functionality or another.

Given that it absolutely smashes the aesthetics of coding, it’ll score highly on that front but as you suffer a bit on breadth of functionality, it’ll suffer there too. It’s more comparable to PyCharm than Notebooks I’d say, and as such, I’d have to give it 6/10.

Spyder

Visually, Spyder is an awful lot like Matlab. It has the same variable explorer frame in the top right hand corner, a place for charts in the bottom right, and the coding pane to the left. It’s intended for use in scientific computing with Python which is reflected in its feature, the packaging, and the overall behaviour of the IDE. However for me, the whole feel of the product can often be clunky as compared to Notebooks or PyCharm.

Atom

Now Atom is something I haven’t used myself but have read fantastic reviews about it. Atom describes itself as a “hackable text editor for the 21st Century”. It’s maintained by GitHub, so as you’d expect it can do pretty much anything you can imagine. However, Atom isn’t really lightweight (it’s about 400MB (including its dependencies)) but even for the programmers on weaker systems, it runs fine assuming you can take the memory!

Now all in all, Atom looks great in the beginning and you could use it instead of say, Sublime or IDLE. However, Atom works with a lot of plug-ins so as you learn, it makes sense to load and install and search for these plug-ins. That helps you appreciate the significance of each, instead of being thrown into the deep end, like in PyCharm.

Given that, Atom is clean to use and beginner friendly, so if you’re thinking about using Sublime, it’s definitely worth a try using Atom as well. 7/10.

Eric

Eric is designed to be the everyday editor as well as being usable as a professional project management tool. It’s offering is pretty strong in that it offers real-time collaboration on code (how awesome?) and includes a plugin system that allows easy extension of the IDE functionality with plugins downloadable from the internet.

Now the IDE is a bit busy but its packed full of features. It supports standard tasks like code folding, code completion, brace matching. It also has an integrated class browser and pretty strong code debugger. It also has support for unit tests and can debug both multithreaded and multiprocessing programs. Moreover, it supports version control software like Mercurial and SVN version control natively, and Git support through plugins.

Given how broad its feature set is and how you can easily expand it with plugins, you’d want to compare this to PyCharm. But the problem with this comparison is that PyCharm is just so good: it’s the Barcelona Fc or Michael Jordan of IDE’s. Eric would struggle to be better given the resources that have been ploughed into PyCharm.

Never the less, Eric might be tough to get off the ground but once you get going it’s good, seriously good. 8/10.

Photo by Michael Lee on Unsplash

Updated to Include: VSCode

I’d like to thank the readers of the first edition of my article who noted that I hadn’t included VSCode. Here it is!

VSCode is a free and open source editor developed by Microsoft. Natively, it supports a few languages but with an extension, you can add Microsoft Python.

Here is where it get’s interesting.

VSCode is intended to have a broad feature set so PyCharm is its natural competitor. Both have Intelligent code completion, full text search, syntax highlighting and bracket matching, Git integration, code formatting and code linting, debugging, and the a lot more.

However, PyCharm is packed full of features and because of that, it runs with quite a high memory requirement: VSCode runs with about 30% less!

Moreover, PyCharm is part of the JetBrains family and so, plugins have to largely pass through the JetBrains family, over 3000 are currently on their website. VSCode on the other hand was designed to be a barebones editor, that is made into a full IDE via its extensions. Given how that’s developed, it means that VSCode can be customised a lot easier for the user.

This is quite an important point because VSCode genuinely has extensions for everything. I’ve looked into ease of Docker container usage, and IPython Notebook extensions and both are comfortable with both IDE’s. Event Reddit can’t decide which is better.

PyCharm is really the full production software and if your computer isn’t too bothered by the memory requirement, then it’s probably the better bet just because its an industry standard however, if you prefer something a bit more lightweight and versatile, VSCode is great.

Great, and maturing with time: 8.5/10


Given that, if you’re just starting out in coding you should really take a look at the projects you want to complete. If you want to build some funky Deep Learning networks and research into the latest tech, Jupyter Notebooks is going to be for you, and generally speaking, it’s the most widely used interface for research purposes.

However, if you intend to deploy software for a client and need some robust code that’s going to work 24/7 — you’re definitely better off using PyCharm. You’ll need the broad functionalities and the integrations with various databases or version control systems or libraries, so PyCharm makes all of this super easy. Shout out to VSCode, it’s also very good, extensible and free!

The IDE you choose is really important because depending on how you expect to code, an IDE can make it easier or a lot more difficult for you. For a long time, my research was impeded due to the fact I was using IDLE rather than something like Jupyter Notebooks. I even almost left Python all-together and used Matlab for a long period of time because it just felt so much more natural to code in Matlab.

So think about it a bit, what do you actually want to


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Can you spot a DeepFake?

Facebook’s Contest proves it’s tougher than you think!

Can you figure out which are real, and which are fake? Answer at the end of the article. [Image Credits: Facebook]

As a machine learning enthusiast and practitioner, news of the results of Facebook’s AI contest twigged my ears.

Note: Facebooks Public Announcement on the Results

Starting back in September 2019, Facebook AI challenged some of the best academic institutions to develop and algorithm that could identify is a video was generated by AI or if it was real.

Universities including Oxford and Berkley were tasked with a training dataset of over 4000 videos, and by November 2019, 115,000 videos being released and the competition was expanded onto Kaggle.

Ultimately, the most competitive model could identify (in-sample) about 85% of fakes but out of sample, this dropped considerably to 65%! Now this is for sure better than chance, but it’s not as great as we would have hoped.

The reasons why models work different in and out of sample are complicated, but come down to how well the machine learning model can generalise. If it recognises a certain image — a good model should also recognise the image if it is rotated. However, a model that cannot generalise will not be able to recognise unfamiliar samples.

Now remember, Facebook are clever.

Facebook had generated fake videos in a variety of ways so they could reflect the diversity in ways that deep fakes are currently made. Methods such as image enhancement, and additional augmentations and distractors, such as blur, frame-rate modification, and overlays.

They also took advice from the Universities on how to make the deep fakes even harder to identify. All in all, they made the problem difficult not in just one or two ways, but a wide variety of ways: enough ways that makes it difficult to hard-code every permutation.

Now in the testing phase, each participant would have to submit their code into a black box environment and from there, 10000 further videos would be passed through the contestants model to see how well it would perform.

Here is how it becomes tricky

Videos were then altered in ways outside the scope of the training data set by e.g. adding random images to each frame, and changing the frame rate and resolution. These are common methods to distort images and they were used increase the difficulty level. The results indicate that the models developed could not fully adapt to these new settings.


Methods that Competitors Used

Attention Dropping

Microsoft Research developed a “Weakly Supervised Data Augmentation Network (WS-DAN)” that explored the potential of data augmentation, whereby, each training image is first represented in terms of its objects discriminative parts, and then, augmented in ways that include attention cropping and attention dropping. This guides the learning procedure not to overfit as more discriminative features are being identified.

Secondly, the attention regions provide an accurate location of the object, which ensures our model to look at the object closer and further improve the performance.

See Better Before Looking Closer : [paper]

In relation to this problem, this allows the model to ‘see’ the pictures better and in more detail to discern discriminative face parts. These types of fine-grained visual classification seem to provide an edge.


Gaining an edge in their Custom Architecture

Architecturally as well, many of the participants had used pertained EfficientNet networks but some found in edge in the manner by which they combined predictions from an ensemble.

Ensemble methods are common in machine learning and the higher performers in this challenge showed that an ensemble approach is also useful for dealing with deepfakes.

Photo by Christian Gertenbach on Unsplash

Non-Learned Enhancements

Finally an interesting point: none of the top performers had used any investigative methods such as searching for noise fingerprints or other characteristics that derive from the image creation process. Given that none of the finalists had used any of these methods, it suggest they aren’t useful or just not widespread. Either way, there’s scope for research in this space


The results of the competition showed that deepfake videos are hard to identify because they require well generalised models. We’ve seen time and time again that machine learning models are often over-fitted to a certain problem so that if the input space into a model is altered (such as an image being rotated), then the model can no longer identify what the image is anymore.

That being said, Robustness methods are in growing demand to ensure that the model can work and a lot of work is being done in this space by the big players in the field. Work will progress here quickly but as the lockdown around the world continues and more people spend even more time on the internet, demand for this technology can only increase.

In the first video above, clips 1, 4, and 6 are original, unmodified videos. Clips 2, 3, and 5 are deepfakes created for the Deepfake Detection Challenge.


Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.

Keep up to date with my latest work here!

Ternary Conditional Operators in Python

Mastering Efficient List and Dictionary Comprehension

Photo by Belinda Fewings on Unsplash

Python is versatile to use and its goal is to make development easier for the user. Compared to C# or Java which are notoriously cumbersome to master, Python is relatively easy to get good at. Moreover, it’s relatively easy to get pretty damn good at.

List and Dictionary Comprehension are widely used methods but something I find that’s a bit less used (especially by beginners) is the Ternary Conditional Operator. The method really streamlines your code and makes it both visually and economically better to run and deal with. Just don’t make it too complicated!

They’re pretty easy to get your head around, so let’s get into it.


Ternary Conditional Operators

The rationale behind these operators is quite simple. Beforehand, if you had an if-else type of problem, you would have to write the following pretty chunky piece of code:

if np.random.rand()>0.5:
x = a
else:
x = b

However, we’re all so familiar with this paradigm and it’s used so often, surely there must be a quicker way? With Ternary Conditional Operators, we can condense the code to the following:

x = 1 if np.random.rand()>0.5 else 0

Which, as in the first statement, computes that if the condition is True, then x=1, otherwise, x=0.

Pretty easy right?


List (and Dictionary) Comprehension with Ternary Conditional Operators

Most people who use Python are familiar with List comprehension. List comprehension is an elegant way to define and create lists based on existing lists.

Say we wanted to do a function to every item in a list (very common problem!). An un-elegant way to do it would be as follows:

new_array = []
for x in array:
new_array.append(custom_function(x))

With a list comprehension, this can be made a lot more efficient as follows:

new_array = [custom_function(x) for x in array]

But quite often, the problem may be a little bit more complicated, so for that, we could look towards the Ternary Conditional Operator.

So take the following problem where we loop through a list of items in an array and given a condition, we set a new value for each item:

new_array = [] # Define a new array
for x in random_array: # Loop over the list
if x > 0.5: # Condition
new_array.append(1) # Result One
else: # Otherwise...
new_array.append(0) # Result Two

This can then be efficiently condensed to the following:

new_array = [1 if x>0.5 else 0 for x in random_array]

See how much easier and efficient that is?

Likewise (and for added bonus points), I now provide the example code for a Dictionary Comprehension with a Ternary Conditional Operator. Here we are simply creating some form of a mapping dictionary which is 1 if an item in Z is in Y, and a 0 otherwise:

Y = ['h','e','l','l','o']
Z = ['a','e','i','o','u']
new_dict = {x:1 if x in Y else 0 for x in Z}

where new_dict now looks like the following:

{‘a’: 0, ‘e’: 1, ‘i’: 0, ‘o’: 1, ‘u’: 0}

Ta da! Super quick, easy, and clean!

You can see that the code is so much tidier as you contain the problem into a pretty elegant one liner, not to mention the reduced amount of headaches and comments you would have to have done before.

Photo by Pineapple Supply Co. on Unsplash

Coding efficiently is the whole point of Python. It’s not the fastest language (44 times slower than C#) or the most widely used (debatably JavaScript), so with this design ethos in mind, it’s worthwhile to make your code as readable as possible.

Methods like Ternary Conditional Operators and List Comprehensions make your job as a Researcher or Scientist 100x easier because you can just fly through your research.

For me, my research process usually involves a few steps and each one can be small and significant. Ensuring that your code is airtight and efficient is by far the best method in being able to focus on the bigger picture, and getting bogged down in massive amounts of code.

I would definitely encourage you to use this method!


Thanks again! Please message me if you have any questions, always happy to help!

Keep up to date with my latest work here!

What does “__name__” mean in Python?

How and why we use __name__=='__main__'

Photo by Nick Fewings on Unsplash

Python is such a popular language because it’s easy to use and it’s comprehensive. Previously, I covered what the keyword yield was useful for and with some great feedback, I have decided to tackle another key feature: __name__ .

Let’s get straight to it.

Now quite often you’ll see a file of the following format

import library
def main():
# some functionality...
return library.function()
if __name__ == ‘__main__’:
main()

At first, I struggled to understand why the file didn’t compile in one straight line but after learning about it, I began to appreciate it’s simplicity.

It comes down to a few things:

  1. You have some functions that people can use elsewhere
  2. You also want to run that file by itself at times

So it’s a bit of a quasi-library type of file. To appreciate it fully, first you should understand what values __name__ can take.

What values can __name__ take? How does it work?

Let’s say that we have a file called foo.py and the only thing in the file is the following:

print(__name__)

Now if we run the following line on the command line:

python foo.py

We will get the following result:

__main__

But on the other hand, say we have another file bar.py that contains:

import foo
print(foo.__name__)

And if you run python bar.py you will get the following output:

foo

See the difference?

If you run the file itself, then __name__ == ‘__main__’ but if you call __name__ from another file, then the name of the file is returned.

So __name__ gets its value depending on how we execute the containing script. Only the file that you are running has a __name__ set to ‘__main__’.

Photo by Keith Luke on Unsplash

So what’s the purpose of __name__?

This feature allows you to both create a script which runs itself but also lends other scripts it’s functionality, so to minimise duplicating any coding.

So when you write a script containing a library of functions which can be useful in other places, this feature comes really handy.

Moreover, it allows you to create code that’s more efficient and dynamic for multiple users. You definitely want code that can easily scale as your team grows.


Python is such a popular language because features like this are well thought out and make life for the coder significantly easier. As more languages streamline their functionality to allow the coder to do less work, hopefully, coders have to carry less of the code and the language/compiler can do more.

Given that, features like I just explained can make life a hell of a lot easier, so I encourage you to use them!


Thanks again for reading! If you have any questions, please message =]

Keep up to date with my latest work here!

Design a site like this with WordPress.com
Get started