Codebox Software

Using a Neural Network to Detect Daleks


I've trained a neural network to recognise Daleks, and used it to locate them in video footage of old Doctor Who episodes. I think this might be the nerdiest thing I've ever done, which is no small claim.


Before I could build the network I needed 2 sets of training images - one of which contained Daleks, and one of which did not. I used ffmpeg to extract screenshots at regular intervals from video of various Doctor Who episodes, and manually identified images that contained Daleks - cropping and saving them one at a time. I'm not going to lie, this was a very dull process. The result was many images like these:

A Dalek Another Dalek And another one Yep, another Dalek Dalek Kittens!! Just kidding, it's a Dalek

To create the second set of images, I wrote a Python script to iterate through the screenshots which I had identified as being Dalek-free. The script selected small rectangular segments of each image, with random positions, and various different aspect ratios. These segments were saved as image files, and became the second half of the training set:

Not a Dalek Still not a Dalek No Daleks here Nope, still no Daleks Dalek-free again OMG Daleks!! jk its just some dude

Rather than training the network from scratch, I used Transfer Learning, to speed things up. This is a machine learning technique where the lower layers of a neural network which has been pre-trained for one task are re-used for a different task in order to reduce training time. In my case I used the Xception model, pre-trained by the Keras development team on the ImageNet database. I removed the output layer of the model and replaced it with a new fully-connected layer which would learn to distinguish Daleks from non-Daleks.

The network took surprisingly little time to train, achieving validation accuracy of over 99% in less than an hour on my rather old laptop.

The next step was to be able to find the position of a Dalek within a larger image. Within the last few years some new and very powerful techniques have been developed to do this, however because I had no particular performance requirements for this project, I decided to keep things simple. I used a 'Sliding Window' approach, sampling rectangular regions of the image one at a time, moving across the image from left to right, and top to bottom. The number of possible rectangles that could be tested in this way was very large, so I experimented with various different step sizes for moving the window across the image, until I found a good compromise between accuracy and performance at around 25 pixels. I also restricted the aspect-ratios of the window regions that were selected, so that they roughly matched the shape of a Dalek.

In order to indicate where a Dalek had been located in the image I created a copy of the input image, and drew red rectangles over the areas where the network had detected one. The rectangles were transparent, and were overlayed upon each other to produce a heatmap effect like this:

Dalek heatmap

Areas of the image with a lot of red indicated that the network had found many image segments in that location that contained a full or partial Dalek.

I extracted the frames from a couple of short video clips, and applied the process described above to each one, before stitching them back together to create new videos. You can see the results below.

Video 1 - Dalek Detected!

This video came out quite well, the Dalek is well localised by the network and is tracked correctly as it moves within the frame. For some reason there are a few false-positive detections where the network highlights areas of the desk and floor. I suspect that this was caused by the training images that I used - Daleks are often seen on or near to flat white surfaces, and the network may have learnt to match these as well as the Daleks themselves.

Video 2 - Multiple Daleks!

This video is more interesting, but I don't think the network performs as well. The areas of the video that are highlighted most strongly do contain Daleks, but they are quite large and also contain a lot of the surrounding area as well. I think this was caused by the sliding window sizes that I chose. Smaller windows would give better localisation, but would require more steps to cover the entire frame, increasing render time.

It is interesting to see how the highlighted areas shrink and grow as the Dalek on the right of the frame is occluded by the figures walking in front of it. Also, when the third Dalek appears from the left, the network only highlights its upper portion - I think this is because the design of this Dalek is unusual, it has a dark-coloured lower half which was probably not often seen in the training images.