Solving XKCD 1683

xkcd 1683 - Digital Data

xkcd 1683 - “Digital Data”

Let’s say you are building an archive of the Internet, including all of the cat pictures and funny memes. You only have a limited amount of space, so you really want to use that space in the best way you can.

As illustrated, one way that digital data can “degrade” is through reposts and re-uploads: sharing images and videos often involves a process that performs a lossy transformation, such as re-uploading to media hosting platforms which re-compress all uploaded files, or saving the image by taking a screenshot or recording a video on their device. (This phenomenon is known as generation loss.) So, if you have many copies of the same image of various quality, you want to save the one closest to the original, i.e. the one with the least amount of noise. Looking at millions of cat pictures would be a nice way to spend a lifetime but would take too long to actually archive them—can we build a program to do that for us instead?

Let’s look at some simple approaches first:

  • Look at the resolution? Unfortunately no, because images are often upscaled, sometimes even in a lossy way.
  • Look at the file size, and pick the biggest file? No, because if you save a JPEG as a PNG, or even as a maximum-quality JPEG, you will get a bigger file with no gain in quality. You will also spend precious bits on carefully preserving worthless compression artifacts from the previous generation.
  • Look at the date, and pick the earliest seen file? That might work if that’s available and accurate, but it often isn’t or is not a reliable indicator.
  • Look at the pixels, and look for compression or rescaling artifacts? That sounds promising, but how exactly would we do that?

Analyzing images in order to look for patterns is adjacent to computer vision / machine learning, so let’s try that. We have:

  • Lots of easily acquired memes and cat pictures, which we can use for training;
  • A few reposts and their originals seen in the wild, which we can use for final evaluation.

To create the training data, take the training images, and apply some set of random lossy transformations to each one, such as:

  • Cropping
  • Resizing (using different scaling algorithms and their parameters)
  • Quantization (reducing the amount of colors, with or without dither)
  • Saving as various lossy formats (JPEG, GIF…) at various quality levels.

You can do all of the above with ImageMagick.

In this way, we can create pairs of images, in which one is closer to the original and one further. If our algorithm is to output a number N indicating how much “noise” there is in the image, we want it to produce a higher number for our edited version.

Next comes a preprocessing step to prepare the data into a raw format that can be fed into a neural network. Here I used D (with my functional image processing library), which performed really well for this task. Highly recommended!

I went into this knowing absolutely nothing about machine learning, so from here on it was a cycle of running into a problem and trying random stuff until it goes away. In rough order, the problems and what worked to solve them:

  • Problem #1: I know nothing about machine learning and don’t know where to start!

    Solution: My first attempt was using Netflix’s Vectorflow, as it was written in D, but it only runs on the CPU. You almost certainly want to train your networks on a GPU though, so I quickly migrated to Python / TensorFlow instead.

    The best way I found to get started is to grab and play around with the Keras examples. Get them working and get a feel for their requirements and environment, then pick one that’s closest to your problem and begin adapting it towards it.

  • Problem #2: How to use an AMD GPU on Linux for machine learning?

    Solution: Machine learning environments seem to heavily favor NVIDIA hardware currently, but AMD provides ROCm as a CUDA alternative. The easiest and most reliable way to run it that I found is to use the Docker images. Not only is it very easy to set up, but it also has the benefit of pinning the entire userspace, which is pretty important considering that the very rapidly evolving world of machine learning software packages currently have lots of fragile inter-dependencies.

  • Problem #3: Great! Now what?

    Solution: Let’s start by breaking the image into 1x8 chunks and feeding that into a simple deep neural network. For now, let’s not worry about how to properly model a “which is better” problem, so let’s just fit the network to produce a single number per image, which we want to approach 0 for our original image and 1 for our edit. We fit the network for each chunk towards the desired result, and then use the average of the output as the image score. We’ll use 75% of our generated pairs for fitting and 25% for validation.

  • Problem #4: It sort of works (better than a coin flip, at least), but the results seem to vary wildly based on the image’s resolution.

    Solution: Let’s put a limit on the number of chunks, so that the network has to process a constant amount of data regardless of the image size.

  • Problem #5: It works a little better, but it should be doing better at recognizing JPEG compression artifacts.

    Solution: Let’s make the chunks 8x8 (the size of the JPEG minimum coding unit), and pre-apply a discrete cosine transform (DCT) on them.

    DCT is a step in JPEG’s compression algorithm, so doing the transform ourselves allows us to see more clearly the amount of information that was preserved following the JPEG compression. To illustrate:

    A photo of a European wildcat with the JPEG quality increasing, from left to right. Image source.
    A visualization of the same image after applying a DCT. Note how the information density increases from left to right.

    Feeding the DCT output into the network should help it figure out how much information was present in the original image, and how much “detail” it sees in the raw pixels is actually due to JPEG compression artifacts.

  • Problem #6: It works a little better, but it should be doing better at recognizing resizing artifacts. For example:

    Original image.
    Simple point sampling creates noticeable artifacts in the form of repeating or missing rows, especially visible at non-power-of-two ratios.
    Some algorithms, such as Lanczos, produce ringing artifacts, e.g. pixels that are darker or brighter than any of the pixels in the input image.

    Solution: Let’s add the raw image pixels back, and add some 2D convolution layers to process them. Since it doesn’t make sense to calculate a 2D convolution over DCT-transformed data, we can feed that into a separate branch of the network as a new input, then combine the result.

  • Problem #7: It works a little better, but it should be doing better at recognizing e.g. cropped images (saved with lossless compression).

    Solution: Let’s add some metadata about the entire image, such as resolution, file size, and image format. The metadata can be provided as a third input fed into its own branch. It will have the same values for all samples taken from the same image file.

  • Problem #8: Now it seems to be doing much better on the training data than on the validation data.

    Solution: The network is being overfit to recognize specific images based on their metadata. Add some randomness to metadata values and vary it per-sample to prevent this.

  • Problem #9: It seems to work poorly for images with lots of blank space at the top.

    Solution: Select the chunks randomly (or, at least, evenly) from the entire image, duh.

  • Problem #10: It seems to work poorly for photos with a lot of film grain, classifying them as having more noise than shrunk edits.

    Solution: Add a total color count metadata input, to help the NN distinguish them from images that have had a lossy quantization applied to them.

    Natural film grain.
    Quantization with Floyd-Steinberg dithering. Note the rough similarity.
  • Problem #11: If the network’s output is smaller than zero or greater than one, we shouldn’t penalize it for “overshooting” our goal.

    Solution: Use the sigmoid activation function on the final layer.

  • Problem #12: The validation accuracy is significantly lower than the training accuracy.

    Solution: The model is overfitting again. No problem, just generate more samples. (64 ➙ 512)

  • Problem #13: It sort of works, but it’s still nowhere close to accurate, and I have no idea what to do next. There are so many different kinds of layers, optimizers, activation functions, and other hyper-parameters, how do I know which ones are best to use for solving this problem?

    Solution: Well, this is probably extremely stupid, but … why not build a neural network to build our neural network? The algorithm is as follows:

    • Describe the structure of our main problem’s neural network as some discrete representation, such as a 2D array. Each cell in this 2D array represents a decision regarding one specific element of it, such as which activation function to use for a particular layer.

    • Encode our current model into this representation as a starting point.

    • Create a new model which we will use for evaluating hyper-parameters for our main model. This hyper-search model accepts as input the above 2D array, and produces as output a single number, which is a metric for how well it predicts this model will perform.

    • In a loop, do this:

      1. Fit the hyper-search model using all configurations that we tried so far, and their performance.
      2. Generate 10,000 random configurations, and ask our hyper-search model to predict their performance.
      3. Pick the random configuration that has the highest predicted performance score.
      4. Run it for 10 minutes, and make note of its final result. Use the validation accuracy for this purpose. (I found that a model’s performance within a small fixed time frame does roughly correspond to its performance if left to fit until it plateaus.)
      5. Add the configuration and its outcome to the history used in step 1.
      6. Repeat.

      Because in each iteration the hyper-search model starts with random parameters, it will quickly “try out” all parameters it hasn’t tried yet. In the first many iterations, it will predict astronomical results for non-sensical models, but eventually it learns better.

      Here is my result:

      Notable surprises:

      • So many dropout layers!

      • Wow, the length of the JPEG DCT branch is pretty long. Probably totally not because, as I found out much later, I messed up my DCT inputs (but not enough to prevent the model from wrenching some information out of them anyhow).

  • Problem #14: Wow, that actually worked!? It seems to be working really well (86% accuracy on validation data)! But, when I try to use it on my real-world final evaluation samples, the result is much worse!

    Solution: Using the validation data to tune the hyper-parameters must have caused it to over-fit over the validation data—maybe I shouldn’t have done that. Oh wait, that’s not what happened at all—it performs equally well on completely different (but generated in the same way) inputs. The actual reason is that the lossy transformations we used to generate the inputs that we trained the network on don’t correspond well to lossy transformations actually seen in the world. Have a closer look at real-world data, try to model it more accurately when generating our training inputs, and retrain the network.

  • Problem #15: That works better, but it seems to be doing poorly for images with lots of blank space (in general, not just at the top).

    Solution: Let’s calculate an “entropy” metric per sample, which represents the amount of data per 8x8 block. We can use the sum of the absolute differences of adjacent pixels for this purpose. Sort the chunks by entropy (descending), and pick the top ones as the samples to use.

    This image is rather monotonous, with little information to work with in most areas. Image source.
    By adding up changes from one pixel to the next, we can select blocks with more information.
    The top 512 blocks picked.

  • Problem #16: That works better, but using the average of the result for all samples still seems sub-optimal. Not all samples are equally useful, surely we can ask the model to itself decide how much information to use from every sample?

    Solution: I don’t know how to model a network that accepts an unordered set as input, so I used an LSTM layer (which takes ordered inputs). Results from the main model are fed into it in increasing entropy order (most entropy last). Maybe analyzing an unordered set could be done by repeating the same (i.e. with shared weights) dense layer n times (n = input length) and making it output two values, output and weight, but I don’t know how to implement that in Keras.

  • Problem #17: But what if the number of timesteps is lower than the layer sample size?

    Solution: Add a “presence” input, which is 1 when the other inputs are populated, and 0 when they’re not.

  • Problem #18: That works much better (92%), but is there a better way to model the problem of comparing two images? After all, if we compress and recompress the same image five times, all of the intermediate steps are going to appear on both sides of the input, meaning we would be trying to fit the network with the same inputs but towards polar opposite labels.

    Solution: We train our network on pairs of images, but what we actually want in the end is a model that outputs one number given one image. We can achieve both by manually creating an artificial comparator layer:

    The structure of DNN1 and DNN2 is the same, and each layer has the same weights as its counterpart. So, each branch above the comparator is an instance of a sub-model that processes a single image. Importantly, the output is a number between 0 and 1 (so probably the last layer should use sigmoid activation).

    As for the comparator, it is a dense layer with two inputs and one output, using sigmoid activation. The weights are [+1e9, -1e9] and the bias is 0. This way, it outputs 1 when the first number is bigger than the second, and 0 when the second number is bigger than the first. We freeze the weights when adding the layer to the model.

    In this way, we can actually describe the problem that we are trying to solve to the system, without resorting to approximations as before. Instead of pulling images we edited towards 1 and images we didn’t edit towards 0, we are now pulling them away from each other, which is what we actually want to do. And, when we want to evaluate one single image’s score, we just have to take the left branch of the model, chopping off the comparator, and use that.

  • Problem #19: That’s great and everything, but now the model is not fitting at all - the loss remains constant!

    Solution: Oh, that’s because our weights are so big, the gradient becomes too small when back-propagating. Lowering them from 1e9 to 1e3 solves this, though it does cause an unfairly high loss when comparing numbers that are within about 1e-3 of each other (the result will be close to 0.5 instead of either extreme). Perhaps more elegant solution exists, such as a special function layer for this purpose.

After all of the above, I get 96.5% accuracy on the generated validation data, and 100% accuracy on my real-world final evaluation samples, which is much, much better than I would have anticipated. The model actually pointed out a few bogus entries in my evaluation data (such as image pairs where I thought one was an edit of another, but actually both were edits with different lossy transformations of an original I did not have).

You can find my code and my model on GitHub. (Though, if you intend to use this for anything serious, you may want to re-run the hyper-parameter search, or at least retrain the model on your data set.)

With the model working on our test data and evaluation samples, we can now run some experiments on synthetic tests. One such test would be to repeat the experiment on Wikipedia’s generation loss article: recompress the same image multiple times (rotating it by 90 degrees every time), and query the model’s results for the resulting images.

100 iterations
300 iterations
1000 iterations

Curiously, performing this experiment on the European wildcat photo causes it to converge towards monochrome stripes rather than gray noise, even though the same software was used (ImageMagick).

Let’s see what our model thinks of these images:

Pretty good! It unquestionably identifies the original, and close successors, as having the least noise. It does apparently get confused a bit at around 100-200 iterations when artifacts begin to manifest due to repeated compression.

Finally, how does it fare with the problem statement itself?

Before we do that, it’s important to note that the model is not trained for comparing the quality of images of different things - it would be comparing apples and oranges. A badly over-compressed photo of an apple may score better than a perfectly crisp drawing of an orange, so the score is only meaningful when compared to scores from other versions of the same image.

With that out of the way, let’s give it a go:



The author would like to thank Sphinx, BeanLearning, kigaltan, and 4LostSoulsinaBowl for providing valuable feedback on previous versions of this article.