Solving XKCD 1683

xkcd 1683 - “Digital Data”
Let’s say you are building an archive of the Internet, including all of the cat pictures and funny memes. You only have a limited amount of space, so you really want to use that space in the best way you can.
As illustrated, one way that digital data can “degrade” is through reposts and re-uploads: sharing images and videos often involves a process that performs a lossy transformation, such as re-uploading to media hosting platforms which re-compress all uploaded files, or saving the image by taking a screenshot or recording a video on their device. (This phenomenon is known as generation loss.) So, if you have many copies of the same image of various quality, you want to save the one closest to the original, i.e. the one with the least amount of noise. Looking at millions of cat pictures would be a nice way to spend a lifetime but would take too long to actually archive them—can we build a program to do that for us instead? Continue reading →