File recognition

One of the issues we all run into as developers is how to recognize a file.
Most people look at first at the file extension (if it’s there). We all can recognize a .zip extension as a compressed file. Unfortunately, this way of detecting a file type can’t always be trusted, crackers and the lot like to misuse trust in this to get your application to crash or even worse getting hacked into.

So how to catch those ‘camouflaged’ files ?
Well luckily there is another (better) option, it is looking at the file headers (magic bytes) of files. This ‘header’ is nothing more than the first couple of bytes of the file. For an overview of those magic bytes you can look some of them up at Wikipedia or a more extensive list from Gary Kessler. This will work in most cases to determine the technical type of the file.

Why did I emphasize technical here you might ask. Well, nowadays a lot of file types are compressed, see for example the Microsoft Office files (ex. DOCX as well as OOXML file type).
Those are nothing more than zip files, you must look inside of those to see which type they are.
So if you open those files as a zip file (by renaming them to .zip instead of .docx or open them with a decompressing utility), you can find multiple files / directories inside.
If we take the docx type as an example you can find it’s ISO description via the following link. On this page you can find a subsection named “File type signifiers and format identifiers” where there is detailed information how to detect the file.
For a couple of other file formats you can find those as well on this website (link), unfortunately this list isn’t complete.

Ok, you ask now, if there isn’t a complete list, how could I detect all those files ?
Well I should go for the short way and use a white list of magic bytes, that way you should be relatively safe.
I say “relatively” because there could always be a catch in the file itself of course.

Happy coding !

Submit a Comment Cancel reply

Recent Posts

Recent Comments