{"id":23,"date":"2020-04-11T11:13:38","date_gmt":"2020-04-11T11:13:38","guid":{"rendered":"https:\/\/blog.paulkaspers.nl\/?p=23"},"modified":"2020-05-09T11:30:20","modified_gmt":"2020-05-09T11:30:20","slug":"file-recognition","status":"publish","type":"post","link":"https:\/\/blog.paulkaspers.nl\/index.php\/2020\/04\/11\/file-recognition\/","title":{"rendered":"File recognition"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">One of the issues we all run into as developers is how to recognize a file.<br>Most people look at first at the file extension (if it&#8217;s there). We all can recognize\u00a0a <strong>.zip<\/strong> extension\u00a0as a compressed file. Unfortunately, this way of detecting a file type can&#8217;t always be trusted, crackers and the lot like to misuse trust in this to get your application to crash or even worse getting hacked into.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So how to catch those &#8216;camouflaged&#8217; files ?<br>Well luckily there is another (better) option, it is looking at the file headers (magic bytes) of files. This &#8216;header&#8217; is nothing more than the first couple of bytes of the file. For an overview of those magic bytes you can look some of them up at\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/List_of_file_signatures\">Wikipedia<\/a>\u00a0or a more extensive list from\u00a0<a href=\"https:\/\/www.garykessler.net\/library\/file_sigs.html\">Gary Kessler<\/a>. This will work in most cases to determine the\u00a0<em>technical<\/em>\u00a0type of the file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why did I emphasize\u00a0<em>technical<\/em>\u00a0here you might ask. Well, nowadays a lot of file types are compressed, see for example the Microsoft Office files (ex. DOCX as well as OOXML file type).<br>Those are nothing more than zip files, you must look inside of those to see which type they are.<br>So if you open those files as a zip file (by renaming them to .zip instead of .docx or open them with a decompressing utility), you can find multiple files \/ directories inside.<br>If we take the docx type as an example you can find it&#8217;s ISO description via the following <a href=\"https:\/\/www.loc.gov\/preservation\/digital\/formats\/fdd\/fdd000397.shtml\">link<\/a>. On this page you can find a subsection named &#8220;File type signifiers and format identifiers&#8221; where there is detailed information how to detect the file.<br>For a couple of other file formats you can find those as well on this website (<a href=\"https:\/\/www.loc.gov\/preservation\/digital\/formats\/fdd\/descriptions.shtml\">link<\/a>), unfortunately this list isn&#8217;t complete.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ok, you ask now, if there isn&#8217;t a complete list, how could I detect all those files ?<br>Well I should go for the short way and use a white list of magic bytes, that way you should be relatively safe.<br>I say &#8220;relatively&#8221; because there could always be a catch in the file itself of course.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Happy coding ! <br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the issues we all run into as developers is how to recognize a file.Most people look at first at the file extension (if it&#8217;s there). We all can recognize\u00a0a .zip extension\u00a0as a compressed file. Unfortunately, this way of detecting a file type can&#8217;t always be trusted, crackers and the lot like to misuse [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"off","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"categories":[6],"tags":[3,4,5],"class_list":["post-23","post","type-post","status-publish","format-standard","hentry","category-universal","tag-file-format","tag-file-recognition","tag-magic-bytes"],"_links":{"self":[{"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/posts\/23","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/comments?post=23"}],"version-history":[{"count":3,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/posts\/23\/revisions"}],"predecessor-version":[{"id":29,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/posts\/23\/revisions\/29"}],"wp:attachment":[{"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/media?parent=23"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/categories?post=23"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.paulkaspers.nl\/index.php\/wp-json\/wp\/v2\/tags?post=23"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}