A collection of files gathered from different sources to be used for tests that compare mimetype with the UNIX file utility.
TLDR: ~97% of samples identified correctly
The 3% misidentified files,
most are indeed misidentified files, but some happen because mimetype
identifies more precisely than file:
- XML based file formats, like GML, GPX, are seens as generic
text/xmlbyfile mimetypeidentifies subtitles astext/vtt, whilefilesees them just asplain/textmimetypeidentifiestext/tab-separated-values, whilefilesees justplain/text- etc.
Results show the latest percentage of misidentified files and a breakdown of what are the most misidentified formats. If you want to run the tests, use these commands.
- testfiles contains all the test files (around 50 000 entries)
- zipshuffler.go reads zip files and then creates random permutations of the files inside the zip.
- truncate.go creates 3KB truncated copies of all the files
- main.go iterates over all files and compares our results with the
results of
file --mime