91

Linux determines a file's type via code in the file's header. This process doesn't depend on file extensions to know which software to use for opening the file.

(That's what I remember from my education. Please correct me in case I'm wrong!)

Working a bit with Ubuntu systems recently: I see a lot of files on the systems which have extensions like .sh, .txt, .o, .c.

Now I'm wondering: Is the purpose of these extensions to merely help people understand what sort of file they happen to be looking at? Or do they have some purpose for the operating system also?

mizech
  • 1,279

7 Answers7

82

There is no 100% black or white answer here.

Usually Linux does not rely on file names (and file extensions i.e. the part of the file name after the normally last period) and instead determines the file type by examining the first few bytes of its content and comparing that to a list of known magic numbers.

For example all Bitmap image files (usually with name extension .bmp) must start with the letters BM in their first two bytes. Scripts in most scripting languages like Bash, Python, Perl, AWK, etc. (basically everything that treats lines starting with # as comment) may contain a shebang like #!/bin/bash as first line. This special comment tells the system with which application to open the file.

So normally the operating system relies on the file content and not its name to determine the file type, but stating that file extensions are never needed on Linux is only half of the truth.


Applications may of course implement their file checks however they want, which includes verifying the file name and extension. An example is the Eye of Gnome (eog, standard picture viewer) which determines the image format by the file extension and throws an error if it does not match the content. Whether this is a bug or a feature can be discussed...

However, even some parts of the operating system rely on file name extensions, e.g. when parsing your software sources files in /etc/apt/sources.list.d/ - only files with the *.list extension get parsed all others are ignored. It's maybe not mainly used to determine the file type here but rather to enable/disable parsing of some files, but it's still a file extension that affects how the system treats a file.

And of course the human user profits most from file extensions as that makes the type of a file obvious and also allows multiple files with the same base name and different extensions like site.html, site.php, site.js, site.css etc. The disadvantage is of course that file extension and the actual file type/content do not necessarily have to match.

Additionally it's needed for cross-platform interoperability, as e.g. Windows will not know what to do with a readme file, but only a readme.txt.

techraf
  • 3,316
Byte Commander
  • 110,243
48

Linux determines the type of a file via a code in the file header. It doesn't depend on file extensions for to know with software is to use for opening the file.

That's what I remember from my education. Please correct me in case I'm wrong!

  • correctly remembered.

Are these extensions are meant only for humans?

  • Yes, with a but.

When you interact with other operating systems that do depend on extensions being what they are it is the smarter idea to use those.

In Windows, opening software is attached to the extensions.

Opening a text file named "file" is harder in Windows than opening the same file named "file.txt" (you will need to switch the file open dialog from *.txt to *.* every time). The same goes for TAB and semi-colon separated text files. The same goes for importing and exporting e-mails (.mbox extension).

In particular when you code software. Opening a file named "software1" that is an HTML file and "software2" that is a JavaScript file becomes more difficult compared to "software.html" and "software.js".


If there is a system in place in Linux where file extensions are important, I would call that a bug. When software depends on file extensions, that is exploitable. We use an interpreter directive to identify what a file is ("the first two bytes in a file can be the characters "#!", which constitute a magic number (hexadecimal 23 and 21, the ASCII values of "#" and "!") often referred to as shebang,").

The most famous problem with file extensions was LOVE-LETTER-FOR-YOU.TXT.vbs on Windows. This is a visual basic script being shown in file explorer as a text file.

In Ubuntu when you start a file from Nautilus you get a warning what it is going to do. Executing a script from Nautilus where it wants to start some software where it is supposed to open gEdit is obvious a problem and we get a warning about it.

In command line when you execute something, you can visually see what the extension is. If it ends on .vbs I would start to become suspicious (not that .vbs is executable on Linux. At least not without some more effort ;) ).

TimWolla
  • 288
Rinzwind
  • 309,379
25

As mentioned by others, in Linux an interpreter directive method is used (storing some metadata in a file as a header or magic number so the correct interpreter can be told to read it) rather than the filename extension association method used by Windows.

This means you can create a file with almost any name you like... with a few exceptions

However

I would like to add a word of caution.

If you have some files on your system from a system that uses filename association, the files may not have those magic numbers or headers. Filename extensions are used to identify these files by applications that are able to read them, and you may experience some unexpected effects if you rename such files. For example:

If you rename a file My Novel.doc to My-Novel, Libreoffice will still be able to open it, but it will open as 'Untitled' and you will have to name it again in order to save it (Libreoffice adds an extension by default, so you would then have two files My-Novel and My-Novel.odt, which could be annoying)

More seriously, if you rename a file My Spreadsheet.xlsx to My-Spreadsheet, then try to open it with xdg-open My-Spreadsheet you will get this (because it's actually a compressed file):

And if you rename a file My Spreadsheet.xls to My-Spreadsheet, when you xdg-open My-Spreadsheet you get an error saying

error opening location: No application is registered as handling this file

(Although in both these cases it works OK if you do soffice My-Spreadsheet)

If you then rename the extensionless file to My-Spreadsheet.ods with mv and try to open it you will get this:

(repair fails)

And you will have to put the original extension back on to open the file correctly (you can then convert the format if you wish)

TL;DR:

If you have non-native files with name extensions, don't remove the extensions assuming everything will be OK!

Zanna
  • 72,312
24

I'd like to take a different approach to this from other answers, and challenge the notion that "Linux" or "Windows" have anything to do with this (bear with me).

The concept of a file extension can be simply expressed as "a convention for identifying the type of a file based on part of its name". The other common conventions for identifying the type of a file are comparing its contents against a database of known signatures (the "magic number" approach), and storing it as an extra attribute on the file system (the approach used in the original MacOS).

Since every file on a Windows or a Linux system has both a name and contents, processes which want to know the file type can use either the "extension" or the "magic number" approaches as they see fit. The metadata approach is not generally available, as there is no standard place for this attribute on most file systems.

On Windows, there is a strong tradition of using the file extension as the primary means of identifying a file; most visibly, the graphical file browser (File Manager on Windows 3.1 and Explorer on modern Windows) uses it when you double-click on a file to determine which application to launch. On Linux (and, more generally, Unix-based systems), there is more tradition for inspecting the contents; most notably, the kernel looks at the beginning of a file executed directly to determine how to run it; script files can indicate an interpreter to use by starting with #! followed by the path to the interpreter.

These traditions influence UI design of programs written for each system, but there are plenty of exceptions, because each approach has pros and cons in different situations. Reasons to use file extensions rather than examining contents include:

  • examining file contents is fairly costly compared to examining file names; so for instance "find all files named *.conf" will be a lot quicker than "find all files whose first line matches this signature"
  • file contents can be ambiguous; many file formats are actually just text files treated in a special way, many others are specially-structured zip files, and defining accurate signatures for these can be tricky
  • a file can genuinely be valid as more than one type; an HTML file may also be valid XML, a zip file and a GIF concatenated together remain valid for both formats
  • magic number matching might lead to false positives; a file format that has no header might happen to begin with the bytes "GIF89a" and be misidentified as a GIF image
  • renaming a file can be a convenient way to mark it as "disabled"; e.g. changing "foo.conf" to "foo.conf~" to indicate a backup is easier than editing the file to comment out all of its directives, and more convenient than moving it out of an autoloaded directory; similarly, renaming a .php file to .txt will tell Apache to serve its source as plain text, rather than passing it to the PHP engine

Examples of Linux programs which use file names by default (but may have other modes):

  • gzip and gunzip have special handling of any file ending ".gz"
  • gcc will handle ".c" files as C, and ".cc" or ".C" as C++
IMSoP
  • 1,549
16

Actually, some technologies do rely on file extensions, so if you use those technologies in Ubuntu, you'll have to rely on extensions too. A few examples:

  • gcc uses extensions to distinguish between C an C++ files. Without the extension it's pretty much impossible to differentiate them (imagine a C++ file with no classes).
  • many files (docx, jar, apk) are just particularly structured ZIP archives. While you can usually infer the type from the content, it may not always be possible (e.g. Java Manifest is optional in jar files).

Not using file extensions in such cases will only be possible with hacky workarounds and is likely to be very error-prone.

Dmitry Grigoryev
  • 1,960
  • 14
  • 23
6

Your first assumption is correct: the extensions on Linux do not matter and only are useful for humans( and other non-Unix-like OS that care about extensions ). The type of a file is determined by first 32 bits of data in the file , which is known as magic number This is why shell scripts need #! line - to tell operating system what interpreter to call. Without it , the shell script is just text file.

As far as file managers go, they do want to know extensions of some files, such as .desktop files , which basically same as Window's version of shortcuts but with more capabilities. But as far as OS is concerned, it needs to know what's in the file, not what's in its name

5

This is a too big for a comment answer.

Keep in mind that even "extension" has a lot if different meanings.

What your talking about seems to be the 3 letters after the . DOS made the 8.3 format really popular and windows uses the .3 part to this day.

Linux has a lot of files like .conf or .list or .d or .c that have meaning, but are not really extensions in the 8.3 sense. For example Apache looks at /etc/apache2/sites-enabled/website.conf for it's configuration directive. While the system uses MIME Types and content headers and what not to determine it's a text file, Apache (by default) still isn't going to load it without it ending in .conf.

.c is another great one. Yep it's a text file, but gcc depends on main.c becoming main.o and finally main (after linking). At no time does the system use the .c, .o or no extension to have any meaning as far as content, but the stuff after the . does have some meaning. You would probably setup your SCM to ignore main.o and main.

Point being is this: Extensions are not used the way they are in windows. The kernel will not execute a .txt file because you remove the .txt part of the name. It is also very happy to execute a .txt file if the execute permission is set. That being said, they do have meaning, and are still used on a "computer level" for many things.

coteyr
  • 18,724