10

Not sure if this is a ubuntu or osx question, but I'll start here. I'll leave it to the mods to move the question to AskDifferent if more apropriate.

I moved a file from ubuntu to osx using scp on the apple machine. I edited the file on the apple machine. Then I moved the file back, again using scp on the apple machine.

The filename of the source file was Documents/trettiårsfirarätare.

  • Sourcecode: Documents/trettiårsfirarätare

The filename I got back had the name Documents/trettiårsfirarätare.

  • Sourcecode: Documents/trettia˚rsfirara¨tare

While these might look similar, the letters å and ä is actually different between them. At no point did I change the name of the file.

This makes little technical difference to me, I just changed the name of the file back to what ubuntu considers å and ä, but it tickled my curiosity.

Can you explain to me why this happened?

Takkat
  • 144,580
azzid
  • 876

1 Answers1

8

In the original name “Documents/trettiårsfirarätare”, the letter “å” is internally represented as U+00E5 LATIN SMALL LETTER A WITH RING ABOVE. This is the common representation of this character. In the filename you got back, it has been turned to the character pair U+0061 LATIN SMALL LETTER A U+030A COMBINING RING ABOVE. This is permissible, but not common; it means decomposing “å” into the base character “a” and a combining diacritic mark. These representations are declared to be canonically equivalent in Unicode; this means that the visual presentation is normally expected to be the same, but it need not (here, at SO, as viewed in Firefox, it is not – this depends on font and on rendering software). Programs may treat them as equivalent, but they need not. In a file system, for example, they might well be treated as different.

Similarly, the letter “ä” gets decomposed to U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS.

The reason to this is not obvious. Possibly some software “thinks” it should convert strings to a normalization form that decomposes all decomposable characters, probably Unicode Normalization Form D (NFD)

The rest is a bit more mysterious. What you specify as “Sourcecode” for the filename you got back, “Documents/trettia˚rsfirara¨tare”, the decomposed forms have been munged: the diacritic marks have been replaced by their spacing clones, the characters “˚” and “¨”. This is not normal, and it changes both the identity of data and its rendering.