SORTING TEXT WITH DIACRITICAL (8-bit ASCII) CHARACTERS or: How to write your own collating sequences by Eric Pement 25 Oct. 2002 Recently, I tried to solve the problem of how to sort lines with diacritical characters (characters using the acute or grave accent, tilde, circumflex, diaeresis or umlaut, etc.). I use a Windows 2000/CMD/4DOS environment, and the accented symbols are on the "high" end of the ASCII table -- generally speaking, in the region from 0x80 to 0xA7. (In this notation, "0x82" represents 82 in hexadecimal, which is also 202 in octal, and 130 in decimal.) The problem is, the standard GNU 'sort' utility sorts characters in strict ASCII sequence. I wanted to see this: baza beza # accented 'e' biza boza where the 'e' on line #2 is a single character with acute accent (in HTML, é). However, the ASCII value of this character is greater than 'z', because all accented symbols have the 8th bit set in the IBM/OEM character set. Therefore, my sort was coming out like this: baza biza boza beza # accented 'e' which was not what I wanted. I tried setting the environment variable LC_COLLATE, which was supposed to configure the local "collating sequence" for the GNU sort utility, but changing this value had no effect. Accented characters still sort after unaccented characters. There was nothing I could do to change that ... or so I thought. Sed to the rescue! It was a tip from Sven Guckes that started me on the right track to solving this problem. I posted my question on the sed-users mailing list, and Sven's mail client apparently displayed my accented "e" as two separate characters, e' (e, apostrophe). The letter "e" with a grave accent was displayed as e`. He helpfully tried to explain that I could use sed to convert the two-character string back to a single-byte character, by using the sed substitution command. I didn't think much of it at the time, but while trying to solve the problem, I discovered that GNU sort has a -d switch, which ignores embedded punctuation marks while sorting. This led to the solution. THE SOLUTION: I can use one sed script to change the input file, altering all high-bit characters to 3-character equivalents. I change (acute-e) to (e+'), (grave-e) to (e+`), (cedilla-c) to (c+,), and so forth. Then I use sort -d to sort the new file in dictionary order. The -d switch causes 'sort' to ignore any punctuation marks or symbols when sorting the line. Finally, I use another sed script to change the 3-character symbols back into the form I started with, converting (e+') to (acute-e), (e+`) back to (grave-e), (c+,) back to (cedilla-c), etc. The first script should look like this: # filename: accent_to_3.sed # # GNU sed script changes IBM/OEM accented alphabetical characters # to three-character equivalents. Note that this method relies on # the plus sign (+) not being used in the input file! GNU sed or # ssed will recognize \xHH as hex values. # s/\x80/C+,/g; # cap C cedilla s/\x81/u+e/g; # German umlaut u = 'ue' in English s/\x82/e+'/g; # e acute s/\x83/a+^/g; # s/\x84/a+:/g; # a diaeresis s/\x85/a+`/g; # s/\x86/a+@/g; # a with ring above s/\x87/c+,/g; # c cedilla s/\x88/e+^/g; # s/\x89/e+:/g; # e diaeresis s/\x8A/e+`/g; # s/\x8B/i+:/g; # s/\x8C/i+^/g; # s/\x8D/i+`/g; # s/\x8E/A+:/g; # A diaresis s/\x8F/A+@/g; # A with ring above s/\x90/E+'/g; # s/\x91/a+e/g; # ligature ae s/\x92/A+E/g; # ligature AE s/\x93/o+^/g; # s/\x94/o+e/g; # German umlaut o = 'oe' in English s/\x95/o+`/g; # s/\x96/u+^/g; # s/\x97/u+`/g; # s/\x98/y+:/g; # s/\x99/O+e/g; # German umlaut O = 'Oe' in English s/\x9A/U+e/g; # German umlaut U = 'Ue' in English # # Break - nonalphabetic symbols between 0x9B and 0x9F # s/\xA0/a+'/g; # s/\xA1/i+'/g; # s/\xA2/o+'/g; # s/\xA3/u+'/g; # s/\xA4/n+~/g; # s/\xA5/N+~/g; # s/\xA6/a+-/g; # s/\xA7/o+-/g; # # s/\xE1/s+s/g; # (Gk. beta) German eszet = "ss" in English #---end of script--- And the second script should look like this: # filename: 3_to_accent.sed # # GNU sed script changes 3-character strings to one-character # 8-bit symbols in the IBM/OEM character set. Note that we put # the '+' within [square brackets] so it will represent a plus # sign, regardless of whether the sed switches -r or -R are used # or not. Obviously, the plus sign in the input file must be a # reserved character, used only for 3-to-one character mapping. # s/C[+],/\x80/g; # cap C cedilla s/u[+]e/\x81/g; # 'u+e' maps to German umlaut u s/e[+]'/\x82/g; # e acute s/a[+]^/\x83/g; # s/a[+]:/\x84/g; # a diaeresis s/a[+]`/\x85/g; # s/a[+]@/\x86/g; # a with ring above s/c[+],/\x87/g; # c cedilla s/e[+]^/\x88/g; # s/e[+]:/\x89/g; # e diaeresis s/e[+]`/\x8A/g; # s/i[+]:/\x8B/g; # s/i[+]^/\x8C/g; # s/i[+]`/\x8D/g; # s/A[+]:/\x8E/g; # A diaresis s/A[+]@/\x8F/g; # A with ring above s/E[+]'/\x90/g; # s/a[+]e/\x91/g; # ligature ae s/A[+]E/\x92/g; # ligature AE s/o[+]^/\x93/g; # s/o[+]e/\x94/g; # 'o+e' maps to German umlaut o s/o[+]`/\x95/g; # s/u[+]^/\x96/g; # s/u[+]`/\x97/g; # s/y[+]:/\x98/g; # s/O[+]e/\x99/g; # 'O+e' maps to German umlaut O s/U[+]e/\x9A/g; # 'U+e' maps to German umlaut U # # Break - nonalphabetic symbols between 0x9B and 0x9F # s/a[+]'/\xA0/g; # s/i[+]'/\xA1/g; # s/o[+]'/\xA2/g; # s/u[+]'/\xA3/g; # s/n[+]~/\xA4/g; # s/N[+]~/\xA5/g; # s/a[+]-/\xA6/g; # s/o[+]-/\xA7/g; # # s/s[+]s/\xE1/g; # "s+s" is mapped to German eszet ( = Greek beta!) #---end of script--- Finally, the solution script or batch file to run everything is below. It uses GNU sed, ssed, HHsed, or any version of sed which has support for \xHH hexadecimal notation. sed -f accent_to_3.sed input.file >temp.one sort -d temp.one >temp.two sed -f 3_to_accent.sed temp.two >output.file erase temp.one temp.two echo Sorted input file is now in 'output.file' This took several hours to work on, but it solves a particular web-related sorting problem that I have. Hope it helps someone else. FURTHER REFLECTIONS: The astute sed-user will recognize that this system allows them to write their own collating sequences. For example, I wanted the German umlauted-u (0x81) to be collated as "ue", so it appears as 'u+e' in the first sed script. If you want 0x81 to be collated as "zork", just map it to "z+o+r+k" in the first sed script, and use an inverse mapping in the second sed script. As long as you make sure that mappings do not clash or overwrite one another, this method enables you to write your own collating sequences, giving you even more power than LC_COLLATE would give. It also is worth mentioning that this same principle would work on HTML character entities. And to be absolutely honest, this is how I started on this project in the first place. I have a database where certain fields contain HTML character entities, like this (showing only the key field, which consists of personal names): Marin, Louis Béza, Theodore Güttgemanns, Erhardt Backus, Irena Gérard, François C. Hadidian, Dikran I knew that I could easily use sed to convert the database from HTML character entities to IBM 8-bit characters, legible as normal text. But then how could I use the "sort" utility to sort these names once they were in 8-bit ASCII? Sort doesn't really understand the collating sequence of the accented characters above 7F hex. So, this little exercise was really a bit more involved than I let on at first. I kept it simple, just asking how to persuade 'sort' to recognize accented symbols. But my real motivation is the one I have presented here. And for the sed-users mailing list, now you know why I always used the string 'B(acute-e)za' in my sorting examples. It was a short word, and I wanted the accented form of "Beza" to come after "Benz" and before "Bishop" when the database is sorted. That's it! If you need further explanation or notice something that needs adjustment, please write me. Thanks! -- Eric Pement - pemente@northpark.edu [eof]