Code Conversion

Although future computers will undoubtedly use Unicode, most machines continue to muddle along with 8-bit coding, allowing a maximum of 256 characters in each character set. And in plain ASCII text files, 7-bit coding is used, restricting the user to less than 128 characters.

Although different 8-bit coding systems are used on alternative computer platforms, most modern word-processing applications use a standardised character set within a proprietary document format, allowing files to be opened on any kind of machine. In fact, many Western applications use files that employ the ISO 8859-1 (Latin-1) character set, whatever computer is used.

Text File Conversion

Although a plain ASCII text file can be read on any computer, other text files should use the same character set as the recipient’s computer platform. If such files are incorrectly coded you’ll still be able to read the text, although many of the non-ASCII codes will give the wrong characters.

Minor differences in a character set can be corrected by repeated use of the find and replace feature found in most text editor applications. Alternatively, you can ‘fudge’ the problem by using special fonts. For example, if you want to read Windows documents on a Mac OS computer you can select a special font on the Mac OS machine that displays the Windows character set.

However, such messy solutions can be avoided by using special code conversion software. For example, the Classic Mac OS incorporates Apple’s Text Encoding Converter (TEC) mechanism, which is available to conversion utilities such as Cyclone (Tomasz Kukielka). You can also obtain special conversion software for Windows and other computer platforms.

Unfortunately, the following characters in the Latin Mac OS character set, as well as the standard Apple symbol, aren’t available in the 8-bit Windows 1252 or ISO Latin-1 sets:-

≠  ∞  ≤  ≥  ∂  ∑  ∏  π  ∫  Ω  √  ≈  ∆  ◊  fi  fl  ı  ˘  ˙  ˚  ˝  ˛  ˇ

During code conversion such characters are often replaced by alternative symbols, such as ? (question mark) or _ (underscore). In some instances, the offending symbol is replaced by a similar ASCII character or a special string inside angle brackets, such as <pi> in place of the ‘pi’ character.

The following table shows how non-ASCII codes are converted when translating Mac OS text into the Windows format:-

MacWinMacWinMacWinMacWinMacWinMacWinMacWin
128196129197130199131201132209133214134220
135225136224137226138228139227140229141231
142233143232144234145235146237147236148238
149239150241151243152242153244154246155245
156250157249158251159252160134161176162162
163163164167165149166182167223168174169169
170153171180172168173141174198175216176144
177177178143179142180165181181182240183221
184222185254186138187170188186189253190230
191248192191193161194172195175196131197188
198208199171200187201133202160203192204195
205213206140207156208173209151210147211148
212145213146214247215215216255217159218158
219164220139221155222128223129224135225183
226130227132228137229194230202231193232203
233200234205235206236207237204238211239212
240157241210242218243219244217245166246136
247152248150249154250178251190252184253189
254179255185          

Similarly, the following table shows the conversion from Windows to Mac OS text:-

WinMacWinMacWinMacWinMacWinMacWinMacWinMac
128222129223130226131196132227133201134160
135224136246137228138186139220140206141173
142179143178144176145212146213147210148211
149165150248151209152247153170154249155221
156207157240158218159217160202161193162162
163163164219165180166245167164168172169169
170187171199172194173208174168175195176161
177177178250179254180171181181182166183225
184252185255186188187200188197189253190251
191192192203193231194229195204196128197129
198174199130200233201131202230203232204237
205234206235207236208198209132210241211238
212239213205214133215215216175217244218242
219243220134221183222184223167224136225135
226137227139228138229140230190231141232143
233142234144235145236147237146238148239149
240182241150242152243151244153245155246154
247214248191249157250156251158252159253189
254185255216          

Downward Conversion

With suitable software, you can convert 7-bit or 8-bit material into Unicode format or convert 7-bit data into 8-bit format. However, converting Unicode material to 8-bit form or converting 8-bit material down to a 7-bit set is far trickier, since there aren’t sufficient codes to represent all the characters.

Fortunately, if the material is suitable for presentation on a Web browser, you can convert the text into HTML form. Then, assuming the conversion software works properly and you have a modern browser, you should see most characters correctly. Unfortunately, some older browsers don’t recognise all HTML entities or character codes, so some characters can still look wrong.

If you can’t use HTML, you’ll have to resort to using alternative characters or a string of characters to replace some of the original Unicode or 8-bit characters. The following table shows the standard Latin-1 character set, complete with the non-standard codes that exist between 128 and 159:-

Dec0123456789101112131415
32 !"#$%&'()*+,-./
480123456789:;<=>?
64@ABCDEFGHIJKLMNO
80PQRSTUVWXYZ[\]^_
96`abcdefghijklmno
112pqrstuvwxyz{|}~
128ƒ^ŠŒŽ
144-˜šœžŸ
160 ¡¢£¤¥¦§¨©ª«¬-®¯
176°±²³´µ·¸¹º»¼½¾¿
192ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏ
208ÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß
224àáâãäåæçèéêëìíîï
240ðñòóôõö÷øùúûüýþÿ

Those characters outside the normal ASCII character set can be replaced as follows:-

Dec0123456789101112131415
128eur,f,,,...+.++^%oS'<OEZ'
144''""**---~(TM)s'>oez'Y"
160 !!centpndcurryen|sect"(C)a/<<-,-(R)-
176deg+-^2^3'mu|P.,^1o/>>1/41/23/4??
192A`A'A^A~A"A*AEC,E`E'E^E"I`I'I^I"
208-DN~O`O'O^O~O"xO/U`U'U^U"Y'p|Ss
224a`a'a^a~a"a*aec,e`e'e^e"i`i'i^i"
240o+n~o`o'o^o~o"÷o/u`u'u^u"y'P|y"

Such replacements avoid loss of information, although some characters can be ambiguous.

Special characters in the Mac OS character set can also be replaced, as in the suggestions shown below:-

CharacterReplacement CharacterReplacement
/= ΩOmega
o=o v/
<= ~=
>= Delta
delta <>
Sigma fi
Pi fl
πpi Apple
Int ı<1>

Line Breaks

Different computer platforms use different codes for a line break, also known as a line ending or end-of-line (EOL). Most systems employ either the CR (carriage return) or LF (line feed) codes or a combination of both, as shown below:-

SystemEnd of Line
Mac OS •CR
MS-DOSCR followed by LF
UNIXLF
VAXCR followed by LF
Mac OS X is based on Unix and can use LF or CR

The Use of Line Breaks

Early computer systems could only store a specified number of characters in each line. Indeed, many machines used punched cards that held a maximum of 80 characters, corresponding to a line of text in a fixed-width window. Hence a line break was inserted at the end of every visible line of text.

Nowadays, text isn’t stored in chunks of a fixed size and can automatically wrap within a chosen size of viewing window. Although this removes the some of the presentational control provided by fixed line breaks, it allows the break characters to usefully separate paragraphs of text. In fact, in the Mac OS, a CR code is often represented by a (paragraph symbol).

Unfortunately, material from the Internet or a PC often has a line break at the end of every line. This means that you can end up with both a hard wrap, provided by the line break characters, and a soft wrap, provided by automatic wrapping in the window. This causes the the text to appear as a series of broken lines, varying with the width of the window.

If you encounter this problem, you’ll need a special utility to remove the unwanted line breaks. This should remove breaks within a paragraph whilst ensuring that the breaks used for separating paragraphs are kept in place. Ideally, it should also remove any unwanted spaces from the text.

©Ray White 2004.