Code Conversion

Although future computers will undoubtedly use Unicode, most machines continue to muddle along with 8-bit coding, allowing a maximum of 256 characters in each character set. And in plain ASCII text files, 7-bit coding is used, restricting the user to less than 128 characters.

Although different 8-bit coding systems are used on alternative computer platforms, most modern word-processing applications use a standardised character set within a proprietary document format, allowing files to be opened on any kind of machine. In fact, many Western applications use files that employ the ISO 8859-1 (Latin-1) character set, whatever computer is used.

Text File Conversion

Although a plain ASCII text file can be read on any computer, other text files should use the same character set as the recipient’s computer platform. If such files are incorrectly coded you’ll still be able to read the text, although many of the non-ASCII codes will give the wrong characters.

Minor differences in a character set can be corrected by repeated use of the find and replace feature found in most text editor applications. Alternatively, you can ‘fudge’ the problem by using special fonts. For example, if you want to read Windows documents on a Mac OS computer you can select a special font on the Mac OS machine that displays the Windows character set.

However, such messy solutions can be avoided by using special code conversion software. For example, the Classic Mac OS incorporates Apple’s Text Encoding Converter (TEC) mechanism, which is available to conversion utilities such as Cyclone (Tomasz Kukielka). You can also obtain special conversion software for Windows and other computer platforms.

Unfortunately, the following characters in the Latin Mac OS character set, as well as the standard Apple symbol, aren’t available in the 8-bit Windows 1252 or ISO Latin-1 sets:-

≠ ∞ ≤ ≥ ∂ ∑ ∏ π ∫ Ω ​√ ≈ ∆ ◊ fi fl ı ˘ ˙ ​˚ ˝ ˛ ˇ

During code conversion such characters are often replaced by alternative symbols, such as ? (question mark) or _ (underscore). In some instances, the offending symbol is replaced by a similar ASCII character or a special string inside angle brackets, such as <pi> in place of the ‘pi’ character.

The following table shows how non-ASCII codes are converted when translating Mac OS text into the Windows format:-

MacWinMacWinMacWin
128196129197130199
131201132209133214
134220135225136224
137226138228139227
140229141231142233
143232144234145235
146237147236148238
149239150241151243
152242153244154246
155245156250157249
158251159252160134
161176162162163163
164167165149166182
167223168174169169
170153171180172168
173141174198175216
176144177177178143
179142180165181181
182240183221184222
185254186138187170
188186189253190230
191248192191193161
194172195175196131
197188198208199171
200187201133202160
203192204195205213
206140207156208173
209151210147211148
212145213146214247
215215216255217159
218158219164220139
221155222128223129
224135225183226130
227132228137229194
230202231193232203
233200234205235206
236207237204238211
239212240157241210
242218243219244217
245166246136247152
248150249154250178
251190252184253189
254179255185

Downward Conversion

With suitable software, you can convert 7-bit or 8-bit material into Unicode format or convert 7-bit data into 8-bit format. However, converting Unicode material to 8-bit form or converting 8-bit material down to a 7-bit set is far trickier, since there aren’t sufficient codes to represent all the characters.

Fortunately, if the material is suitable for presentation on a Web browser, you can convert the text into HTML form. Then, assuming the conversion software works properly and you have a modern browser, you should see most characters correctly. Unfortunately, some older browsers don’t recognise all HTML entities or character codes, so some characters can still look wrong.

If you can’t use HTML, you’ll have to resort to using alternative characters or a string of characters to replace some of the original Unicode or 8-bit characters. The following table shows the standard Latin-1 character set, complete with the non-standard codes that exist between 128 and 159:-

Latin Character Set

Those characters outside the normal ASCII character set can be replaced as follows:-

Non-ASCII Replacements

Such replacements avoid loss of information, although some characters can be ambiguous.

Special characters in the Mac OS character set can also be replaced, as in the suggestions shown below:-

CharacterReplacement
/=
o=o
<=
>=
delta
Sigma
Pi
πpi
Int
ΩOmega
v/
~=
Delta
<>
fi
fl
Apple
ı<1>

Line Breaks

Different computer platforms use different codes for a line break, also known as a line ending or end-of-line (EOL). Most systems employ either the CR (carriage return) or LF (line feed) codes or a combination of both, as shown below:-

SystemEnd of Line
Mac OS •CR
MS-DOSCR followed by LF
UNIXLF
VAXCR followed by LF
Mac OS X is based on Unix and can use LF or CR

The Use of Line Breaks

Early computer systems could only store a specified number of characters in each line. Indeed, many machines used punched cards that held a maximum of 80 characters, corresponding to a line of text in a fixed-width window. Hence a line break was inserted at the end of every visible line of text.

Nowadays, text isn’t stored in chunks of a fixed size and can automatically wrap within a chosen size of viewing window. Although this removes the some of the presentational control provided by fixed line breaks, it allows the break characters to usefully separate paragraphs of text. In fact, in the Mac OS, a CR code is often represented by a (paragraph symbol).

Unfortunately, material from the Internet or a PC often has a line break at the end of every line. This means that you can end up with both a hard wrap, provided by the line break characters, and a soft wrap, provided by automatic wrapping in the window. This causes the the text to appear as a series of broken lines, varying with the width of the window.

If you encounter this problem, you’ll need a special utility to remove the unwanted line breaks. This should remove breaks within a paragraph whilst ensuring that the breaks used for separating paragraphs are kept in place. Ideally, it should also remove any unwanted spaces from the text.

©Ray White 2004.