Text

The process of transferring information by means of computer is sometimes known as electronic data interchange (EDI). The most common kind of data is text, which is usually stored in a text file. Each byte (or pair of bytes) in such a file represents a character in the complete string of text.

The following kinds of applications can be used to create and modify text files:-

Hex Editors

A hex editor can be used to examine the raw data within any kind of file, including a text file. Unlike more sophisticated software, this type of program doesn’t decode the data structure of the file. Instead, it simply provides a list of all the data bytes that exist inside the file.

The following diagram shows how a text file can appear when viewed in HexEdit, a popular hex editor used in the Classic Mac OS. As you can see, the hex values for each byte appear in the middle part of the window while the corresponding characters are shown to the right. The column on the extreme left provides a byte count, in hex, for the number of lines up to the current line.

This document shown here is 16-bit Unicode Text Format (UTF-16) file, which means each character is represented by a double-byte code. As a result, the ASCII characters that make up most of this file appear double-spaced, while the non-ASCII characters at the end of the document are shown as dots. In this example the two bytes for the letter M are highlighted, including the letter itself.

Text Processors

A text processor is similar to a word processor, except that it doesn’t let you change the layout of the material using tabs or permit any variation in fonts, font sizes or font styles. In other words, it only works with unstyled text, which must use a single font throughout the document.

And unlike a hex editor, a text processing application usually decodes the data structure of the file. Here’s the text file shown above, as seen in BBEdit, a powerful text processor for the Mac OS:-

As you can see, this application shows the characters with the correct spacing and also displays the non-ASCII characters at the end of the document. Unfortunately, not all text processors are as good as BBEdit, so you may find double-spaced characters or incorrect characters in some documents.

The WorldText application, supplied with some versions of the Classic Mac OS, gives a similar result, with all the characters correctly represented. Unfortunately, SimpleText, the basic text editor supplied as part of the Classic Mac OS, refuses UTF-16 files, since they have type code of utxt instead of TEXT. If you change the type code, SimpleText gives the following result:-

which illustrates how badly things can go wrong. Although the characters don’t appear double-spaced, extra symbols have been added and the non-ASCII characters are shown incorrectly.

Styled Text Editors

A styled text editor lets you use different fonts, font sizes or font styles, although you can’t usually change the layout with tabs in the way that you would with a word processor. Since a standard text file can’t store such information some kind of styled text file must be used.

Many computers platforms, including Mac OS X, employ Rich Text Format (RTF) files that are understood by numerous word processing applications, although there are variations in this format.

The Classic Mac OS employs the SimpleText variety of styled text file, which can can be opened using SimpleText or TeachText. If such a file is copied to another computer platform, all of the style information and other special content, is lost. However, any textual material is retained as plain text.

Other proprietary file formats that accommodate styled text usually require a matching application. However, some programs, including WorldText, can handle both their own documents and RTF files.

Markup Language Editors

A markup language allows extra information, contained in text strings known as tags, to be kept in a plain text file or UTF file. Common languages include Hypertext Markup Language (HTML), as used in the World Wide Web (WWW), and Extensible Markup Language (XML).

The creation of a text file containing a markup language can be approached in two ways. By using an advanced text-processor, such as BBEdit, you can work directly on the raw text, although this can be time-consuming and difficult. Alternatively, you can use a WYSIWYG editor suited to your chosen markup language, allowing you to use intuitive methods to create text and other elements.

Using Text for Graphics

Early computer systems were limited to plain ASCII text characters, a restriction that still applies to some basic e-mail systems. Many users have circumvented this limitation by using combinations of characters as graphical elements, in some instances creating quite elaborate ‘images’, although these are sometimes difficult to discern. However, the majority of ‘textual graphics’ are in the form of emoticons, the most common being the smiley face, which looks like this :-).

Text File Formats

A text file uses specific numerical codes for each kind of character, including standard control codes, such as CR (carriage return), LF (line feed) and HT (Horizontal Tab). Other control codes aren’t usually conveyed in a text file since they can cause disruption to software.

Typically, the content of a text file is of the form:-

Mary had a little lamb<CR><LF>

<HT>Its fleece was white as snow<CR><LF><EOF>

which contains the following control codes:-

CodeMeaningDecimalHex
CRCarriage Return130D
LFLine Feed100A
HTHorizontal Tab909
EOFEnd of File261A

However, the EOF byte is often omitted, as on a PC where the file size is rounded up to the nearest 128 bytes or so. Sadly, different platforms use alternative codes for a line break, also known as a line ending or end-of-line (EOL). Most systems use CR or LF codes, as shown below:-

SystemEnd of Line Codes
Mac OS •CR
MS-DOSCR followed by LF
Unix •LF
VAXCR followed by LF
Mac OS X is based on Unix and can use LF or CR

Fortunately, markup languages such as HTML, as well as most forms of RTF and many proprietary document, ignore these standard codes, using special tags or strings to identify the end of each line.

The following list describes some of the more common text formats, each shown with the appropriate filename extension and Classic Mac OS type code:-

Plain Text File   .txt  TEXT

A file of this kind contains a succession of bytes, each of which represents a specific textual character or control code. The text is conveyed on its own without extra information about fonts, font styles, tab positions or page layout. In other words, the file only contains plain text, also known as pure text.

Traditionally, such a document only uses the 7-bit codes in the ASCII character set and is known as an ASCII text file. In practice, some modern text files also use 8-bit codes, although different codes can be used to suit the computer’s character set, making files unsuitable for email systems. A file containing the Mac OS character set, is shown below, as viewed in HexEdit:-

In this example, the highlighted character is a space, which is represented by hex 20. Although the non-ASCII characters can’t be seen here they do appear if such a file is opened in a text application.

Some text files contain Unicode Text Format (UTF) data, either in 7-bit form (UTF-7) or 8-bit form (UTF-8). This ensures that all characters, including non-Roman and other international characters, appear the same on all computer platforms. It should be noted, however, that there’s no advantage in using UTF files if you only want to convey ASCII characters.

In a UTF file, ASCII characters still use one byte while others are in double-byte form. The data in a UTF-8 file that contains the Mac OS character set is shown below, as viewed in HexEdit:-

Once again, the highlighted character is a space, represented by hex 20. The file is identified as a UTF-8 document by the EF, BB and BF codes that precede this character. The codes for non-ASCII characters are all 8-bit values (higher than 7F), thereby avoiding confusion with the ASCII codes.

UTF-16 Text File   .txt  utxt

This is a variation of plain text file (see above) that only contains 16-bit values, also known as double-byte codes. This makes this kind of file suitable for conveying non-Roman or other international characters. The upper byte for every ASCII character is set to 00, as shown in this UTF-16 file that contains the Mac OS character set, as viewed from within HexEdit:-

The highlighted character is a space, which is hex 00 20. Although the non-ASCII characters aren’t shown in HexEdit’s window, they do appear correctly in an application that accepts UTF-16 files. The first character is preceded by FE and FF, the byte order marks, indicating that the byte order is normal. Some processors prefer a swapped byte order, as shown below:-

As you can see, all the pairs of data bytes are now reversed, including the byte order marks. Hence the normal code of 00 20 for a space is replaced by 20 00 and so on.

In the third variation of UTF-16 the byte order marks are omitted entirely. Whatever version of file is used, a text application, such as BBEdit, should present the characters correctly, as shown below:-

Rich Text Format (RTF)   .rtf  TEXT

This format is used with word processors, usually for interchanging files with Microsoft Word. It consists of normal text interspersed with special character strings, similar to those in a markup language (see below), which represent information about font styles and formatting. RTFs are also supported by many other applications, including ClarisWorks, MacWrite II, Works and WriteNow.

There are several variations in the RTF standard, causing some applications to reject specific files. Information about a document can be gleaned by examining it with a text editor such as BBEdit. The type of file is indicated in the first line of text, sometimes known as a file header. For example, a document that uses the Windows (ANSI) character set usually begins with:-

{\rtf0\ansi

while one that uses the Mac OS character set should begin with:-

{\rtf1\mac

Mac OS Styled Text File   .txt  TEXT/ttro

This kind of file, unique to the Classic Mac OS, can be created or modified using SimpleText, formerly known as TeachText. The text is kept in a plain text file, in this context known as the data fork, while the styles are kept as a styl resource in a separate file known as the resource fork. However, both files are presented as a single document via the workings of the Classic Mac OS.

Markup Text Files   Various  TEXT

A text file can contain additional information in the form of a markup language. The file used for this purpose can be a plain text file or one of the variations of UTF. However, it’s worth noting that some applications, including older Web browsers, can’t accommodate every kind of UTF file.

The oldest languages, such as Hypertext Markup Language (HTML) are derived from the Standard Generalised Markup Language (SGML), using tags to convey style and formatting. More recent varieties are based on the Extensible Markup Language (XML).

The following older formats are rarely encountered:-

DCA-RFT   .rft  TEXT

The Document Content Architecture-Revisable Form Text format is a PC standard that originates from IBM’s Distributed Office Support System (DIOSS) and System Network Architecture (SNA). This kind of file contains limited formatting information, including details of margins and page width, but doesn’t accommodate different fonts, font sizes or graphic elements. Text or data fields from other documents may be incorporated into such a file.

DCA-FFT   .fft  TEXT

The Document Content Architecture-Final Form Text format is a less common form of DCA (see above) and is used for storing a completed document. This kind of file contains limited formatting information, including details of margins, line spacing, font and justification, together with indicators that mark any text that needs an underscore or overstrike.

©Ray White 2004.