4. HTML Basics

Each web page on the internet consists of a simple text file composed in Hypertext Markup Language (HTML). This is one of a family of markup languages used to store information on computer systems. Other varieties include Extensible Markup Language (XML) and Server HTML (SHTML), which is the same as HTML, but launched from a web server.

The File

A text document used for a web page must have a filename extension (the dot and letters at the end of the name) of .htm or .html. There are many different types of text file, some designed for specific computer operating systems, but the variety almost universally used for web pages is the 8‑bit Unicode Transformation Format (UTF‑8).

For the technically minded, this type of file should not have a Byte Order Mark (BOM).

Any kind of text, of any language, including pictographic characters and emojis, can be entered into a UTF‑8 document. If you create a web page using a UTF‑8 text file all the characters you type or enter will be faithfully reproduced when viewed in a modern web browser.

Only UTF‑8 files should be used to create web pages. Other types should be avoided.

To create a web page you’ll need a specialised text editor, one that incorporates utilities for processing HTML content, as well as a preview facility, allowing you to view the page as seen in a web browser.

In macOS you can use BBEdit, a highly versatile text editor for HTML and coding.

To create a web page you should open a new text document, ensuring that the formatting options are set to Unicode (UTF‑8) and that the line endings are set to Unix (LF). Line Feed (LF) is the control code used to begin a new line in text files employed in the Unix computer operating system, a technology widely employed for web servers.

The Language

All markup languages use angle bracket characters, ＜ (also known as ‘less than’) and ﹥ (also known as ‘greater than’), to delineate tags. Here’s a simple example:

﹤p﹥This is ﹤em﹥emphasised﹤/em﹥ text.﹤/p﹥

Seen in a browser, this appears as a paragraph of text with a portion emphasised, like this:

This is emphasised text.

The spacing above and below the paragraph, as well as the size and style of of the font, and the variation caused by emphasis, is determined by a Cascading Stylesheet (CSS), usually kept in a separate file. Should such a file not exist, the content is displayed using the browser’s defaults, which can vary, depending on the software or the device.

A section of HTML code, with matching open tags and close tags, such as ＜em＞ and ＜/em＞, is called an element.

Closing tags must be included in all elements, except:
＜br＞ (line break)
＜hr＞ (horizontal rule)
＜wbr＞ (optional word break)

White Spaces

Browsers usually treat white space characters, such as control codes that don't represent any visible character, as a normal space between words. Similarly, multiple spaces or single spaces in combination with other white space characters are also interpreted as a single space. This means that line feed characters are best left at the end of real lines and other white space characters avoided.

A non-breaking space inserted between words makes a length of text behave as one word without any white spaces. On the other hand, the introduction of a ＜wbr＞ tag in the middle of a word or string of text allows the browser to wrap the text at that point, should there be insufficient space in the window.

Formatting

Raw HTML can end up being difficult to read, especially in a large and complex document, so the text is often formatted to ease editing, although this makes no difference to the appearance of the page as seen in a browser. The option to do this should to available in your text editor. Here’s the example above in a formatted style:

＜p＞ ␊
␉ This is ␊
␉ ␉ ＜em＞ ␊
␉ ␉ ␉ emphasised ␊
␉ ␉ ＜/em＞ ␊
␉ text. ␊
＜/p＞ ␊

As you can see, ␉ (horizontal tabs) and ␊ (line feed) characters have been added. Various kinds of formatting exist, so you’re free to use whatever format is easiest for you. Note, however, that the more sophisticated styles of formatting may introduce rather too many line feeds, introducing unwanted spaces in your final content.

Optimisation

Formatting an HTML file can almost double its size, which means it takes up more space on the server, takes longer to download and therefore appears more slowly in a browser. To minimise size, your text editor should also have a facility to optimise the document, removing all the unwanted control codes that were introduced by formatting.

The ＜ Character

The beginning of tags in HTML is delineated by the ＜ character, meaning it can't be used elsewhere in your text. Normally this isn’t a problem, since it isn’t that commonly used.

When you really need to use a ＜ character there are two options:

1. Use one of the other Unicode variations of ‘less than’ characters, along with the matching ‘greater than’ symbols.

2. Employ a character entity, a mechanism used long before UTF‑8 was introduced. This fools the browser into treating a ＜ or ＞ as a character, not as part of HTML. To do this, simply type ＆lt; or ＆gt; in their place.

Avoid using ＆ (ampersand) where it might create other character entities by mistake. Such entities consist of ＆ followed by other characters and/or numbers and ending in a ; (semicolon). Chances are this shouldn't happen very often.

Because of the use of ＆ in character entities, some older validation software may object to the existence of ＆ itself within an HTML document. If this happens you should replace all instances of ＆ by ＆amp; or avoid the character entirely within the content of your text.

Hyperlinks

In almost every web page you can click on hyperlinks, portions of ‘clickable’ text that take you to other web locations. Each hyperlink requires an anchor tag, represented as ＜a＞, as in:

Please click ＜a href="https://www.apple.com"＞here＜/a＞

where ‘here’ is the hyperlink text. Such text can also be a partial or complete URL, such as:

Go to ＜a href="https://www.apple.com"＞https://www.apple.com＜/a＞

or:

More at ＜a href="https://www.apple.com"＞apple.com＜/a＞

You can also use an image as a ‘clickable’ object, as in:

＜a href="https://www.apple.com"＞＜img src="apple.png"alt="Link to Apple"＞＜/a＞

which includes the alt attribute for those with visual impairments.

Links can also be made for bookmarks, linking to other parts of the same page, parts of another page on the same site or even parts of a page on another site. You do can do this with something like:

＜a href="#id12"＞Memories of Menlo Park＜/a＞

which provides a link to the following ＜h2＞ heading elsewhere on the page, identified by its id:

＜h2＞＜a id="id12"＞Chapter 12＜/a＞＜/h2＞

To bookmark a location in another page, the link can be of the form:

＜a href="html_demo.html#C4"＞Chapter 4＜/a＞

where C4 is the id of the required location on the page at html_demo.html.

An id value can be any combination of letters and numbers, or can be set to top or bottom, allowing navigation to either end of a document.