A few days ago I checked the HTML validity of my own blog. I have to admit that I didn’t do that after I wrote the jugglingPostLang plugin (wrote about that in my blog post Simple Language-Defined Posts in WordPress) for WordPress. I realized that my plugin produced invalid html in many cases, so that was an issue:
The issue is, that <span> elements forbid block level content, while <div> elements are not allowed as a child of any element that only accepts phrasing content (which is roughly running text and what’s inside).
To solve that I read through the html specifications to find a solution but unfortunately there isn’t a simple solution on the horizon.
In the following I’m going to summarize the html specification details on the topic and give a rough sketch on the algorithm to determine what’s a valid wrapping node.
Specifications referred to in this Article
Once upon a time the W3C was the only organization that specified how Html should look like to be called „valid“. While these specifications were valuable to fight the browser war where different browser vendors fought for having more and better features ignoring the specification, that browser war today is fought more on a performance and useability level in complicance with web standards. But the W3C often has been criticised as being slow and less innovative. Additional features take ages to be added to the standards.
As a reaction browser vendors founded the WHATWG to speed up inventions but keep it in a common standard that allows multiple compatible browsers on the market. This working group started to specify HTML5 to speed up the process, until W3C rebooted their XHTML2 specification process with the HTML5 specification started by the WHATWG.
Both organizations publish a HTML5 specification. Due to the collaboration they should not differ much, but the WHATWG HTML5 specification seems to be quite active, so I’m not sure. The W3C variant of the HTML5 specs is available here.
For the purpose of this article I checked both, and they agreed on the issue discussed here.
HTML elements are „categorized“ in so called content models (link to corresponding WHATWG section). A content model basically defines what is allowed content of a given node.
Content models are not distinct sets, they overlap.
Flow Content contains most elements from the body part of a typical html document, basically anything that takes part in the document layout flow.
Phrasing content is a strict subset of flow Content. That is: wherever an element requires it’s descendants to be flow content, any phrasing content element fulfills this condition as well.
The content model of elements with transparent content model depends on the actual content. These elements may contain any other element. From those it’s determined to which content model they are categorized. If an element with transparent content model contains only children from the phrasing content category, it is itself taken as phrasing content. If there’s any flow content model aside phrasing model elements it’s taken as flow content.
Semantics of HTML elements
Too often HTML is taken as a means to layout content. For decades (which is basically ever since the beginning of HTML) html has been misused to get the right visual shape. In contrast, the markup language HTML started from a semantic perspective. Elements like `<p>`, `<h1>` to `<h6>`, `<ul>` and `<ol>` or `<a>` carry a dedicated semantical meaning: The `<h*>`-Tags are headlines of different level, structuring the document (in HTML5 this is extended by the sectioning content), `<p>` is a paragraph, `<ul>` and `<ol>` model list content. The `<a>` tag mirrors the core idea of hypertext by it’s ambivalent semantical meaning to be an anchor as source or target of a hyperlink (as a source it usually takes a `href` attribute to reference the target, as a target it has an `id` attribute to make it a unique target together with the documents URL).
Two HTML elements, `<div>` and `<span>`, don’t carry any semantic. Compared to the content model, which is a syntactical perspective, these two could be seen as „semantically transparent“. There are two major differences between both:
- By default a `<div>` element is a block level element, while a `<span>` is an inline element (and displayed as such when not overwritten by CSS), and
- the content model is different
Some attributes in HTML are called „universal“ as they can be used on nearly any node. The best known of those are `class`, `id` and `style`.
The `lang` attribute is another one that isn’t used very often. It specifies the language of the content of the element and has a wide spread, though merely hidden practical use:
- Screen readers may adapt how the text is pronounced based on the language (imagine how English sounds when read with French pronounciation rules, or German read like English).
- For search engines it’s easier to understand the meaning of an article if at least the language is known (although the determination of the language by heuristics may be a minor problem in the overall process).
- Browsers, plugins or dedicated tools that provide automatic translations for websites may be sure in what’s the source language of the translation (try to figure out if something is American or British English).
My personal background is my quite basic knowledge of accessibility in software in general, so I would like e.g. to „feed“ screen readers with the best information they can get.
Algorithm Sketch: What Content Model does a given HTML substring Comply with?
As input a HTML substring is given. The input may consist of one or more HTML nodes, including text nodes.
As long as the input represents valid content for any HTML element (which means it is or may be part of a valid HTML document), the output of the algorithm should be valid HTML as well.
- Take the input string and parse it as an HTML node list, or alternatively (because it’s better supported by APIs), wrap the input by another node `<root>` and parse the result as XML.
- Traverse the document in a depth-first-search with early pruning to determine the content model.
- If a child C has a flow content model, but not a phrasing content model (the subset of both), we cannot wrap the whole by a `span` anymore, thus we return `div` and are fininshed.
- If a child has a phrasing content model, we can skip it as – assuming the validity of the input – it’s content can be phrasing content only.
- If a child has a transparent content model, we have to recursively call the algorithm and use the result to handle this child with (1) and (2).
- If not returned before, we return `span` as a result as we didn’t find any child element that’s not a phrasing content model.
As an alternative approach I could have implemented an algorithm that would not have wrapped the content, but manipulate it instead, wrapping text elements in `span` nodes and adding a `lang` attribute to any other top-level node of the given text.