apart() · Mecha CMS

`apart()`

Splits HTML/XML string into tokens.

Table of Contents

Description
Example
Notes

Description

apart(string $value, array $raw = [], array $void = []): array;

This function splits HTML/XML string into an array of tokens, where each entry mainly contains the token string and the token type. Other additional information may also be present. This function is usually used in combination with the pair() function.

Example

$tokens = apart('asdf <asdf>asdf &amp; asdf</asdf> ');

// [
//     ['asdf ', 0],
//     ['<asdf>', 2, 6],
//     ['asdf ', 0],
//     ['&amp;', -1],
//     [' asdf', 0],
//     ['</asdf>', 2, 7],
//     [' ', 0]
// ]

The second entry in each token data contains the token type information. This does not specifically indicate a particular token pattern, but rather provides information about whether a token requires additional processing to make it complete. Type 2 is an incomplete token, it is applied to the opening and closing HTML/XML tags. You will probably need it in case you want to construct a nested HTML/XML element as a single token using the token data array:

$i = -1;
$lot = [];
$stack = 0;
foreach (apart($value) as $v) {
    if ($stack > 0) {
        if (2 === $v[1]) {
            $stack += '/' === $v[0][1] ? -1 : 1;
        }
        $lot[$i][0] .= $v[0];
        continue;
    }
    if (2 === $v[1]) {
        $stack += 1;
    }
    $lot[++$i] = [$v[0], $v[1], $v[2] ?? strlen($v[0])];
}

test($lot);

Type 1 is a complete token. It is applied to void XML elements or HTML elements that appear as if it is an opening XML element, but has its tag name listed in the $void parameter.

This $void parameter exists to specify HTML5 tag names that should be treated as void element names. This list of element names is required because HTML5 does not require authors to end void elements with a />. A > alone is fine to close the specified void elements:

$tokens = apart('<asdf/><asdf></asdf><br/><br></br>');

// [
//     ['<asdf/>', 1, 7, true],
//     ['<asdf>', 2, 6],
//     ['</asdf>', 2, 7],
//     ['<br/>', 1, 5, true],
//     ['<br>', 2, 4], ← Treated as an open tag
//     ['</br>', 2, 5] ← Treated as a close tag
// ]

$tokens = apart('<asdf/><asdf></asdf><br/><br></br>', void: ['br']);

// [
//     ['<asdf/>', 1, 7, true],
//     ['<asdf>', 2, 6],
//     ['</asdf>', 2, 7],
//     ['<br/>', 1, 5, true],
//     ['<br>', 1, 4], ← Treated as a void tag
//     ['</br>', 2, 5] ← Treated as a close tag that is missing its open tag
// ]

Character data section, comment, document type declaration, and processing instruction tokens are assumed to be complete tokens. No further processing is required, and no special configuration is available to treat them as a different token. All of them are given a type of 1:

$tokens = apart('<!--<asdf>--><![CDATA[<asdf>]]><!asdf><?asdf?>');

// [
//     ['<!--<asdf>-->', 1],
//     ['<![CDATA[<asdf>]]>', 1],
//     ['<!asdf>', 1],
//     ['<?asdf?>', 1]
// ]

Some HTML elements, such as <script>…</script> and <style>…</style>, may contain tokens that look like HTML/XML elements but should not be treated as such. This is not an issue with the XHTML standard, since it requires a character data section in those elements:

$tokens = apart(trim(
<<<HTML
<script>
const input = jQuery('<input/>').prop('disabled', true);
</script>
HTML
));

// [
//     ['<script>', 2, 8],
//     ['
// const input = jQuery(\'', 0],
//     ['<input/>', 1, 8, true], ← This is a false positive
//     ['\').prop(\'disabled\', true);
// ', 0],
//     ['</script>', 2, 9]
// ]

$tokens = apart(trim(
<<<XHTML
<script>
<![CDATA[
const input = jQuery('<input/>').prop('disabled', true);
]]>
</script>
XHTML
));

// [
//     ['<script>', 2, 8],
//     ['
// ', 0],
//     ['<![CDATA[
// const input = jQuery(\'<input/>\').prop(\'disabled\', true);
// ]]>', 1],
//     ['
// ', 0],
//     ['</script>', 2, 9]
// ]

To avoid false positives like in the first example, you can add 'script' as a raw element tag name to the $raw parameter, so that if the parser finds an opening HTML tag of <script>, it will immediately jump to the closing tag and ignore what is in the element:

$tokens = apart(trim(
<<<HTML
<script>
const input = jQuery('<input/>').prop('disabled', true);
</script>
HTML
), raw: ['script']);

// [
//     ['<script>
// const input = jQuery(\'<input/>\').prop(\'disabled\', true);
// </script>', 1, 8],
// ]

You can see that it produces just one instance of <script>…</script> token. And since that token is a complete HTML element, it gets the type of 1.

Notes

For tokens of type 2, or tokens of type 1 that are not in the form of a character data section, comment, document type declaration, and processing instruction, they will be provided with additional data in the form of the token’s character length:

$tokens = [
    ['<asdf>', 2, 6],
    ['</asdf>', 2, 7],
    ['<asdf/>', 1, 7]
];

If a tag name is provided to the $raw parameter, and it is found as a complete HTML/XML element token at some point in the HTML/XML string, then the token type will be set to 1, but the length of the token characters will remain the length of the opening HTML/XML tag only, and not the length of the entire HTML/XML element token:

// `$tokens = apart('<asdf asdf="asdf asdf">…</asdf>', raw: ['asdf']);`
$tokens = [
    ['<asdf asdf="asdf asdf">…</asdf>', 1, 23],
];

Its preservation is useful to separate the opening and closing tags from the actual element’s content in case you want to process those parts separately:

$token = $tokens[0];

$element_name = substr($token[0], 1); // Removes `<` from the start of the token
$element_name = strtok($element_name, " \n\r\t>"); // Gets the token just before the first white-space (or just before the `>` if it has no attributes) as the element name

$element_open = substr($token[0], 0, $token[2]);
$element_content = substr($token[0], $token[2], -strlen('</' . $element_name . '>'));

$element_open = process_element_open($element_open);
$element_content = process_element_content($element_content);

echo $element_open . $element_content . '</' . $element_name . '>';

Void elements ending in '/>' are provided with a boolean data to indicate the presence of that ending:

// `$tokens = apart('<asdf><asdf/><asdf />', void: ['asdf']);`
$tokens = [
    ['<asdf>', 1, 6],
    ['<asdf/>', 1, 7, true],
    ['<asdf />', 1, 8, true],
];

HTML/XML special characters will be given a type of -1. Else, it will be given a type of 0 and will be considered as plain text:

$tokens = [
    ['&#9829;', -1],
    ['&#x2665;', -1],
    ['&hearts;', -1],
    ['♥', 0]
];

This function works by loosely grouping tokens based on the presence of HTML/XML special characters. It does not take into account whether a tag name is valid and conforms to the tag name specification in HTML/XML or not. This also applies to the special character sequences. HTML/XML special characters in quoted attribute values also don’t have to be escaped. This does not conform to the HTML/XML attribute value specification, but may be very useful in the future to create HTML/XML-based template engines. Developers who want to use this function to validate their own HTML/XML code will have to perform additional processing:

$tokens = apart('<*** ***="a < b && c > d ? \'<*>\' : \'\'"></***>***&123;');

// [
//     ['<*** ***="a < b && c > d ? \'<*>\' : \'\'">', 2, 39],
//     ['</***>', 2, 6],
//     ['***', 0],
//     ['&123;', -1]
// ]

`apart()`

Splits HTML/XML string into tokens.

Description

Example

Notes

Start

Core

Extension

Layout