apart()

Splits HTML/XML string into tokens.

Table of Contents
  1. Description
  2. Example
  3. Notes

Description

apart(string $value, array $raw = [], array $void = []): array;

This function splits HTML/XML string into an array of tokens, where each entry mainly contains the token string and the token type. Other additional information may also be present. This function is usually used in combination with the pair() function.

Example

$tokens = apart('asdf <asdf>asdf &amp; asdf</asdf> ');

// [
//     ['asdf ', 0],
//     ['<asdf>', 2, 6],
//     ['asdf ', 0],
//     ['&amp;', -1],
//     [' asdf', 0],
//     ['</asdf>', 2, 7],
//     [' ', 0]
// ]

The second entry in each token data contains the token type information. This does not specifically indicate a particular token pattern, but rather provides information about whether a token requires additional processing to make it complete. Type 2 is an incomplete token, it is applied to the opening and closing HTML/XML tags. You will probably need it in case you want to construct a nested HTML/XML element as a single token using the token data array:

$i = -1;
$lot = [];
$stack = 0;
foreach (apart($value) as $v) {
    if ($stack > 0) {
        if (2 === $v[1]) {
            $stack += '/' === $v[0][1] ? -1 : 1;
        }
        $lot[$i][0] .= $v[0];
        continue;
    }
    if (2 === $v[1]) {
        $stack += 1;
    }
    $lot[++$i] = [$v[0], $v[1], $v[2] ?? strlen($v[0])];
}

test($lot);

Type 1 is a complete token. It is applied to void XML elements or HTML elements that appear as if it is an opening XML element, but has its tag name listed in the $void parameter.

This $void parameter exists to specify HTML5 tag names that should be treated as void element names. This list of element names is required because HTML5 does not require authors to end void elements with a />. A > alone is fine to close the specified void elements:

$tokens = apart('<asdf/><asdf></asdf><br/><br></br>');

// [
//     ['<asdf/>', 1, 7, true],
//     ['<asdf>', 2, 6],
//     ['</asdf>', 2, 7],
//     ['<br/>', 1, 5, true],
//     ['<br>', 2, 4], ← Treated as an open tag
//     ['</br>', 2, 5] ← Treated as a close tag
// ]
$tokens = apart('<asdf/><asdf></asdf><br/><br></br>', void: ['br']);

// [
//     ['<asdf/>', 1, 7, true],
//     ['<asdf>', 2, 6],
//     ['</asdf>', 2, 7],
//     ['<br/>', 1, 5, true],
//     ['<br>', 1, 4], ← Treated as a void tag
//     ['</br>', 2, 5] ← Treated as a close tag that is missing its open tag
// ]

Character data section, comment, document type declaration, and processing instruction tokens are assumed to be complete tokens. No further processing is required, and no special configuration is available to treat them as a different token. All of them are given a type of 1:

$tokens = apart('<!--<asdf>--><![CDATA[<asdf>]]><!asdf><?asdf?>');

// [
//     ['<!--<asdf>-->', 1],
//     ['<![CDATA[<asdf>]]>', 1],
//     ['<!asdf>', 1],
//     ['<?asdf?>', 1]
// ]

Some HTML elements, such as <script>…</script> and <style>…</style>, may contain tokens that look like HTML/XML elements but should not be treated as such. This is not an issue with the XHTML standard, since it requires a character data section in those elements:

$tokens = apart(trim(
<<<HTML
<script>
const input = jQuery('<input/>').prop('disabled', true);
</script>
HTML
));

// [
//     ['<script>', 2, 8],
//     ['
// const input = jQuery(\'', 0],
//     ['<input/>', 1, 8, true], ← This is a false positive
//     ['\').prop(\'disabled\', true);
// ', 0],
//     ['</script>', 2, 9]
// ]
$tokens = apart(trim(
<<<XHTML
<script>
<![CDATA[
const input = jQuery('<input/>').prop('disabled', true);
]]>
</script>
XHTML
));

// [
//     ['<script>', 2, 8],
//     ['
// ', 0],
//     ['<![CDATA[
// const input = jQuery(\'<input/>\').prop(\'disabled\', true);
// ]]>', 1],
//     ['
// ', 0],
//     ['</script>', 2, 9]
// ]

To avoid false positives like in the first example, you can add 'script' as a raw element tag name to the $raw parameter, so that if the parser finds an opening HTML tag of <script>, it will immediately jump to the closing tag and ignore what is in the element:

$tokens = apart(trim(
<<<HTML
<script>
const input = jQuery('<input/>').prop('disabled', true);
</script>
HTML
), raw: ['script']);

// [
//     ['<script>
// const input = jQuery(\'<input/>\').prop(\'disabled\', true);
// </script>', 1, 8],
// ]

You can see that it produces just one instance of <script>…</script> token. And since that token is a complete HTML element, it gets the type of 1.

Notes

For tokens of type 2, or tokens of type 1 that are not in the form of a character data section, comment, document type declaration, and processing instruction, they will be provided with additional data in the form of the token’s character length:

$tokens = [
    ['<asdf>', 2, 6],
    ['</asdf>', 2, 7],
    ['<asdf/>', 1, 7]
];

If a tag name is provided to the $raw parameter, and it is found as a complete HTML/XML element token at some point in the HTML/XML string, then the token type will be set to 1, but the length of the token characters will remain the length of the opening HTML/XML tag only, and not the length of the entire HTML/XML element token:

// `$tokens = apart('<asdf asdf="asdf asdf">…</asdf>', raw: ['asdf']);`
$tokens = [
    ['<asdf asdf="asdf asdf">…</asdf>', 1, 23],
];

Its preservation is useful to separate the opening and closing tags from the actual element’s content in case you want to process those parts separately:

$token = $tokens[0];

$element_name = substr($token[0], 1); // Removes `<` from the start of the token
$element_name = strtok($element_name, " \n\r\t>"); // Gets the token just before the first white-space (or just before the `>` if it has no attributes) as the element name

$element_open = substr($token[0], 0, $token[2]);
$element_content = substr($token[0], $token[2], -strlen('</' . $element_name . '>'));

$element_open = process_element_open($element_open);
$element_content = process_element_content($element_content);

echo $element_open . $element_content . '</' . $element_name . '>';

Void elements ending in '/>' are provided with a boolean data to indicate the presence of that ending:

// `$tokens = apart('<asdf><asdf/><asdf />', void: ['asdf']);`
$tokens = [
    ['<asdf>', 1, 6],
    ['<asdf/>', 1, 7, true],
    ['<asdf />', 1, 8, true],
];

HTML/XML special characters will be given a type of -1. Else, it will be given a type of 0 and will be considered as plain text:

$tokens = [
    ['&#9829;', -1],
    ['&#x2665;', -1],
    ['&hearts;', -1],
    ['♥', 0]
];

This function works by loosely grouping tokens based on the presence of HTML/XML special characters. It does not take into account whether a tag name is valid and conforms to the tag name specification in HTML/XML or not. This also applies to the special character sequences. HTML/XML special characters in quoted attribute values also don’t have to be escaped. This does not conform to the HTML/XML attribute value specification, but may be very useful in the future to create HTML/XML-based template engines. Developers who want to use this function to validate their own HTML/XML code will have to perform additional processing:

$tokens = apart('<*** ***="a < b && c > d ? \'<*>\' : \'\'"></***>***&123;');

// [
//     ['<*** ***="a < b && c > d ? \'<*>\' : \'\'">', 2, 39],
//     ['</***>', 2, 6],
//     ['***', 0],
//     ['&123;', -1]
// ]

a()

Converts object to array.

all()

Checks if all items in the data pass the test.

any()

Checks if at least one item in the value passes the test.

apart()

Splits HTML/XML string into tokens.

b()

Ensures the minimum and maximum value of a value.

c()

Converts text to camel case.

c2f()

Converts class name to file name.

choke()

Allows access at certain intervals.

concat()

Concatenates multiple arrays into one array.

cookie()

Gets or sets a cookie or cookies.

d()

Loads classes automatically, efficiently.

drop()

Removes meaning-less array items to reduce the size.

e()

Evaluates string to the proper data type.

eat()

Escapes HTML/XML attribute’s value.

exist()

Checks if file/folder exists.

extend()

Merges multiple arrays into one array.

f()

Filters out characters from a string.

f2c()

Converts file name to class name.

f2p()

Converts file name to property name.

fetch()

Fetches content from a remote URL.

find()

Gets the first array item that passes the test.

fire()

Executes a callable or a function.

g()

Generates a list of files and/or folders from a folder.

ge()

Greater than or equal to.

get()

Gets values from an array using dot notation access.

h()

Hyphenates current value.

has()

Checks if an array contains a key using dot notation access.

hook()

Gets or sets a hook or hooks.

i()

Makes text translatable.

ip()

Gets the client’s IP address.

is()

Filters the data so that only items that pass the test are left.

j()

Gets array items that are not present in the second array.

k()

Generates a filtered list of files and/or folders from a folder.

l()

Converts text to lower case.

le()

Less than or equal to.

let()

Deletes values from an array using dot notation access.

long()

Converts relative URL to full URL.

lot()

Sets global variables.

m()

Normalizes range to a new range.

map()

Creates a new data set from the current data.

move()

Moves a file/folder to a folder.

n()

Normalizes string.

not()

Filters the data so that only items that does not pass the test are left.

o()

Converts array to object.

p()

Converts text to pascal case.

p2f()

Converts property name to file name.

pair()

Pairs HTML/XML attributes string as key and value in array.

path()

Normalizes and resolves file/folder path.

pluck()

Returns a new data set contains values from the key on every item.

q()

Counts the data quantity.

r()

Replaces string.

s()

Converts value to the string representation of it.

seal()

Sets a file/folder permission.

set()

Sets values to an array using dot notation access.

short()

Converts full URL to relative URL.

size()

Converts size in bytes to a human readable string format.

state()

Gets or sets a state or states.

status()

Gets current request/response headers and status or sets current response headers and status.

step()

Creates a step sequence of a split pattern.

store()

Moves the uploaded file to a folder.

stream()

Streams the file content chunk by chunk.

t()

Trims value from a delimiter once.

type()

Gets or sets current response type.

u()

Converts text to upper case.

ua()

Gets the client’s user agent string.

v()

Returns a string without the backslash prefix on every regular expression characters.

w()

Converts file name or HTML string to plain text.

x()

Returns a string with the backslash prefix on every regular expression characters.

y()

Converts iterator to array.

z()

Converts PHP values to a compact string of PHP values.

zone()

Gets or sets current application time zone.