Mecha CMS

Mecha CMS blog and documentation.

Common Problems in Making the Shortcode Parser

Updated: Sunday, 07 August 2016

The first problem is not a bug. It’s a security standard.

Problem 1: Shortcode Phrase with Descendant Definitions

Let’s say, you want to create a shortcode parser with additional features in it such as ability to recognizes specific descendant patterns on its parent shortcode phrase. Then, you decide to use this kind of pattern:

$input = '{{parent.child}}the content{{/parent}}';

$data = Converter::attr(
    $input,
    array('{{', '}}', ' '),
    array('"', '"', '=')
);

var_dump($data); // test!

which will produce this result:

false

Here, your expectation is that your code will produce an array of data of the element name, element attributes and element content as described in this article. Like this:

array(
    'element' => 'parent.child',
    'attributes' => null,
    'content' => 'the content'
);

However, why the translation process failed on this pattern?

This is due to the opening and closing element which doesn’t have the same name. You will get the expected results if you change the $input value into this (see the difference on the closing element name):

$input = '{{parent.child}}the content{{/parent.child}}';

Sample of the Actual Shortcode Parser

The input:

$input = 'A very very very long {{parent.child}}text{{/parent}} with a shortcode.';

The parser:

$pattern = '%
  (?<!`)                    # not preceded by a “`”
    \{\{                    # begin open–shortcode tag → `{{`
      parent                # the parent shortcode key phrase → `parent`
      (\.(?:[\w.]+))?       # optional descendant definition → `.child`
    (?:\}\}|[ ]+.*?\}\})    # end open–shortcode tag → `}}` or ` attr="value"}}`
  (?!`)                     # not followed by a “`”
    ([\s\S]*?)              # the content → `text`
  (?<!`)                    # not preceded by a “`”
    \{\{\/                  # begin close–shortcode tag → `{{/`
      parent                # the parent shortcode key phrase → `parent`
      (\2)?                 # optional descendant definition → `.child`
    \}\}                    # end close–shortcode tag → `}}`
  (?!`)                     # not followed by a “`”
%x';

$results = preg_replace_callback($pattern, function($matches) {
    $data = Converter::attr(
        $matches[0],
        array('{{', '}}', ' '),
        array('"', '"', '=')
    );
    extract($data);
    return Cell::unit($element, $content, $attributes);
}, $input);

echo $results; // display the result!

The expected result:

A very very very long <parent.child>text</parent.child> with a shortcode.

The fact is:

A very very very long <></> with shortcodes.

Converter::attr() method won’t allow you to parse an element with invalid closing element name in it. {{parent}} … {{/parent}} and {{parent.child}} … {{/parent.child}} patterns are okay but not with the {{parent.child}} … {{/parent}} pattern. You can fix this problem by changing the closing element name into the same name as the opening element name before the conversion process started:

if( ! empty($matches[2])) {
    $matches[0] = preg_replace('#\{\{\/parent\}\}$#', '{{/parent' . $matches[1] . '}}', $matches[0]);
}

With this way, $matches[0] value will be changed from {{parent.child}}text{{/parent}} to {{parent.child}}text{{/parent.child}}, which is valid.

Conclusion

$input = 'A very very very long {{parent.child}}text{{/parent}} with shortcodes.';

/**
 * Matched with ...
 * ----------------
 *
 * {{parent}} … {{/parent}}
 * {{parent.child}} … {{/parent}}
 * {{parent.child}} … {{/parent.child}}
 * {{parent attr="value"}} … {{/parent}}
 * {{parent.child attr="value"}} … {{/parent}}
 * {{parent.child attr="value"}} … {{/parent.child}}
 *
 */

$pattern = '%
  (?<!`)                    # not preceded by a “`”
    \{\{                    # begin open–shortcode tag → `{{`
      parent                # the parent shortcode key phrase → `parent`
      (\.(?:[\w.]+))?       # optional descendant definition → `.child`
    (?:\}\}|[ ]+.*?\}\})    # end open–shortcode tag → `}}` or ` attr="value"}}`
  (?!`)                     # not followed by a “`”
    ([\s\S]*?)              # the content → `text`
  (?<!`)                    # not preceded by a “`”
    \{\{\/                  # begin close–shortcode tag → `{{/`
      parent                # the parent shortcode key phrase → `parent`
      (\2)?                 # optional descendant definition → `.child`
    \}\}                    # end close–shortcode tag → `}}`
  (?!`)                     # not followed by a “`”
%x';

$results = preg_replace_callback($pattern, function($matches) {
    // var_dump($matches);
    if( ! empty($matches[2])) {
        $matches[0] = preg_replace('#\{\{\/parent\}\}$#', '{{/parent' . $matches[1] . '}}', $matches[0]);
    }
    $data = Converter::attr(
        $matches[0],
        array('{{', '}}', ' '),
        array('"', '"', '=')
    );
    extract($data);
    return Cell::unit($element, $content, $attributes);
}, $input);

echo $results; // display the result!

Problem 2: Cannot Escape the Shortcode

The current stable release uses a priority of 10 on the basic shortcode parser which also has a duty to eliminate the backtick characters on the escaped shortcodes. You will be too late if you do the parsing process after the basic parser finished his tasks. And if you try to exclude all of the escaped shortcodes after the basic parser finished parsing the content, then the results will always fail, because the escaped shortcodes are gone.

Fix this problem by determine a smaller priority from 10 —which is the default priority of the Filter class— on your custom filter declaration, so that your function will be executed before the basic parser execution:

Filter::add('shortcode', function($content) {
    …
}, 9);

This bug will soon be gone on the next version of Mecha. I just need to change the default priority on the basic shortcode parser into a value that is greater than 10. So, in the future you don’t need to write the number 9 (or less than that) after the callback function.

Update: 2016/06/06

There is an easy way to parse shortcode tag with namespaces without having to validate the closing element name. Just parse the opening tag, so Mecha will consider it as a stand–alone shortcode tag without content data:

$pattern = '%
  ( # --------------------> # begin capture 1 …
    \{\{                    # begin open–shortcode tag → `{{`
      parent                # the parent shortcode key phrase → `parent`
      (\.(?:[\w.]+))?       # optional descendant definition → `.child`
    (?:\}\}|[ ]+.*?\}\})    # end open–shortcode tag → `}}` or ` attr="value"}}`
  ) # --------------------> # capture 1: the open-shortcode tag
  ([\s\S]*?) # -----------> # capture 2: the content → `text`
  \{\{\/                    # begin close–shortcode tag → `{{/`
    parent                  # the parent shortcode key phrase → `parent`
    (\2)?                   # optional descendant definition → `.child`
  \}\}                      # end close–shortcode tag → `}}`
%x';

$results = preg_replace_callback($pattern, function($matches) {
    $data = Converter::attr(
        $matches[1], // parse only the open-shortcode tag
        array('{{', '}}', ' '),
        array('"', '"', '=')
    );
    extract($data);
    $content = $matches[3]; // because `$content` is `null`
    return Cell::unit($element, $content, $attributes);
}, $input);

echo $results; // display the result!

Note: In version 1.2.7 you can omit the look-behind and look-ahead parts from the regular expression.

Donation and Email Subscription