PHP: A Good Algorithm Is Not Always Better

I was looking at how to improve the performance of Slim’s Http\Cookies::parseHeader() function. As a C developer, I find it “too heavy”, and here is why:

$header = rtrim($header, "\r\n");
$pieces = preg_split('@[;]\s*@', $header);

trim() will heave to create a copy of the original string if it has to modify it. Regular expressions in general and preg_split in particular are a kind of heavy metal. You know,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

$cookie = explode('=', $cookie, 2);

if (count($cookie) === 2) {
    $key = urldecode($cookie[0]);
    $value = urldecode($cookie[1]);

So you need to split a string into two parts at a given character. explode() is good, but it results in unnecessary memory allocations (the resulting array). The same can be achieved with just:

$p = strpos($cookie, '=');
if (false !== $p) {
    $key   = urldecode(substr($cookie, 0, $p));
    $value = urldecode(substr($cookie, $p + 1));

It does the same but does not allocate an extra array (please note that I am talking as a C developer here, but as you will learn soon, C experience is not always a good fit for PHP).

So, if I code in C, how would I implement the following PHP code?

$header = rtrim($header, "\r\n");
$pieces = preg_split('@[;]\s*@', $header);
$cookies = [];
foreach ($pieces as $cookie) {
    $cookie = explode('=', $cookie, 2);
    if (count($cookie) === 2) {
        $key = urldecode($cookie[0]);
        $value = urldecode($cookie[1]);
        if (!isset($cookies[$key])) {
            $cookies[$key] = $value;
        }
    }
}

return $cookies;

Something like this:

array_init(return_value);

/* 
 * PHP internally uses zend_string* to represent strings.
 * We need const char* pointers (zend_string's are immutable) for the standard C library
 */
const char* header_start = Z_STRVAL_P(zheader);
const char* header_end   = header_start + Z_STRLEN_P(zheader);
const char* s            = header_start;

/*
 *  $header = rtrim($header, "\r\n");
 *
 * We don't want to create unnecessary string copies, therefore we work with pointers.
 * We use a bit different `rtrim()` condition, as spaces and semicolons will be ignored anyway
 * by the regular expression in `preg_split`
 *
 * is_terminator() function is defined as
 *
 * static inline int is_terminator(char c) { return isspace(c) || c == ';'; }
 */
while (header_end > header_start && is_terminator(*(header_end-1))) {
    --header_end;
}

/*
 *  $pieces = preg_split('@[;]\s*@', $header);
 *  foreach ($pieces as $cookie) {
 */
while (s < header_end) {
    /*
     * Every cookie is terminated either by a semicolon or the end of the string
     */
    char* semicolon = memchr(s, ';', header_end - s);
    size_t n = semicolon ? semicolon - s : header_end - s;
    if (n) {
        /*
         * $cookie = explode('=', $cookie, 2);
         * if (count($cookie) === 2) {
         */
        char* equals = memchr(s, '=', n);

        if (equals) {
            /*
             * $key   = urldecode($cookie[0]);
             * $value = urldecode($cookie[1]);
             */
            zend_string* name  = zend_string_init(s, equals - s, 0);
            /* Deal with empty value case as well */
            zend_string* value = (equals + 1 < s + n) ? zend_string_init(equals + 1, s + n - equals - 1, 0) : ZSTR_EMPTY_ALLOC();

            if (memchr(s, '%', n)) {
                ZSTR_LEN(name)  = php_url_decode(ZSTR_VAL(name),  ZSTR_LEN(name));
                ZSTR_LEN(value) = php_url_decode(ZSTR_VAL(value), ZSTR_LEN(value));
            }

            /*
             *  if (!isset($cookies[$key])) {
             *      $cookies[$key] = $value;
             *  }
             */
            if (!zend_hash_exists(Z_ARRVAL_P(return_value), name)) {
                zval z;
                ZVAL_STR(&z, value);
                if (UNEXPECTED(!zend_hash_add_new(Z_ARRVAL_P(return_value), name, &z))) {
                    zend_string_release(value);
                }
            }

            zend_string_release(name);
        }

        s += n;
    }

    ++s; /* Account for ; */
    /* `\s*` part in preg_split's regular expression */
    while (s < header_end && isspace(*s)) {
        ++s;
    }
}

The C code is a bit longer 🙂

But:

It does not use regular expressions at all: memchr() in the outer while loop searches for the semicolon that separates cookies from each other, and isspace() in the bottom of the loop skips whitespace characters following the semicolon. Yes, we implemented [;]\s* regular expression with two fast functions.
Pointer arithmetics is a powerful tool: combined with the standard C library, it allows for manipulating substrings without making any copies (that is, unnecessary memory allocations). We were able to identify all cookies, their names, and values without allocating any memory.
We had to allocate memory for the cookie name and value to insert them into the resulting array, but that’s because of the Zend API requirements.

Looks cool and works blazing fast. But what if we try to implement this in PHP?

$header  = rtrim($header, "\r\n;");
$len     = strlen($header);
$pos     = 0;
$cookies = [];

while ($pos < $len) {
    $x = strpos($header, ";", $pos);
    $n = (false === $x) ? $len - $pos : $x - $pos;
    if ($n) {
        $cookie = substr($header, $pos, $n);
        $pos   += $n;

        $q = strpos($cookie, '%');
        $p = strpos($cookie, '=');
        if (false !== $p) {
            $key   = substr($cookie, 0, $p);
            $value = substr($cookie, $p + 1);
            if (false !== $q) {
                $key   = urldecode($key);
                $value = urldecode($value);
            }

            if (!isset($cookies[$key])) {
                $cookies[$key] = $value;
            }
        }
    }

    ++$pos;
    $pos += strspn($header, " \r\n\t", $pos);
}

return $cookies;

In PHP, we cannot work with substrings as a part of a larger string; we have to copy the substring to a different place and work with it there. This is because of the limitations of the standard PHP library: PHP string functions may have an optional $offset argument (where the substring starts), but they do not have an argument telling them where to stop (they stop at the end of the string). Therefore, we cannot optimize rtrim() by playing with the “end of string” pointer; neither can we avoid a call to substr() to extract the found cookie because strpos() does not have a limit parameter.

Another thing to consider is that function calls in PHP are relatively expensive: in C we could use while (s < header_end && isspace(*s)) ++s; to skip whitespaces, but in PHP this will be inefficient, therefore we have to resort to strspn() function (my benchmarks show that the solution with ctype_space() is slower than with strspn(), and it slows down as the number of spaces to skip increases. However, we have eliminated the need for the regular expression. So, how fast is the code? PHP 7.2.5 (release build, no xdebug), 1,000,000 iterations give something like this:

Cookie	Original Code	Our Version
	0.108	0.055
`a=b`	0.336	0.244
`cookie1=value1`	0.359	0.282
`cookie1=value1; cookie2=value2`	0.635	0.508
`cookie1=value1; cookie2=value2; cookie3=value3`	0.882	0.710
`a=b; c=d;e=fgh;`	0.771	0.667
`fr=XXXXXXXXXX27iPkjD.XXXXICElYAaI5c3wytYjV3ZXXXX.XXXXXX.xx.XXX.X.X.XXXXXX.XXXXXXXX; sb=XXXXXXXX3VoZgX2CWNUKB7CH; wd=XXXXxXXX; datr=XXXXXXXX2r3SisVFaElX18Yy; c_user=1000173XXXXXXXX; xs=XX%3Ay-XXXX2AczqXXX%3AX%3AXXX1143173%3AXX030%3AXX699; pl=n; act=XXXXXX6726382%2F8; presence=XXXXXXtimeFXXXXXX8795EuserFXXXXXX39742XXX9A2EstateFXXXFXXXXXX8795XX8CEchFDp_5f1B1XX974203XXXXXX`	2.758	2.289

OK, so we have outperformed the original implementation, and we can be proud of ourselves. But this would be too easy.

Let us try to apply some improvements to the original implementation, but do not touch preg_split():

$header = rtrim($header, "\r\n;");
$pieces = preg_split('@;[ \t]*@', $header, -1, PREG_SPLIT_NO_EMPTY);
$cookies = [];

foreach ($pieces as $cookie) {
    $q = strpos($cookie, '%');
    $p = strpos($cookie, '=');
    if (false !== $p) {
        $key   = (substr($cookie, 0, $p));
        $value = (substr($cookie, $p + 1));
        if (false !== $q) {
            $key   = urldecode($key);
            $value = urldecode($value);
        }

        if (!isset($cookies[$key])) {
            $cookies[$key] = $value;
        }
    }
}

return $cookies;

We eliminated the call to explode() and called urldecode() only if the cookie contained the % sign.

Now, if we run the benchmark, we will get something like this:

Cookie	Original Code	Our Version	Modified
	0.108	0.055	0.110
a=b	0.336	0.244	0.247
cookie1=value1	0.359	0.282	0.283
cookie1=value1;	0.370	0.293	0.299
cookie1=value1; cookie2=value2	0.635	0.508	0.481
cookie1=value1; cookie2=value2; cookie3=value3	0.882	0.710	0.650
a=b; c=d;e=fgh;	0.771	0.677	0.588
fr=XXXXXXXXXX…	2.758	2.289	2.085

This function is on par with the original when there is only one cookie, and outperforms the “regexless” version when there are more than one.

These results make me think that in PHP, the fewer function calls you have, the faster the algorithm is. In C, you can replace one “heavy” regular expression function with two light functions, and this gives you quite a significant performance boost. In the PHP world, a relatively slow C code seems to be faster than a fast and clever PHP code.

PHP: A Good Algorithm Is Not Always Better

Tagged on: C optimization performance php

Wild Wild Wolf

PHP: A Good Algorithm Is Not Always Better

Leave a Reply Cancel reply