PHP: A Good Algorithm Is Not Always Better

I was looking how to improve performance of Slim’s Http\Cookies::parseHeader() function. As a C developer, I find it “too heavy”, and here is why:

$header = rtrim($header, "\r\n");
$pieces = preg_split('@[;]\s*@', $header);

trim() will heave to create a copy of the original string if it has to modify it, regular expressions in general and preg_split in particular are a kind of heavy metal. You know,

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

$cookie = explode('=', $cookie, 2);

if (count($cookie) === 2) {
    $key = urldecode($cookie[0]);
    $value = urldecode($cookie[1]);

So you need to split a string into two by a character. explode() is good, but it results in unnecessary memory allocations (the resulting array). The same can be achieved with just

$p = strpos($cookie, '=');
if (false !== $p) {
    $key   = urldecode(substr($cookie, 0, $p));
    $value = urldecode(substr($cookie, $p + 1));

which does the same but does not allocate an extra array (please note that I am talking as a C developer here, but as you will learn soon, C experience is not always good with PHP).

So, if I code in C, how would I implement the following PHP code?

$header = rtrim($header, "\r\n");
$pieces = preg_split('@[;]\s*@', $header);
$cookies = [];
foreach ($pieces as $cookie) {
    $cookie = explode('=', $cookie, 2);
    if (count($cookie) === 2) {
        $key = urldecode($cookie[0]);
        $value = urldecode($cookie[1]);
        if (!isset($cookies[$key])) {
            $cookies[$key] = $value;
        }
    }
}

return $cookies;

Something like this:

array_init(return_value);

/* 
 * PHP internally uses zend_string* to represent strings.
 * We need const char* pointers (zend_string's are immutable) for the standard C library
 */
const char* header_start = Z_STRVAL_P(zheader);
const char* header_end   = header_start + Z_STRLEN_P(zheader);
const char* s            = header_start;

/*
 *  $header = rtrim($header, "\r\n");
 *
 * We don't want to create unnecessary string copies, therefore we work with pointers.
 * We use a bit different <code>{{EJS3}}</code> condition, as spaces and semicolons will be ignored anyway
 * by the regular expression in <code>{{EJS4}}</code>
 *
 * is_terminator() function is defined as
 *
 * static inline int is_terminator(char c) { return isspace(c) || c == ';'; }
 */
while (header_end > header_start && is_terminator(*(header_end-1))) {
    --header_end;
}

/*
 *  $pieces = preg_split('@[;]\s*@', $header);
 *  foreach ($pieces as $cookie) {
 */
while (s < header_end) {
    /*
     * Every cookie is terminated either by a semicolon or the end of the string
     */
    char* semicolon = memchr(s, ';', header_end - s);
    size_t n = semicolon ? semicolon - s : header_end - s;
    if (n) {
        /*
         * $cookie = explode('=', $cookie, 2);
         * if (count($cookie) === 2) {
         */
        char* equals = memchr(s, '=', n);

        if (equals) {
            /*
             * $key   = urldecode($cookie[0]);
             * $value = urldecode($cookie[1]);
             */
            zend_string* name  = zend_string_init(s, equals - s, 0);
            /* Deal with empty value case as well */
            zend_string* value = (equals + 1 < s + n) ? zend_string_init(equals + 1, s + n - equals - 1, 0) : ZSTR_EMPTY_ALLOC();

            if (memchr(s, '%', n)) {
                ZSTR_LEN(name)  = php_url_decode(ZSTR_VAL(name),  ZSTR_LEN(name));
                ZSTR_LEN(value) = php_url_decode(ZSTR_VAL(value), ZSTR_LEN(value));
            }

            /*
             *  if (!isset($cookies[$key])) {
             *      $cookies[$key] = $value;
             *  }
             */
            if (!zend_hash_exists(Z_ARRVAL_P(return_value), name)) {
                zval z;
                ZVAL_STR(&z, value);
                if (UNEXPECTED(!zend_hash_add_new(Z_ARRVAL_P(return_value), name, &z))) {
                    zend_string_release(value);
                }
            }

            zend_string_release(name);
        }

        s += n;
    }

    ++s; /* Account for ; */
    /* <code>{{EJS5}}</code> part in preg_split's regular expression */
    while (s < header_end && isspace(*s)) {
        ++s;
    }
}

The C code is a bit longer 🙂

But:

It does not use regular expressions at all: memchr() in the outer while loop searches for the semicolon which separates cookies from each other, and isspace() in the bottom of the loop skips whitespace characters following the semicolon: yes, we implemented [;]\s* regular expression with two fast functions;
Pointer arithmetics is a powerful tool: combined with the standard C library, it allows for manipulating with substrings without making any copies (that is, unnecessary memory allocations). We were able to identify all cookies, their names and values without making any memory allocation;
We had to do memory allocations for cookie name and value in order to insert them into the resulting array, but that’s because of the requirements of the Zend API.

Looks cool and works blazing fast. But what if we try to implement this in PHP?

$header  = rtrim($header, "\r\n;");
$len     = strlen($header);
$pos     = 0;
$cookies = [];

while ($pos < $len) {
    $x = strpos($header, ";", $pos);
    $n = (false === $x) ? $len - $pos : $x - $pos;
    if ($n) {
        $cookie = substr($header, $pos, $n);
        $pos   += $n;

        $q = strpos($cookie, '%');
        $p = strpos($cookie, '=');
        if (false !== $p) {
            $key   = substr($cookie, 0, $p);
            $value = substr($cookie, $p + 1);
            if (false !== $q) {
                $key   = urldecode($key);
                $value = urldecode($value);
            }

            if (!isset($cookies[$key])) {
                $cookies[$key] = $value;
            }
        }
    }

    ++$pos;
    $pos += strspn($header, " \r\n\t", $pos);
}

return $cookies;

In PHP, we cannot work with substrings as a part of a larger strings, we have to copy the substring to a different place and work with it there. This is because of the limitations of the standard PHP library: PHP string function may have an optional $offset argument (where the substring starts) but they do not have an argument telling them where to stop (they stop at the end of the string). Therefore we cannot optimize rtrim() by playing with the “end of string” pointer; neither can we avoid a call to substr() to extract the found cookie because strpos() does not have a limit parameter.

Another thing to consider is that function calls in PHP are relatively expensive: in C we could use while (s < header_end && isspace(*s)) ++s; to skip whitespaces, but in PHP this will be inefficient, therefore we have to resort to strspn() function (my benchmarks show that the solution with ctype_space() is slower than with strspn(), and it becomes more and more slower depending on the number of spaces to skip).

However, we have eliminated the need in the regular expression. So, how faster the code is?

PHP 7.2.5 (release build, no xdebug), 1,000,000 iterations give something like this:

Cookie	Original	Our version
	0.108	0.055
a=b	0.336	0.244
cookie1=value1	0.359	0.282
cookie1=value1;	0.370	0.293
cookie1=value1; cookie2=value2	0.635	0.508
cookie1=value1; cookie2=value2; cookie3=value3	0.882	0.710
a=b; c=d;e=fgh;	0.771	0.677
fr=XXXXXXXXXX27iPkjD.XXXXICElYAaI5c3wytYjV3ZXXXX.XXXXXX.xx.XXX.X.X.XXXXXX.XXXXXXXX; sb=XXXXXXXX3VoZgX2CWNUKB7CH; wd=XXXXxXXX; datr=XXXXXXXX2r3SisVFaElX18Yy; c_user=1000173XXXXXXXX; xs=XX%3Ay-XXXX2AczqXXX%3AX%3AXXX1143173%3AXX030%3AXX699; pl=n; act=XXXXXX6726382%2F8; presence=XXXXXXtimeFXXXXXX8795EuserFXXXXXX39742XXX9A2EstateFXXXFXXXXXX8795XX8CEchFDp_5f1B1XX974203XXXXXX	2.758	2.289

OK, so we have outperformed the original implementation, and we can be proud of ourselves. But this would be too easy.

Let us try to apply some improvements to the original implementation but do not touch preg_split():

$header = rtrim($header, "\r\n;");
$pieces = preg_split('@;[ \t]*@', $header, -1, PREG_SPLIT_NO_EMPTY);
$cookies = [];

foreach ($pieces as $cookie) {
    $q = strpos($cookie, '%');
    $p = strpos($cookie, '=');
    if (false !== $p) {
        $key   = (substr($cookie, 0, $p));
        $value = (substr($cookie, $p + 1));
        if (false !== $q) {
            $key   = urldecode($key);
            $value = urldecode($value);
        }

        if (!isset($cookies[$key])) {
            $cookies[$key] = $value;
        }
    }
}

return $cookies;

What we did is eliminated the call to explode() and call urldecode() only if the cookie contains % sign.

Now, if we run the benchmark, we will get something like this:

Cookie	Original	Our version	Modified
	0.108	0.055	0.110
a=b	0.336	0.244	0.247
cookie1=value1	0.359	0.282	0.283
cookie1=value1;	0.370	0.293	0.299
cookie1=value1; cookie2=value2	0.635	0.508	0.481
cookie1=value1; cookie2=value2; cookie3=value3	0.882	0.710	0.650
a=b; c=d;e=fgh;	0.771	0.677	0.588
fr=XXXXXXXXXX…	2.758	2.289	2.085

This function is on par with the original when there is only one cookie, and outperforms the “regexless” version when there is more than one cookie.

These results make me think that in PHP, the less function calls you have, the faster the algorithm is. In C, you can replace one “heavy” regular expression function with two light functions, and this gives you quite significant performance boost. In PHP world, a relatively slow C code seems to be faster than a fast and clever PHP code.

PHP: A Good Algorithm Is Not Always Better

Tagged on: C optimization performance php

Wild Wild Wolf

PHP: A Good Algorithm Is Not Always Better

Leave a Reply Cancel reply