Use Remex in Sanitizer::stripAllTags()
authorRoan Kattouw <roan.kattouw@gmail.com>
Tue, 14 Nov 2017 22:22:31 +0000 (14:22 -0800)
committerJames D. Forrester <jforrester@wikimedia.org>
Thu, 16 Nov 2017 01:31:31 +0000 (17:31 -0800)
commitddb4913f53624c8ee0a2a91bd44bf750e378569d
tree3ff0618683b270ab2a57785585c726296db30ec6
parent7980e38a8405293dfab02c5260bb7d5a368ac7e8
Use Remex in Sanitizer::stripAllTags()

Using a real HTML tokenizer fixes bugs when < or > appear in attribute
values. The old implementation used delimiterReplace(), which didn't
handle this case:

    > print Sanitizer::stripAllTags( '<p data-foo="a&lt;b>c">Hello</p>' );
    c">Hello

We also can't use PHP's built-in strip_tags() because it doesn't handle
<?php and <? correctly:

    > print strip_tags('1<span class="<?php">2</span>3');
    1
    > print strip_tags('1<span class="<?">2</span>3');
    1

Bug: T179978
Change-Id: I53b98e6c877c00c03ff110914168b398559c9c3e
autoload.php
includes/parser/RemexStripTagHandler.php [new file with mode: 0644]
includes/parser/Sanitizer.php
tests/phpunit/includes/parser/SanitizerTest.php