Protect legacy URL parameter syntax in link and alt options
authorC. Scott Ananian <cscott@cscott.net>
Mon, 26 Nov 2018 17:53:51 +0000 (12:53 -0500)
committerC. Scott Ananian <cscott@cscott.net>
Tue, 27 Nov 2018 15:12:05 +0000 (10:12 -0500)
HTML doesn't allow certain semicolon-less HTML entities in attribute
values to avoid breaking legacy markup like:
  <a href="http://example.com?foo&param=bar">...</a>
(Note that the & in that URL is not properly entity-escaped as `&amp;`.)

Unlike wikitext, HTML generally allows semicolon-less legacy entities
in text.

Our alt and link option processing shove text through
Sanitizer::stripAllTags, which does entity decoding including these
legacy semicolon-less entities.  Wikitext doesn't allow semicolon-less
entities, so escape & characters where appropriate to protect alt/link
options and avoid breaking URLs.

This was a "regression" in how alt options were handled starting in
ddb4913f53624c8ee0a2a91bd44bf750e378569d when we switched to using
Remex for Sanitizer::stripAllTags -- semicolon-less entities (previously
invalid in wikitext) were now being decoded when stripAllTags was
called on alt text.  This change became a problem when
ad80f0bca27c2b0905b2b137977586bfab80db34 sent link option text through
Sanitizer::stripAllTags (with the new semicolon-less entity decode)
instead of PHP's strip_tags (which, in addition to its other faults,
doesn't do entity decode at all).  This suddenly started decoding
"non-wikitext" entities like `&para` inside URLs, breaking links.
Filed T210437 as a follow-up to consider changing the behavior
of Sanitizer::stripAllTags() globally to prevent it from decoding
semicolon-less entities for all callers.

Bug: T209236
Change-Id: I5925e110e335d83eafa9de935c4e06806322f4a9

includes/parser/Parser.php
tests/parser/parserTests.txt

index 11825fa..353afde 100644 (file)
@@ -5524,6 +5524,40 @@ class Parser {
                # that are later expanded to html- so expand them now and
                # remove the tags
                $tooltip = $this->mStripState->unstripBoth( $tooltip );
+               # Compatibility hack!  In HTML certain entity references not terminated
+               # by a semicolon are decoded (but not if we're in an attribute; that's
+               # how link URLs get away without properly escaping & in queries).
+               # But wikitext has always required semicolon-termination of entities,
+               # so encode & where needed to avoid decode of semicolon-less entities.
+               # See T209236 and
+               # https://www.w3.org/TR/html5/syntax.html#named-character-references
+               # T210437 discusses moving this workaround to Sanitizer::stripAllTags.
+               $tooltip = preg_replace( "/
+                       &                       # 1. entity prefix
+                       (?=                     # 2. followed by:
+                       (?:                     #  a. one of the legacy semicolon-less named entities
+                               A(?:Elig|MP|acute|circ|grave|ring|tilde|uml)|
+                               C(?:OPY|cedil)|E(?:TH|acute|circ|grave|uml)|
+                               GT|I(?:acute|circ|grave|uml)|LT|Ntilde|
+                               O(?:acute|circ|grave|slash|tilde|uml)|QUOT|REG|THORN|
+                               U(?:acute|circ|grave|uml)|Yacute|
+                               a(?:acute|c(?:irc|ute)|elig|grave|mp|ring|tilde|uml)|brvbar|
+                               c(?:cedil|edil|urren)|cent(?!erdot;)|copy(?!sr;)|deg|
+                               divide(?!ontimes;)|e(?:acute|circ|grave|th|uml)|
+                               frac(?:1(?:2|4)|34)|
+                               gt(?!c(?:c|ir)|dot|lPar|quest|r(?:a(?:pprox|rr)|dot|eq(?:less|qless)|less|sim);)|
+                               i(?:acute|circ|excl|grave|quest|uml)|laquo|
+                               lt(?!c(?:c|ir)|dot|hree|imes|larr|quest|r(?:Par|i(?:e|f|));)|
+                               m(?:acr|i(?:cro|ddot))|n(?:bsp|tilde)|
+                               not(?!in(?:E|dot|v(?:a|b|c)|)|ni(?:v(?:a|b|c)|);)|
+                               o(?:acute|circ|grave|rd(?:f|m)|slash|tilde|uml)|
+                               p(?:lusmn|ound)|para(?!llel;)|quot|r(?:aquo|eg)|
+                               s(?:ect|hy|up(?:1|2|3)|zlig)|thorn|times(?!b(?:ar|)|d;)|
+                               u(?:acute|circ|grave|ml|uml)|y(?:acute|en|uml)
+                       )
+                       (?:[^;]|$))     #  b. and not followed by a semicolon
+                       # S = study, for efficiency
+                       /Sx", '&amp;', $tooltip );
                $tooltip = Sanitizer::stripAllTags( $tooltip );
 
                return $tooltip;
index d2fbd8d..bfa7de9 100644 (file)
@@ -15476,6 +15476,37 @@ File:Foobar.jpg|link=Foo<nowiki>''s_bar''</nowiki>s|caption
 </ul>
 !! end
 
+!! test
+HTML entity prefix in link markup (T209236)
+!! wikitext
+[[File:Foobar.jpg|link=https://example.com?foo&params=bar]]
+
+<!-- consistency with gallery extension -->
+<gallery>
+File:Foobar.jpg|link=https://example.com?foo&params=bar
+</gallery>
+!! html/php+tidy
+<p><a href="https://example.com?foo&amp;params=bar" rel="nofollow"><img alt="Foobar.jpg" src="http://example.com/images/3/3a/Foobar.jpg" width="1941" height="220" /></a>
+</p>
+<ul class="gallery mw-gallery-traditional">
+               <li class="gallerybox" style="width: 155px"><div style="width: 155px">
+                       <div class="thumb" style="width: 150px;"><div style="margin:68px auto;"><a href="https://example.com?foo&amp;params=bar"><img alt="Foobar.jpg" src="http://example.com/images/thumb/3/3a/Foobar.jpg/120px-Foobar.jpg" width="120" height="14" srcset="http://example.com/images/thumb/3/3a/Foobar.jpg/180px-Foobar.jpg 1.5x, http://example.com/images/thumb/3/3a/Foobar.jpg/240px-Foobar.jpg 2x" /></a></div></div>
+                       <div class="gallerytext">
+                       </div>
+               </div></li>
+</ul>
+!! html/parsoid
+<p><figure-inline class="mw-default-size" typeof="mw:Image"><a href="https://example.com?foo&amp;params=bar"><img resource="./File:Foobar.jpg" src="//example.com/images/3/3a/Foobar.jpg" data-file-width="1941" data-file-height="220" data-file-type="bitmap" height="220" width="1941"/></a></figure-inline></p>
+
+<!-- consistency with gallery extension -->
+<ul class="gallery mw-gallery-traditional" typeof="mw:Extension/gallery" data-mw='{"name":"gallery","attrs":{},"body":{"extsrc":"\nFile:Foobar.jpg|link=https://example.com?foo&amp;params=bar\n"}}'>
+<li class="gallerybox">
+<div class="thumb"><figure-inline typeof="mw:Image"><a href="https://example.com?foo&amp;params=bar"><img resource="./File:Foobar.jpg" src="//example.com/images/thumb/3/3a/Foobar.jpg/120px-Foobar.jpg" data-file-width="1941" data-file-height="220" data-file-type="bitmap" height="14" width="120"/></a></figure-inline></div>
+<div class="gallerytext"></div>
+</li>
+</ul>
+!! end
+
 !! test
 Image with table with attributes in caption
 !! options