Improve linkprefix regular expressions
authorBrad Jorsch <bjorsch@wikimedia.org>
Tue, 27 Aug 2013 19:28:52 +0000 (15:28 -0400)
committerTim Starling <tstarling@wikimedia.org>
Thu, 24 Oct 2013 09:44:33 +0000 (09:44 +0000)
commitc8006382739518fda279344638b8e263d84d72dc
treeb7948d59b7c064088440eb96b335f2dcc28aa9e4
parentde7af7ac2c651d747221dd322fa9e40956681cb9
Improve linkprefix regular expressions

The regular expression in the linkprefix message is run against the
entire page up to each wikilink, and is expected to capture one group
having everything except the prefix and another having only the prefix.
For long pages this winds up being a lot of text, so inefficient regular
expressions are going to cause problems.

The current regex is this:

  /^(.*?)([a-zA-Z\\x80-\\xff]+)$/sD

This is not efficient: it will scan through the string trying to match
against every run of one or more letters/non-ASCII characters,
backtracking at every one except possibly the last. The only reason this
hasn't been a huge problem everywhere is because only a few languages
have this feature enabled.

This change replaces this with this regex:

  /^((?>.*(?<![a-zA-Z\\x80-\\xff])))(.+)$/sD

This is rather more efficient: it will grab the whole string (which is
actually fast even for huge strings), then back off character by
character until it finds one that isn't a letter/non-ASCII.

Note that the above could be simplified somewhat:

  /^((?>.*[^a-zA-Z\\x80-\\xff]|))(.+)$/sD

The performance improvement here is minor, and Gujarati, Church Slavic,
Udmurt, and Ukrainian would still need the other style for their current
implementations.

For Gujarati, we also use another regex trick: a look-behind assertion
in PCRE must be fixed length, so something like (?<!a|bb) won't work.
But that regex fragment is equivalent to (?<!a)(?<!bb) which is allowed,
so we use that instead.

Bug: 52865
Change-Id: Iaa7eaa446b3f045a9ce970affcb2a889f44bdefd
36 files changed:
languages/messages/MessagesAry.php
languages/messages/MessagesAz.php
languages/messages/MessagesCe.php
languages/messages/MessagesCrh_cyrl.php
languages/messages/MessagesCrh_latn.php
languages/messages/MessagesCu.php
languages/messages/MessagesCv.php
languages/messages/MessagesEn.php
languages/messages/MessagesGa.php
languages/messages/MessagesGu.php
languages/messages/MessagesId.php
languages/messages/MessagesIs.php
languages/messages/MessagesKa.php
languages/messages/MessagesKaa.php
languages/messages/MessagesKiu.php
languages/messages/MessagesKm.php
languages/messages/MessagesLtg.php
languages/messages/MessagesMk.php
languages/messages/MessagesMs.php
languages/messages/MessagesMt.php
languages/messages/MessagesNe.php
languages/messages/MessagesNn.php
languages/messages/MessagesRo.php
languages/messages/MessagesRoa_tara.php
languages/messages/MessagesSc.php
languages/messages/MessagesSi.php
languages/messages/MessagesSr_ec.php
languages/messages/MessagesSr_el.php
languages/messages/MessagesTl.php
languages/messages/MessagesTt_cyrl.php
languages/messages/MessagesTt_latn.php
languages/messages/MessagesUdm.php
languages/messages/MessagesUg_arab.php
languages/messages/MessagesUk.php
languages/messages/MessagesUz.php
languages/messages/MessagesWar.php