Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals
authorBartosz Dziewoński <matma.rex@gmail.com>
Sat, 7 Oct 2017 00:26:23 +0000 (02:26 +0200)
committerJforrester <jforrester@wikimedia.org>
Mon, 4 Jun 2018 16:20:13 +0000 (16:20 +0000)
commit0313128b1038de8f2ee52a181eafdee8c5e430f7
tree3367d299f6f27af7d2006f6390944aa9edd1ad31
parent4d5b2473a41208816001c6d20fa5c093e9d7615b
Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals

In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.

This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.

Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
  (e.g. in code converting from legacy encodings or testing the
  handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
  are actually handled by PCRE and have to be dealt with carefully
  (those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
  probably be left as-is.

The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:

  chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
  puts chars.split('').map{|char|
    '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
  }.join('')

Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
57 files changed:
includes/cache/MessageCache.php
includes/collation/IcuCollation.php
includes/installer/Installer.php
includes/json/FormatJson.php
includes/libs/CSSMin.php
includes/specials/formfields/Licenses.php
includes/tidy/RemexCompatFormatter.php
languages/Language.php
languages/data/Names.php
languages/messages/MessagesAf.php
languages/messages/MessagesBe.php
languages/messages/MessagesBe_tarask.php
languages/messages/MessagesBg.php
languages/messages/MessagesBr.php
languages/messages/MessagesCs.php
languages/messages/MessagesEo.php
languages/messages/MessagesEs.php
languages/messages/MessagesEt.php
languages/messages/MessagesFi.php
languages/messages/MessagesFr.php
languages/messages/MessagesFrp.php
languages/messages/MessagesFur.php
languages/messages/MessagesHu.php
languages/messages/MessagesHy.php
languages/messages/MessagesIa.php
languages/messages/MessagesIt.php
languages/messages/MessagesKaa.php
languages/messages/MessagesKk_cyrl.php
languages/messages/MessagesKk_latn.php
languages/messages/MessagesKsh.php
languages/messages/MessagesLa.php
languages/messages/MessagesLbe.php
languages/messages/MessagesLn.php
languages/messages/MessagesLt.php
languages/messages/MessagesLv.php
languages/messages/MessagesMr.php
languages/messages/MessagesNb.php
languages/messages/MessagesNn.php
languages/messages/MessagesOc.php
languages/messages/MessagesPl.php
languages/messages/MessagesPt.php
languages/messages/MessagesPt_br.php
languages/messages/MessagesRu.php
languages/messages/MessagesSe.php
languages/messages/MessagesSk.php
languages/messages/MessagesSv.php
languages/messages/MessagesTa.php
languages/messages/MessagesTe.php
languages/messages/MessagesUdm.php
languages/messages/MessagesUk.php
languages/messages/MessagesUz.php
languages/messages/MessagesWa.php
maintenance/generateSitemap.php
maintenance/language/languages.inc
tests/phpunit/includes/collation/CustomUppercaseCollationTest.php
tests/phpunit/includes/libs/CSSMinTest.php
tests/phpunit/languages/LanguageTest.php