1
Fork 0
mirror of git://git.sv.gnu.org/emacs.git synced 2026-03-06 14:02:07 -08:00
emacs/admin/unidata/README
Michal Nazarewicz b3b9b258c4 Support casing characters which map into multiple code points (bug#24603)
Implement unconditional special casing rules defined in Unicode standard.

Among other things, they deal with cases when a single code point is
replaced by multiple ones because single character does not exist (e.g.
‘fi’ ligature turning into ‘FL’) or is not commonly used (e.g. ß turning
into SS).

* admin/unidata/SpecialCasing.txt: New data file pulled from Unicode
standard distribution.
* admin/unidata/README: Mention SpecialCasing.txt.

* admin/unidata/unidata-get.el (unidata-gen-table-special-casing,
unidata-gen-table-special-casing--do-load): New functions generating
‘special-uppercase’, ‘special-lowercase’ and ‘special-titlecase’
character Unicode properties built from the SpecialCasing.txt Unicode
data file.

* src/casefiddle.c (struct casing_str_buf): New structure for
representing short strings used to handle one-to-many character
mappings.

(case_character_imlp): New function which can handle one-to-many
character mappings.
(case_character, case_single_character): Wrappers for the above
functions.  The former may map one character to multiple (or no)
code points while the latter does what the former used to do (i.e.
handles one-to-one mappings only).

(do_casify_natnum, do_casify_unibyte_string,
do_casify_unibyte_region): Use case_single_character.
(do_casify_multibyte_string, do_casify_multibyte_region): Support new
features of case_character.
* (do_casify_region): Updated to reflact do_casify_multibyte_string
changes.

(casify_word): Handle situation when one character-length of a word
can change affecting where end of the word is.

(upcase, capitalize, upcase-initials): Update documentation to mention
limitations when working on characters.

* test/src/casefiddle-tests.el (casefiddle-tests-char-properties):
Add test cases for the newly introduced character properties.
(casefiddle-tests-casing): Update test cases which are now passing.

* test/lisp/char-fold-tests.el (char-fold--ascii-upcase,
char-fold--ascii-downcase): New functions which behave like old ‘upcase’
and ‘downcase’.
(char-fold--test-match-exactly): Use the new functions.  This is needed
because otherwise fi and similar characters are turned into their multi-
-character representation.

* doc/lispref/strings.texi: Describe issue with casing characters versus
strings.
* doc/lispref/nonascii.texi: Describe the new character properties.
2017-04-06 20:54:58 +02:00

30 lines
784 B
Text

Some files in this directory are taken from the Unicode Character
Database and the Unicode Ideographic Variation Database. These files
are governed by the Unicode Terms of Use contained in the file
copyright.html.
The names, URLs, and dates for these files are as follows.
BidiMirroring.txt
http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt
2013-12-17
IVD_Sequences.txt
http://www.unicode.org/ivd/data/2014-05-16/IVD_Sequences.txt
2014-05-16
UnicodeData.txt
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
2014-03-10
Blocks.txt
http://www.unicode.org/Public/8.0.0/ucd/Blocks.txt
2014-11-10
NormalizationTest.txt
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt
2016-07-16
SpecialCasing.txt
http://unicode.org/Public/UNIDATA/SpecialCasing.txt
2016-03-03