133 Commits

Author SHA1 Message Date
Victor Stinner
4a21e57fe5 bpo-40268: Remove unused structmember.h includes (GH-19530)
If only offsetof() is needed: include stddef.h instead.

When structmember.h is used, add a comment explaining that
PyMemberDef is used.
2020-04-15 02:35:41 +02:00
Serhiy Storchaka
cd8295ff75 bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode data. (GH-19345) 2020-04-11 10:48:40 +03:00
Andy Lester
982307b9cc bpo-39943: Remove unused self from find_nfc_index() (GH-18973) 2020-03-17 17:38:12 +01:00
Benjamin Peterson
051b9d08d1 closes bpo-39926: Update Unicode to 13.0.0. (GH-18910) 2020-03-10 20:41:34 -07:00
Dong-hee Na
1b55b65638 bpo-39573: Clean up modules and headers to use Py_IS_TYPE() function (GH-18521) 2020-02-17 11:09:15 +01:00
Victor Stinner
d2ec81a8c9 bpo-39573: Add Py_SET_TYPE() function (GH-18394)
Add Py_SET_TYPE() function to set the type of an object.
2020-02-07 09:17:07 +01:00
Jordon Xu
2ec7010206 bpo-37752: Delete redundant Py_CHARMASK in normalizestring() (GH-15095) 2019-09-10 17:04:08 +01:00
Greg Price
7669cb8b21 bpo-38043: Use bool for boolean flags on is_normalized_quickcheck. (GH-15711) 2019-09-09 02:16:31 -07:00
Greg Price
2f09413947 closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
2019-09-03 19:45:44 -07:00
Jeroen Demeyer
530f506ac9 bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464)
Automatically replace
tp_print -> tp_vectorcall_offset
tp_compare -> tp_as_async
tp_reserved -> tp_as_async
2019-05-30 19:13:39 -07:00
Inada Naoki
6fec905de5 bpo-36642: make unicodedata const (GH-12855) 2019-04-17 08:40:34 +09:00
Max Bélanger
2810dd7be9 closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) 2018-11-04 15:58:24 -08:00
Wonsup Yoon
d134809cd3 bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
2018-06-15 20:03:14 +08:00
Benjamin Peterson
7c69c1c0fb update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
luzpaz
a5293b4ff2 Fix miscellaneous typos (#4275) 2017-11-05 15:37:50 +02:00
Benjamin Peterson
279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Serhiy Storchaka
f8d7d41507 Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*. 2016-10-23 15:12:25 +03:00
Christian Heimes
0202c347bc Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:21:20 +02:00
Christian Heimes
2f366cab48 Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:20:27 +02:00
Benjamin Peterson
6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Christian Heimes
8cee10386e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:54 +02:00
Christian Heimes
7ce201322e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:46 +02:00
Serhiy Storchaka
2d06e84455 Issue #25923: Added the const qualifier to static constant arrays. 2015-12-25 19:53:18 +02:00
Benjamin Peterson
4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
Larry Hastings
38337d1e15 Issue #24000: Improved Argument Clinic's mapping of converters to legacy
"format units".  Updated the documentation to match.
2015-05-07 23:30:09 -07:00