Obtain cleaner coding and a system wide
performance boost by using the fast, pre-parsed
PyArg_Unpack function instead of PyArg_ParseTuple
function which is driven by a format string.
Armin Rigo's Draconian but effective fix for
SF bug 453523: list.sort crasher
slightly fiddled to catch more cases of list mutation. The dreaded
internal "immutable list type" is gone! OTOH, if you look at a list
*while* it's being sorted now, it will appear to be empty. Better
than a core dump.
This is friendlier for caches.
2. Cut MIN_GALLOP to 7, but added a per-sort min_gallop vrbl that adapts
the "get into galloping mode" threshold higher when galloping isn't
paying, and lower when it is. There's no known case where this hurts.
It's (of course) neutral for /sort, \sort and =sort. It also happens
to be neutral for !sort. It cuts a tiny # of compares in 3sort and +sort.
For *sort, it reduces the # of compares to better than what this used to
do when MIN_GALLOP was hardcoded to 10 (it did about 0.1% more *sort
compares before, but given how close we are to the limit, this is "a
lot"!). %sort used to do about 1.5% more compares, and ~sort about
3.6% more. Here are exact counts:
i *sort 3sort +sort %sort ~sort !sort
15 449235 33019 33016 51328 188720 65534 before
448885 33016 33007 50426 182083 65534 after
0.08% 0.01% 0.03% 1.79% 3.65% 0.00% %ch from after
16 963714 65824 65809 103409 377634 131070
962991 65821 65808 101667 364341 131070
0.08% 0.00% 0.00% 1.71% 3.65% 0.00%
17 2059092 131413 131362 209130 755476 262142
2057533 131410 131361 206193 728871 262142
0.08% 0.00% 0.00% 1.42% 3.65% 0.00%
18 4380687 262440 262460 421998 1511174 524286
4377402 262437 262459 416347 1457945 524286
0.08% 0.00% 0.00% 1.36% 3.65% 0.00%
19 9285709 524581 524634 848590 3022584 1048574
9278734 524580 524633 837947 2916107 1048574
0.08% 0.00% 0.00% 1.27% 3.65% 0.00%
20 19621118 1048960 1048942 1715806 6045418 2097150
19606028 1048958 1048941 1694896 5832445 2097150
0.08% 0.00% 0.00% 1.23% 3.65% 0.00%
3. Added some key asserts I overlooked before.
4. Updated the doc file.
directly when no comparison function is specified. This saves a layer
of function call on every compare then. Measured speedups:
i 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
15 32768 12.5% 0.0% 0.0% 100.0% 0.0% 50.0% 100.0% 100.0% -50.0%
16 65536 8.7% 0.0% 0.0% 0.0% 0.0% 0.0% 12.5% 0.0% 0.0%
17 131072 8.0% 25.0% 0.0% 25.0% 0.0% 14.3% 5.9% 0.0% 0.0%
18 262144 6.3% -10.0% 12.5% 11.1% 0.0% 6.3% 5.6% 12.5% 0.0%
19 524288 5.3% 5.9% 0.0% 5.6% 0.0% 5.9% 5.4% 0.0% 2.9%
20 1048576 5.3% 2.9% 2.9% 5.1% 2.8% 1.3% 5.9% 2.9% 4.2%
The best indicators are those that take significant time (larger i), and
where sort doesn't do very few compares (so *sort and ~sort benefit most
reliably). The large numbers are due to roundoff noise combined with
platform variability; e.g., the 14.3% speedup for %sort at i=17 reflects
a printed elapsed time of 0.18 seconds falling to 0.17, but a change in
the last digit isn't really meaningful (indeed, if it really took 0.175
seconds, one electron having a lazy nanosecond could shift it to either
value <wink>). Similarly the 25% at 3sort i=17 was a meaningless change
from 0.05 to 0.04. However, almost all the "meaningless changes" were
in the same direction, which is good. The before-and-after times for
*sort are clearest:
before after
0.18 0.16
0.25 0.23
0.54 0.50
1.18 1.11
2.57 2.44
5.58 5.30
listsort. If the former calls itself recursively, they're a waste of
time, since it's called on a random permutation of a random subset of
elements. OTOH, for exactly the same reason, they're an immeasurably
small waste of time (the odds of finding exploitable order in a random
permutation are ~= 0, so the special-case loops looking for order give
up quickly). The point is more for conceptual clarity.
Also changed some "assert comments" into real asserts; when this code
was first written, Python.h didn't supply assert.h.
introduced, list.sort() was rewritten to use only the "< or not <?"
distinction. After rich comparisons were introduced, docompare() was
fiddled to translate a Py_LT Boolean result into the old "-1 for <,
0 for ==, 1 for >" flavor of outcome, and the sorting code was left
alone. This left things more obscure than they should be, and turns
out it also cost measurable cycles.
So: The old CMPERROR novelty is gone. docompare() is renamed to islt(),
and now has the same return conditinos as PyObject_RichCompareBool. The
SETK macro is renamed to ISLT, and is even weirder than before (don't
complain unless you want to maintain the sort code <wink>).
Overall, this yields a 1-2% speedup in the usual (no explicit function
passed to list.sort()) case when sorting arrays of floats (as sortperf.py
does). The boost is higher for arrays of ints.
The staticforward define was needed to support certain broken C
compilers (notably SCO ODT 3.0, perhaps early AIX as well) botched the
static keyword when it was used with a forward declaration of a static
initialized structure. Standard C allows the forward declaration with
static, and we've decided to stop catering to broken C compilers. (In
fact, we expect that the compilers are all fixed eight years later.)
I'm leaving staticforward and statichere defined in object.h as
static. This is only for backwards compatibility with C extensions
that might still use it.
XXX I haven't updated the documentation.