* bpo-36520: reset the encoded word offset when starting a new
line during an email header folding operation
* 📜🤖 Added by blurb_it.
* bpo-36520: add an additional test case, and provide descriptive
comments for the test_folding_with_utf8_encoding_* tests
* bpo-36520: fix whitespace issue
* bpo-36520: changes per reviewer request -- remove extraneous
backslashes; add whitespace between terminating quotes and
line-continuation backslashes; use "bpo-" instead of
"issue #" in comments
* bpo-21315: Fix parsing of encoded words with missing leading ws.
Because of missing leading whitespace, encoded word would get parsed as
unstructured token. This patch fixes that by looking for encoded words when
splitting tokens with whitespace.
Missing trailing whitespace around encoded word now register a defect
instead.
Original patch suggestion by David R. Murray on bpo-21315.
* bpo-35805: Add parser for Message-ID header.
This parser is based on the definition of Identification Fields from RFC 5322
Sec 3.6.4.
This should also prevent folding of Message-ID header using RFC 2047 encoded
words and hence fix bpo-35805.
* Prevent folding of non-ascii message-id headers.
* Add fold method to MsgID token to prevent folding.
Two kind of mistakes:
1. Missed space. After concatenating there is no space between words.
2. Missed comma. Causes unintentional concatenating in a list of strings.
The original algorithm tried to delegate the folding to the tokens so
that those tokens whose folding rules differed could specify the
differences. However, this resulted in a lot of duplicated code because
most of the rules were the same.
The new algorithm moves all folding logic into a set of functions
external to the token classes, but puts the information about which
tokens can be folded in which ways on the tokens...with the exception of
mime-parameters, which are a special case (which was not even
implemented in the old folder).
This algorithm can still probably be improved and hopefully simplified
somewhat.
Note that some of the test expectations are changed. I believe the
changes are toward more desirable and consistent behavior: in general
when (re) folding a line the canonical version of the tokens is
generated, rather than preserving errors or extra whitespace.
Leading whitespace was incorrectly dropped during folding of certain lines in the _header_value_parser's folding algorithm. This makes the whitespace handling code consistent.
This could use more edge case tests, but the basic functionality is tested.
(Note that this changeset does not add tailored support for the RFC 6532
message/global MIME type, but the email package generic facilities will handle
it.)
Reviewed by Maciej Szulik.
This mimics get_param's error handling for the most part. It is slightly
better in some regards as get_param can produce some really weird results for
duplicate *0* parts. It departs from get_param slightly in that if we have a
mix of non-extended and extended pieces for the same parameter name, the new
parser assumes they were all supposed to be extended and concatenates all the
values, whereas get_param always picks the non-extended parameter value. All
of this error recovery is pretty much arbitrary decisions...
This applies only to the new parser. The old parser decodes encoded words
inside quoted strings already, although it gets the whitespace wrong
when it does so.
This version of the patch only handles the most common case (a single encoded
word surrounded by quotes), but I haven't seen any other variations of this in
the wild yet, so its good enough for now.