xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence - sfeed

commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
parent 57d341d9826ff742b5f69cab8228d0d06c3997a3
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri, 22 Jan 2021 01:11:19 +0100

xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence

Simple way to reproduce:

	printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8

Result:
	iconv: (stdin):1:8: cannot convert

Output result:

	printf '<item><title>&#xdc00;</title></item>' | sfeed

Before:

00000000  09 ed b0 80 09 09 09 09  09 09 09 0a              |............|
0000000c

After:

00000000  09 26 23 78 64 63 30 30  3b 09 09 09 09 09 09 09  |.&#xdc00;.......|
00000010  0a                                                |.|
00000011

The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).

Reference: https://unicode.org/faq/utf_bom.html , specificly:

"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"

Diffstat:
M xml.c  | 3 ++-

1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/xml.c b/xml.c
@@ -252,7 +252,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz)
 	else
 		l = strtol(e, &end, 10);
 	/* invalid value or not a well-formed entity or invalid code point */
-	if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff)
+	if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff ||
+	    (l >= 0xd800 && l <= 0xdffff))
 		return -1;
 	len = codepointtoutf8(l, buf);
 	buf[len] = '\0';

	sfeed simple feed reader - forked from git.codemadness.org/sfeed
	git clone git://src.gearsix.net/sfeed	sfeed.zip
	Log \| Files \| Refs \| Atom \| README \| LICENSE