sfeed

simple feed reader - forked from git.codemadness.org/sfeed
git clone git://src.gearsix.net/sfeed
Log | Files | Refs | Atom | README | LICENSE

commit 71815afcf0577ff1fb5e65fcdb3b8041c72c7623
parent 928a3ca62f748e7301fd2fa978b64c9a36f5d2d1
Author: gearsix@tuta.io <gearsix@tuta.io>
Date:   Thu, 15 Apr 2021 16:31:15 +0100

squash merge from upstream

Squashed commit of the following:

commit 4e96b1f3f90c86589006070fe739ee9a6c5d06de
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Mar 13 13:22:10 2021 +0100

    bump version to 0.9.22

commit 99a8e4deeb81480f2c0c2798256fe9c2df006ff1
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Mar 12 13:11:17 2021 +0100

    sfeed_web.1, sfeed_xmlenc.1: remove unneeded mdoc escape sequence

commit 317d08eee3494234e1e5f84fca0240abfdc8e2f5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Mar 3 18:12:34 2021 +0100

    sfeed_update: return instead of exit in main() on success

    This is useful so the script can be included, call main and then have
    additional post-main functionality.

commit ceefac3e911b84c8aa8d215a3f7ec651aaeb4800
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Mar 2 13:13:19 2021 +0100

    README: workaround empty fields with *BSD xargs -0

    Workaround it by setting the empty "middle" fields to some value. The last
    field can be empty.

    Some feeds were incorrectly using the wrong base URL if the `baseurl` field was
    empty but the encoding field was set. So it incorrectly used the encoding field
    instead.

    Only now noticed some feeds were failing because the baseURL is validated since
    commit f305b032bc19b4e81c0dd6c0398370028ea910ca and returning a non-zero exit
    status.

    This doesn't happen with GNU xargs, busybox or toybox xargs.
    Affected (atleast): OpenBSD, NetBSD, FreeBSD and DragonFlyBSD xargs which share
    similar code.

    Simple way to reproduce the difference:

    	printf 'a\0\0c\0' | xargs -0 echo

    Prints "a c" on *BSD.
    Prints "a  c" on GNU xargs (and some other implementations).

commit f0e0326248e3a8c148eea64f77a5a1f84cf8d80b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Mar 1 22:27:11 2021 +0100

    sfeed_update: fix baseurl substitution

    Follow-up from a rushed commit:

    commit 58555779d123be68c0acf9ea898931d656ec6d63
    Author: Hiltjo Posthuma <hiltjo@codemadness.org>
    Date:   Sun Feb 28 13:33:21 2021 +0100

        sfeed_update: simplify, use feedurl directly

        This also make it possible to use non-authoritive URLs as a baseurl, like
        "magnet:" URLs.

commit 16b7cc14e0681531d0d0e7c73e7c9ee903d0b1a9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Mar 1 18:50:43 2021 +0100

    util.c: uri_makeabs: check initial base URI field, not dest `a` (style)

    No functional difference because the base URI host is copied beforehand.

commit fef85e3c39946b50b337cdf57a4a3d530eb85ac2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Feb 28 13:37:21 2021 +0100

    sfeed.1: reference sfeed_update and sfeedrc

    The shellscript is optional, but reference it in the documentation.

commit 58555779d123be68c0acf9ea898931d656ec6d63
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Feb 28 13:33:21 2021 +0100

    sfeed_update: simplify, use feedurl directly

    This also make it possible to use non-authoritive URLs as a baseurl, like
    "magnet:" URLs.

commit f305b032bc19b4e81c0dd6c0398370028ea910ca
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Feb 16 18:38:56 2021 +0100

    util: improve/refactor URI parsing and formatting

    Removed/rewritten the functions:
    absuri, parseuri, and encodeuri() for percent-encoding.

    The functions are now split separately with the following purpose:

    - uri_format: format struct uri into a string.
    - uri_hasscheme: quick check if a string is absolute or not.
    - uri_makeabs: make a URI absolute using a base uri and the original URI.
    - uri_parse: parse a string into a struct uri.

    The following URLs are better parsed:

    - URLs with extra "/"'s in the path prepended are kept as is, no "/" is added
      either for empty paths.
    - URLs like "http://codemadness.org" are not changed to
      "http://codemadness.org/" anymore (paths are kept as is, unless they are
      non-empty and not start with "/").
    - Paths are not percent-encoded anymore.
    - URLs with userinfo field (username, password) are parsed.
      like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt
    - Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn,
      like: urn:isbn:0-395-36341-1 are allowed and parsed correctly.
    - Both local (file:///) and non-local (file://) are supported.
    - Specifying a base URL with a port will now only use it when the relative URL
      has no host and port set and follows RFC3986 5.2.2 more closely.
    - Parsing numeric port: parse as signed long and check <= 0, empty port is
      allowed.
    - Parsing URIs containing query, fragment, but no path separator (/) will now
      parse the component properly.

    For sfeed:

    - Parse the baseURI only once (no need to do it every time for making absolute
      URIs).
    - If a link/enclosure is absolute already or if there is no base URL specified
      then just print the link directly. There have also been other small performance
      improvements related to handling URIs.

    References:
    - https://tools.ietf.org/html/rfc3986
      - Section "5.2.2. Transform References" have also been helpful.

commit 30476d22307aaa38170da5241a5d5e9864c4e76d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Feb 18 14:14:49 2021 +0100

    README: combine bandwidth saving options into one section

    Combine E-Tags, If-Modified-Since in one section. Also mention the curl
    --compression option for typically GZIP decompression.

    Note that E-Tags were broken in curl <7.73 due to a bug with "weak" e-tags.
    https://github.com/curl/curl/issues/5610

    From a question/feedback by e-mail from Hadrien Lacour, thanks.

commit afc604c4c9c91febde2a313f731ff59974a533d5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Feb 5 01:25:53 2021 +0100

    sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedantic

commit fee20df52a9091ee3a3efeaf3ed63b6940fb5be5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Feb 4 01:19:31 2021 +0100

    sfeed.c: fix time parsing regression with non-standard date format

    The commit that introduced the regression was:

    commit 33c50db302957bca2a850ac8d0b960d05ee0520e
    Author: Hiltjo Posthuma <hiltjo@codemadness.org>
    Date:   Mon Oct 12 18:55:35 2020 +0200

        simplify time parsing

    Noticed on a RSS feed with the following date:

    	<pubDate>2021-02-03 05:13:03</pubDate>

    This format is non-standard, but sfeed should support this.
    A standard format would be (for Atom): 2021-02-03T05:13:03Z

    Partially revert it.

commit 76274ca7980b2f71fc07b2dc59123e3f66b8fd9d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Jan 28 17:05:32 2021 +0100

    README: fix xargs -P example when there are no feeds

    Kindof a non-issue but if theres a sfeedrc with no feeds then xargs will still
    be executed and give an error. The xargs -r option (GNU extension) fixes this:

    From the OpenBSD xargs(1) man page:
    "-r  Do not run the command if there are no arguments.  Normally the
    command is executed at least once even if there are no arguments."

    Reproducable with the sfeedrc:

    	feeds() {
    		true
    	}

commit 438634fec934102839e05b5c587d7eb3fdf17571
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Jan 27 18:04:32 2021 +0100

    sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedantic

commit 645ef7420056796e6d2716bf920b8704451912ac
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Jan 27 13:07:45 2021 +0100

    typofixes

commit 2f8a83288d91ea0abc2e4ebd6754513ee3ad37ec
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:36:41 2021 +0100

    README: add an example script to reuse the sfeed_update code

    This code uses the non-portable xargs -P option to more efficiently process
    feeds in parallel.

commit 4d9f922c8396bada73fb0b1e318c8b947f0f606b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:35:02 2021 +0100

    sfeed_update: allow to reuse the code more easily as an included script

    This adds a main() function.  When the environment variable
    $SFEED_UPDATE_INCLUDE is set then it will not execute the main handler.  The
    other functions are included and can be reused.  This is also useful for
    unit-testing.

commit ddda48dac8e373f9bc0884ffeb84605925d988e5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:33:01 2021 +0100

    sfeed_update: separate code of parallel exection and feed() into a _feed() handler

    This is useful to be able to reuse the code (together with using sfeed_update
    as an included script, coming in the next commit).

commit 1b3f13b1ba5a3e40d4c9266f16091ad347aec333
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:30:21 2021 +0100

    sfeed_update: shuffle code getting the path of the feedurl to make the basesiteurl

    Move it closer before it is used.

commit f1883579dfcb256b514defcece7bf88fe65ee659
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:29:03 2021 +0100

    sfeed_update: change parse failure error message

    "(FAIL CONVERT)" -> "(FAIL PARSE)". Convert may be too similar to text encoding
    conversion.

commit 1a17ad48876172fadb6782f1edee2058819770dc
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jan 25 19:27:04 2021 +0100

    sfeed_update: add an overridable parse() function, using sfeed(1) by default

    This can be useful to make more cleanly make connector scripts.
    This does not necesarily even have to be in the sfeed(5) format.

commit 650adaf537c2baea0d8ad1d11d16286c798cd640
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jan 24 14:27:26 2021 +0100

    sfeed_opml_export: fix typos in comment

commit df74ba274c4ea5d9b7388c33500ba601ed0c991d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jan 24 00:27:50 2021 +0100

    sfeed_update: print the filename again as passed as a parameter

    ... and do not show stderr of readlink.

commit 55d64a0d370b140547fdbab88d271cd6b92265d2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Jan 23 14:48:27 2021 +0100

    bump version to 0.9.21

commit 2e00d9c904e6b03a6198651d15dc827526fa319c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 22 22:40:40 2021 +0100

    xml.c: fix typo / regression in checking codepoint range for utf-16 surrogate pair

    Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a

    0xdffff should be 0xdfff.

    printf '<item><title>&#x1f448;</title></item>' | sfeed

    Before (bad):
    	&#x1f448;

    After:
    	👈

commit 81499ca75a8c1cb2440c89241b5f5d06e5c44f5a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 22 22:16:34 2021 +0100

    sfeed: fix regression with parsing content fields

    This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200.

    After a content tag was parsed the "iscontenttag" variable was not reset.

    This caused 2 regressions:
    - It ignored other tags such as links after it.
    - It incorrectly set the content-type of a lesser priority field.

    Thanks to pazz0 for reporting it!

commit 5940c794323d58dd94bdabc2e7282e90f15dfeb4
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 22 15:02:56 2021 +0100

    README: tested with lacc

    Interesting C compiler project:

    lacc: A simple, self-hosting C compiler:
    https://github.com/larmel/lacc

commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 22 01:11:19 2021 +0100

    xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence

    Simple way to reproduce:

    	printf '<item><title>&#xdc00;</title></item>' | sfeed | iconv -t utf-8

    Result:
    	iconv: (stdin):1:8: cannot convert

    Output result:

    	printf '<item><title>&#xdc00;</title></item>' | sfeed

    Before:

    00000000  09 ed b0 80 09 09 09 09  09 09 09 0a              |............|
    0000000c

    After:

    00000000  09 26 23 78 64 63 30 30  3b 09 09 09 09 09 09 09  |.&#xdc00;.......|
    00000010  0a                                                |.|
    00000011

    The entity is output as a literal string. This allows to see more easily whats
    wrong and debug the feed and it is consistent with the current behaviour of
    invalid named entities (&bla;). An alternative could be a UTF-8 replacement
    symbol (codepoint 0xfffd).

    Reference: https://unicode.org/faq/utf_bom.html , specificly:

    "Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
    "A: A different issue arises if an unpaired surrogate is encountered when
    converting ill-formed UTF-16 data. By representing such an unpaired surrogate
    on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
    ill-formed. While it faithfully reflects the nature of the input, Unicode
    conformance requires that encoding form conversion always results in a valid
    data stream. Therefore a converter must treat this as an error. [AF]"

commit 57d341d9826ff742b5f69cab8228d0d06c3997a3
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Jan 16 00:02:12 2021 +0100

    sfeed_update: typo in comment

commit a94f19744a2b62b021fcd3741ec2e4d5316ed49a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Jan 12 01:02:37 2021 +0100

    sfeed_update: improve consistency of feed creation and merging

    - Improve feed creation with empty results and new feed files.
      Always make sure the file is created even when it is new and there are also no
      items (after filtering).

    - Consistency: always use the same feed file for merging.
      Do not use "/dev/null" when it is a new file. This works using sort, but is
      ugly when the merge() function is overridden and does something else. It should
      be the feed file always.

commit 7270aee2452804631741c971d3b4bc6d52ca72a6
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Jan 12 00:50:04 2021 +0100

    sfeed_update: make convertencoding() consistent with other overridable functions

    This adds the name as the first parameter for the convertencoding() function,
    like filter, merge, order, etc.

    This can be useful to make an exception rule for text decoding in a more clean
    way.

commit 4c35fd89e81ee83b1174dd6cfed3d6beaf25dc63
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Jan 16 00:01:09 2021 +0100

    sfeed_opml_import: minor code-style improvements

commit 3260b503e5a4c1d4ccabed0b2672e2f49c38df05
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 15 18:35:34 2021 +0100

    sfeed_opml_import.1: clarify it handles OPML _subscription_ lists specificly

    OPML is a more generic format, this tool is specificly for "rss" types and
    subscription lists.

commit 7feab0fd885c1c93344fd427904eae91943493b7
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 15 18:31:52 2021 +0100

    README: newsboat sqlite3 export script: improvements

    - Export read/unread state to a separate plain-text "urls" file, line by line.

    - Handle white-space control-chars better.

    From the sfeed(1) man page:

    "    The fields: title, id, author are not allowed to have newlines and TABs,
         all whitespace characters are replaced by a single space character.
         Control characters are removed."

    So do the reverse for newsboat aswell: change white-space characters which are
    also control-characters (such as TABs and newlines) to a single space
    character.

commit f18f4818ed2c992aa9b7b91c74bb9ce7cc1bc745
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jan 10 16:57:53 2021 +0100

    optimize converting UNIX timestamp to localtime

    Make a huge difference (cuts the time in half to process the same amount of
    lines) on atleast glibc 2.30 on Void Linux.  Seems to make no difference on
    OpenBSD.

    - This removes atleast one heap allocation per line (checked with valgrind).
      This is because glibc will strdup() the environment variable $TZ and free it
      each time, which is pointless here and wasteful.
    - localtime_r does not require to set the variables like tzname.

    In glibc-2.30/time/tzset.c in __tz_convert is the following code and comment:

      /* Update internal database according to current TZ setting.
         POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
         This is a good idea since this allows at least a bit more parallelism.  */
      tzset_internal (tp == &_tmbuf && use_localtime);

    This makes it always tzset() and inspect the environment $TZ etc. While with
    localtime_r it will only initialize it once:

    	static void tzset_internal (int always) {
    	[...]
    	if (is_initialized && !always)
    		return;

commit c1b44cf790f8090ff25a2ff268c3f7a8d53e1bcf
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Jan 9 16:05:27 2021 +0100

    printutf8pad: fix byte-seek issue with negative width codepoints in the range >= 127

    For example: "\xef\xbf\xb7" (codepoint 0xfff7), returns wcwidth(wc) == -1.
    The next byte was incorrected seeked, but the codepoint itself was valid
    (mbtowc).

commit 6403eadff853f470a859317f605dd306eb427bba
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Jan 9 14:58:40 2021 +0100

    printutf8pad: small code-style/clarify changes

commit d95499b2d8e777fcb82fd3637044c3e70118a06b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 8 19:38:58 2021 +0100

    sfeed_atom: code-style: use conditional with pledge, like the other tools

commit 04b832539cd5b5392c56ef238ec9b42b689de3ae
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 8 19:08:59 2021 +0100

    util.c: printutf8pad(): improve padded printing and printing invalid unicode characters

    This affects sfeed_plain.

    - Use unicode replacement character (codepoint 0xfffd) when a codepoint is
      invalid and proceed printing the rest of the characters.

    - When a codepoint is invalid reset the internal state of mbtowc(3), from the
      OpenBSD man page:

      "  If a call to mbtowc() resulted in an undefined internal state, mbtowc()
         must be called with s set to NULL to reset the internal state before it
         can safely be used again."

    - Optimize for the common ASCII case and use a macro to print the character
      instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this
      improves printing performance from 1.7s to 1.0s on my laptop. On an other
      system it improved by +- 25%.  Tested with clang and gcc and also tested the
      worst-case (non-ASCII) with no penalty.

    To test:

    	printf '0\tabc\xc3 def' | sfeed_plain

    Before:

      1970-01-01 01:00  abc

    After:

      1970-01-01 01:00  abc� def

commit c7e3ec5f37738c43b3918cba6977fa51631a23af
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 8 12:05:14 2021 +0100

    sfeed_gopher: optimize common output character function

    Same reason as the previous commit (allow to expand to macros).

commit bb34ab8d50cbe4c9525d06e4cb67fb58e48ae8b8
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 8 11:58:48 2021 +0100

    xmlencode: optimize common character output function

    Use putc instead of fputc, it can be optimized to macros.

    From the OpenBSD man page:

    "    putc() acts essentially identically to fputc(), but is a macro that
         expands in-line.  It may evaluate stream more than once, so arguments
         given to putc() should not be expressions with potential side effects."

    sfeed_atom, sfeed_frames and sfeed_html are using this function.

    Mini-benchmarked sfeed_html and it went from 1.45s to 1.0s with feed files in
    total 250k lines (+- 350MB). Tested with clang and gcc on OpenBSD on an older
    laptop.

commit b829948d9da8dbbea6d7275ebc1021000114ba15
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jan 3 14:46:01 2021 +0100

    man pages: add more real world examples to the man pages

commit 8c85cbf85c42eca17f18ffe45ffc6bde695a4e26
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 23:23:06 2021 +0100

    sfeed.1/sfeed_plain.1: add example, improve quoting the url for sfeed_web.1

commit aec6b5c35ac33736e6b94e1a613666fe19ebb2d4
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 22:38:10 2021 +0100

    sfeed_gopher: tighten filesystem permissions on OpenBSD using unveil(2)

    sfeed_gopher must be able to write in the current directory, but does not need
    write permissions outside it. It could read from any place in the filesystem
    (to read feed files).

    Prompted by a suggestion from vejetaryenvampir, thanks!

commit e09c96a75edf81b3d219223b6d3564ad7df10004
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 22:36:18 2021 +0100

    README: add text about page redirects + tweak some words.

    ... move sections around in a more logical order and tweak some words.

    Prompted by a question and feedback from Aleksei, thanks!

commit 3ef73a907e65cad2a250e2e9d345c58f19ae4ca0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 22:36:00 2021 +0100

    README: tested on MIPS32 (big-endian)

commit cd3ef9def6e39c57236c53e7a3bfe3903132c6a2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 22:35:48 2021 +0100

    LICENSE: bump year

commit 7e4c07313301fdc43bd60f328dba8bf638813a37
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Jan 1 22:34:04 2021 +0100

    sfeed_update: if baseurl is empty then use the path from the feed by default

    Feeds should contain absolute urls, but if it does not have it then this makes
    it more convenient to configure such feeds.

commit cfcdae2a394a2ac0d180ff34cae17e30422d0800
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Nov 9 19:16:09 2020 +0100

    bump version to 0.9.20

commit 7664faf66555c5c5b1221322bff7e49722ea36c7
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Nov 1 16:53:40 2020 +0100

    sfeed_xmlenc: be more paranoid in printing encoding names

    sfeed_xmlenc is used automatically in sfeed_update for detecting the encoding.

    In particular do not allow slashes anymore either.  For example "//IGNORE" and
    "//TRANSLIT" which are normally allowed.

    Some iconv implementation might allow other funky names or even pathnames too,
    so disallow that.

    See also the notes about the "frommap" for the "-f" option.
    https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html

    + some minor parsing handling improvements.

commit 134a1ac3372fe1eae6bc5c6acd12666c17e82696
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Oct 31 19:51:17 2020 +0100

    sfeed_web: improve parsing a <link> if it has no type attribute

    This happens because the previous link type is not reset when a <link> tag
    starts again, but it is reset when a type attribute starts.

    Found on the spanish newspaper site: elpais.com

    Input:

    	<link rel="alternate" href="https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada" type="application/rss+xml" title="RSS de la portada de El País"/>
    	<link rel="canonical" href="https://elpais.com"/>

    Would print (second line is incorrect).

    	https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada	application/rss+xml
    	https://elpais.com/	application/rss+xml

    Now prints:

    	https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada	application/rss+xml

    Fix: reset it also at the start of a <link> tag in this case (for <base href />
    it is still not wanted).

commit 6a7229149f03a54d7d63241c4cbc1c83aa9831f0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Oct 24 16:20:38 2020 +0200

    bump version to 0.9.19

commit d7b910dcaea6027b5c11f027e3af7eb4ac8962b2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Oct 22 19:54:21 2020 +0200

    sfeed_web: whoops, fix bug mentioned in the previous commit

    (ascii.jp)

commit e771e43d51830ec7d2a19d9d4e67cded83c1b302
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Oct 21 22:06:58 2020 +0200

    sfeed_web: attribute parsing improvements, improve man page

    Fix attribute parsing and now decode entities. The following now works (from
    helsinkitimes.fi):

    	<base href="https://www.helsinkitimes.fi/" />
    	<link href="/?format=feed&amp;type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" />
    	<link href="/?format=feed&amp;type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" />

    Properly associate attributes with the actual tag, this now parses properly
    (from ascii.jp).

    	<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
    	<link rel="alternate" type="application/rss+xml"  />

commit f054e581dac4921b302e0459a40d1b4f1fbd28ae
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Oct 22 18:18:32 2020 +0200

    Do not change the referenced matched tag data (from gettag()).

    Fixes a regression introduced in the refactor in commit
    e43b7a48b08a6bbcb4e730e80395b3257681b33e

    Now copy the data by value. This structure is small and no performance
    regression has been seen.

    This was because the tag ID was modified which made subsequent parsed tags of
    this type behave strangely:

    	ctx.tag->id = RSSTagGuidPermalinkTrue;

    Input data to reproduce:

    <rss>
    <channel>

    <item>
    <guid isPermaLink="false">https://def/</guid>
    </item>

    <item>
    <guid>https://abc/</guid>
    </item>

    </channel>
    </rss>

commit 0f76bbf2d809e93903facaa0da9cfd51cabf93fe
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Oct 21 17:30:28 2020 +0200

    README: filter example, filter Google Analytics utm_* parameters

    https://support.google.com/analytics/answer/1033867?hl=nl

commit c30806b15999b4c35649ddea54b2be6ac9a14620
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Wed Oct 21 00:31:51 2020 +0200

    sfeed_web: reset feedlink buffer

    Noticed strange output on the site ascii.jp:

    The site HTML contained:
    	<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
    	<link rel="alternate" type="application/rss+xml"  />

    This would print:
    	"/img/apple-touch-icon.png       application/rss+xml"

    Now it prints:

    	"	application/rss+xml"

commit 3d4ab8553f7f2867b1ee5526a6785a2877ec5648
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Oct 18 16:22:57 2020 +0200

    README: improve etag example with escaping of the filename

    Use the same base filename as the feed file, because sfeed_update replaces '/'
    in names with '_':

    	filename="$(printf '%s' "$1" | tr '/' '_')"

    This fixes the example for fetching feeds with names containing '/'.

    Reported by __20h__, thanks!

commit 27a14a3d46603ab452dba10796a24591db096d76
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Oct 18 13:31:11 2020 +0200

    README: add example to support ETag caching

commit bffe41223467cb6d9a87c1a057966924fbda6c40
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Oct 18 00:00:58 2020 +0200

    xml.c: initialize i = 0

    Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d

commit 3bce637a2a610c7991b7cbf87c530d87b019c3d6
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Oct 16 11:24:28 2020 +0200

    README.xml: reference examples, ANSI compatible, mention original parser

commit 2cda3937b6e8cd333e4a4388553a17339fce82ac
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Oct 16 11:18:51 2020 +0200

    README: fix unescaped character in regex in awk in filter example

    Found by testing using mawk.

commit 65df3f2e9a3961541ac33f1a147d373f814aedf9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 20:42:17 2020 +0200

    add a comment about the intended date priority

commit db9a53d95dddb5b668be498d9274ba700ac3b529
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 20:40:50 2020 +0200

    Revert "RSS: give Dublin Core <dc:date> higher priority over <pubDate>"

    This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.

commit 74df6f66ab2b6cf1857658151f9076c585911b7f
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 20:10:44 2020 +0200

    README: filter example: strip Facebook fbclid parameter

commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 18:55:35 2020 +0200

    simplify time parsing

commit 0ea6495717a3245d1da079b5d9570a6cf776ef1a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 18:49:56 2020 +0200

    remove unneeded check for NUL terminator

commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Oct 12 18:41:31 2020 +0200

    RSS: give Dublin Core <dc:date> higher priority over <pubDate>

    This way dc:date could be the updated time of the item. For Atom there is
    <published> and <updated> with the same logic.

commit e43b7a48b08a6bbcb4e730e80395b3257681b33e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:51:33 2020 +0200

    parse categories, add multiple field values support (for categories)

    Fields with multiple values are separated by '|'. In the future multiple
    enclosure support might be added.

    The categories tags are now parsed. This feature is useful for filtering and
    categorizing.

    Parsing of nested tags such as <author><name> has been improved. This code has
    been refactored.

    RSS <guid> isPermaLink is now handled differently also and will now prefer a
    permalink with "true" (link) over the ID. In practise multiple <guid> in an
    item does not happen.

commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Fri Oct 9 11:43:52 2020 +0200

    xml: remove unused code for sfeed

commit 3a5295f595acee81170565ba0405a64d7110a229
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:45:54 2020 +0200

    fix counting due to uninitialized variable when the time could not be parsed

    Since commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0 if the time is empty or
    could not be parsed then it is shown/aligned as a blank space instead of being
    skipped.

    An oversight in this change was that items should be counted and set in
    `isnew`.

    This commit fixes the uninitialized variable and possible miscounting.

commit 08add9b95e58bc4e7ece8137efe8dc0905a94467
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:45:26 2020 +0200

    xml.h: minor comment rewording

commit c0e9fea2a9661f3fc77a1a608e5ca456972321d9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:45:09 2020 +0200

    sfeed: parse day with max 2 digits (instead of 4)

commit ac9fbcd7c4c5603d0a430b82021e6b45c8a02836
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:44:35 2020 +0200

    sfeed: support the ISO8601 time format without separators

    For example "19720229T132245Z" is now supported.

commit 688fbed347554944aa8f9d0bbd9f66694a8fa353
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:40:30 2020 +0200

    README: tested with cproc and sdcc on Z80 emulator, for fun

    cproc:
    cproc: https://github.com/michaelforney/cproc
    qbe: https://c9x.me/compile/

    z80 (sfeed base program)
    fuzix: http://www.fuzix.org/
    RC2014 emulator: https://github.com/EtchedPixels/RC2014
    sdcc: http://sdcc.sourceforge.net/

commit 64ee7df437b5ee497cac0a7ac7d382306a5cd159
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Tue Oct 6 18:39:55 2020 +0200

    man pages: tweak alignment of lists

commit 95f025d9bcc4b3d1e09a5790bb9ee2cb89ad4cf1
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Sep 19 16:38:21 2020 +0200

    xml.c: remove buffering of comment data, which is unused anyway

commit 88208d5ffda5cfbf03901038b41afae703f0d172
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Sep 19 16:37:44 2020 +0200

    xml.h: add underscore for #ifdef guard

    This is the common style.

commit 7779e6a74852a7655210abb686287a4b59e75fdc
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sat Sep 19 16:35:33 2020 +0200

    XML cdata callback: handle CDATA as data

    This improves handling CDATA for example in Atom feeds with:
    <author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>

commit 5b130adb977ba0ba233ccc5b2bdfc6d6e37563dd
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Mon Jul 6 17:38:17 2020 +0200

    bump version to 0.9.18

commit 17898c52f631c3dd10c11167aaa702e5756dde04
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 16:46:40 2020 +0200

    sfeed_atom: minor simplification, gmtime_r is not needed here

commit e8bbe3e72d14a3d81188ea3df4c2c14a5ab90f98
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 16:41:38 2020 +0200

    README: reference sfeed_curses

commit 92be24ba540d9c7a0462ea7744803165aba26849
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 16:10:23 2020 +0200

    README: improvements

    - Add an example to optimize bandwidth use with the curl -z option.
    - Add a note about CDNs blocking based on the User-Agent (based on a question
      mailed to me).
    - Add an script to convert existing newsboat items to the sfeed(5) TSV format.

commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 15:57:02 2020 +0200

    format tools: don't skip items with a missing/invalid timestamp field

    Handle it appropriately in the context of each format tool. Output the item but
    keep it blanked.

    NOTE: maybe in sfeed_twtxt it should use the current time instead?

commit a13633a0a6b4e43d7809919ad70e085b9ecdff47
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 15:54:28 2020 +0200

    sfeed_mbox: don't ignore items with a missing/invalid timestamp

    The Date header is mandatory.  Use the current time if it is missing/invalid.

commit d425b524f5ce06e6d9cbb6838c4bf8b66b73ee06
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 15:53:37 2020 +0200

    sfeed_atom: the updated field is mandatory: use the current time...

    ... if it is missing/invalid.

commit b722b45e5468af3e9405652b4ca57c10c376ba8b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jul 5 14:05:04 2020 +0200

    sfeed_atom: fix timezone, output if timestamp is set

    Timezone should be GMT (as intended), do not convert to localtime.

commit affe5b6b64d57e56eb14132fa67c8f9e3800af6c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Thu Jun 25 19:50:43 2020 +0200

    README: small tweaks and a filter example improvement

    This is a "quick&dirty" regex to block some of the typical 1px width or height
    tracking pixels.

commit 1b14e601434435f48dfe1027a117c2af3dac325b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jun 21 18:08:41 2020 +0200

    sfeed_html/sfeed_frames: simplify struct feed allocation

    There's no need for a dynamic struct feed **. The required size is known
    (argc). Just allocate it in one go.

commit b4830a4a3a98874293e20b309e083a144c1b3609
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date:   Sun Jun 21 18:01:52 2020 +0200

    Makefile: tiny compatibility improvement for tar -cf

Diffstat:
MLICENSE | 2+-
MMakefile | 4++--
MREADME | 260+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
MREADME.xml | 10++++++++--
Msfeed.1 | 23+++++++++++++++++------
Msfeed.5 | 8+++++---
Msfeed.c | 507++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------
Msfeed_atom.c | 26+++++++++++---------------
Msfeed_frames.c | 43++++++++++++++++++++-----------------------
Msfeed_gopher.1 | 7++++++-
Msfeed_gopher.c | 66+++++++++++++++++++++++++++++++++++++++++++++---------------------
Msfeed_html.c | 45+++++++++++++++++++++------------------------
Msfeed_mbox.c | 33++++++++++++++++++---------------
Msfeed_opml_export | 2+-
Msfeed_opml_export.1 | 6+++++-
Msfeed_opml_import.1 | 8++++++--
Msfeed_opml_import.c | 19++++++++++---------
Msfeed_plain.1 | 6+++++-
Msfeed_plain.c | 26+++++++++++++-------------
Msfeed_twtxt.c | 16++++++++--------
Msfeed_update | 118+++++++++++++++++++++++++++++++++++++++++++------------------------------------
Msfeed_update.1 | 17+++++++++++++----
Msfeed_web.1 | 29++++++++++++++---------------
Msfeed_web.c | 111++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------
Msfeed_xmlenc.1 | 11++++++-----
Msfeed_xmlenc.c | 16+++++++++++++---
Msfeedrc.5 | 25++++++++++++++++++++-----
Mutil.c | 336++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
Mutil.h | 22+++++++++++++++-------
Mxml.c | 48+++++-------------------------------------------
Mxml.h | 15++++-----------
31 files changed, 1181 insertions(+), 684 deletions(-)

diff --git a/LICENSE b/LICENSE @@ -1,6 +1,6 @@ ISC License -Copyright (c) 2011-2020 Hiltjo Posthuma <hiltjo@codemadness.org> +Copyright (c) 2011-2021 Hiltjo Posthuma <hiltjo@codemadness.org> Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above diff --git a/Makefile b/Makefile @@ -1,7 +1,7 @@ .POSIX: NAME = sfeed -VERSION = 0.9.17 +VERSION = 0.9.22 # paths PREFIX = /usr/local @@ -97,7 +97,7 @@ dist: sfeedrc.example style.css \ "${NAME}-${VERSION}" # make tarball - tar -cf - "${NAME}-${VERSION}" | \ + tar cf - "${NAME}-${VERSION}" | \ gzip -c > "${NAME}-${VERSION}.tar.gz" rm -rf "${NAME}-${VERSION}" diff --git a/README b/README @@ -79,6 +79,9 @@ Gopher, SSH, etc. See the section "Usage and examples" below and the man-pages for more information how to use sfeed(1) and the additional tools. +A separate curses UI front-end called sfeed_curses is available at: +https://codemadness.org/sfeed_curses.html + Dependencies ------------ @@ -108,20 +111,21 @@ Optional dependencies OS tested --------- -- Linux (compilers: clang, gcc, pcc, tcc, libc: glibc, musl). +- Linux (compilers: clang, cproc, gcc, lacc, pcc, tcc, libc: glibc, musl). - OpenBSD (clang, gcc). - NetBSD - FreeBSD - DragonFlyBSD - Windows (cygwin gcc, mingw). - HaikuOS (using libbsd). -- FreeDOS (djgpp) +- FreeDOS (djgpp). +- FUZIX (sdcc -mz80). Architectures tested -------------------- -amd64, ARM, aarch64, HPPA, i386, SPARC64. +amd64, ARM, aarch64, HPPA, i386, MIPS32-BE, SPARC64, Z80. Files @@ -139,8 +143,8 @@ sfeed_mbox - Format feed data (TSV) to mbox. sfeed_plain - Format feed data (TSV) to a plain-text list. sfeed_twtxt - Format feed data (TSV) to a twtxt feed. sfeed_update - Update feeds and merge items. -sfeed_web - Find urls to RSS/Atom feed from a webpage. -sfeed_xmlenc - Detect character-set encoding from XML stream. +sfeed_web - Find URLs to RSS/Atom feed from a webpage. +sfeed_xmlenc - Detect character-set encoding from a XML stream. sfeedrc.example - Example config file. Can be copied to $HOME/.sfeed/sfeedrc. style.css - Example stylesheet to use with sfeed_html(1) and sfeed_frames(1). @@ -152,7 +156,7 @@ Files read at runtime by sfeed_update(1) sfeedrc - Config file. This file is evaluated as a shellscript in sfeed_update(1). -Atleast the following functions can be overridden per feed: +At least the following functions can be overridden per feed: - fetch: to use wget(1), OpenBSD ftp(1) or an other download program. - filter: to filter on fields. @@ -186,7 +190,7 @@ man 1 sfeed Usage and examples ------------------ -Find RSS/Atom feed urls from a webpage: +Find RSS/Atom feed URLs from a webpage: url="https://codemadness.org"; curl -L -s "$url" | sfeed_web "$url" @@ -222,7 +226,7 @@ View formatted output in your editor: - - - Example script to view feed items in a vertical list/menu in dmenu(1). It opens -the selected url in the browser set in $BROWSER: +the selected URL in the browser set in $BROWSER: #!/bin/sh url=$(sfeed_plain "$HOME/.sfeed/feeds/"* | dmenu -l 35 -i | \ @@ -248,7 +252,7 @@ argument is optional): - - - The filter function can be overridden in your sfeedrc file. This allows -filtering items per feed. It can be used to shorten urls, filter away +filtering items per feed. It can be used to shorten URLs, filter away advertisements, strip tracking parameters and more. # filter fields. @@ -285,14 +289,14 @@ advertisements, strip tracking parameters and more. } # shorten feedburner links. - if (match(s, /^(http|https):\/\/[^/]+\/~r\/.*\/~3\/[^\/]+\//)) { + if (match(s, /^(http|https):\/\/[^\/]+\/~r\/.*\/~3\/[^\/]+\//)) { s = substr($3, RSTART, RLENGTH); } # strip tracking parameters # urchin, facebook, piwik, webtrekk and generic. - gsub(/\?(ad|campaign|pk|tm|wt)_([^&]+)/, "?", s); - gsub(/&(ad|campaign|pk|tm|wt)_([^&]+)/, "", s); + gsub(/\?(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "?", s); + gsub(/&(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "", s); gsub(/\?&/, "?", s); gsub(/[\?&]+$/, "", s); @@ -303,24 +307,15 @@ advertisements, strip tracking parameters and more. $3 = filterlink($3); # link $8 = filterlink($8); # enclosure + # try to remove tracking pixels: <img/> tags with 1px width or height. + gsub("<img[^>]*(width|height)[\s]*=[\s]*[\"'"'"' ]?1[\"'"'"' ]?[^0-9>]+[^>]*>", "", $4); + print $0; }' } - - - -The fetch function can be overridden in your sfeedrc file. This allows to -replace the default curl(1) for sfeed_update with any other client to fetch the -RSS/Atom data: - - # fetch a feed via HTTP/HTTPS etc. - # fetch(name, url, feedfile) - fetch() { - hurl -m 1048576 -t 15 "$2" 2>/dev/null - } - -- - - - Aggregate feeds. This filters new entries (maximum one day old) and sorts them by newest first. Prefix the feed name in the title. Convert the TSV output data to an Atom XML feed (again): @@ -372,7 +367,7 @@ cut -b is used to trim the "N " prefix of sfeed_plain(1). - - - For some podcast feed the following code can be used to filter the latest -enclosure url (probably some audio file): +enclosure URL (probably some audio file): awk -F '\t' 'BEGIN { latest = 0; } length($8) { @@ -543,6 +538,221 @@ Now run: Now you can view feeds in mutt(1) for example. +- - - + +The fetch function can be overridden in your sfeedrc file. This allows to +replace the default curl(1) for sfeed_update with any other client to fetch the +RSS/Atom data or change the default curl options: + + # fetch a feed via HTTP/HTTPS etc. + # fetch(name, url, feedfile) + fetch() { + hurl -m 1048576 -t 15 "$2" 2>/dev/null + } + +- - - + +Caching, incremental data updates and bandwidth-saving + +For servers that support it some incremental updates and bandwidth-saving can +be done by using the "ETag" HTTP header. + +Create a directory for storing the ETags per feed: + + mkdir -p ~/.sfeed/etags/ + +The curl ETag options (--etag-save and --etag-compare) can be used to store and +send the previous ETag header value. curl version 7.73+ is recommended for it +to work properly. + +The curl -z option can be used to send the modification date of a local file as +a HTTP "If-Modified-Since" request header. The server can then respond if the +data is modified or not or respond with only the incremental data. + +The curl --compressed option can be used to indicate the client supports +decompression. Because RSS/Atom feeds are textual XML content this generally +compresses very well. + +These options can be set by overriding the fetch() function in the sfeedrc +file: + + # fetch(name, url, feedfile) + fetch() { + etag="$HOME/.sfeed/etags/$(basename "$3")" + curl \ + -L --max-redirs 0 -H "User-Agent:" -f -s -m 15 \ + --compressed \ + --etag-save "${etag}" --etag-compare "${etag}" \ + -z "${etag}" \ + "$2" 2>/dev/null + } + +These options can come at a cost of some privacy, because it exposes +additional metadata from the previous request. + +- - - + +CDNs blocking requests due to a missing HTTP User-Agent request header + +sfeed_update will not send the "User-Agent" header by default for privacy +reasons. Some CDNs like Cloudflare don't like this and will block such HTTP +requests. + +A custom User-Agent can be set by using the curl -H option, like so: + + curl -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0' + +The above example string pretends to be a Windows 10 (x86-64) machine running +Firefox 78. + +- - - + +Page redirects + +For security and efficiency reasons by default redirects are not allowed and +are treated as an error. + +For example to prevent hijacking an unencrypted http:// to https:// redirect or +to not add time of an unnecessary page redirect each time. It is encouraged to +use the final redirected URL in the sfeedrc config file. + +If you want to ignore this advise you can override the fetch() function in the +sfeedrc file and change the curl options "-L --max-redirs 0". + +- - - + +Shellscript to update feeds in parallel more efficiently using xargs -P. + +It creates a queue of the feeds with its settings, then uses xargs to process +them in parallel using the common, but non-POSIX -P option. This is more +efficient than the more portable solution in sfeed_update which can stall a +batch of $maxjobs in the queue if one item is slow. + +sfeed_update_xargs shellscript: + + #!/bin/sh + # update feeds, merge with old feeds using xargs in parallel mode (non-POSIX). + + # include script and reuse its functions, but do not start main(). + SFEED_UPDATE_INCLUDE="1" . sfeed_update + # load config file, sets $config. + loadconfig "$1" + + # process a single feed. + # args are: config, tmpdir, name, feedurl, basesiteurl, encoding + if [ "${SFEED_UPDATE_CHILD}" = "1" ]; then + sfeedtmpdir="$2" + _feed "$3" "$4" "$5" "$6" + exit $? + fi + + # ...else parent mode: + + # feed(name, feedurl, basesiteurl, encoding) + feed() { + # workaround: *BSD xargs doesn't handle empty fields in the middle. + name="${1:-$$}" + feedurl="${2:-http://}" + basesiteurl="${3:-${feedurl}}" + encoding="$4" + + printf '%s\0%s\0%s\0%s\0%s\0%s\0' "${config}" "${sfeedtmpdir}" \ + "${name}" "${feedurl}" "${basesiteurl}" "${encoding}" + } + + # fetch feeds and store in temporary directory. + sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')" + # make sure path exists. + mkdir -p "${sfeedpath}" + # print feeds for parallel processing with xargs. + feeds | SFEED_UPDATE_CHILD="1" xargs -r -0 -P "${maxjobs}" -L 6 "$(readlink -f "$0")" + # cleanup temporary files etc. + cleanup + +- - - + +Shellscript to export existing newsboat cached items from sqlite3 to the sfeed +TSV format. + + #!/bin/sh + # Export newsbeuter/newsboat cached items from sqlite3 to the sfeed TSV format. + # The data is split per file per feed with the name of the newsboat title/url. + # It writes the URLs of the read items line by line to a "urls" file. + # + # Dependencies: sqlite3, awk. + # + # Usage: create some directory to store the feeds, run this script. + # + # Assumes "html" for content-type (Newsboat only handles HTML content). + # Assumes feednames are unique and a feed title is set. + + # newsboat cache.db file. + cachefile="$HOME/.newsboat/cache.db" + test -n "$1" && cachefile="$1" + + # dump data. + # .mode ascii: Columns/rows delimited by 0x1F and 0x1E + # get the first fields in the order of the sfeed(5) format. + sqlite3 "$cachefile" <<!EOF | + .headers off + .mode ascii + .output + SELECT + i.pubDate, i.title, i.url, i.content, i.guid, i.author, + i.enclosure_url, + f.rssurl AS rssurl, f.title AS feedtitle, i.unread --, + -- i.id, i.enclosure_type, i.enqueued, i.flags, i.deleted, + -- i.base + FROM rss_feed f + INNER JOIN rss_item i ON i.feedurl = f.rssurl + ORDER BY + i.feedurl ASC, i.pubDate DESC; + .quit + !EOF + # convert to sfeed(5) TSV format. + awk ' + BEGIN { + FS = "\x1f"; + RS = "\x1e"; + } + # normal non-content fields. + function field(s) { + gsub("^[[:space:]]*", "", s); + gsub("[[:space:]]*$", "", s); + gsub("[[:space:]]", " ", s); + gsub("[[:cntrl:]]", "", s); + return s; + } + # content field. + function content(s) { + gsub("^[[:space:]]*", "", s); + gsub("[[:space:]]*$", "", s); + # escape chars in content field. + gsub("\\\\", "\\\\", s); + gsub("\n", "\\n", s); + gsub("\t", "\\t", s); + return s; + } + function feedname(url, title) { + gsub("/", "_", title); + return title; + } + { + fname = feedname($8, $9); + if (!feed[fname]++) { + print "Writing file: \"" fname "\" (title: " $9 ", url: " $8 ")" > "/dev/stderr"; + } + + print $1 "\t" field($2) "\t" field($3) "\t" content($4) "\t" \ + "html" "\t" field($5) "\t" field($6) "\t" field($7) \ + > fname; + + # write URLs of the read items to a file line by line. + if ($10 == "0") { + print $3 > "urls"; + } + }' + License ------- diff --git a/README.xml b/README.xml @@ -3,11 +3,14 @@ XML parser A small XML parser. +For the original version see: +https://git.codemadness.org/xmlparser/ + Dependencies ------------ -- C compiler (C99). +- C compiler (ANSI). Features @@ -51,7 +54,7 @@ Caveats HTML <pre> tags. - The XML specification has no limits on tag and attribute names. For simplicity/sanity sake this XML parser takes some liberties. Tag and - attribute names are truncated. + attribute names are truncated if they are excessively long. - Entity expansions are not parsed aswell as DOCTYPE, ATTLIST etc. @@ -72,6 +75,9 @@ Examples sfeed_opml_import.c or sfeed_web.c or sfeed_xmlenc.c +See skeleton.c in the original xmlparser repository for a base program to start +quickly. + License ------- diff --git a/sfeed.1 b/sfeed.1 @@ -1,4 +1,4 @@ -.Dd March 15, 2020 +.Dd February 28, 2021 .Dt SFEED 1 .Os .Sh NAME @@ -13,8 +13,11 @@ reads RSS or Atom feed data (XML) from stdin. It writes the feed data in a TAB-separated format to stdout. A .Ar baseurl -can be specified if the links in the feed are relative urls. -It is recommended to always have absolute urls in your feeds. +can be specified if the links or enclosures in the feed are relative URLs. +If the +.Ar baseurl +is a valid absolute URL then the relative links or enclosures will be +made absolute. .Sh TAB-SEPARATED FORMAT FIELDS The items are output per line in a TSV-like format. .Pp @@ -29,13 +32,13 @@ Other whitespace characters except spaces are removed. Control characters are removed. .Pp The order and content of the fields are: -.Bl -tag -width 17n +.Bl -tag -width 12n .It timestamp UNIX timestamp in UTC+0, empty if missing or on parse failure. .It title Title text, HTML code in titles is ignored and is treated as plain-text. .It link -Absolute url, unsafe characters are encoded. +Link .It content Content, can have plain-text or HTML code depending on the content-type field. .It content-type @@ -46,12 +49,20 @@ RSS item GUID or Atom id. Item author. .It enclosure Item, first enclosure. +.It category +Item, categories, multiple values are separated by |. .El .Sh EXIT STATUS .Ex -std +.Sh EXAMPLES +.Bd -literal +curl -s 'https://codemadness.org/atom.xml' | sfeed +.Ed .Sh SEE ALSO .Xr sfeed_plain 1 , -.Xr sfeed 5 +.Xr sfeed_update 1 , +.Xr sfeed 5 , +.Xr sfeedrc 5 .Sh AUTHORS .An Hiltjo Posthuma Aq Mt hiltjo@codemadness.org .Sh CAVEATS diff --git a/sfeed.5 b/sfeed.5 @@ -1,4 +1,4 @@ -.Dd July 20, 2019 +.Dd February 19, 2021 .Dt SFEED 5 .Os .Sh NAME @@ -23,13 +23,13 @@ Other whitespace characters except spaces are removed. Control characters are removed. .Pp The order and content of the fields are: -.Bl -tag -width 17n +.Bl -tag -width 12n .It timestamp UNIX timestamp in UTC+0, empty if missing or on parse failure. .It title Title text, HTML code in titles is ignored and is treated as plain-text. .It link -Absolute url, unsafe characters are encoded. +Link .It content Content, can have plain-text or HTML code depending on the content-type field. .It content-type @@ -40,6 +40,8 @@ RSS item GUID or Atom id. Item author. .It enclosure Item, first enclosure. +.It category +Item, categories, multiple values are separated by |. .El .Sh SEE ALSO .Xr sfeed 1 , diff --git a/sfeed.c b/sfeed.c @@ -14,6 +14,10 @@ #define ISINCONTENT(ctx) ((ctx).iscontent && !((ctx).iscontenttag)) #define ISCONTENTTAG(ctx) (!((ctx).iscontent) && (ctx).iscontenttag) + +/* these feed fields support multiple separated values */ +#define ISFEEDFIELDMULTI(t) ((t) == FeedFieldCategory) + /* string and byte-length */ #define STRP(s) s,sizeof(s)-1 @@ -42,23 +46,27 @@ typedef struct string { enum TagId { TagUnknown = 0, /* RSS */ - RSSTagDcdate, RSSTagPubdate, + RSSTagDcdate, RSSTagPubdate, /* creation date has higher priority */ RSSTagTitle, RSSTagMediaDescription, RSSTagDescription, RSSTagContentEncoded, RSSTagGuid, + RSSTagGuidPermalinkFalse, + RSSTagGuidPermalinkTrue, /* must be defined after GUID, because it can be a link (isPermaLink) */ RSSTagLink, RSSTagEnclosure, RSSTagAuthor, RSSTagDccreator, + RSSTagCategory, /* Atom */ - AtomTagUpdated, AtomTagPublished, + AtomTagUpdated, AtomTagPublished, /* creation date has higher priority */ AtomTagTitle, AtomTagMediaDescription, AtomTagSummary, AtomTagContent, AtomTagId, AtomTagLink, AtomTagLinkAlternate, AtomTagLinkEnclosure, - AtomTagAuthor, + AtomTagAuthor, AtomTagAuthorName, + AtomTagCategory, TagLast }; @@ -75,18 +83,19 @@ typedef struct field { enum { FeedFieldTime = 0, FeedFieldTitle, FeedFieldLink, FeedFieldContent, - FeedFieldId, FeedFieldAuthor, FeedFieldEnclosure, FeedFieldLast + FeedFieldId, FeedFieldAuthor, FeedFieldEnclosure, FeedFieldCategory, + FeedFieldLast }; typedef struct feedcontext { String *field; /* current FeedItem field String */ FeedField fields[FeedFieldLast]; /* data for current item */ - FeedTag *tag; /* unique current parsed tag */ + FeedTag tag; /* unique current parsed tag */ int iscontent; /* in content data */ int iscontenttag; /* in content tag */ enum ContentType contenttype; /* content-type for item */ enum FeedType feedtype; - int attrcount; /* count item HTML element attributes */ + int attrcount; /* count item HTML element attributes */ } FeedContext; static long long datetounix(long long, int, int, int, int, int); @@ -102,6 +111,7 @@ static void string_clear(String *); static void string_print_encoded(String *); static void string_print_timestamp(String *); static void string_print_trimmed(String *); +static void string_print_trimmed_multi(String *); static void string_print_uri(String *); static void xmlattr(XMLParser *, const char *, size_t, const char *, size_t, const char *, size_t); @@ -111,7 +121,6 @@ static void xmlattrend(XMLParser *, const char *, size_t, const char *, size_t); static void xmlattrstart(XMLParser *, const char *, size_t, const char *, size_t); -static void xmlcdata(XMLParser *, const char *, size_t); static void xmldata(XMLParser *, const char *, size_t); static void xmldataentity(XMLParser *, const char *, size_t); static void xmltagend(XMLParser *, const char *, size_t, int); @@ -122,6 +131,7 @@ static void xmltagstartparsed(XMLParser *, const char *, size_t, int); /* RSS, must be alphabetical order */ static FeedTag rsstags[] = { { STRP("author"), RSSTagAuthor }, + { STRP("category"), RSSTagCategory }, { STRP("content:encoded"), RSSTagContentEncoded }, { STRP("dc:creator"), RSSTagDccreator }, { STRP("dc:date"), RSSTagDcdate }, @@ -134,10 +144,11 @@ static FeedTag rsstags[] = { { STRP("pubdate"), RSSTagPubdate }, { STRP("title"), RSSTagTitle } }; + /* Atom, must be alphabetical order */ static FeedTag atomtags[] = { - /* <author><name></name></author> */ { STRP("author"), AtomTagAuthor }, + { STRP("category"), AtomTagCategory }, { STRP("content"), AtomTagContent }, { STRP("id"), AtomTagId }, /* Atom: <link href="" />, RSS has <link></link> */ @@ -148,46 +159,58 @@ static FeedTag atomtags[] = { { STRP("title"), AtomTagTitle }, { STRP("updated"), AtomTagUpdated } }; + +/* special case: nested <author><name> */ +static FeedTag atomtagauthor = { STRP("author"), AtomTagAuthor }; +static FeedTag atomtagauthorname = { STRP("name"), AtomTagAuthorName }; + +/* reference to no / unknown tag */ static FeedTag notag = { STRP(""), TagUnknown }; /* map TagId type to RSS/Atom field, all tags must be defined */ static int fieldmap[TagLast] = { - [TagUnknown] = -1, + [TagUnknown] = -1, /* RSS */ - [RSSTagDcdate] = FeedFieldTime, - [RSSTagPubdate] = FeedFieldTime, - [RSSTagTitle] = FeedFieldTitle, - [RSSTagMediaDescription] = FeedFieldContent, - [RSSTagDescription] = FeedFieldContent, - [RSSTagContentEncoded] = FeedFieldContent, - [RSSTagGuid] = FeedFieldId, - [RSSTagLink] = FeedFieldLink, - [RSSTagEnclosure] = FeedFieldEnclosure, - [RSSTagAuthor] = FeedFieldAuthor, - [RSSTagDccreator] = FeedFieldAuthor, + [RSSTagDcdate] = FeedFieldTime, + [RSSTagPubdate] = FeedFieldTime, + [RSSTagTitle] = FeedFieldTitle, + [RSSTagMediaDescription] = FeedFieldContent, + [RSSTagDescription] = FeedFieldContent, + [RSSTagContentEncoded] = FeedFieldContent, + [RSSTagGuid] = -1, + [RSSTagGuidPermalinkTrue] = FeedFieldId, /* special-case: both a link and an id */ + [RSSTagGuidPermalinkFalse] = FeedFieldId, + [RSSTagLink] = FeedFieldLink, + [RSSTagEnclosure] = FeedFieldEnclosure, + [RSSTagAuthor] = FeedFieldAuthor, + [RSSTagDccreator] = FeedFieldAuthor, + [RSSTagCategory] = FeedFieldCategory, /* Atom */ - [AtomTagUpdated] = FeedFieldTime, - [AtomTagPublished] = FeedFieldTime, - [AtomTagTitle] = FeedFieldTitle, - [AtomTagMediaDescription] = FeedFieldContent, - [AtomTagSummary] = FeedFieldContent, - [AtomTagContent] = FeedFieldContent, - [AtomTagId] = FeedFieldId, - [AtomTagLink] = -1, - [AtomTagLinkAlternate] = FeedFieldLink, - [AtomTagLinkEnclosure] = FeedFieldEnclosure, - [AtomTagAuthor] = FeedFieldAuthor + [AtomTagUpdated] = FeedFieldTime, + [AtomTagPublished] = FeedFieldTime, + [AtomTagTitle] = FeedFieldTitle, + [AtomTagMediaDescription] = FeedFieldContent, + [AtomTagSummary] = FeedFieldContent, + [AtomTagContent] = FeedFieldContent, + [AtomTagId] = FeedFieldId, + [AtomTagLink] = -1, + [AtomTagLinkAlternate] = FeedFieldLink, + [AtomTagLinkEnclosure] = FeedFieldEnclosure, + [AtomTagAuthor] = -1, + [AtomTagAuthorName] = FeedFieldAuthor, + [AtomTagCategory] = FeedFieldCategory }; static const int FieldSeparator = '\t'; -static const char *baseurl = ""; +/* separator for multiple values in a field, separator should be 1 byte */ +static const char *FieldMultiSeparator = "|"; +static struct uri baseuri; +static const char *baseurl; -static FeedContext ctx = { .tag = &notag }; +static FeedContext ctx; static XMLParser parser; /* XML parser state */ - -static String atomlink; -static enum TagId atomlinktype; -static int rssidpermalink; +static String tmpstr; +static enum ContentType tmpcontenttype; /* content-type for item */ int tagcmp(const void *v1, const void *v2) @@ -222,7 +245,7 @@ gettag(enum FeedType feedtype, const char *name, size_t namelen) static char * ltrim(const char *s) { - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; return (char *)s; } @@ -309,19 +332,13 @@ string_print_encoded(String *s) } } -/* Print text, replace TABs, carriage return and other whitespace with ' '. - * Other control chars are removed. Remove leading and trailing whitespace. */ static void -string_print_trimmed(String *s) +printtrimmed(char *s) { char *p, *e; - if (!s->data || !s->len) - return; - - p = ltrim(s->data); + p = ltrim(s); e = rtrim(p); - for (; *p && p != e; p++) { if (isspace((unsigned char)*p)) putchar(' '); /* any whitespace to space */ @@ -331,23 +348,74 @@ string_print_trimmed(String *s) } } -/* always print absolute urls (using global baseurl) */ +/* Print text, replace TABs, carriage return and other whitespace with ' '. + * Other control chars are removed. Remove leading and trailing whitespace. */ +static void +string_print_trimmed(String *s) +{ + if (!s->data || !s->len) + return; + + printtrimmed(s->data); +} + void -string_print_uri(String *s) +string_print_trimmed_multi(String *s) { - char link[4096], *p, *e; + char *p, *e; int c; if (!s->data || !s->len) return; - p = ltrim(s->data); + for (p = s->data; ; p = e + 1) { + if ((e = strstr(p, FieldMultiSeparator))) { + c = *e; + *e = '\0'; + printtrimmed(p); + *e = c; /* restore NUL byte to original character */ + fputs(FieldMultiSeparator, stdout); + } else { + printtrimmed(p); + break; + } + } +} + +/* always print absolute URLs (using global baseurl) */ +void +printuri(char *s) +{ + char link[4096], *p, *e; + struct uri newuri, olduri; + int c, r = -1; + + p = ltrim(s); e = rtrim(p); c = *e; *e = '\0'; - if (absuri(link, sizeof(link), p, baseurl) != -1) - fputs(link, stdout); - *e = c; /* restore */ + + if (baseurl && !uri_hasscheme(p) && + uri_parse(p, &olduri) != -1 && !olduri.proto[0] && + uri_makeabs(&newuri, &olduri, &baseuri) != -1 && newuri.proto[0]) + r = uri_format(link, sizeof(link), &newuri); + + if (r >= 0 && (size_t)r < sizeof(link)) + printtrimmed(link); + else + printtrimmed(p); + + *e = c; /* restore NUL byte to original character */ +} + +/* always print absolute URLs (using global baseurl) */ +void +string_print_uri(String *s) +{ + if (!s->data || !s->len) + return; + + printuri(s->data); } /* print as UNIX timestamp, print nothing if the parsed time is invalid */ @@ -443,20 +511,20 @@ gettzoffset(const char *s) long tzhour = 0, tzmin = 0; size_t i; - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; switch (*s) { case '-': /* offset */ case '+': - for (i = 0, p = s + 1; i < 2 && *p && isdigit((unsigned char)*p); i++, p++) + for (i = 0, p = s + 1; i < 2 && isdigit((unsigned char)*p); i++, p++) tzhour = (tzhour * 10) + (*p - '0'); if (*p == ':') p++; - for (i = 0; i < 2 && *p && isdigit((unsigned char)*p); i++, p++) + for (i = 0; i < 2 && isdigit((unsigned char)*p); i++, p++) tzmin = (tzmin * 10) + (*p - '0'); return ((tzhour * 3600) + (tzmin * 60)) * (s[0] == '-' ? -1 : 1); default: /* timezone name */ - for (i = 0; *s && isalpha((unsigned char)s[i]); i++) + for (i = 0; isalpha((unsigned char)s[i]); i++) ; if (i != 3) return 0; @@ -492,32 +560,35 @@ parsetime(const char *s, time_t *tp) int va[6] = { 0 }, i, j, v, vi; size_t m; - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; if (!isdigit((unsigned char)*s) && !isalpha((unsigned char)*s)) return -1; - if (strspn(s, "0123456789") == 4) { - /* format "%Y-%m-%d %H:%M:%S" or "%Y-%m-%dT%H:%M:%S" */ + if (isdigit((unsigned char)s[0]) && + isdigit((unsigned char)s[1]) && + isdigit((unsigned char)s[2]) && + isdigit((unsigned char)s[3])) { + /* formats "%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S" or "%Y%m%d%H%M%S" */ vi = 0; } else { /* format: "[%a, ]%d %b %Y %H:%M:%S" */ /* parse "[%a, ]%d %b %Y " part, then use time parsing as above */ - for (; *s && isalpha((unsigned char)*s); s++) + for (; isalpha((unsigned char)*s); s++) ; - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; if (*s == ',') s++; - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; - for (v = 0, i = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++) + for (v = 0, i = 0; i < 2 && isdigit((unsigned char)*s); s++, i++) v = (v * 10) + (*s - '0'); va[2] = v; /* day */ - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; /* end of word month */ - for (j = 0; *s && isalpha((unsigned char)s[j]); j++) + for (j = 0; isalpha((unsigned char)s[j]); j++) ; /* check month name */ if (j < 3 || j > 9) @@ -533,22 +604,25 @@ parsetime(const char *s, time_t *tp) } if (m >= 12) return -1; /* no month found */ - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; - for (v = 0, i = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++) + for (v = 0, i = 0; i < 4 && isdigit((unsigned char)*s); s++, i++) v = (v * 10) + (*s - '0'); va[0] = v; /* year */ - for (; *s && isspace((unsigned char)*s); s++) + for (; isspace((unsigned char)*s); s++) ; /* parse only regular time part, see below */ vi = 3; } - /* parse time part */ + /* parse time parts (and possibly remaining date parts) */ for (; *s && vi < 6; vi++) { - for (i = 0, v = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++) + for (i = 0, v = 0; i < ((vi == 0) ? 4 : 2) && + isdigit((unsigned char)*s); s++, i++) { v = (v * 10) + (*s - '0'); + } va[vi] = v; + if ((vi < 2 && *s == '-') || (vi == 2 && (*s == 'T' || isspace((unsigned char)*s))) || (vi > 2 && *s == ':')) @@ -557,7 +631,7 @@ parsetime(const char *s, time_t *tp) /* skip milliseconds in for example: "%Y-%m-%dT%H:%M:%S.000Z" */ if (*s == '.') { - for (s++; *s && isdigit((unsigned char)*s); s++) + for (s++; isdigit((unsigned char)*s); s++) ; } @@ -594,6 +668,8 @@ printfields(void) string_print_trimmed(&ctx.fields[FeedFieldAuthor].str); putchar(FieldSeparator); string_print_uri(&ctx.fields[FeedFieldEnclosure].str); + putchar(FieldSeparator); + string_print_trimmed_multi(&ctx.fields[FeedFieldCategory].str); putchar('\n'); } @@ -620,7 +696,7 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, return; } - if (!ctx.tag->id) + if (!ctx.tag.id) return; /* content-type may be: Atom: text, xhtml, html or mime-type. @@ -631,39 +707,49 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, isattr(v, vl, STRP("xhtml")) || isattr(v, vl, STRP("text/html")) || isattr(v, vl, STRP("text/xhtml"))) { - ctx.contenttype = ContentTypeHTML; + tmpcontenttype = ContentTypeHTML; } else if (isattr(v, vl, STRP("text")) || isattr(v, vl, STRP("plain")) || isattr(v, vl, STRP("text/plain"))) { - ctx.contenttype = ContentTypePlain; + tmpcontenttype = ContentTypePlain; } } return; } if (ctx.feedtype == FeedTypeRSS) { - if (ctx.tag->id == RSSTagEnclosure && - isattr(n, nl, STRP("url")) && ctx.field) { - string_append(ctx.field, v, vl); - } else if (ctx.tag->id == RSSTagGuid && - isattr(n, nl, STRP("ispermalink")) && - !isattr(v, vl, STRP("true"))) { - rssidpermalink = 0; + if (ctx.tag.id == RSSTagEnclosure && + isattr(n, nl, STRP("url"))) { + string_append(&tmpstr, v, vl); + } else if ((ctx.tag.id == RSSTagGuid || + ctx.tag.id == RSSTagGuidPermalinkFalse || + ctx.tag.id == RSSTagGuidPermalinkTrue) && + isattr(n, nl, STRP("ispermalink"))) { + if (isattr(v, vl, STRP("true"))) + ctx.tag.id = RSSTagGuidPermalinkTrue; + else + ctx.tag.id = RSSTagGuidPermalinkFalse; } } else if (ctx.feedtype == FeedTypeAtom) { - if (ctx.tag->id == AtomTagLink && - isattr(n, nl, STRP("rel"))) { - /* empty or "alternate": other types could be - "enclosure", "related", "self" or "via" */ - if (!vl || isattr(v, vl, STRP("alternate"))) - atomlinktype = AtomTagLinkAlternate; - else if (isattr(v, vl, STRP("enclosure"))) - atomlinktype = AtomTagLinkEnclosure; - else - atomlinktype = TagUnknown; - } else if (ctx.tag->id == AtomTagLink && - isattr(n, nl, STRP("href"))) { - string_append(&atomlink, v, vl); + if (ctx.tag.id == AtomTagLink || + ctx.tag.id == AtomTagLinkAlternate || + ctx.tag.id == AtomTagLinkEnclosure) { + if (isattr(n, nl, STRP("rel"))) { + /* empty or "alternate": other types could be + "enclosure", "related", "self" or "via" */ + if (!vl || isattr(v, vl, STRP("alternate"))) + ctx.tag.id = AtomTagLinkAlternate; + else if (isattr(v, vl, STRP("enclosure"))) + ctx.tag.id = AtomTagLinkEnclosure; + else + ctx.tag.id = AtomTagLink; /* unknown */ + } else if (ctx.tag.id != AtomTagLink && + isattr(n, nl, STRP("href"))) { + string_append(&tmpstr, v, vl); + } + } else if (ctx.tag.id == AtomTagCategory && + isattr(n, nl, STRP("term"))) { + string_append(&tmpstr, v, vl); } } } @@ -682,7 +768,7 @@ xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, return; } - if (!ctx.tag->id) + if (!ctx.tag.id) return; /* try to translate entity, else just pass as data to @@ -696,35 +782,30 @@ xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, static void xmlattrend(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl) { - if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML) + if (ISINCONTENT(ctx)) { + if (ctx.contenttype == ContentTypeHTML) { + /* handles transforming inline XML to data */ + xmldata(p, "\"", 1); + ctx.attrcount = 0; + } return; - - /* handles transforming inline XML to data */ - xmldata(p, "\"", 1); - ctx.attrcount = 0; + } } static void xmlattrstart(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl) { - if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML) - return; - - /* handles transforming inline XML to data */ - if (!ctx.attrcount) - xmldata(p, " ", 1); - ctx.attrcount++; - xmldata(p, n, nl); - xmldata(p, "=\"", 2); -} - -static void -xmlcdata(XMLParser *p, const char *s, size_t len) -{ - if (!ctx.field) + if (ISINCONTENT(ctx)) { + if (ctx.contenttype == ContentTypeHTML) { + /* handles transforming inline XML to data */ + if (!ctx.attrcount) + xmldata(p, " ", 1); + ctx.attrcount++; + xmldata(p, n, nl); + xmldata(p, "=\"", 2); + } return; - - string_append(ctx.field, s, len); + } } /* NOTE: this handler can be called multiple times if the data in this @@ -735,9 +816,9 @@ xmldata(XMLParser *p, const char *s, size_t len) if (!ctx.field) return; - /* add only data from <name> inside <author> tag - * or any other non-<author> tag */ - if (ctx.tag->id != AtomTagAuthor || istag(p->tag, p->taglen, STRP("name"))) + if (ISFEEDFIELDMULTI(fieldmap[ctx.tag.id])) + string_append(&tmpstr, s, len); + else string_append(ctx.field, s, len); } @@ -761,11 +842,11 @@ xmldataentity(XMLParser *p, const char *data, size_t datalen) static void xmltagstart(XMLParser *p, const char *t, size_t tl) { - enum TagId tagid; + FeedTag *f; if (ISINCONTENT(ctx)) { - ctx.attrcount = 0; if (ctx.contenttype == ContentTypeHTML) { + ctx.attrcount = 0; xmldata(p, "<", 1); xmldata(p, t, tl); } @@ -781,73 +862,87 @@ xmltagstart(XMLParser *p, const char *t, size_t tl) return; } - /* field tagid already set, nested tags are not allowed: return */ - if (ctx.tag->id) - return; + /* field tagid already set or nested tags. */ + if (ctx.tag.id) { + /* nested <author><name> for Atom */ + if (ctx.tag.id == AtomTagAuthor && + istag(t, tl, STRP("name"))) { + memcpy(&(ctx.tag), &atomtagauthorname, sizeof(ctx.tag)); + } else { + return; /* other nested tags are not allowed: return */ + } + } /* in item */ - if (!(ctx.tag = gettag(ctx.feedtype, t, tl))) - ctx.tag = &notag; - tagid = ctx.tag->id; - - /* without a rel attribute the default link type is "alternate" */ - if (tagid == AtomTagLink) { - atomlinktype = AtomTagLinkAlternate; - string_clear(&atomlink); /* reuse and clear temporary link */ - } else if (tagid == RSSTagGuid) { + if (ctx.tag.id == TagUnknown) { + if (!(f = gettag(ctx.feedtype, t, tl))) + f = &notag; + memcpy(&(ctx.tag), f, sizeof(ctx.tag)); + } + + switch (ctx.tag.id) { + case AtomTagLink: + /* without a rel attribute the default link type is "alternate" */ + ctx.tag.id = AtomTagLinkAlternate; + break; + case RSSTagGuid: /* without a ispermalink attribute the default value is "true" */ - rssidpermalink = 1; + ctx.tag.id = RSSTagGuidPermalinkTrue; + break; + case RSSTagContentEncoded: + case RSSTagDescription: + tmpcontenttype = ContentTypeHTML; /* default content-type */ + break; + case RSSTagMediaDescription: + case AtomTagContent: + case AtomTagMediaDescription: + case AtomTagSummary: + tmpcontenttype = ContentTypePlain; /* default content-type */ + break; + default: + break; + } + ctx.iscontenttag = (fieldmap[ctx.tag.id] == FeedFieldContent); +} + +static void +xmltagstartparsed(XMLParser *p, const char *t, size_t tl, int isshort) +{ + enum TagId tagid; + + if (ISINCONTENT(ctx)) { + if (ctx.contenttype == ContentTypeHTML) { + if (isshort) + xmldata(p, "/>", 2); + else + xmldata(p, ">", 1); + } + return; } + tagid = ctx.tag.id; + /* map tag type to field: unknown or lesser priority is ignored, when tags of the same type are repeated only the first is used. */ if (fieldmap[tagid] == -1 || - tagid <= ctx.fields[fieldmap[tagid]].tagid) { - ctx.field = NULL; + (!ISFEEDFIELDMULTI(fieldmap[tagid]) && + tagid <= ctx.fields[fieldmap[tagid]].tagid)) { return; } - if (fieldmap[tagid] == FeedFieldContent) { - /* handle default content-type per tag, Atom, RSS, MRSS. */ - switch (tagid) { - case RSSTagContentEncoded: - case RSSTagDescription: - ctx.contenttype = ContentTypeHTML; - break; - default: - ctx.contenttype = ContentTypePlain; - } - ctx.iscontenttag = 1; - } else { - ctx.iscontenttag = 0; - } - - ctx.field = &(ctx.fields[fieldmap[tagid]].str); - ctx.fields[fieldmap[tagid]].tagid = tagid; - /* clear field */ - string_clear(ctx.field); -} - -static void -xmltagstartparsed(XMLParser *p, const char *tag, size_t taglen, int isshort) -{ if (ctx.iscontenttag) { ctx.iscontent = 1; ctx.iscontenttag = 0; - return; + ctx.contenttype = tmpcontenttype; } - /* don't read field value in Atom <link> tag */ - if (ctx.tag->id == AtomTagLink) - ctx.field = NULL; - - if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML) - return; + ctx.field = &(ctx.fields[fieldmap[tagid]].str); + ctx.fields[fieldmap[tagid]].tagid = tagid; - if (isshort) - xmldata(p, "/>", 2); - else - xmldata(p, ">", 1); + /* clear field if it is overwritten (with a priority order) for the new + value, if the field can have multiple values then do not clear it. */ + if (!ISFEEDFIELDMULTI(fieldmap[ctx.tag.id])) + string_clear(ctx.field); } static void @@ -860,7 +955,7 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort) if (ISINCONTENT(ctx)) { /* not close content field */ - if (!istag(ctx.tag->name, ctx.tag->len, t, tl)) { + if (!istag(ctx.tag.name, ctx.tag.len, t, tl)) { if (!isshort && ctx.contenttype == ContentTypeHTML) { xmldata(p, "</", 2); xmldata(p, t, tl); @@ -868,23 +963,18 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort) } return; } - } else if (ctx.tag->id == AtomTagLink) { - /* map tag type to field: unknown or lesser priority is ignored, - when tags of the same type are repeated only the first is used. */ - if (atomlinktype && atomlinktype > ctx.fields[fieldmap[atomlinktype]].tagid) { - string_append(&ctx.fields[fieldmap[atomlinktype]].str, - atomlink.data, atomlink.len); - ctx.fields[fieldmap[atomlinktype]].tagid = atomlinktype; - } - } else if (ctx.tag->id == RSSTagGuid && rssidpermalink) { - if (ctx.tag->id > ctx.fields[FeedFieldLink].tagid) { + } else if (ctx.tag.id && istag(ctx.tag.name, ctx.tag.len, t, tl)) { + /* matched tag end: close it */ + /* copy also to the link field if the attribute isPermaLink="true" + and it is not set by a tag with higher prio. */ + if (ctx.tag.id == RSSTagGuidPermalinkTrue && ctx.field && + ctx.tag.id > ctx.fields[FeedFieldLink].tagid) { string_clear(&ctx.fields[FeedFieldLink].str); string_append(&ctx.fields[FeedFieldLink].str, - ctx.fields[FeedFieldId].str.data, - ctx.fields[FeedFieldId].str.len); - ctx.fields[FeedFieldLink].tagid = ctx.tag->id; + ctx.field->data, ctx.field->len); + ctx.fields[FeedFieldLink].tagid = ctx.tag.id; } - } else if (!ctx.tag->id && ((ctx.feedtype == FeedTypeAtom && + } else if (!ctx.tag.id && ((ctx.feedtype == FeedTypeAtom && istag(t, tl, STRP("entry"))) || /* Atom */ (ctx.feedtype == FeedTypeRSS && istag(t, tl, STRP("item"))))) /* RSS */ @@ -900,14 +990,33 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort) ctx.contenttype = ContentTypeNone; /* allow parsing of Atom and RSS concatenated in one XML stream. */ ctx.feedtype = FeedTypeNone; - } else if (!ctx.tag->id || - !istag(ctx.tag->name, ctx.tag->len, t, tl)) { - /* not end of field */ - return; + } else { + return; /* not end of field */ + } + + /* temporary string: for fields that cannot be processed + directly and need more context, for example by it's tag + attributes, like the Atom link rel="alternate|enclosure". */ + if (tmpstr.len && ctx.field) { + if (ISFEEDFIELDMULTI(fieldmap[ctx.tag.id])) { + if (ctx.field->len) + string_append(ctx.field, FieldMultiSeparator, 1); + string_append(ctx.field, tmpstr.data, tmpstr.len); + } else { + string_clear(ctx.field); + string_append(ctx.field, tmpstr.data, tmpstr.len); + } } + /* close field */ + string_clear(&tmpstr); /* reuse and clear temporary string */ + + if (ctx.tag.id == AtomTagAuthorName) + memcpy(&(ctx.tag), &atomtagauthor, sizeof(ctx.tag)); /* outer tag */ + else + memcpy(&(ctx.tag), &notag, sizeof(ctx.tag)); + ctx.iscontent = 0; - ctx.tag = &notag; ctx.field = NULL; } @@ -917,14 +1026,20 @@ main(int argc, char *argv[]) if (pledge("stdio", NULL) == -1) err(1, "pledge"); - if (argc > 1) - baseurl = argv[1]; + if (argc > 1) { + if (uri_parse(argv[1], &baseuri) != -1 && baseuri.proto[0]) + baseurl = argv[1]; + else + errx(1, "baseurl incorrect or too long"); + } + + memcpy(&(ctx.tag), &notag, sizeof(ctx.tag)); parser.xmlattr = xmlattr; parser.xmlattrentity = xmlattrentity; parser.xmlattrend = xmlattrend; parser.xmlattrstart = xmlattrstart; - parser.xmlcdata = xmlcdata; + parser.xmlcdata = xmldata; parser.xmldata = xmldata; parser.xmldataentity = xmldataentity; parser.xmltagend = xmltagend; diff --git a/sfeed_atom.c b/sfeed_atom.c @@ -7,6 +7,7 @@ #include "util.h" +static struct tm tmnow; static time_t now; static char *line; static size_t linesize; @@ -47,12 +48,6 @@ printfeed(FILE *fp, const char *feedname) line[--linelen] = '\0'; parseline(line, fields); - parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = localtime(&parsedtime))) - err(1, "localtime"); - fputs("<entry>\n\t<title>", stdout); if (feedname[0]) { fputs("[", stdout); @@ -78,9 +73,15 @@ printfeed(FILE *fp, const char *feedname) xmlencode(fields[FieldEnclosure], stdout); fputs("\" />\n", stdout); } + + parsedtime = 0; + if (strtotime(fields[FieldUnixTimestamp], &parsedtime) || + !(tm = gmtime(&parsedtime))) + tm = &tmnow; fprintf(stdout, "\t<updated>%04d-%02d-%02dT%02d:%02d:%02dZ</updated>\n", tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, tm->tm_hour, tm->tm_min, tm->tm_sec); + if (fields[FieldAuthor][0]) { fputs("\t<author><name>", stdout); xmlencode(fields[FieldAuthor], stdout); @@ -110,18 +111,13 @@ main(int argc, char *argv[]) char *name; int i; - if (argc == 1) { - if (pledge("stdio", NULL) == -1) - err(1, "pledge"); - } else { - if (pledge("stdio rpath", NULL) == -1) - err(1, "pledge"); - } + if (pledge(argc == 1 ? "stdio" : "stdio rpath", NULL) == -1) + err(1, "pledge"); if ((now = time(NULL)) == -1) err(1, "time"); - if (!(tm = gmtime(&now))) - err(1, "gmtime"); + if (!(tm = gmtime_r(&now, &tmnow))) + err(1, "gmtime_r"); fputs("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" "<feed xmlns=\"http://www.w3.org/2005/Atom\">\n" diff --git a/sfeed_frames.c b/sfeed_frames.c @@ -8,7 +8,7 @@ #include "util.h" -static struct feed **feeds; +static struct feed *feeds; static char *line; static size_t linesize; static time_t comparetime; @@ -20,7 +20,7 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f) char *fields[FieldLast]; ssize_t linelen; unsigned int isnew; - struct tm *tm; + struct tm rtm, *tm; time_t parsedtime; /* menu if not unnamed */ @@ -41,19 +41,20 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f) parseline(line, fields); parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = localtime(&parsedtime))) - err(1, "localtime"); - - isnew = (parsedtime >= comparetime) ? 1 : 0; - totalnew += isnew; - f->totalnew += isnew; + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = localtime_r(&parsedtime, &rtm))) { + isnew = (parsedtime >= comparetime) ? 1 : 0; + totalnew += isnew; + f->totalnew += isnew; + fprintf(fpitems, "%04d-%02d-%02d&nbsp;%02d:%02d ", + tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min); + } else { + isnew = 0; + fputs(" ", fpitems); + } f->total++; - fprintf(fpitems, "%04d-%02d-%02d&nbsp;%02d:%02d ", - tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, - tm->tm_hour, tm->tm_min); if (fields[FieldLink][0]) { fputs("<a href=\"", fpitems); xmlencode(fields[FieldLink], fpitems); @@ -82,7 +83,7 @@ main(int argc, char *argv[]) if (pledge("stdio rpath wpath cpath", NULL) == -1) err(1, "pledge"); - if (!(feeds = calloc(argc, sizeof(struct feed *)))) + if (!(feeds = calloc(argc, sizeof(struct feed)))) err(1, "calloc"); if ((comparetime = time(NULL)) == -1) @@ -111,20 +112,16 @@ main(int argc, char *argv[]) "<body class=\"frame\"><div id=\"items\">", fpitems); if (argc == 1) { - if (!(feeds[0] = calloc(1, sizeof(struct feed)))) - err(1, "calloc"); - feeds[0]->name = ""; - printfeed(fpitems, stdin, feeds[0]); + feeds[0].name = ""; + printfeed(fpitems, stdin, &feeds[0]); } else { for (i = 1; i < argc; i++) { - if (!(feeds[i - 1] = calloc(1, sizeof(struct feed)))) - err(1, "calloc"); name = ((name = strrchr(argv[i], '/'))) ? name + 1 : argv[i]; - feeds[i - 1]->name = name; + feeds[i - 1].name = name; if (!(fp = fopen(argv[i], "r"))) err(1, "fopen: %s", argv[i]); - printfeed(fpitems, fp, feeds[i - 1]); + printfeed(fpitems, fp, &feeds[i - 1]); if (ferror(fp)) err(1, "ferror: %s", argv[i]); fclose(fp); @@ -142,7 +139,7 @@ main(int argc, char *argv[]) "<body class=\"frame\">\n<div id=\"sidebar\">\n", fpmenu); for (i = 1; i < argc; i++) { - f = feeds[i - 1]; + f = &feeds[i - 1]; if (f->totalnew) fputs("<a class=\"n\" href=\"items.html#", fpmenu); else diff --git a/sfeed_gopher.1 b/sfeed_gopher.1 @@ -1,4 +1,4 @@ -.Dd March 15, 2020 +.Dd January 3, 2021 .Dt SFEED_GOPHER 1 .Os .Sh NAME @@ -46,6 +46,11 @@ The default is "70". .El .Sh EXIT STATUS .Ex -std +.Sh EXAMPLES +.Bd -literal +SFEED_GOPHER_HOST="codemadness.org" SFEED_GOPHER_PATH="/feeds/" \\ + sfeed_gopher ~/.sfeed/feeds/* +.Ed .Sh SEE ALSO .Xr sfeed 1 , .Xr sfeed 5 diff --git a/sfeed_gopher.c b/sfeed_gopher.c @@ -28,7 +28,7 @@ gophertext(FILE *fp, const char *s) fputs(" ", fp); break; default: - fputc(*s, fp); + putc(*s, fp); break; } } @@ -38,10 +38,11 @@ static void printfeed(FILE *fpitems, FILE *fpin, struct feed *f) { struct uri u; - char *fields[FieldLast], *itemhost, *itemport, *itempath; + char *fields[FieldLast]; + char *itemhost, *itemport, *itempath, *itemquery, *itemfragment; ssize_t linelen; unsigned int isnew; - struct tm *tm; + struct tm rtm, *tm; time_t parsedtime; int itemtype; @@ -55,29 +56,24 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f) line[--linelen] = '\0'; parseline(line, fields); - parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = localtime(&parsedtime))) - err(1, "localtime"); - - isnew = (parsedtime >= comparetime) ? 1 : 0; - f->totalnew += isnew; - f->total++; - itemhost = host; itemport = port; itemtype = 'i'; itempath = fields[FieldLink]; + itemquery = ""; + itemfragment = ""; if (fields[FieldLink][0]) { itemtype = 'h'; + /* if it's a gopher URL then change it into a direntry */ if (!strncmp(fields[FieldLink], "gopher://", 9) && - parseuri(fields[FieldLink], &u, 0) != -1) { + uri_parse(fields[FieldLink], &u) != -1) { itemhost = u.host; itemport = u.port[0] ? u.port : "70"; itemtype = '1'; itempath = u.path; + itemquery = u.query; + itemfragment = u.fragment; if (itempath[0] == '/') { itempath++; @@ -89,16 +85,35 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f) } } - fprintf(fpitems, "%c%c %04d-%02d-%02d %02d:%02d ", - itemtype, - isnew ? 'N' : ' ', - tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, - tm->tm_hour, tm->tm_min); + parsedtime = 0; + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = localtime_r(&parsedtime, &rtm))) { + isnew = (parsedtime >= comparetime) ? 1 : 0; + f->totalnew += isnew; + + fprintf(fpitems, "%c%c %04d-%02d-%02d %02d:%02d ", + itemtype, + isnew ? 'N' : ' ', + tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min); + } else { + fprintf(fpitems, "%c ", itemtype); + } + f->total++; + gophertext(fpitems, fields[FieldTitle]); fputs("\t", fpitems); if (itemtype == 'h' && fields[FieldLink] == itempath) fputs("URL:", fpitems); gophertext(fpitems, itempath); + if (itemquery[0]) { + fputs("?", fpitems); + gophertext(fpitems, itemquery); + } + if (itemfragment[0]) { + fputs("#", fpitems); + gophertext(fpitems, itemfragment); + } fprintf(fpitems, "\t%s\t%s\r\n", itemhost, itemport); } fputs(".\r\n", fpitems); @@ -111,8 +126,17 @@ main(int argc, char *argv[]) char *name, *p, path[PATH_MAX + 1]; int i, r; - if (pledge(argc == 1 ? "stdio" : "stdio rpath wpath cpath", NULL) == -1) - err(1, "pledge"); + if (argc == 1) { + if (pledge("stdio", NULL) == -1) + err(1, "pledge"); + } else { + if (unveil("/", "r") == -1) + err(1, "unveil"); + if (unveil(".", "rwc") == -1) + err(1, "unveil"); + if (pledge("stdio rpath wpath cpath", NULL) == -1) + err(1, "pledge"); + } if ((comparetime = time(NULL)) == -1) err(1, "time"); diff --git a/sfeed_html.c b/sfeed_html.c @@ -8,7 +8,7 @@ #include "util.h" -static struct feed **feeds; +static struct feed *feeds; static int showsidebar; static char *line; static size_t linesize; @@ -19,7 +19,7 @@ static void printfeed(FILE *fp, struct feed *f) { char *fields[FieldLast]; - struct tm *tm; + struct tm rtm, *tm; time_t parsedtime; unsigned int isnew; ssize_t linelen; @@ -41,20 +41,21 @@ printfeed(FILE *fp, struct feed *f) parseline(line, fields); parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = localtime(&parsedtime))) - err(1, "localtime"); - - isnew = (parsedtime >= comparetime) ? 1 : 0; - totalnew += isnew; - f->totalnew += isnew; + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = localtime_r(&parsedtime, &rtm))) { + isnew = (parsedtime >= comparetime) ? 1 : 0; + totalnew += isnew; + f->totalnew += isnew; + + fprintf(stdout, "%04d-%02d-%02d&nbsp;%02d:%02d ", + tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min); + } else { + isnew = 0; + fputs(" ", stdout); + } f->total++; - fprintf(stdout, "%04d-%02d-%02d&nbsp;%02d:%02d ", - tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, - tm->tm_hour, tm->tm_min); - if (fields[FieldLink][0]) { fputs("<a href=\"", stdout); xmlencode(fields[FieldLink], stdout); @@ -83,7 +84,7 @@ main(int argc, char *argv[]) if (pledge(argc == 1 ? "stdio" : "stdio rpath", NULL) == -1) err(1, "pledge"); - if (!(feeds = calloc(argc, sizeof(struct feed *)))) + if (!(feeds = calloc(argc, sizeof(struct feed)))) err(1, "calloc"); if ((comparetime = time(NULL)) == -1) err(1, "time"); @@ -106,21 +107,17 @@ main(int argc, char *argv[]) fputs("\t\t<div id=\"items\" class=\"nosidebar\">\n", stdout); if (argc == 1) { - if (!(feeds[0] = calloc(1, sizeof(struct feed)))) - err(1, "calloc"); - feeds[0]->name = ""; - printfeed(stdin, feeds[0]); + feeds[0].name = ""; + printfeed(stdin, &feeds[0]); if (ferror(stdin)) err(1, "ferror: <stdin>:"); } else { for (i = 1; i < argc; i++) { - if (!(feeds[i - 1] = calloc(1, sizeof(struct feed)))) - err(1, "calloc"); name = ((name = strrchr(argv[i], '/'))) ? name + 1 : argv[i]; - feeds[i - 1]->name = name; + feeds[i - 1].name = name; if (!(fp = fopen(argv[i], "r"))) err(1, "fopen: %s", argv[i]); - printfeed(fp, feeds[i - 1]); + printfeed(fp, &feeds[i - 1]); if (ferror(fp)) err(1, "ferror: %s", argv[i]); fclose(fp); @@ -132,7 +129,7 @@ main(int argc, char *argv[]) fputs("\t<div id=\"sidebar\">\n\t\t<ul>\n", stdout); for (i = 1; i < argc; i++) { - f = feeds[i - 1]; + f = &feeds[i - 1]; if (f->totalnew > 0) fputs("<li class=\"n\"><a href=\"#", stdout); else diff --git a/sfeed_mbox.c b/sfeed_mbox.c @@ -9,7 +9,7 @@ static char *line; static size_t linesize; -static char host[256], *user, mtimebuf[32]; +static char host[256], *user, dtimebuf[32], mtimebuf[32]; static unsigned long djb2(unsigned char *s, unsigned long hash) @@ -36,17 +36,18 @@ printfeed(FILE *fp, const char *feedname) hash = djb2((unsigned char *)line, 5381UL); parseline(line, fields); - parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = gmtime(&parsedtime))) - err(1, "gmtime"); - if (!strftime(timebuf, sizeof(timebuf), "%a, %d %b %Y %H:%M:%S +0000", tm)) - errx(1, "strftime"); - /* mbox + mail header */ printf("From MAILER-DAEMON %s\n", mtimebuf); - printf("Date: %s\n", timebuf); + + parsedtime = 0; + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = gmtime(&parsedtime)) && + strftime(timebuf, sizeof(timebuf), "%a, %d %b %Y %H:%M:%S +0000", tm)) { + printf("Date: %s\n", timebuf); + } else { + printf("Date: %s\n", dtimebuf); /* invalid/missing: use current time */ + } + printf("From: %s <sfeed@>\n", fields[FieldAuthor][0] ? fields[FieldAuthor] : feedname); printf("To: %s <%s@%s>\n", user, user, host); printf("Subject: %s\n", fields[FieldTitle]); @@ -68,8 +69,8 @@ printfeed(FILE *fp, const char *feedname) int main(int argc, char *argv[]) { - struct tm tm; - time_t t; + struct tm tmnow; + time_t now; FILE *fp; char *name; int i; @@ -81,11 +82,13 @@ main(int argc, char *argv[]) user = "you"; if (gethostname(host, sizeof(host)) == -1) err(1, "gethostname"); - if ((t = time(NULL)) == -1) + if ((now = time(NULL)) == -1) err(1, "time"); - if (!gmtime_r(&t, &tm)) + if (!gmtime_r(&now, &tmnow)) err(1, "gmtime_r: can't get current time"); - if (!strftime(mtimebuf, sizeof(mtimebuf), "%a %b %d %H:%M:%S %Y", &tm)) + if (!strftime(mtimebuf, sizeof(mtimebuf), "%a %b %d %H:%M:%S %Y", &tmnow)) + errx(1, "strftime: can't format current time"); + if (!strftime(dtimebuf, sizeof(dtimebuf), "%a, %d %b %Y %H:%M:%S +0000", &tmnow)) errx(1, "strftime: can't format current time"); if (argc == 1) { diff --git a/sfeed_opml_export b/sfeed_opml_export @@ -22,7 +22,7 @@ loadconfig() { fi } -# override feeds function to ouput opml XML. +# override feed function to output OPML XML. # feed(name, feedurl, [basesiteurl], [encoding]) feed() { # NOTE: TABs in field values are unsupported, be sane. diff --git a/sfeed_opml_export.1 b/sfeed_opml_export.1 @@ -1,4 +1,4 @@ -.Dd March 15, 2020 +.Dd January 3, 2021 .Dt SFEED_OPML_EXPORT 1 .Os .Sh NAME @@ -24,6 +24,10 @@ section for more information. .El .Sh EXIT STATUS .Ex -std +.Sh EXAMPLES +.Bd -literal +sfeed_opml_export ~/.sfeed/sfeedrc > opmlfile.xml +.Ed .Sh SEE ALSO .Xr sfeed_update 1 , .Xr sfeedrc 5 diff --git a/sfeed_opml_import.1 b/sfeed_opml_import.1 @@ -1,9 +1,9 @@ -.Dd March 15, 2020 +.Dd January 15, 2021 .Dt SFEED_OPML_IMPORT 1 .Os .Sh NAME .Nm sfeed_opml_import -.Nd generate a sfeedrc config file from an OPML file +.Nd generate a sfeedrc config file from an OPML subscription list .Sh SYNOPSIS .Nm .Sh DESCRIPTION @@ -11,6 +11,10 @@ reads the OPML XML data from stdin and writes the config file text to stdout. .Sh EXIT STATUS .Ex -std +.Sh EXAMPLES +.Bd -literal +sfeed_opml_import < opmlfile.xml > ~/.sfeed/sfeedrc +.Ed .Sh SEE ALSO .Xr sfeed_update 1 , .Xr sfeedrc 5 diff --git a/sfeed_opml_import.c b/sfeed_opml_import.c @@ -7,7 +7,7 @@ #include "xml.h" static XMLParser parser; /* XML parser state */ -static char url[2048], text[256], title[256]; +static char text[256], title[256], xmlurl[2048]; static void printsafe(const char *s) @@ -27,7 +27,7 @@ printsafe(const char *s) static void xmltagstart(XMLParser *p, const char *t, size_t tl) { - url[0] = text[0] = title[0] = '\0'; + text[0] = title[0] = xmlurl[0] = '\0'; } static void @@ -36,8 +36,9 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort) if (strcasecmp(t, "outline")) return; - if (url[0]) { + if (xmlurl[0]) { fputs("\tfeed '", stdout); + /* prefer title over text attribute */ if (title[0]) printsafe(title); else if (text[0]) @@ -45,11 +46,11 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort) else fputs("unnamed", stdout); fputs("' '", stdout); - printsafe(url); + printsafe(xmlurl); fputs("'\n", stdout); } - url[0] = text[0] = title[0] = '\0'; + text[0] = title[0] = xmlurl[0] = '\0'; } static void @@ -59,12 +60,12 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, if (strcasecmp(t, "outline")) return; - if (!strcasecmp(n, "title")) - strlcat(title, v, sizeof(title)); - else if (!strcasecmp(n, "text")) + if (!strcasecmp(n, "text")) strlcat(text, v, sizeof(text)); + else if (!strcasecmp(n, "title")) + strlcat(title, v, sizeof(title)); else if (!strcasecmp(n, "xmlurl")) - strlcat(url, v, sizeof(url)); + strlcat(xmlurl, v, sizeof(xmlurl)); } static void diff --git a/sfeed_plain.1 b/sfeed_plain.1 @@ -1,4 +1,4 @@ -.Dd March 15, 2020 +.Dd January 1, 2021 .Dt SFEED_PLAIN 1 .Os .Sh NAME @@ -40,6 +40,10 @@ and .Xr wcwidth 3 . .Sh EXIT STATUS .Ex -std +.Sh EXAMPLES +.Bd -literal +curl -s 'https://codemadness.org/atom.xml' | sfeed | sfeed_plain +.Ed .Sh SEE ALSO .Xr sfeed 1 , .Xr sfeed_html 1 , diff --git a/sfeed_plain.c b/sfeed_plain.c @@ -16,7 +16,7 @@ static void printfeed(FILE *fp, const char *feedname) { char *fields[FieldLast]; - struct tm *tm; + struct tm rtm, *tm; time_t parsedtime; ssize_t linelen; @@ -26,19 +26,19 @@ printfeed(FILE *fp, const char *feedname) parseline(line, fields); parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = localtime(&parsedtime))) - err(1, "localtime"); - - if (parsedtime >= comparetime) - fputs("N ", stdout); - else - fputs(" ", stdout); + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = localtime_r(&parsedtime, &rtm))) { + if (parsedtime >= comparetime) + fputs("N ", stdout); + else + fputs(" ", stdout); + fprintf(stdout, "%04d-%02d-%02d %02d:%02d ", + tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min); + } else { + fputs(" ", stdout); + } - fprintf(stdout, "%04d-%02d-%02d %02d:%02d ", - tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, - tm->tm_hour, tm->tm_min); if (feedname[0]) { printutf8pad(stdout, feedname, 15, ' '); fputs(" ", stdout); diff --git a/sfeed_twtxt.c b/sfeed_twtxt.c @@ -24,14 +24,14 @@ printfeed(FILE *fp, const char *feedname) parseline(line, fields); parsedtime = 0; - if (strtotime(fields[FieldUnixTimestamp], &parsedtime)) - continue; - if (!(tm = gmtime(&parsedtime))) - err(1, "gmtime"); - - fprintf(stdout, "%04d-%02d-%02dT%02d:%02d:%02dZ\t", - tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, - tm->tm_hour, tm->tm_min, tm->tm_sec); + if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) && + (tm = gmtime(&parsedtime))) { + fprintf(stdout, "%04d-%02d-%02dT%02d:%02d:%02dZ\t", + tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday, + tm->tm_hour, tm->tm_min, tm->tm_sec); + } else { + fputs("\t", stdout); + } if (feedname[0]) printf("[%s] ", feedname); fputs(fields[FieldTitle], stdout); diff --git a/sfeed_update b/sfeed_update @@ -15,7 +15,7 @@ loadconfig() { # allow to specify config via argv[1]. if [ "$1" != "" ]; then # get absolute path of config file. - config=$(readlink -f "$1") + config=$(readlink -f "$1" 2>/dev/null) else # default config location. config="$HOME/.sfeed/sfeedrc" @@ -25,7 +25,7 @@ loadconfig() { if [ -r "${config}" ]; then . "${config}" else - echo "Configuration file \"${config}\" does not exist or is not readable." >&2 + echo "Configuration file \"$1\" does not exist or is not readable." >&2 echo "See sfeedrc.example for an example." >&2 exit 1 fi @@ -45,16 +45,22 @@ fetch() { } # convert encoding from one encoding to another. -# convertencoding(from, to) +# convertencoding(name, from, to) convertencoding() { - if [ "$1" != "" ] && [ "$2" != "" ] && [ "$1" != "$2" ]; then - iconv -cs -f "$1" -t "$2" 2> /dev/null + if [ "$2" != "" ] && [ "$3" != "" ] && [ "$2" != "$3" ]; then + iconv -cs -f "$2" -t "$3" 2> /dev/null else - # else no convert, just output + # else no convert, just output. cat fi } +# parse and convert input, by default XML to the sfeed(5) TSV format. +# parse(name, feedurl, basesiteurl) +parse() { + sfeed "$3" +} + # filter fields. # filter(name) filter() { @@ -73,25 +79,21 @@ order() { sort -t ' ' -k1rn,1 } -# fetch and parse feed. -# feed(name, feedurl, [basesiteurl], [encoding]) -feed() { - # wait until ${maxjobs} are finished: will stall the queue if an item - # is slow, but it is portable. - [ ${signo} -ne 0 ] && return - [ $((curjobs % maxjobs)) -eq 0 ] && wait - [ ${signo} -ne 0 ] && return - curjobs=$((curjobs + 1)) - - (name="$1" - filename="$(printf '%s' "$1" | tr '/' '_')" +# internal handler to fetch and process a feed. +# _feed(name, feedurl, [basesiteurl], [encoding]) +_feed() { + name="$1" feedurl="$2" basesiteurl="$3" encoding="$4" + filename="$(printf '%s' "${name}" | tr '/' '_')" sfeedfile="${sfeedpath}/${filename}" tmpfeedfile="${sfeedtmpdir}/${filename}" + # if file does not exist yet create it. + [ -e "${sfeedfile}" ] || touch "${sfeedfile}" 2>/dev/null + if ! fetch "${name}" "${feedurl}" "${sfeedfile}" > "${tmpfeedfile}.fetch"; then log "${name}" "FAIL (FETCH)" return @@ -100,14 +102,15 @@ feed() { # try to detect encoding (if not specified). if detecting the encoding fails assume utf-8. [ "${encoding}" = "" ] && encoding=$(sfeed_xmlenc < "${tmpfeedfile}.fetch") - if ! convertencoding "${encoding}" "utf-8" < "${tmpfeedfile}.fetch" > "${tmpfeedfile}.utf8"; then + if ! convertencoding "${name}" "${encoding}" "utf-8" < "${tmpfeedfile}.fetch" > "${tmpfeedfile}.utf8"; then log "${name}" "FAIL (ENCODING)" return fi rm -f "${tmpfeedfile}.fetch" - if ! sfeed "${basesiteurl}" < "${tmpfeedfile}.utf8" > "${tmpfeedfile}.tsv"; then - log "${name}" "FAIL (CONVERT)" + # if baseurl is empty then use feedurl. + if ! parse "${name}" "${feedurl}" "${basesiteurl:-${feedurl}}" < "${tmpfeedfile}.utf8" > "${tmpfeedfile}.tsv"; then + log "${name}" "FAIL (PARSE)" return fi rm -f "${tmpfeedfile}.utf8" @@ -124,14 +127,7 @@ feed() { return fi - # if file does not exist yet "merge" with /dev/null. - if [ -e "${sfeedfile}" ]; then - oldfile="${sfeedfile}" - else - oldfile="/dev/null" - fi - - if ! merge "${name}" "${oldfile}" "${tmpfeedfile}.filter" > "${tmpfeedfile}.merge"; then + if ! merge "${name}" "${sfeedfile}" "${tmpfeedfile}.filter" > "${tmpfeedfile}.merge"; then log "${name}" "FAIL (MERGE)" return fi @@ -152,7 +148,19 @@ feed() { # OK log "${name}" "OK" - ) & +} + +# fetch and process a feed in parallel. +# feed(name, feedurl, [basesiteurl], [encoding]) +feed() { + # wait until ${maxjobs} are finished: will stall the queue if an item + # is slow, but it is portable. + [ ${signo} -ne 0 ] && return + [ $((curjobs % maxjobs)) -eq 0 ] && wait + [ ${signo} -ne 0 ] && return + curjobs=$((curjobs + 1)) + + _feed "$@" & } cleanup() { @@ -173,26 +181,30 @@ feeds() { echo "See sfeedrc.example for an example." >&2 } -# job counter. -curjobs=0 -# signal number received for parent. -signo=0 -# SIGINT: signal to interrupt parent. -trap -- "sighandler 2" "INT" -# SIGTERM: signal to terminate parent. -trap -- "sighandler 15" "TERM" -# load config file. -loadconfig "$1" -# fetch feeds and store in temporary directory. -sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')" -# make sure path exists. -mkdir -p "${sfeedpath}" -# fetch feeds specified in config file. -feeds -# wait till all feeds are fetched (concurrently). -[ ${signo} -eq 0 ] && wait -# cleanup temporary files etc. -cleanup -# on signal SIGINT and SIGTERM exit with signal number + 128. -[ ${signo} -ne 0 ] && exit $((signo+128)) -exit 0 +main() { + # job counter. + curjobs=0 + # signal number received for parent. + signo=0 + # SIGINT: signal to interrupt parent. + trap -- "sighandler 2" "INT" + # SIGTERM: signal to terminate parent. + trap -- "sighandler 15" "TERM" + # load config file. + loadconfig "$1" + # fetch feeds and store in temporary directory. + sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')" + # make sure path exists. + mkdir -p "${sfeedpath}" + # fetch feeds specified in config file. + feeds + # wait till all feeds are fetched (concurrently). + [ ${signo} -eq 0 ] && wait + # cleanup temporary files etc. + cleanup + # on signal SIGINT and SIGTERM exit with signal number + 128. + [ ${signo} -ne 0 ] && exit $((signo+128)) + return 0 +} + +[ "${SFEED_UPDATE_INCLUDE}" = "1" ] || main "$@" diff --git a/sfeed_update.1 b/sfeed_update.1 @@ -1,4 +1,4 @@ -.Dd March 15, 2020 +.Dd February 5, 2021 .Dt SFEED_UPDATE 1 .Os .Sh NAME @@ -18,14 +18,14 @@ The directory can be changed in the .Xr sfeedrc 5 file. .Sh OPTIONS -.Bl -tag -width 17n +.Bl -tag -width Ds .It Ar sfeedrc Config file. The default is .Pa $HOME/.sfeed/sfeedrc . .El .Sh FILES READ -.Bl -tag -width 17n +.Bl -tag -width Ds .It Ar sfeedrc This file is evaluated as a shellscript in .Nm . @@ -34,7 +34,7 @@ See also the man page for a detailed description of the format and an example file. .El .Sh FILES WRITTEN -.Bl -tag -width 17n +.Bl -tag -width Ds .It feedname TAB-separated .Xr sfeed 5 @@ -44,6 +44,15 @@ The script merges new items with this file. The feedname cannot contain '/' characters, they will be replaced with '_'. .El +.Sh ENVIRONMENT VARIABLES +.Bl -tag -width Ds +.It SFEED_UPDATE_INCLUDE +When set to "1" +.Nm +can be sourced as a script, but it won't run the +.Fn main +entry-point. +.El .Sh EXIT STATUS .Ex -std .Sh EXAMPLES diff --git a/sfeed_web.1 b/sfeed_web.1 @@ -1,42 +1,41 @@ -.Dd March 15, 2020 +.Dd March 12, 2021 .Dt SFEED_WEB 1 .Os .Sh NAME .Nm sfeed_web -.Nd finds urls to feeds from a HTML webpage +.Nd finds URLs to feeds from a HTML webpage .Sh SYNOPSIS .Nm .Op Ar baseurl .Sh DESCRIPTION .Nm reads the HTML website as XML or HTML data from stdin and writes the found -urls to stdout. +URLs to stdout. .Sh OPTIONS .Bl -tag -width 8n .It Ar baseurl -Optional base url to use for found feed urls that are relative. +Optional base URL to use for found feed URLs that are relative. .El .Sh OUTPUT FORMAT -url<TAB>content\-type<newline> +url<TAB>content-type<newline> .Bl -tag -width Ds .It url -Found relative or absolute url. -If the url is relative and the +Found relative or absolute URL. +.Pp +For relative URLs if a <base href="..." /> tag is found it will be used, +otherwise if the .Ar baseurl -option is -specified then the url is made absolute. -If the url is relative and no -.Ar baseurl -option is specified then it is printed as is. -.It content\-type +option is specified then that is used, if neither are set then the relative URL +is printed. +.It content-type Usually application/atom+xml or application/rss+xml. .El .Sh EXIT STATUS .Ex -std .Sh EXAMPLES -Get urls from xkcd website: +Get URLs from xkcd website: .Bd -literal -curl -s -L http://www.xkcd.com | sfeed_web "http://www.xkcd.com/" +curl -s -L 'http://www.xkcd.com/' | sfeed_web 'http://www.xkcd.com/' .Ed .Sh SEE ALSO .Xr sfeed_update 1 , diff --git a/sfeed_web.c b/sfeed_web.c @@ -10,64 +10,115 @@ #define STRP(s) s,sizeof(s)-1 static XMLParser parser; -static int isbase, islink, isfeedlink; -static char abslink[4096], feedlink[4096], basehref[4096], feedtype[256]; +static int isbasetag, islinktag, ishrefattr, istypeattr; +static char linkhref[4096], linktype[256], basehref[4096]; static void -printfeedtype(const char *s, FILE *fp) +printvalue(const char *s) { for (; *s; s++) - if (!isspace((unsigned char)*s)) - fputc(*s, fp); + if (!iscntrl((unsigned char)*s)) + putchar(*s); } static void xmltagstart(XMLParser *p, const char *t, size_t tl) { - isbase = islink = isfeedlink = 0; + isbasetag = islinktag = 0; - if (!strcasecmp(t, "base")) - isbase = 1; - else if (!strcasecmp(t, "link")) - islink = 1; + if (!strcasecmp(t, "base")) { + isbasetag = 1; + } else if (!strcasecmp(t, "link")) { + islinktag = 1; + linkhref[0] = '\0'; + linktype[0] = '\0'; + } } static void xmltagstartparsed(XMLParser *p, const char *t, size_t tl, int isshort) { - if (!isfeedlink) + struct uri baseuri, linkuri, u; + char buf[4096]; + int r = -1; + + if (!islinktag) + return; + + if (strncasecmp(linktype, STRP("application/atom")) && + strncasecmp(linktype, STRP("application/xml")) && + strncasecmp(linktype, STRP("application/rss"))) return; - if (absuri(abslink, sizeof(abslink), feedlink, basehref) != -1) - fputs(abslink, stdout); + /* parse base URI each time: it can change. */ + if (basehref[0] && + uri_parse(linkhref, &linkuri) != -1 && !linkuri.proto[0] && + uri_parse(basehref, &baseuri) != -1 && + uri_makeabs(&u, &linkuri, &baseuri) != -1 && u.proto[0]) + r = uri_format(buf, sizeof(buf), &u); + + if (r >= 0 && (size_t)r < sizeof(buf)) + printvalue(buf); else - fputs(feedlink, stdout); + printvalue(linkhref); + putchar('\t'); - printfeedtype(feedtype, stdout); + printvalue(linktype); putchar('\n'); } static void +xmlattrstart(XMLParser *p, const char *t, size_t tl, const char *a, size_t al) +{ + ishrefattr = istypeattr = 0; + + if (!isbasetag && !islinktag) + return; + + if (!strcasecmp(a, "href")) { + ishrefattr = 1; + if (isbasetag) + basehref[0] = '\0'; + else if (islinktag) + linkhref[0] = '\0'; + } else if (!strcasecmp(a, "type") && islinktag) { + istypeattr = 1; + linktype[0] = '\0'; + } +} + +static void xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, const char *v, size_t vl) { - if (isbase) { - if (!strcasecmp(n, "href")) - strlcpy(basehref, v, sizeof(basehref)); - } else if (islink) { - if (!strcasecmp(n, "type")) { - if (!strncasecmp(v, STRP("application/atom")) || - !strncasecmp(v, STRP("application/xml")) || - !strncasecmp(v, STRP("application/rss"))) { - isfeedlink = 1; - strlcpy(feedtype, v, sizeof(feedtype)); - } - } else if (!strcasecmp(n, "href")) { - strlcpy(feedlink, v, sizeof(feedlink)); - } + if (isbasetag && ishrefattr) { + strlcat(basehref, v, sizeof(basehref)); + } else if (islinktag) { + if (ishrefattr) + strlcat(linkhref, v, sizeof(linkhref)); + else if (istypeattr) + strlcat(linktype, v, sizeof(linktype)); } } +static void +xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *a, size_t al, + const char *v, size_t vl) +{ + char buf[16]; + int len; + + if (!ishrefattr && !istypeattr) + return; + + /* try to translate entity, else just pass as data to + * xmlattr handler. */ + if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0) + xmlattr(p, t, tl, a, al, buf, (size_t)len); + else + xmlattr(p, t, tl, a, al, v, vl); +} + int main(int argc, char *argv[]) { @@ -78,6 +129,8 @@ main(int argc, char *argv[]) strlcpy(basehref, argv[1], sizeof(basehref)); parser.xmlattr = xmlattr; + parser.xmlattrentity = xmlattrentity; + parser.xmlattrstart = xmlattrstart; parser.xmltagstart = xmltagstart; parser.xmltagstartparsed = xmltagstartparsed; diff --git a/sfeed_xmlenc.1 b/sfeed_xmlenc.1 @@ -1,19 +1,20 @@ -.Dd March 15, 2020 +.Dd March 12, 2021 .Dt SFEED_XMLENC 1 .Os .Sh NAME .Nm sfeed_xmlenc -.Nd get text\-encoding from XML +.Nd get text-encoding from XML .Sh SYNOPSIS .Nm .Sh DESCRIPTION .Nm -reads XML data from stdin and writes the found text\-encoding to stdout. -It translates the characters to lowercase and strips control characters. +reads XML data from stdin and writes the found text-encoding to stdout. +It converts characters in the encoding-name to lowercase and strips characters +which are not allowed in an encoding-name. .Sh EXIT STATUS .Ex -std .Sh EXAMPLES -Get text\-encoding from xkcd Atom feed: +Get text-encoding from xkcd Atom feed: .Bd -literal curl -s -L http://www.xkcd.com/atom.xml | sfeed_xmlenc .Ed diff --git a/sfeed_xmlenc.c b/sfeed_xmlenc.c @@ -14,7 +14,7 @@ static void xmltagstart(XMLParser *p, const char *t, size_t tl) { /* optimization: try to find a processing instruction only at the - start of the data. */ + start of the data at the first few starting tags. */ if (tags++ > 3) exit(0); } @@ -26,11 +26,19 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl, if (strcasecmp(t, "?xml") || strcasecmp(n, "encoding")) return; - /* output lowercase, no control characters */ for (; *v; v++) { - if (!iscntrl((unsigned char)*v)) + if (isalpha((unsigned char)*v) || + isdigit((unsigned char)*v) || + *v == '.' || *v == ':' || *v == '-' || *v == '_') putchar(tolower((unsigned char)*v)); } +} + +static void +xmlattrend(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl) +{ + if (strcasecmp(t, "?xml") || strcasecmp(n, "encoding")) + return; putchar('\n'); exit(0); } @@ -42,6 +50,8 @@ main(void) err(1, "pledge"); parser.xmlattr = xmlattr; + parser.xmlattrentity = xmlattr; /* no entity conversion */ + parser.xmlattrend = xmlattrend; parser.xmltagstart = xmltagstart; /* NOTE: getnext is defined in xml.h for inline optimization */ diff --git a/sfeedrc.5 b/sfeedrc.5 @@ -1,4 +1,4 @@ -.Dd February 4, 2020 +.Dd January 26, 2021 .Dt SFEEDRC 5 .Os .Sh NAME @@ -37,13 +37,13 @@ Name of the feed, this is also used as the filename for the TAB-separated feed file. The feedname cannot contain '/' characters, they will be replaced with '_'. .It Fa feedurl -Url to fetch the RSS/Atom data from, usually a HTTP or HTTPS url. +URL to fetch the RSS/Atom data from, usually a HTTP or HTTPS URL. .It Op Fa basesiteurl Baseurl of the feed links. This argument allows to fix relative item links. .Pp According to the RSS and Atom specification feeds should always have absolute -urls, however this is not always the case in practise. +URLs, however this is not always the case in practise. .It Op Fa encoding Feeds are decoded from this name to UTF-8, the name should be a usable character-set for the @@ -58,7 +58,7 @@ is a shellscript each function can be overridden to change its behaviour, notable functions are: .Bl -tag -width Ds .It Fn fetch "name" "url" "feedfile" -Fetch feed from url and writes data to stdout, its arguments are: +Fetch feed from URL and writes data to stdout, its arguments are: .Bl -tag -width Ds .It Fa name Specified name in configuration file (useful for logging). @@ -71,10 +71,12 @@ Used feedfile (useful for comparing modification times). By default the tool .Xr curl 1 is used. -.It Fn convertencoding "from" "to" +.It Fn convertencoding "name" "from" "to" Convert from text-encoding to another and writes it to stdout, its arguments are: .Bl -tag -width Ds +.It Fa name +Feed name. .It Fa from From text-encoding. .It Fa to @@ -84,6 +86,19 @@ To text-encoding. By default the tool .Xr iconv 1 is used. +.It Fn parse "name" "feedurl" "basesiteurl" +Parse and convert RSS/Atom XML to the +.Xr sfeed 5 +TSV format. +.Bl -tag -width Ds +.It Fa name +Name of the feed. +.It Fa feedurl +URL of the feed. +.It Fa basesiteurl +Baseurl of the feed links. +This argument allows to fix relative item links. +.El .It Fn filter "name" Filter .Xr sfeed 5 diff --git a/util.c b/util.c @@ -7,167 +7,203 @@ #include "util.h" +/* check if string has a non-empty scheme / protocol part */ int -parseuri(const char *s, struct uri *u, int rel) +uri_hasscheme(const char *s) { - const char *p = s, *b; - char *endptr = NULL; + const char *p = s; + + for (; isalpha((unsigned char)*p) || isdigit((unsigned char)*p) || + *p == '+' || *p == '-' || *p == '.'; p++) + ; + /* scheme, except if empty and starts with ":" then it is a path */ + return (*p == ':' && p != s); +} + +int +uri_parse(const char *s, struct uri *u) +{ + const char *p = s; + char *endptr; size_t i; - unsigned long l; + long l; - u->proto[0] = u->host[0] = u->path[0] = u->port[0] = '\0'; - if (!*s) - return 0; + u->proto[0] = u->userinfo[0] = u->host[0] = u->port[0] = '\0'; + u->path[0] = u->query[0] = u->fragment[0] = '\0'; - /* prefix is "//", don't read protocol, skip to domain parsing */ - if (!strncmp(p, "//", 2)) { + /* protocol-relative */ + if (*p == '/' && *(p + 1) == '/') { p += 2; /* skip "//" */ - } else { - /* protocol part */ - for (p = s; *p && (isalpha((unsigned char)*p) || isdigit((unsigned char)*p) || - *p == '+' || *p == '-' || *p == '.'); p++) - ; - if (!strncmp(p, "://", 3)) { - if ((size_t)(p - s) >= sizeof(u->proto)) - return -1; /* protocol too long */ - memcpy(u->proto, s, p - s); - u->proto[p - s] = '\0'; + goto parseauth; + } + + /* scheme / protocol part */ + for (; isalpha((unsigned char)*p) || isdigit((unsigned char)*p) || + *p == '+' || *p == '-' || *p == '.'; p++) + ; + /* scheme, except if empty and starts with ":" then it is a path */ + if (*p == ':' && p != s) { + if (*(p + 1) == '/' && *(p + 2) == '/') p += 3; /* skip "://" */ - } else { - p = s; /* no protocol format, set to start */ - /* relative url: read rest as path, else as domain */ - if (rel) - goto readpath; - } + else + p++; /* skip ":" */ + + if ((size_t)(p - s) >= sizeof(u->proto)) + return -1; /* protocol too long */ + memcpy(u->proto, s, p - s); + u->proto[p - s] = '\0'; + + if (*(p - 1) != '/') + goto parsepath; + } else { + p = s; /* no scheme format, reset to start */ + goto parsepath; + } + +parseauth: + /* userinfo (username:password) */ + i = strcspn(p, "@/?#"); + if (p[i] == '@') { + if (i >= sizeof(u->userinfo)) + return -1; /* userinfo too long */ + memcpy(u->userinfo, p, i); + u->userinfo[i] = '\0'; + p += i + 1; } + /* IPv6 address */ if (*p == '[') { - /* bracket not found or host too long */ - if (!(b = strchr(p, ']')) || (size_t)(b - p) < 3 || - (size_t)(b - p) >= sizeof(u->host)) + /* bracket not found, host too short or too long */ + i = strcspn(p, "]"); + if (p[i] != ']' || i < 3) return -1; - memcpy(u->host, p, b - p + 1); - u->host[b - p + 1] = '\0'; - p = b + 1; + i++; /* including "]" */ } else { /* domain / host part, skip until port, path or end. */ - if ((i = strcspn(p, ":/")) >= sizeof(u->host)) - return -1; /* host too long */ - memcpy(u->host, p, i); - u->host[i] = '\0'; - p = &p[i]; + i = strcspn(p, ":/?#"); } + if (i >= sizeof(u->host)) + return -1; /* host too long */ + memcpy(u->host, p, i); + u->host[i] = '\0'; + p += i; + /* port */ if (*p == ':') { - if ((i = strcspn(++p, "/")) >= sizeof(u->port)) + p++; + if ((i = strcspn(p, "/?#")) >= sizeof(u->port)) return -1; /* port too long */ memcpy(u->port, p, i); u->port[i] = '\0'; - /* check for valid port: range 1 - 65535 */ + /* check for valid port: range 1 - 65535, may be empty */ errno = 0; - l = strtoul(u->port, &endptr, 10); - if (errno || u->port[0] == '\0' || *endptr || - !l || l > 65535) - return -1; - p = &p[i]; - } -readpath: - if (u->host[0]) { - p = &p[strspn(p, "/")]; - strlcpy(u->path, "/", sizeof(u->path)); - } else { - /* absolute uri must have a host specified */ - if (!rel) + l = strtol(u->port, &endptr, 10); + if (i && (errno || *endptr || l <= 0 || l > 65535)) return -1; + p += i; } - /* treat truncation as an error */ - if (strlcat(u->path, p, sizeof(u->path)) >= sizeof(u->path)) - return -1; - return 0; -} -static int -encodeuri(char *buf, size_t bufsiz, const char *s) -{ - static const char *table = "0123456789ABCDEF"; - size_t i, b; +parsepath: + /* path */ + if ((i = strcspn(p, "?#")) >= sizeof(u->path)) + return -1; /* path too long */ + memcpy(u->path, p, i); + u->path[i] = '\0'; + p += i; - for (i = 0, b = 0; s[i]; i++) { - if ((unsigned char)s[i] <= ' ' || - (unsigned char)s[i] >= 127) { - if (b + 3 >= bufsiz) - return -1; - buf[b++] = '%'; - buf[b++] = table[((unsigned char)s[i] >> 4) & 15]; - buf[b++] = table[(unsigned char)s[i] & 15]; - } else if (b < bufsiz) { - buf[b++] = s[i]; - } else { - return -1; - } + /* query */ + if (*p == '?') { + p++; + if ((i = strcspn(p, "#")) >= sizeof(u->query)) + return -1; /* query too long */ + memcpy(u->query, p, i); + u->query[i] = '\0'; + p += i; + } + + /* fragment */ + if (*p == '#') { + p++; + if ((i = strlen(p)) >= sizeof(u->fragment)) + return -1; /* fragment too long */ + memcpy(u->fragment, p, i); + u->fragment[i] = '\0'; } - if (b >= bufsiz) - return -1; - buf[b] = '\0'; return 0; } -/* Get absolute uri; if `link` is relative use `base` to make it absolute. - * the returned string in `buf` is uri encoded, see: encodeuri(). */ +/* Transform and try to make the URI `u` absolute using base URI `b` into `a`. + Follows some of the logic from "RFC 3986 - 5.2.2. Transform References". + Returns 0 on success, -1 on error or truncation. */ int -absuri(char *buf, size_t bufsiz, const char *link, const char *base) +uri_makeabs(struct uri *a, struct uri *u, struct uri *b) { - struct uri ulink, ubase; - char tmp[4096], *host, *p, *port; - int c, r; - size_t i; + char *p; + int c; - buf[0] = '\0'; - if (parseuri(base, &ubase, 0) == -1 || - parseuri(link, &ulink, 1) == -1 || - (!ulink.host[0] && !ubase.host[0])) - return -1; + strlcpy(a->fragment, u->fragment, sizeof(a->fragment)); - if (!strncmp(link, "//", 2)) { - host = ulink.host; - port = ulink.port; - } else { - host = ulink.host[0] ? ulink.host : ubase.host; - port = ulink.port[0] ? ulink.port : ubase.port; + if (u->proto[0] || u->host[0]) { + strlcpy(a->proto, u->proto[0] ? u->proto : b->proto, sizeof(a->proto)); + strlcpy(a->host, u->host, sizeof(a->host)); + strlcpy(a->userinfo, u->userinfo, sizeof(a->userinfo)); + strlcpy(a->host, u->host, sizeof(a->host)); + strlcpy(a->port, u->port, sizeof(a->port)); + strlcpy(a->path, u->path, sizeof(a->path)); + strlcpy(a->query, u->query, sizeof(a->query)); + return 0; } - r = snprintf(tmp, sizeof(tmp), "%s://%s%s%s", - ulink.proto[0] ? - ulink.proto : - (ubase.proto[0] ? ubase.proto : "http"), - host, - port[0] ? ":" : "", - port); - if (r < 0 || (size_t)r >= sizeof(tmp)) - return -1; /* error or truncation */ - - /* relative to root */ - if (!ulink.host[0] && ulink.path[0] != '/') { - /* relative to base url path */ - if (ulink.path[0]) { - if ((p = strrchr(ubase.path, '/'))) { - /* temporary null-terminate */ - c = *(++p); - *p = '\0'; - i = strlcat(tmp, ubase.path, sizeof(tmp)); - *p = c; /* restore */ - if (i >= sizeof(tmp)) - return -1; - } - } else if (strlcat(tmp, ubase.path, sizeof(tmp)) >= - sizeof(tmp)) { - return -1; + + strlcpy(a->proto, b->proto, sizeof(a->proto)); + strlcpy(a->host, b->host, sizeof(a->host)); + strlcpy(a->userinfo, b->userinfo, sizeof(a->userinfo)); + strlcpy(a->host, b->host, sizeof(a->host)); + strlcpy(a->port, b->port, sizeof(a->port)); + + if (!u->path[0]) { + strlcpy(a->path, b->path, sizeof(a->path)); + } else if (u->path[0] == '/') { + strlcpy(a->path, u->path, sizeof(a->path)); + } else { + a->path[0] = (b->host[0] && b->path[0] != '/') ? '/' : '\0'; + a->path[1] = '\0'; + + if ((p = strrchr(b->path, '/'))) { + c = *(++p); + *p = '\0'; /* temporary NUL-terminate */ + if (strlcat(a->path, b->path, sizeof(a->path)) >= sizeof(a->path)) + return -1; + *p = c; /* restore */ } + if (strlcat(a->path, u->path, sizeof(a->path)) >= sizeof(a->path)) + return -1; } - if (strlcat(tmp, ulink.path, sizeof(tmp)) >= sizeof(tmp)) - return -1; - return encodeuri(buf, bufsiz, tmp); + if (u->path[0] || u->query[0]) + strlcpy(a->query, u->query, sizeof(a->query)); + else + strlcpy(a->query, b->query, sizeof(a->query)); + + return 0; +} + +int +uri_format(char *buf, size_t bufsiz, struct uri *u) +{ + return snprintf(buf, bufsiz, "%s%s%s%s%s%s%s%s%s%s%s%s", + u->proto, + u->userinfo[0] ? u->userinfo : "", + u->userinfo[0] ? "@" : "", + u->host, + u->port[0] ? ":" : "", + u->port, + u->host[0] && u->path[0] && u->path[0] != '/' ? "/" : "", + u->path, + u->query[0] ? "?" : "", + u->query, + u->fragment[0] ? "#" : "", + u->fragment); } /* Splits fields in the line buffer by replacing TAB separators with NUL ('\0') @@ -204,7 +240,7 @@ strtotime(const char *s, time_t *t) if (errno || *s == '\0' || *e) return -1; /* NOTE: assumes time_t is 64-bit on 64-bit platforms: - long long (atleast 32-bit) to time_t. */ + long long (at least 32-bit) to time_t. */ if (t) *t = (time_t)l; @@ -222,7 +258,7 @@ xmlencode(const char *s, FILE *fp) case '\'': fputs("&#39;", fp); break; case '&': fputs("&amp;", fp); break; case '"': fputs("&quot;", fp); break; - default: fputc(*s, fp); + default: putc(*s, fp); } } } @@ -234,29 +270,49 @@ printutf8pad(FILE *fp, const char *s, size_t len, int pad) { wchar_t wc; size_t col = 0, i, slen; - int rl, w; + int inc, rl, w; if (!len) return; slen = strlen(s); - for (i = 0; i < slen; i += rl) { - rl = w = 1; - if ((unsigned char)s[i] < 32) - continue; - if ((unsigned char)s[i] >= 127) { - if ((rl = mbtowc(&wc, s + i, slen - i < 4 ? slen - i : 4)) <= 0) + for (i = 0; i < slen; i += inc) { + inc = 1; /* next byte */ + if ((unsigned char)s[i] < 32) { + continue; /* skip control characters */ + } else if ((unsigned char)s[i] >= 127) { + rl = mbtowc(&wc, s + i, slen - i < 4 ? slen - i : 4); + inc = rl; + if (rl < 0) { + mbtowc(NULL, NULL, 0); /* reset state */ + inc = 1; /* invalid, seek next byte */ + w = 1; /* replacement char is one width */ + } else if ((w = wcwidth(wc)) == -1) { + continue; + } + + if (col + w > len || (col + w == len && s[i + inc])) { + fputs("\xe2\x80\xa6", fp); /* ellipsis */ + col++; break; - if ((w = wcwidth(wc)) == -1) + } else if (rl < 0) { + fputs("\xef\xbf\xbd", fp); /* replacement */ + col++; continue; - } - if (col + w > len || (col + w == len && s[i + rl])) { - fputs("\xe2\x80\xa6", fp); + } + fwrite(&s[i], 1, rl, fp); + col += w; + } else { + /* optimization: simple ASCII character */ + if (col + 1 > len || (col + 1 == len && s[i + 1])) { + fputs("\xe2\x80\xa6", fp); /* ellipsis */ + col++; + break; + } + putc(s[i], fp); col++; - break; } - fwrite(&s[i], 1, rl, fp); - col += w; + } for (; col < len; ++col) putc(pad, fp); diff --git a/util.h b/util.h @@ -6,6 +6,7 @@ #include <unistd.h> #else #define pledge(p1,p2) 0 +#define unveil(p1,p2) 0 #endif #undef strlcat @@ -20,22 +21,29 @@ struct feed { unsigned long total; /* total items */ }; -/* uri */ +/* URI */ struct uri { - char proto[48]; + char proto[48]; /* scheme including ":" or "://" */ + char userinfo[256]; /* username [:password] */ char host[256]; - char path[2048]; - char port[6]; /* numeric port */ + char port[6]; /* numeric port */ + char path[1024]; + char query[1024]; + char fragment[1024]; }; enum { FieldUnixTimestamp = 0, FieldTitle, FieldLink, FieldContent, - FieldContentType, FieldId, FieldAuthor, FieldEnclosure, FieldLast + FieldContentType, FieldId, FieldAuthor, FieldEnclosure, FieldCategory, + FieldLast }; -int absuri(char *, size_t, const char *, const char *); +int uri_format(char *, size_t, struct uri *); +int uri_hasscheme(const char *); +int uri_makeabs(struct uri *, struct uri *, struct uri *); +int uri_parse(const char *, struct uri *); + void parseline(char *, char *[FieldLast]); -int parseuri(const char *, struct uri *, int); void printutf8pad(FILE *, const char *, size_t, int); int strtotime(const char *, time_t *); void xmlencode(const char *, FILE *); diff --git a/xml.c b/xml.c @@ -116,49 +116,18 @@ startvalue: static void xml_parsecomment(XMLParser *x) { - size_t datalen = 0, i = 0; - int c; + int c, i = 0; - if (x->xmlcommentstart) - x->xmlcommentstart(x); while ((c = GETNEXT()) != EOF) { - if (c == '-' || c == '>') { - if (x->xmlcomment && datalen) { - x->data[datalen] = '\0'; - x->xmlcomment(x, x->data, datalen); - datalen = 0; - } - } - if (c == '-') { - if (++i > 2) { - if (x->xmlcomment) - for (; i > 2; i--) - x->xmlcomment(x, "-", 1); + if (++i > 2) i = 2; - } continue; } else if (c == '>' && i == 2) { - if (x->xmlcommentend) - x->xmlcommentend(x); return; } else if (i) { - if (x->xmlcomment) { - for (; i > 0; i--) - x->xmlcomment(x, "-", 1); - } i = 0; } - - if (datalen < sizeof(x->data) - 1) { - x->data[datalen++] = c; - } else { - x->data[datalen] = '\0'; - if (x->xmlcomment) - x->xmlcomment(x, x->data, datalen); - x->data[0] = c; - datalen = 1; - } } } @@ -168,8 +137,6 @@ xml_parsecdata(XMLParser *x) size_t datalen = 0, i = 0; int c; - if (x->xmlcdatastart) - x->xmlcdatastart(x); while ((c = GETNEXT()) != EOF) { if (c == ']' || c == '>') { if (x->xmlcdata && datalen) { @@ -188,8 +155,6 @@ xml_parsecdata(XMLParser *x) } continue; } else if (c == '>' && i == 2) { - if (x->xmlcdataend) - x->xmlcdataend(x); return; } else if (i) { if (x->xmlcdata) @@ -287,7 +252,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz) else l = strtol(e, &end, 10); /* invalid value or not a well-formed entity or invalid code point */ - if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff) + if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff || + (l >= 0xd800 && l <= 0xdfff)) return -1; len = codepointtoutf8(l, buf); buf[len] = '\0'; @@ -326,7 +292,7 @@ xml_parse(XMLParser *x) if (c == '!') { /* cdata and comments */ for (tagdatalen = 0; (c = GETNEXT()) != EOF;) { - /* NOTE: sizeof(x->data) must be atleast sizeof("[CDATA[") */ + /* NOTE: sizeof(x->data) must be at least sizeof("[CDATA[") */ if (tagdatalen <= sizeof("[CDATA[") - 1) x->data[tagdatalen++] = c; if (c == '>') @@ -393,8 +359,6 @@ xml_parse(XMLParser *x) } else { /* parse tag data */ datalen = 0; - if (x->xmldatastart) - x->xmldatastart(x); while ((c = GETNEXT()) != EOF) { if (c == '&') { if (datalen) { @@ -441,8 +405,6 @@ xml_parse(XMLParser *x) x->data[datalen] = '\0'; if (x->xmldata && datalen) x->xmldata(x, x->data, datalen); - if (x->xmldataend) - x->xmldataend(x); break; } } diff --git a/xml.h b/xml.h @@ -1,5 +1,5 @@ -#ifndef _XML_H -#define _XML_H +#ifndef _XML_H_ +#define _XML_H_ #include <stdio.h> @@ -13,24 +13,17 @@ typedef struct xmlparser { const char *, size_t); void (*xmlattrentity)(struct xmlparser *, const char *, size_t, const char *, size_t, const char *, size_t); - void (*xmlcdatastart)(struct xmlparser *); void (*xmlcdata)(struct xmlparser *, const char *, size_t); - void (*xmlcdataend)(struct xmlparser *); - void (*xmlcommentstart)(struct xmlparser *); - void (*xmlcomment)(struct xmlparser *, const char *, size_t); - void (*xmlcommentend)(struct xmlparser *); void (*xmldata)(struct xmlparser *, const char *, size_t); - void (*xmldataend)(struct xmlparser *); void (*xmldataentity)(struct xmlparser *, const char *, size_t); - void (*xmldatastart)(struct xmlparser *); void (*xmltagend)(struct xmlparser *, const char *, size_t, int); void (*xmltagstart)(struct xmlparser *, const char *, size_t); void (*xmltagstartparsed)(struct xmlparser *, const char *, size_t, int); #ifndef GETNEXT - /* GETNEXT overridden for sfeed to reduce function call overhead and - further context optimizations. */ + /* GETNEXT overridden to reduce function call overhead and further + context optimizations. */ #define GETNEXT getchar #endif