commit 71815afcf0577ff1fb5e65fcdb3b8041c72c7623
parent 928a3ca62f748e7301fd2fa978b64c9a36f5d2d1
Author: gearsix@tuta.io <gearsix@tuta.io>
Date: Thu, 15 Apr 2021 16:31:15 +0100
squash merge from upstream
Squashed commit of the following:
commit 4e96b1f3f90c86589006070fe739ee9a6c5d06de
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Mar 13 13:22:10 2021 +0100
bump version to 0.9.22
commit 99a8e4deeb81480f2c0c2798256fe9c2df006ff1
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Mar 12 13:11:17 2021 +0100
sfeed_web.1, sfeed_xmlenc.1: remove unneeded mdoc escape sequence
commit 317d08eee3494234e1e5f84fca0240abfdc8e2f5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Mar 3 18:12:34 2021 +0100
sfeed_update: return instead of exit in main() on success
This is useful so the script can be included, call main and then have
additional post-main functionality.
commit ceefac3e911b84c8aa8d215a3f7ec651aaeb4800
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Mar 2 13:13:19 2021 +0100
README: workaround empty fields with *BSD xargs -0
Workaround it by setting the empty "middle" fields to some value. The last
field can be empty.
Some feeds were incorrectly using the wrong base URL if the `baseurl` field was
empty but the encoding field was set. So it incorrectly used the encoding field
instead.
Only now noticed some feeds were failing because the baseURL is validated since
commit f305b032bc19b4e81c0dd6c0398370028ea910ca and returning a non-zero exit
status.
This doesn't happen with GNU xargs, busybox or toybox xargs.
Affected (atleast): OpenBSD, NetBSD, FreeBSD and DragonFlyBSD xargs which share
similar code.
Simple way to reproduce the difference:
printf 'a\0\0c\0' | xargs -0 echo
Prints "a c" on *BSD.
Prints "a c" on GNU xargs (and some other implementations).
commit f0e0326248e3a8c148eea64f77a5a1f84cf8d80b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Mar 1 22:27:11 2021 +0100
sfeed_update: fix baseurl substitution
Follow-up from a rushed commit:
commit 58555779d123be68c0acf9ea898931d656ec6d63
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Feb 28 13:33:21 2021 +0100
sfeed_update: simplify, use feedurl directly
This also make it possible to use non-authoritive URLs as a baseurl, like
"magnet:" URLs.
commit 16b7cc14e0681531d0d0e7c73e7c9ee903d0b1a9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Mar 1 18:50:43 2021 +0100
util.c: uri_makeabs: check initial base URI field, not dest `a` (style)
No functional difference because the base URI host is copied beforehand.
commit fef85e3c39946b50b337cdf57a4a3d530eb85ac2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Feb 28 13:37:21 2021 +0100
sfeed.1: reference sfeed_update and sfeedrc
The shellscript is optional, but reference it in the documentation.
commit 58555779d123be68c0acf9ea898931d656ec6d63
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Feb 28 13:33:21 2021 +0100
sfeed_update: simplify, use feedurl directly
This also make it possible to use non-authoritive URLs as a baseurl, like
"magnet:" URLs.
commit f305b032bc19b4e81c0dd6c0398370028ea910ca
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Feb 16 18:38:56 2021 +0100
util: improve/refactor URI parsing and formatting
Removed/rewritten the functions:
absuri, parseuri, and encodeuri() for percent-encoding.
The functions are now split separately with the following purpose:
- uri_format: format struct uri into a string.
- uri_hasscheme: quick check if a string is absolute or not.
- uri_makeabs: make a URI absolute using a base uri and the original URI.
- uri_parse: parse a string into a struct uri.
The following URLs are better parsed:
- URLs with extra "/"'s in the path prepended are kept as is, no "/" is added
either for empty paths.
- URLs like "http://codemadness.org" are not changed to
"http://codemadness.org/" anymore (paths are kept as is, unless they are
non-empty and not start with "/").
- Paths are not percent-encoded anymore.
- URLs with userinfo field (username, password) are parsed.
like: ftp://user:password@[2001:db8::7]:2121/rfc/rfc1808.txt
- Non-authoritive URLs like mailto:some@email.org, magnet URIs, ISBN URIs/urn,
like: urn:isbn:0-395-36341-1 are allowed and parsed correctly.
- Both local (file:///) and non-local (file://) are supported.
- Specifying a base URL with a port will now only use it when the relative URL
has no host and port set and follows RFC3986 5.2.2 more closely.
- Parsing numeric port: parse as signed long and check <= 0, empty port is
allowed.
- Parsing URIs containing query, fragment, but no path separator (/) will now
parse the component properly.
For sfeed:
- Parse the baseURI only once (no need to do it every time for making absolute
URIs).
- If a link/enclosure is absolute already or if there is no base URL specified
then just print the link directly. There have also been other small performance
improvements related to handling URIs.
References:
- https://tools.ietf.org/html/rfc3986
- Section "5.2.2. Transform References" have also been helpful.
commit 30476d22307aaa38170da5241a5d5e9864c4e76d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Feb 18 14:14:49 2021 +0100
README: combine bandwidth saving options into one section
Combine E-Tags, If-Modified-Since in one section. Also mention the curl
--compression option for typically GZIP decompression.
Note that E-Tags were broken in curl <7.73 due to a bug with "weak" e-tags.
https://github.com/curl/curl/issues/5610
From a question/feedback by e-mail from Hadrien Lacour, thanks.
commit afc604c4c9c91febde2a313f731ff59974a533d5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Feb 5 01:25:53 2021 +0100
sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedantic
commit fee20df52a9091ee3a3efeaf3ed63b6940fb5be5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Feb 4 01:19:31 2021 +0100
sfeed.c: fix time parsing regression with non-standard date format
The commit that introduced the regression was:
commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:55:35 2020 +0200
simplify time parsing
Noticed on a RSS feed with the following date:
<pubDate>2021-02-03 05:13:03</pubDate>
This format is non-standard, but sfeed should support this.
A standard format would be (for Atom): 2021-02-03T05:13:03Z
Partially revert it.
commit 76274ca7980b2f71fc07b2dc59123e3f66b8fd9d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Jan 28 17:05:32 2021 +0100
README: fix xargs -P example when there are no feeds
Kindof a non-issue but if theres a sfeedrc with no feeds then xargs will still
be executed and give an error. The xargs -r option (GNU extension) fixes this:
From the OpenBSD xargs(1) man page:
"-r Do not run the command if there are no arguments. Normally the
command is executed at least once even if there are no arguments."
Reproducable with the sfeedrc:
feeds() {
true
}
commit 438634fec934102839e05b5c587d7eb3fdf17571
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Jan 27 18:04:32 2021 +0100
sfeed_update: $SFEED_UPDATE_INCLUDE: be a bit more precise/pedantic
commit 645ef7420056796e6d2716bf920b8704451912ac
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Jan 27 13:07:45 2021 +0100
typofixes
commit 2f8a83288d91ea0abc2e4ebd6754513ee3ad37ec
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:36:41 2021 +0100
README: add an example script to reuse the sfeed_update code
This code uses the non-portable xargs -P option to more efficiently process
feeds in parallel.
commit 4d9f922c8396bada73fb0b1e318c8b947f0f606b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:35:02 2021 +0100
sfeed_update: allow to reuse the code more easily as an included script
This adds a main() function. When the environment variable
$SFEED_UPDATE_INCLUDE is set then it will not execute the main handler. The
other functions are included and can be reused. This is also useful for
unit-testing.
commit ddda48dac8e373f9bc0884ffeb84605925d988e5
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:33:01 2021 +0100
sfeed_update: separate code of parallel exection and feed() into a _feed() handler
This is useful to be able to reuse the code (together with using sfeed_update
as an included script, coming in the next commit).
commit 1b3f13b1ba5a3e40d4c9266f16091ad347aec333
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:30:21 2021 +0100
sfeed_update: shuffle code getting the path of the feedurl to make the basesiteurl
Move it closer before it is used.
commit f1883579dfcb256b514defcece7bf88fe65ee659
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:29:03 2021 +0100
sfeed_update: change parse failure error message
"(FAIL CONVERT)" -> "(FAIL PARSE)". Convert may be too similar to text encoding
conversion.
commit 1a17ad48876172fadb6782f1edee2058819770dc
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jan 25 19:27:04 2021 +0100
sfeed_update: add an overridable parse() function, using sfeed(1) by default
This can be useful to make more cleanly make connector scripts.
This does not necesarily even have to be in the sfeed(5) format.
commit 650adaf537c2baea0d8ad1d11d16286c798cd640
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jan 24 14:27:26 2021 +0100
sfeed_opml_export: fix typos in comment
commit df74ba274c4ea5d9b7388c33500ba601ed0c991d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jan 24 00:27:50 2021 +0100
sfeed_update: print the filename again as passed as a parameter
... and do not show stderr of readlink.
commit 55d64a0d370b140547fdbab88d271cd6b92265d2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Jan 23 14:48:27 2021 +0100
bump version to 0.9.21
commit 2e00d9c904e6b03a6198651d15dc827526fa319c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 22 22:40:40 2021 +0100
xml.c: fix typo / regression in checking codepoint range for utf-16 surrogate pair
Regression in commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
0xdffff should be 0xdfff.
printf '<item><title>👈</title></item>' | sfeed
Before (bad):
👈
After:
👈
commit 81499ca75a8c1cb2440c89241b5f5d06e5c44f5a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 22 22:16:34 2021 +0100
sfeed: fix regression with parsing content fields
This regression introduced in commit e43b7a48 on Tue Oct 6 18:51:33 2020 +0200.
After a content tag was parsed the "iscontenttag" variable was not reset.
This caused 2 regressions:
- It ignored other tags such as links after it.
- It incorrectly set the content-type of a lesser priority field.
Thanks to pazz0 for reporting it!
commit 5940c794323d58dd94bdabc2e7282e90f15dfeb4
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 22 15:02:56 2021 +0100
README: tested with lacc
Interesting C compiler project:
lacc: A simple, self-hosting C compiler:
https://github.com/larmel/lacc
commit 12b279581fbbcde2b36eb4b78d70a1c52d4a209a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 22 01:11:19 2021 +0100
xml.c: do not convert UTF-16 surrogate pairs to an invalid sequence
Simple way to reproduce:
printf '<item><title>�</title></item>' | sfeed | iconv -t utf-8
Result:
iconv: (stdin):1:8: cannot convert
Output result:
printf '<item><title>�</title></item>' | sfeed
Before:
00000000 09 ed b0 80 09 09 09 09 09 09 09 0a |............|
0000000c
After:
00000000 09 26 23 78 64 63 30 30 3b 09 09 09 09 09 09 09 |.�.......|
00000010 0a |.|
00000011
The entity is output as a literal string. This allows to see more easily whats
wrong and debug the feed and it is consistent with the current behaviour of
invalid named entities (&bla;). An alternative could be a UTF-8 replacement
symbol (codepoint 0xfffd).
Reference: https://unicode.org/faq/utf_bom.html , specificly:
"Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? "
"A: A different issue arises if an unpaired surrogate is encountered when
converting ill-formed UTF-16 data. By representing such an unpaired surrogate
on its own as a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input, Unicode
conformance requires that encoding form conversion always results in a valid
data stream. Therefore a converter must treat this as an error. [AF]"
commit 57d341d9826ff742b5f69cab8228d0d06c3997a3
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Jan 16 00:02:12 2021 +0100
sfeed_update: typo in comment
commit a94f19744a2b62b021fcd3741ec2e4d5316ed49a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Jan 12 01:02:37 2021 +0100
sfeed_update: improve consistency of feed creation and merging
- Improve feed creation with empty results and new feed files.
Always make sure the file is created even when it is new and there are also no
items (after filtering).
- Consistency: always use the same feed file for merging.
Do not use "/dev/null" when it is a new file. This works using sort, but is
ugly when the merge() function is overridden and does something else. It should
be the feed file always.
commit 7270aee2452804631741c971d3b4bc6d52ca72a6
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Jan 12 00:50:04 2021 +0100
sfeed_update: make convertencoding() consistent with other overridable functions
This adds the name as the first parameter for the convertencoding() function,
like filter, merge, order, etc.
This can be useful to make an exception rule for text decoding in a more clean
way.
commit 4c35fd89e81ee83b1174dd6cfed3d6beaf25dc63
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Jan 16 00:01:09 2021 +0100
sfeed_opml_import: minor code-style improvements
commit 3260b503e5a4c1d4ccabed0b2672e2f49c38df05
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 15 18:35:34 2021 +0100
sfeed_opml_import.1: clarify it handles OPML _subscription_ lists specificly
OPML is a more generic format, this tool is specificly for "rss" types and
subscription lists.
commit 7feab0fd885c1c93344fd427904eae91943493b7
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 15 18:31:52 2021 +0100
README: newsboat sqlite3 export script: improvements
- Export read/unread state to a separate plain-text "urls" file, line by line.
- Handle white-space control-chars better.
From the sfeed(1) man page:
" The fields: title, id, author are not allowed to have newlines and TABs,
all whitespace characters are replaced by a single space character.
Control characters are removed."
So do the reverse for newsboat aswell: change white-space characters which are
also control-characters (such as TABs and newlines) to a single space
character.
commit f18f4818ed2c992aa9b7b91c74bb9ce7cc1bc745
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jan 10 16:57:53 2021 +0100
optimize converting UNIX timestamp to localtime
Make a huge difference (cuts the time in half to process the same amount of
lines) on atleast glibc 2.30 on Void Linux. Seems to make no difference on
OpenBSD.
- This removes atleast one heap allocation per line (checked with valgrind).
This is because glibc will strdup() the environment variable $TZ and free it
each time, which is pointless here and wasteful.
- localtime_r does not require to set the variables like tzname.
In glibc-2.30/time/tzset.c in __tz_convert is the following code and comment:
/* Update internal database according to current TZ setting.
POSIX.1 8.3.7.2 says that localtime_r is not required to set tzname.
This is a good idea since this allows at least a bit more parallelism. */
tzset_internal (tp == &_tmbuf && use_localtime);
This makes it always tzset() and inspect the environment $TZ etc. While with
localtime_r it will only initialize it once:
static void tzset_internal (int always) {
[...]
if (is_initialized && !always)
return;
commit c1b44cf790f8090ff25a2ff268c3f7a8d53e1bcf
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Jan 9 16:05:27 2021 +0100
printutf8pad: fix byte-seek issue with negative width codepoints in the range >= 127
For example: "\xef\xbf\xb7" (codepoint 0xfff7), returns wcwidth(wc) == -1.
The next byte was incorrected seeked, but the codepoint itself was valid
(mbtowc).
commit 6403eadff853f470a859317f605dd306eb427bba
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Jan 9 14:58:40 2021 +0100
printutf8pad: small code-style/clarify changes
commit d95499b2d8e777fcb82fd3637044c3e70118a06b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 8 19:38:58 2021 +0100
sfeed_atom: code-style: use conditional with pledge, like the other tools
commit 04b832539cd5b5392c56ef238ec9b42b689de3ae
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 8 19:08:59 2021 +0100
util.c: printutf8pad(): improve padded printing and printing invalid unicode characters
This affects sfeed_plain.
- Use unicode replacement character (codepoint 0xfffd) when a codepoint is
invalid and proceed printing the rest of the characters.
- When a codepoint is invalid reset the internal state of mbtowc(3), from the
OpenBSD man page:
" If a call to mbtowc() resulted in an undefined internal state, mbtowc()
must be called with s set to NULL to reset the internal state before it
can safely be used again."
- Optimize for the common ASCII case and use a macro to print the character
instead of a wasteful fwrite() function call. With 250k lines (+- 350MB) this
improves printing performance from 1.7s to 1.0s on my laptop. On an other
system it improved by +- 25%. Tested with clang and gcc and also tested the
worst-case (non-ASCII) with no penalty.
To test:
printf '0\tabc\xc3 def' | sfeed_plain
Before:
1970-01-01 01:00 abc
After:
1970-01-01 01:00 abc� def
commit c7e3ec5f37738c43b3918cba6977fa51631a23af
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 8 12:05:14 2021 +0100
sfeed_gopher: optimize common output character function
Same reason as the previous commit (allow to expand to macros).
commit bb34ab8d50cbe4c9525d06e4cb67fb58e48ae8b8
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 8 11:58:48 2021 +0100
xmlencode: optimize common character output function
Use putc instead of fputc, it can be optimized to macros.
From the OpenBSD man page:
" putc() acts essentially identically to fputc(), but is a macro that
expands in-line. It may evaluate stream more than once, so arguments
given to putc() should not be expressions with potential side effects."
sfeed_atom, sfeed_frames and sfeed_html are using this function.
Mini-benchmarked sfeed_html and it went from 1.45s to 1.0s with feed files in
total 250k lines (+- 350MB). Tested with clang and gcc on OpenBSD on an older
laptop.
commit b829948d9da8dbbea6d7275ebc1021000114ba15
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jan 3 14:46:01 2021 +0100
man pages: add more real world examples to the man pages
commit 8c85cbf85c42eca17f18ffe45ffc6bde695a4e26
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 23:23:06 2021 +0100
sfeed.1/sfeed_plain.1: add example, improve quoting the url for sfeed_web.1
commit aec6b5c35ac33736e6b94e1a613666fe19ebb2d4
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 22:38:10 2021 +0100
sfeed_gopher: tighten filesystem permissions on OpenBSD using unveil(2)
sfeed_gopher must be able to write in the current directory, but does not need
write permissions outside it. It could read from any place in the filesystem
(to read feed files).
Prompted by a suggestion from vejetaryenvampir, thanks!
commit e09c96a75edf81b3d219223b6d3564ad7df10004
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 22:36:18 2021 +0100
README: add text about page redirects + tweak some words.
... move sections around in a more logical order and tweak some words.
Prompted by a question and feedback from Aleksei, thanks!
commit 3ef73a907e65cad2a250e2e9d345c58f19ae4ca0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 22:36:00 2021 +0100
README: tested on MIPS32 (big-endian)
commit cd3ef9def6e39c57236c53e7a3bfe3903132c6a2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 22:35:48 2021 +0100
LICENSE: bump year
commit 7e4c07313301fdc43bd60f328dba8bf638813a37
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Jan 1 22:34:04 2021 +0100
sfeed_update: if baseurl is empty then use the path from the feed by default
Feeds should contain absolute urls, but if it does not have it then this makes
it more convenient to configure such feeds.
commit cfcdae2a394a2ac0d180ff34cae17e30422d0800
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Nov 9 19:16:09 2020 +0100
bump version to 0.9.20
commit 7664faf66555c5c5b1221322bff7e49722ea36c7
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Nov 1 16:53:40 2020 +0100
sfeed_xmlenc: be more paranoid in printing encoding names
sfeed_xmlenc is used automatically in sfeed_update for detecting the encoding.
In particular do not allow slashes anymore either. For example "//IGNORE" and
"//TRANSLIT" which are normally allowed.
Some iconv implementation might allow other funky names or even pathnames too,
so disallow that.
See also the notes about the "frommap" for the "-f" option.
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html
+ some minor parsing handling improvements.
commit 134a1ac3372fe1eae6bc5c6acd12666c17e82696
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Oct 31 19:51:17 2020 +0100
sfeed_web: improve parsing a <link> if it has no type attribute
This happens because the previous link type is not reset when a <link> tag
starts again, but it is reset when a type attribute starts.
Found on the spanish newspaper site: elpais.com
Input:
<link rel="alternate" href="https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada" type="application/rss+xml" title="RSS de la portada de El PaÃs"/>
<link rel="canonical" href="https://elpais.com"/>
Would print (second line is incorrect).
https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml
https://elpais.com/ application/rss+xml
Now prints:
https://feeds.elpais.com/mrss-s/pages/ep/site/elpais.com/portada application/rss+xml
Fix: reset it also at the start of a <link> tag in this case (for <base href />
it is still not wanted).
commit 6a7229149f03a54d7d63241c4cbc1c83aa9831f0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Oct 24 16:20:38 2020 +0200
bump version to 0.9.19
commit d7b910dcaea6027b5c11f027e3af7eb4ac8962b2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Oct 22 19:54:21 2020 +0200
sfeed_web: whoops, fix bug mentioned in the previous commit
(ascii.jp)
commit e771e43d51830ec7d2a19d9d4e67cded83c1b302
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Oct 21 22:06:58 2020 +0200
sfeed_web: attribute parsing improvements, improve man page
Fix attribute parsing and now decode entities. The following now works (from
helsinkitimes.fi):
<base href="https://www.helsinkitimes.fi/" />
<link href="/?format=feed&type=rss" rel="alternate" type="application/rss+xml" title="RSS 2.0" />
<link href="/?format=feed&type=atom" rel="alternate" type="application/atom+xml" title="Atom 1.0" />
Properly associate attributes with the actual tag, this now parses properly
(from ascii.jp).
<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
<link rel="alternate" type="application/rss+xml" />
commit f054e581dac4921b302e0459a40d1b4f1fbd28ae
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Oct 22 18:18:32 2020 +0200
Do not change the referenced matched tag data (from gettag()).
Fixes a regression introduced in the refactor in commit
e43b7a48b08a6bbcb4e730e80395b3257681b33e
Now copy the data by value. This structure is small and no performance
regression has been seen.
This was because the tag ID was modified which made subsequent parsed tags of
this type behave strangely:
ctx.tag->id = RSSTagGuidPermalinkTrue;
Input data to reproduce:
<rss>
<channel>
<item>
<guid isPermaLink="false">https://def/</guid>
</item>
<item>
<guid>https://abc/</guid>
</item>
</channel>
</rss>
commit 0f76bbf2d809e93903facaa0da9cfd51cabf93fe
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Oct 21 17:30:28 2020 +0200
README: filter example, filter Google Analytics utm_* parameters
https://support.google.com/analytics/answer/1033867?hl=nl
commit c30806b15999b4c35649ddea54b2be6ac9a14620
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Wed Oct 21 00:31:51 2020 +0200
sfeed_web: reset feedlink buffer
Noticed strange output on the site ascii.jp:
The site HTML contained:
<link rel="apple-touch-icon-precomposed" href="/img/apple-touch-icon.png" />
<link rel="alternate" type="application/rss+xml" />
This would print:
"/img/apple-touch-icon.png application/rss+xml"
Now it prints:
" application/rss+xml"
commit 3d4ab8553f7f2867b1ee5526a6785a2877ec5648
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Oct 18 16:22:57 2020 +0200
README: improve etag example with escaping of the filename
Use the same base filename as the feed file, because sfeed_update replaces '/'
in names with '_':
filename="$(printf '%s' "$1" | tr '/' '_')"
This fixes the example for fetching feeds with names containing '/'.
Reported by __20h__, thanks!
commit 27a14a3d46603ab452dba10796a24591db096d76
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Oct 18 13:31:11 2020 +0200
README: add example to support ETag caching
commit bffe41223467cb6d9a87c1a057966924fbda6c40
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Oct 18 00:00:58 2020 +0200
xml.c: initialize i = 0
Forgot it in the cleanup commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
commit 3bce637a2a610c7991b7cbf87c530d87b019c3d6
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Oct 16 11:24:28 2020 +0200
README.xml: reference examples, ANSI compatible, mention original parser
commit 2cda3937b6e8cd333e4a4388553a17339fce82ac
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Oct 16 11:18:51 2020 +0200
README: fix unescaped character in regex in awk in filter example
Found by testing using mawk.
commit 65df3f2e9a3961541ac33f1a147d373f814aedf9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 20:42:17 2020 +0200
add a comment about the intended date priority
commit db9a53d95dddb5b668be498d9274ba700ac3b529
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 20:40:50 2020 +0200
Revert "RSS: give Dublin Core <dc:date> higher priority over <pubDate>"
This reverts commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2.
commit 74df6f66ab2b6cf1857658151f9076c585911b7f
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 20:10:44 2020 +0200
README: filter example: strip Facebook fbclid parameter
commit 33c50db302957bca2a850ac8d0b960d05ee0520e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:55:35 2020 +0200
simplify time parsing
commit 0ea6495717a3245d1da079b5d9570a6cf776ef1a
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:49:56 2020 +0200
remove unneeded check for NUL terminator
commit a1516cb7869a0dd99ebaacf846ad4161f2b9b9a2
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Oct 12 18:41:31 2020 +0200
RSS: give Dublin Core <dc:date> higher priority over <pubDate>
This way dc:date could be the updated time of the item. For Atom there is
<published> and <updated> with the same logic.
commit e43b7a48b08a6bbcb4e730e80395b3257681b33e
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:51:33 2020 +0200
parse categories, add multiple field values support (for categories)
Fields with multiple values are separated by '|'. In the future multiple
enclosure support might be added.
The categories tags are now parsed. This feature is useful for filtering and
categorizing.
Parsing of nested tags such as <author><name> has been improved. This code has
been refactored.
RSS <guid> isPermaLink is now handled differently also and will now prefer a
permalink with "true" (link) over the ID. In practise multiple <guid> in an
item does not happen.
commit 37afcf334fa1ba0b668bde08e8fcaaa9fd7dfa0d
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Fri Oct 9 11:43:52 2020 +0200
xml: remove unused code for sfeed
commit 3a5295f595acee81170565ba0405a64d7110a229
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:45:54 2020 +0200
fix counting due to uninitialized variable when the time could not be parsed
Since commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0 if the time is empty or
could not be parsed then it is shown/aligned as a blank space instead of being
skipped.
An oversight in this change was that items should be counted and set in
`isnew`.
This commit fixes the uninitialized variable and possible miscounting.
commit 08add9b95e58bc4e7ece8137efe8dc0905a94467
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:45:26 2020 +0200
xml.h: minor comment rewording
commit c0e9fea2a9661f3fc77a1a608e5ca456972321d9
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:45:09 2020 +0200
sfeed: parse day with max 2 digits (instead of 4)
commit ac9fbcd7c4c5603d0a430b82021e6b45c8a02836
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:44:35 2020 +0200
sfeed: support the ISO8601 time format without separators
For example "19720229T132245Z" is now supported.
commit 688fbed347554944aa8f9d0bbd9f66694a8fa353
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:40:30 2020 +0200
README: tested with cproc and sdcc on Z80 emulator, for fun
cproc:
cproc: https://github.com/michaelforney/cproc
qbe: https://c9x.me/compile/
z80 (sfeed base program)
fuzix: http://www.fuzix.org/
RC2014 emulator: https://github.com/EtchedPixels/RC2014
sdcc: http://sdcc.sourceforge.net/
commit 64ee7df437b5ee497cac0a7ac7d382306a5cd159
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Tue Oct 6 18:39:55 2020 +0200
man pages: tweak alignment of lists
commit 95f025d9bcc4b3d1e09a5790bb9ee2cb89ad4cf1
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Sep 19 16:38:21 2020 +0200
xml.c: remove buffering of comment data, which is unused anyway
commit 88208d5ffda5cfbf03901038b41afae703f0d172
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Sep 19 16:37:44 2020 +0200
xml.h: add underscore for #ifdef guard
This is the common style.
commit 7779e6a74852a7655210abb686287a4b59e75fdc
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sat Sep 19 16:35:33 2020 +0200
XML cdata callback: handle CDATA as data
This improves handling CDATA for example in Atom feeds with:
<author><email><![CDATA[abc]]><name><![CDATA[[person]]></name></author>
commit 5b130adb977ba0ba233ccc5b2bdfc6d6e37563dd
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Mon Jul 6 17:38:17 2020 +0200
bump version to 0.9.18
commit 17898c52f631c3dd10c11167aaa702e5756dde04
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 16:46:40 2020 +0200
sfeed_atom: minor simplification, gmtime_r is not needed here
commit e8bbe3e72d14a3d81188ea3df4c2c14a5ab90f98
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 16:41:38 2020 +0200
README: reference sfeed_curses
commit 92be24ba540d9c7a0462ea7744803165aba26849
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 16:10:23 2020 +0200
README: improvements
- Add an example to optimize bandwidth use with the curl -z option.
- Add a note about CDNs blocking based on the User-Agent (based on a question
mailed to me).
- Add an script to convert existing newsboat items to the sfeed(5) TSV format.
commit 276d5789fd91d1cbe84b7baee736dea28b1e04c0
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 15:57:02 2020 +0200
format tools: don't skip items with a missing/invalid timestamp field
Handle it appropriately in the context of each format tool. Output the item but
keep it blanked.
NOTE: maybe in sfeed_twtxt it should use the current time instead?
commit a13633a0a6b4e43d7809919ad70e085b9ecdff47
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 15:54:28 2020 +0200
sfeed_mbox: don't ignore items with a missing/invalid timestamp
The Date header is mandatory. Use the current time if it is missing/invalid.
commit d425b524f5ce06e6d9cbb6838c4bf8b66b73ee06
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 15:53:37 2020 +0200
sfeed_atom: the updated field is mandatory: use the current time...
... if it is missing/invalid.
commit b722b45e5468af3e9405652b4ca57c10c376ba8b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jul 5 14:05:04 2020 +0200
sfeed_atom: fix timezone, output if timestamp is set
Timezone should be GMT (as intended), do not convert to localtime.
commit affe5b6b64d57e56eb14132fa67c8f9e3800af6c
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Thu Jun 25 19:50:43 2020 +0200
README: small tweaks and a filter example improvement
This is a "quick&dirty" regex to block some of the typical 1px width or height
tracking pixels.
commit 1b14e601434435f48dfe1027a117c2af3dac325b
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jun 21 18:08:41 2020 +0200
sfeed_html/sfeed_frames: simplify struct feed allocation
There's no need for a dynamic struct feed **. The required size is known
(argc). Just allocate it in one go.
commit b4830a4a3a98874293e20b309e083a144c1b3609
Author: Hiltjo Posthuma <hiltjo@codemadness.org>
Date: Sun Jun 21 18:01:52 2020 +0200
Makefile: tiny compatibility improvement for tar -cf
Diffstat:
31 files changed, 1181 insertions(+), 684 deletions(-)
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
ISC License
-Copyright (c) 2011-2020 Hiltjo Posthuma <hiltjo@codemadness.org>
+Copyright (c) 2011-2021 Hiltjo Posthuma <hiltjo@codemadness.org>
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted, provided that the above
diff --git a/Makefile b/Makefile
@@ -1,7 +1,7 @@
.POSIX:
NAME = sfeed
-VERSION = 0.9.17
+VERSION = 0.9.22
# paths
PREFIX = /usr/local
@@ -97,7 +97,7 @@ dist:
sfeedrc.example style.css \
"${NAME}-${VERSION}"
# make tarball
- tar -cf - "${NAME}-${VERSION}" | \
+ tar cf - "${NAME}-${VERSION}" | \
gzip -c > "${NAME}-${VERSION}.tar.gz"
rm -rf "${NAME}-${VERSION}"
diff --git a/README b/README
@@ -79,6 +79,9 @@ Gopher, SSH, etc.
See the section "Usage and examples" below and the man-pages for more
information how to use sfeed(1) and the additional tools.
+A separate curses UI front-end called sfeed_curses is available at:
+https://codemadness.org/sfeed_curses.html
+
Dependencies
------------
@@ -108,20 +111,21 @@ Optional dependencies
OS tested
---------
-- Linux (compilers: clang, gcc, pcc, tcc, libc: glibc, musl).
+- Linux (compilers: clang, cproc, gcc, lacc, pcc, tcc, libc: glibc, musl).
- OpenBSD (clang, gcc).
- NetBSD
- FreeBSD
- DragonFlyBSD
- Windows (cygwin gcc, mingw).
- HaikuOS (using libbsd).
-- FreeDOS (djgpp)
+- FreeDOS (djgpp).
+- FUZIX (sdcc -mz80).
Architectures tested
--------------------
-amd64, ARM, aarch64, HPPA, i386, SPARC64.
+amd64, ARM, aarch64, HPPA, i386, MIPS32-BE, SPARC64, Z80.
Files
@@ -139,8 +143,8 @@ sfeed_mbox - Format feed data (TSV) to mbox.
sfeed_plain - Format feed data (TSV) to a plain-text list.
sfeed_twtxt - Format feed data (TSV) to a twtxt feed.
sfeed_update - Update feeds and merge items.
-sfeed_web - Find urls to RSS/Atom feed from a webpage.
-sfeed_xmlenc - Detect character-set encoding from XML stream.
+sfeed_web - Find URLs to RSS/Atom feed from a webpage.
+sfeed_xmlenc - Detect character-set encoding from a XML stream.
sfeedrc.example - Example config file. Can be copied to $HOME/.sfeed/sfeedrc.
style.css - Example stylesheet to use with sfeed_html(1) and
sfeed_frames(1).
@@ -152,7 +156,7 @@ Files read at runtime by sfeed_update(1)
sfeedrc - Config file. This file is evaluated as a shellscript in
sfeed_update(1).
-Atleast the following functions can be overridden per feed:
+At least the following functions can be overridden per feed:
- fetch: to use wget(1), OpenBSD ftp(1) or an other download program.
- filter: to filter on fields.
@@ -186,7 +190,7 @@ man 1 sfeed
Usage and examples
------------------
-Find RSS/Atom feed urls from a webpage:
+Find RSS/Atom feed URLs from a webpage:
url="https://codemadness.org"; curl -L -s "$url" | sfeed_web "$url"
@@ -222,7 +226,7 @@ View formatted output in your editor:
- - -
Example script to view feed items in a vertical list/menu in dmenu(1). It opens
-the selected url in the browser set in $BROWSER:
+the selected URL in the browser set in $BROWSER:
#!/bin/sh
url=$(sfeed_plain "$HOME/.sfeed/feeds/"* | dmenu -l 35 -i | \
@@ -248,7 +252,7 @@ argument is optional):
- - -
The filter function can be overridden in your sfeedrc file. This allows
-filtering items per feed. It can be used to shorten urls, filter away
+filtering items per feed. It can be used to shorten URLs, filter away
advertisements, strip tracking parameters and more.
# filter fields.
@@ -285,14 +289,14 @@ advertisements, strip tracking parameters and more.
}
# shorten feedburner links.
- if (match(s, /^(http|https):\/\/[^/]+\/~r\/.*\/~3\/[^\/]+\//)) {
+ if (match(s, /^(http|https):\/\/[^\/]+\/~r\/.*\/~3\/[^\/]+\//)) {
s = substr($3, RSTART, RLENGTH);
}
# strip tracking parameters
# urchin, facebook, piwik, webtrekk and generic.
- gsub(/\?(ad|campaign|pk|tm|wt)_([^&]+)/, "?", s);
- gsub(/&(ad|campaign|pk|tm|wt)_([^&]+)/, "", s);
+ gsub(/\?(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "?", s);
+ gsub(/&(ad|campaign|fbclid|pk|tm|utm|wt)_([^&]+)/, "", s);
gsub(/\?&/, "?", s);
gsub(/[\?&]+$/, "", s);
@@ -303,24 +307,15 @@ advertisements, strip tracking parameters and more.
$3 = filterlink($3); # link
$8 = filterlink($8); # enclosure
+ # try to remove tracking pixels: <img/> tags with 1px width or height.
+ gsub("<img[^>]*(width|height)[\s]*=[\s]*[\"'"'"' ]?1[\"'"'"' ]?[^0-9>]+[^>]*>", "", $4);
+
print $0;
}'
}
- - -
-The fetch function can be overridden in your sfeedrc file. This allows to
-replace the default curl(1) for sfeed_update with any other client to fetch the
-RSS/Atom data:
-
- # fetch a feed via HTTP/HTTPS etc.
- # fetch(name, url, feedfile)
- fetch() {
- hurl -m 1048576 -t 15 "$2" 2>/dev/null
- }
-
-- - -
-
Aggregate feeds. This filters new entries (maximum one day old) and sorts them
by newest first. Prefix the feed name in the title. Convert the TSV output data
to an Atom XML feed (again):
@@ -372,7 +367,7 @@ cut -b is used to trim the "N " prefix of sfeed_plain(1).
- - -
For some podcast feed the following code can be used to filter the latest
-enclosure url (probably some audio file):
+enclosure URL (probably some audio file):
awk -F '\t' 'BEGIN { latest = 0; }
length($8) {
@@ -543,6 +538,221 @@ Now run:
Now you can view feeds in mutt(1) for example.
+- - -
+
+The fetch function can be overridden in your sfeedrc file. This allows to
+replace the default curl(1) for sfeed_update with any other client to fetch the
+RSS/Atom data or change the default curl options:
+
+ # fetch a feed via HTTP/HTTPS etc.
+ # fetch(name, url, feedfile)
+ fetch() {
+ hurl -m 1048576 -t 15 "$2" 2>/dev/null
+ }
+
+- - -
+
+Caching, incremental data updates and bandwidth-saving
+
+For servers that support it some incremental updates and bandwidth-saving can
+be done by using the "ETag" HTTP header.
+
+Create a directory for storing the ETags per feed:
+
+ mkdir -p ~/.sfeed/etags/
+
+The curl ETag options (--etag-save and --etag-compare) can be used to store and
+send the previous ETag header value. curl version 7.73+ is recommended for it
+to work properly.
+
+The curl -z option can be used to send the modification date of a local file as
+a HTTP "If-Modified-Since" request header. The server can then respond if the
+data is modified or not or respond with only the incremental data.
+
+The curl --compressed option can be used to indicate the client supports
+decompression. Because RSS/Atom feeds are textual XML content this generally
+compresses very well.
+
+These options can be set by overriding the fetch() function in the sfeedrc
+file:
+
+ # fetch(name, url, feedfile)
+ fetch() {
+ etag="$HOME/.sfeed/etags/$(basename "$3")"
+ curl \
+ -L --max-redirs 0 -H "User-Agent:" -f -s -m 15 \
+ --compressed \
+ --etag-save "${etag}" --etag-compare "${etag}" \
+ -z "${etag}" \
+ "$2" 2>/dev/null
+ }
+
+These options can come at a cost of some privacy, because it exposes
+additional metadata from the previous request.
+
+- - -
+
+CDNs blocking requests due to a missing HTTP User-Agent request header
+
+sfeed_update will not send the "User-Agent" header by default for privacy
+reasons. Some CDNs like Cloudflare don't like this and will block such HTTP
+requests.
+
+A custom User-Agent can be set by using the curl -H option, like so:
+
+ curl -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0'
+
+The above example string pretends to be a Windows 10 (x86-64) machine running
+Firefox 78.
+
+- - -
+
+Page redirects
+
+For security and efficiency reasons by default redirects are not allowed and
+are treated as an error.
+
+For example to prevent hijacking an unencrypted http:// to https:// redirect or
+to not add time of an unnecessary page redirect each time. It is encouraged to
+use the final redirected URL in the sfeedrc config file.
+
+If you want to ignore this advise you can override the fetch() function in the
+sfeedrc file and change the curl options "-L --max-redirs 0".
+
+- - -
+
+Shellscript to update feeds in parallel more efficiently using xargs -P.
+
+It creates a queue of the feeds with its settings, then uses xargs to process
+them in parallel using the common, but non-POSIX -P option. This is more
+efficient than the more portable solution in sfeed_update which can stall a
+batch of $maxjobs in the queue if one item is slow.
+
+sfeed_update_xargs shellscript:
+
+ #!/bin/sh
+ # update feeds, merge with old feeds using xargs in parallel mode (non-POSIX).
+
+ # include script and reuse its functions, but do not start main().
+ SFEED_UPDATE_INCLUDE="1" . sfeed_update
+ # load config file, sets $config.
+ loadconfig "$1"
+
+ # process a single feed.
+ # args are: config, tmpdir, name, feedurl, basesiteurl, encoding
+ if [ "${SFEED_UPDATE_CHILD}" = "1" ]; then
+ sfeedtmpdir="$2"
+ _feed "$3" "$4" "$5" "$6"
+ exit $?
+ fi
+
+ # ...else parent mode:
+
+ # feed(name, feedurl, basesiteurl, encoding)
+ feed() {
+ # workaround: *BSD xargs doesn't handle empty fields in the middle.
+ name="${1:-$$}"
+ feedurl="${2:-http://}"
+ basesiteurl="${3:-${feedurl}}"
+ encoding="$4"
+
+ printf '%s\0%s\0%s\0%s\0%s\0%s\0' "${config}" "${sfeedtmpdir}" \
+ "${name}" "${feedurl}" "${basesiteurl}" "${encoding}"
+ }
+
+ # fetch feeds and store in temporary directory.
+ sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')"
+ # make sure path exists.
+ mkdir -p "${sfeedpath}"
+ # print feeds for parallel processing with xargs.
+ feeds | SFEED_UPDATE_CHILD="1" xargs -r -0 -P "${maxjobs}" -L 6 "$(readlink -f "$0")"
+ # cleanup temporary files etc.
+ cleanup
+
+- - -
+
+Shellscript to export existing newsboat cached items from sqlite3 to the sfeed
+TSV format.
+
+ #!/bin/sh
+ # Export newsbeuter/newsboat cached items from sqlite3 to the sfeed TSV format.
+ # The data is split per file per feed with the name of the newsboat title/url.
+ # It writes the URLs of the read items line by line to a "urls" file.
+ #
+ # Dependencies: sqlite3, awk.
+ #
+ # Usage: create some directory to store the feeds, run this script.
+ #
+ # Assumes "html" for content-type (Newsboat only handles HTML content).
+ # Assumes feednames are unique and a feed title is set.
+
+ # newsboat cache.db file.
+ cachefile="$HOME/.newsboat/cache.db"
+ test -n "$1" && cachefile="$1"
+
+ # dump data.
+ # .mode ascii: Columns/rows delimited by 0x1F and 0x1E
+ # get the first fields in the order of the sfeed(5) format.
+ sqlite3 "$cachefile" <<!EOF |
+ .headers off
+ .mode ascii
+ .output
+ SELECT
+ i.pubDate, i.title, i.url, i.content, i.guid, i.author,
+ i.enclosure_url,
+ f.rssurl AS rssurl, f.title AS feedtitle, i.unread --,
+ -- i.id, i.enclosure_type, i.enqueued, i.flags, i.deleted,
+ -- i.base
+ FROM rss_feed f
+ INNER JOIN rss_item i ON i.feedurl = f.rssurl
+ ORDER BY
+ i.feedurl ASC, i.pubDate DESC;
+ .quit
+ !EOF
+ # convert to sfeed(5) TSV format.
+ awk '
+ BEGIN {
+ FS = "\x1f";
+ RS = "\x1e";
+ }
+ # normal non-content fields.
+ function field(s) {
+ gsub("^[[:space:]]*", "", s);
+ gsub("[[:space:]]*$", "", s);
+ gsub("[[:space:]]", " ", s);
+ gsub("[[:cntrl:]]", "", s);
+ return s;
+ }
+ # content field.
+ function content(s) {
+ gsub("^[[:space:]]*", "", s);
+ gsub("[[:space:]]*$", "", s);
+ # escape chars in content field.
+ gsub("\\\\", "\\\\", s);
+ gsub("\n", "\\n", s);
+ gsub("\t", "\\t", s);
+ return s;
+ }
+ function feedname(url, title) {
+ gsub("/", "_", title);
+ return title;
+ }
+ {
+ fname = feedname($8, $9);
+ if (!feed[fname]++) {
+ print "Writing file: \"" fname "\" (title: " $9 ", url: " $8 ")" > "/dev/stderr";
+ }
+
+ print $1 "\t" field($2) "\t" field($3) "\t" content($4) "\t" \
+ "html" "\t" field($5) "\t" field($6) "\t" field($7) \
+ > fname;
+
+ # write URLs of the read items to a file line by line.
+ if ($10 == "0") {
+ print $3 > "urls";
+ }
+ }'
+
License
-------
diff --git a/README.xml b/README.xml
@@ -3,11 +3,14 @@ XML parser
A small XML parser.
+For the original version see:
+https://git.codemadness.org/xmlparser/
+
Dependencies
------------
-- C compiler (C99).
+- C compiler (ANSI).
Features
@@ -51,7 +54,7 @@ Caveats
HTML <pre> tags.
- The XML specification has no limits on tag and attribute names. For
simplicity/sanity sake this XML parser takes some liberties. Tag and
- attribute names are truncated.
+ attribute names are truncated if they are excessively long.
- Entity expansions are not parsed aswell as DOCTYPE, ATTLIST etc.
@@ -72,6 +75,9 @@ Examples
sfeed_opml_import.c or sfeed_web.c or sfeed_xmlenc.c
+See skeleton.c in the original xmlparser repository for a base program to start
+quickly.
+
License
-------
diff --git a/sfeed.1 b/sfeed.1
@@ -1,4 +1,4 @@
-.Dd March 15, 2020
+.Dd February 28, 2021
.Dt SFEED 1
.Os
.Sh NAME
@@ -13,8 +13,11 @@ reads RSS or Atom feed data (XML) from stdin.
It writes the feed data in a TAB-separated format to stdout.
A
.Ar baseurl
-can be specified if the links in the feed are relative urls.
-It is recommended to always have absolute urls in your feeds.
+can be specified if the links or enclosures in the feed are relative URLs.
+If the
+.Ar baseurl
+is a valid absolute URL then the relative links or enclosures will be
+made absolute.
.Sh TAB-SEPARATED FORMAT FIELDS
The items are output per line in a TSV-like format.
.Pp
@@ -29,13 +32,13 @@ Other whitespace characters except spaces are removed.
Control characters are removed.
.Pp
The order and content of the fields are:
-.Bl -tag -width 17n
+.Bl -tag -width 12n
.It timestamp
UNIX timestamp in UTC+0, empty if missing or on parse failure.
.It title
Title text, HTML code in titles is ignored and is treated as plain-text.
.It link
-Absolute url, unsafe characters are encoded.
+Link
.It content
Content, can have plain-text or HTML code depending on the content-type field.
.It content-type
@@ -46,12 +49,20 @@ RSS item GUID or Atom id.
Item author.
.It enclosure
Item, first enclosure.
+.It category
+Item, categories, multiple values are separated by |.
.El
.Sh EXIT STATUS
.Ex -std
+.Sh EXAMPLES
+.Bd -literal
+curl -s 'https://codemadness.org/atom.xml' | sfeed
+.Ed
.Sh SEE ALSO
.Xr sfeed_plain 1 ,
-.Xr sfeed 5
+.Xr sfeed_update 1 ,
+.Xr sfeed 5 ,
+.Xr sfeedrc 5
.Sh AUTHORS
.An Hiltjo Posthuma Aq Mt hiltjo@codemadness.org
.Sh CAVEATS
diff --git a/sfeed.5 b/sfeed.5
@@ -1,4 +1,4 @@
-.Dd July 20, 2019
+.Dd February 19, 2021
.Dt SFEED 5
.Os
.Sh NAME
@@ -23,13 +23,13 @@ Other whitespace characters except spaces are removed.
Control characters are removed.
.Pp
The order and content of the fields are:
-.Bl -tag -width 17n
+.Bl -tag -width 12n
.It timestamp
UNIX timestamp in UTC+0, empty if missing or on parse failure.
.It title
Title text, HTML code in titles is ignored and is treated as plain-text.
.It link
-Absolute url, unsafe characters are encoded.
+Link
.It content
Content, can have plain-text or HTML code depending on the content-type field.
.It content-type
@@ -40,6 +40,8 @@ RSS item GUID or Atom id.
Item author.
.It enclosure
Item, first enclosure.
+.It category
+Item, categories, multiple values are separated by |.
.El
.Sh SEE ALSO
.Xr sfeed 1 ,
diff --git a/sfeed.c b/sfeed.c
@@ -14,6 +14,10 @@
#define ISINCONTENT(ctx) ((ctx).iscontent && !((ctx).iscontenttag))
#define ISCONTENTTAG(ctx) (!((ctx).iscontent) && (ctx).iscontenttag)
+
+/* these feed fields support multiple separated values */
+#define ISFEEDFIELDMULTI(t) ((t) == FeedFieldCategory)
+
/* string and byte-length */
#define STRP(s) s,sizeof(s)-1
@@ -42,23 +46,27 @@ typedef struct string {
enum TagId {
TagUnknown = 0,
/* RSS */
- RSSTagDcdate, RSSTagPubdate,
+ RSSTagDcdate, RSSTagPubdate, /* creation date has higher priority */
RSSTagTitle,
RSSTagMediaDescription, RSSTagDescription, RSSTagContentEncoded,
RSSTagGuid,
+ RSSTagGuidPermalinkFalse,
+ RSSTagGuidPermalinkTrue,
/* must be defined after GUID, because it can be a link (isPermaLink) */
RSSTagLink,
RSSTagEnclosure,
RSSTagAuthor, RSSTagDccreator,
+ RSSTagCategory,
/* Atom */
- AtomTagUpdated, AtomTagPublished,
+ AtomTagUpdated, AtomTagPublished, /* creation date has higher priority */
AtomTagTitle,
AtomTagMediaDescription, AtomTagSummary, AtomTagContent,
AtomTagId,
AtomTagLink,
AtomTagLinkAlternate,
AtomTagLinkEnclosure,
- AtomTagAuthor,
+ AtomTagAuthor, AtomTagAuthorName,
+ AtomTagCategory,
TagLast
};
@@ -75,18 +83,19 @@ typedef struct field {
enum {
FeedFieldTime = 0, FeedFieldTitle, FeedFieldLink, FeedFieldContent,
- FeedFieldId, FeedFieldAuthor, FeedFieldEnclosure, FeedFieldLast
+ FeedFieldId, FeedFieldAuthor, FeedFieldEnclosure, FeedFieldCategory,
+ FeedFieldLast
};
typedef struct feedcontext {
String *field; /* current FeedItem field String */
FeedField fields[FeedFieldLast]; /* data for current item */
- FeedTag *tag; /* unique current parsed tag */
+ FeedTag tag; /* unique current parsed tag */
int iscontent; /* in content data */
int iscontenttag; /* in content tag */
enum ContentType contenttype; /* content-type for item */
enum FeedType feedtype;
- int attrcount; /* count item HTML element attributes */
+ int attrcount; /* count item HTML element attributes */
} FeedContext;
static long long datetounix(long long, int, int, int, int, int);
@@ -102,6 +111,7 @@ static void string_clear(String *);
static void string_print_encoded(String *);
static void string_print_timestamp(String *);
static void string_print_trimmed(String *);
+static void string_print_trimmed_multi(String *);
static void string_print_uri(String *);
static void xmlattr(XMLParser *, const char *, size_t, const char *, size_t,
const char *, size_t);
@@ -111,7 +121,6 @@ static void xmlattrend(XMLParser *, const char *, size_t, const char *,
size_t);
static void xmlattrstart(XMLParser *, const char *, size_t, const char *,
size_t);
-static void xmlcdata(XMLParser *, const char *, size_t);
static void xmldata(XMLParser *, const char *, size_t);
static void xmldataentity(XMLParser *, const char *, size_t);
static void xmltagend(XMLParser *, const char *, size_t, int);
@@ -122,6 +131,7 @@ static void xmltagstartparsed(XMLParser *, const char *, size_t, int);
/* RSS, must be alphabetical order */
static FeedTag rsstags[] = {
{ STRP("author"), RSSTagAuthor },
+ { STRP("category"), RSSTagCategory },
{ STRP("content:encoded"), RSSTagContentEncoded },
{ STRP("dc:creator"), RSSTagDccreator },
{ STRP("dc:date"), RSSTagDcdate },
@@ -134,10 +144,11 @@ static FeedTag rsstags[] = {
{ STRP("pubdate"), RSSTagPubdate },
{ STRP("title"), RSSTagTitle }
};
+
/* Atom, must be alphabetical order */
static FeedTag atomtags[] = {
- /* <author><name></name></author> */
{ STRP("author"), AtomTagAuthor },
+ { STRP("category"), AtomTagCategory },
{ STRP("content"), AtomTagContent },
{ STRP("id"), AtomTagId },
/* Atom: <link href="" />, RSS has <link></link> */
@@ -148,46 +159,58 @@ static FeedTag atomtags[] = {
{ STRP("title"), AtomTagTitle },
{ STRP("updated"), AtomTagUpdated }
};
+
+/* special case: nested <author><name> */
+static FeedTag atomtagauthor = { STRP("author"), AtomTagAuthor };
+static FeedTag atomtagauthorname = { STRP("name"), AtomTagAuthorName };
+
+/* reference to no / unknown tag */
static FeedTag notag = { STRP(""), TagUnknown };
/* map TagId type to RSS/Atom field, all tags must be defined */
static int fieldmap[TagLast] = {
- [TagUnknown] = -1,
+ [TagUnknown] = -1,
/* RSS */
- [RSSTagDcdate] = FeedFieldTime,
- [RSSTagPubdate] = FeedFieldTime,
- [RSSTagTitle] = FeedFieldTitle,
- [RSSTagMediaDescription] = FeedFieldContent,
- [RSSTagDescription] = FeedFieldContent,
- [RSSTagContentEncoded] = FeedFieldContent,
- [RSSTagGuid] = FeedFieldId,
- [RSSTagLink] = FeedFieldLink,
- [RSSTagEnclosure] = FeedFieldEnclosure,
- [RSSTagAuthor] = FeedFieldAuthor,
- [RSSTagDccreator] = FeedFieldAuthor,
+ [RSSTagDcdate] = FeedFieldTime,
+ [RSSTagPubdate] = FeedFieldTime,
+ [RSSTagTitle] = FeedFieldTitle,
+ [RSSTagMediaDescription] = FeedFieldContent,
+ [RSSTagDescription] = FeedFieldContent,
+ [RSSTagContentEncoded] = FeedFieldContent,
+ [RSSTagGuid] = -1,
+ [RSSTagGuidPermalinkTrue] = FeedFieldId, /* special-case: both a link and an id */
+ [RSSTagGuidPermalinkFalse] = FeedFieldId,
+ [RSSTagLink] = FeedFieldLink,
+ [RSSTagEnclosure] = FeedFieldEnclosure,
+ [RSSTagAuthor] = FeedFieldAuthor,
+ [RSSTagDccreator] = FeedFieldAuthor,
+ [RSSTagCategory] = FeedFieldCategory,
/* Atom */
- [AtomTagUpdated] = FeedFieldTime,
- [AtomTagPublished] = FeedFieldTime,
- [AtomTagTitle] = FeedFieldTitle,
- [AtomTagMediaDescription] = FeedFieldContent,
- [AtomTagSummary] = FeedFieldContent,
- [AtomTagContent] = FeedFieldContent,
- [AtomTagId] = FeedFieldId,
- [AtomTagLink] = -1,
- [AtomTagLinkAlternate] = FeedFieldLink,
- [AtomTagLinkEnclosure] = FeedFieldEnclosure,
- [AtomTagAuthor] = FeedFieldAuthor
+ [AtomTagUpdated] = FeedFieldTime,
+ [AtomTagPublished] = FeedFieldTime,
+ [AtomTagTitle] = FeedFieldTitle,
+ [AtomTagMediaDescription] = FeedFieldContent,
+ [AtomTagSummary] = FeedFieldContent,
+ [AtomTagContent] = FeedFieldContent,
+ [AtomTagId] = FeedFieldId,
+ [AtomTagLink] = -1,
+ [AtomTagLinkAlternate] = FeedFieldLink,
+ [AtomTagLinkEnclosure] = FeedFieldEnclosure,
+ [AtomTagAuthor] = -1,
+ [AtomTagAuthorName] = FeedFieldAuthor,
+ [AtomTagCategory] = FeedFieldCategory
};
static const int FieldSeparator = '\t';
-static const char *baseurl = "";
+/* separator for multiple values in a field, separator should be 1 byte */
+static const char *FieldMultiSeparator = "|";
+static struct uri baseuri;
+static const char *baseurl;
-static FeedContext ctx = { .tag = ¬ag };
+static FeedContext ctx;
static XMLParser parser; /* XML parser state */
-
-static String atomlink;
-static enum TagId atomlinktype;
-static int rssidpermalink;
+static String tmpstr;
+static enum ContentType tmpcontenttype; /* content-type for item */
int
tagcmp(const void *v1, const void *v2)
@@ -222,7 +245,7 @@ gettag(enum FeedType feedtype, const char *name, size_t namelen)
static char *
ltrim(const char *s)
{
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
return (char *)s;
}
@@ -309,19 +332,13 @@ string_print_encoded(String *s)
}
}
-/* Print text, replace TABs, carriage return and other whitespace with ' '.
- * Other control chars are removed. Remove leading and trailing whitespace. */
static void
-string_print_trimmed(String *s)
+printtrimmed(char *s)
{
char *p, *e;
- if (!s->data || !s->len)
- return;
-
- p = ltrim(s->data);
+ p = ltrim(s);
e = rtrim(p);
-
for (; *p && p != e; p++) {
if (isspace((unsigned char)*p))
putchar(' '); /* any whitespace to space */
@@ -331,23 +348,74 @@ string_print_trimmed(String *s)
}
}
-/* always print absolute urls (using global baseurl) */
+/* Print text, replace TABs, carriage return and other whitespace with ' '.
+ * Other control chars are removed. Remove leading and trailing whitespace. */
+static void
+string_print_trimmed(String *s)
+{
+ if (!s->data || !s->len)
+ return;
+
+ printtrimmed(s->data);
+}
+
void
-string_print_uri(String *s)
+string_print_trimmed_multi(String *s)
{
- char link[4096], *p, *e;
+ char *p, *e;
int c;
if (!s->data || !s->len)
return;
- p = ltrim(s->data);
+ for (p = s->data; ; p = e + 1) {
+ if ((e = strstr(p, FieldMultiSeparator))) {
+ c = *e;
+ *e = '\0';
+ printtrimmed(p);
+ *e = c; /* restore NUL byte to original character */
+ fputs(FieldMultiSeparator, stdout);
+ } else {
+ printtrimmed(p);
+ break;
+ }
+ }
+}
+
+/* always print absolute URLs (using global baseurl) */
+void
+printuri(char *s)
+{
+ char link[4096], *p, *e;
+ struct uri newuri, olduri;
+ int c, r = -1;
+
+ p = ltrim(s);
e = rtrim(p);
c = *e;
*e = '\0';
- if (absuri(link, sizeof(link), p, baseurl) != -1)
- fputs(link, stdout);
- *e = c; /* restore */
+
+ if (baseurl && !uri_hasscheme(p) &&
+ uri_parse(p, &olduri) != -1 && !olduri.proto[0] &&
+ uri_makeabs(&newuri, &olduri, &baseuri) != -1 && newuri.proto[0])
+ r = uri_format(link, sizeof(link), &newuri);
+
+ if (r >= 0 && (size_t)r < sizeof(link))
+ printtrimmed(link);
+ else
+ printtrimmed(p);
+
+ *e = c; /* restore NUL byte to original character */
+}
+
+/* always print absolute URLs (using global baseurl) */
+void
+string_print_uri(String *s)
+{
+ if (!s->data || !s->len)
+ return;
+
+ printuri(s->data);
}
/* print as UNIX timestamp, print nothing if the parsed time is invalid */
@@ -443,20 +511,20 @@ gettzoffset(const char *s)
long tzhour = 0, tzmin = 0;
size_t i;
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
switch (*s) {
case '-': /* offset */
case '+':
- for (i = 0, p = s + 1; i < 2 && *p && isdigit((unsigned char)*p); i++, p++)
+ for (i = 0, p = s + 1; i < 2 && isdigit((unsigned char)*p); i++, p++)
tzhour = (tzhour * 10) + (*p - '0');
if (*p == ':')
p++;
- for (i = 0; i < 2 && *p && isdigit((unsigned char)*p); i++, p++)
+ for (i = 0; i < 2 && isdigit((unsigned char)*p); i++, p++)
tzmin = (tzmin * 10) + (*p - '0');
return ((tzhour * 3600) + (tzmin * 60)) * (s[0] == '-' ? -1 : 1);
default: /* timezone name */
- for (i = 0; *s && isalpha((unsigned char)s[i]); i++)
+ for (i = 0; isalpha((unsigned char)s[i]); i++)
;
if (i != 3)
return 0;
@@ -492,32 +560,35 @@ parsetime(const char *s, time_t *tp)
int va[6] = { 0 }, i, j, v, vi;
size_t m;
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
if (!isdigit((unsigned char)*s) && !isalpha((unsigned char)*s))
return -1;
- if (strspn(s, "0123456789") == 4) {
- /* format "%Y-%m-%d %H:%M:%S" or "%Y-%m-%dT%H:%M:%S" */
+ if (isdigit((unsigned char)s[0]) &&
+ isdigit((unsigned char)s[1]) &&
+ isdigit((unsigned char)s[2]) &&
+ isdigit((unsigned char)s[3])) {
+ /* formats "%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S" or "%Y%m%d%H%M%S" */
vi = 0;
} else {
/* format: "[%a, ]%d %b %Y %H:%M:%S" */
/* parse "[%a, ]%d %b %Y " part, then use time parsing as above */
- for (; *s && isalpha((unsigned char)*s); s++)
+ for (; isalpha((unsigned char)*s); s++)
;
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
if (*s == ',')
s++;
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
- for (v = 0, i = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++)
+ for (v = 0, i = 0; i < 2 && isdigit((unsigned char)*s); s++, i++)
v = (v * 10) + (*s - '0');
va[2] = v; /* day */
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
/* end of word month */
- for (j = 0; *s && isalpha((unsigned char)s[j]); j++)
+ for (j = 0; isalpha((unsigned char)s[j]); j++)
;
/* check month name */
if (j < 3 || j > 9)
@@ -533,22 +604,25 @@ parsetime(const char *s, time_t *tp)
}
if (m >= 12)
return -1; /* no month found */
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
- for (v = 0, i = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++)
+ for (v = 0, i = 0; i < 4 && isdigit((unsigned char)*s); s++, i++)
v = (v * 10) + (*s - '0');
va[0] = v; /* year */
- for (; *s && isspace((unsigned char)*s); s++)
+ for (; isspace((unsigned char)*s); s++)
;
/* parse only regular time part, see below */
vi = 3;
}
- /* parse time part */
+ /* parse time parts (and possibly remaining date parts) */
for (; *s && vi < 6; vi++) {
- for (i = 0, v = 0; *s && i < 4 && isdigit((unsigned char)*s); s++, i++)
+ for (i = 0, v = 0; i < ((vi == 0) ? 4 : 2) &&
+ isdigit((unsigned char)*s); s++, i++) {
v = (v * 10) + (*s - '0');
+ }
va[vi] = v;
+
if ((vi < 2 && *s == '-') ||
(vi == 2 && (*s == 'T' || isspace((unsigned char)*s))) ||
(vi > 2 && *s == ':'))
@@ -557,7 +631,7 @@ parsetime(const char *s, time_t *tp)
/* skip milliseconds in for example: "%Y-%m-%dT%H:%M:%S.000Z" */
if (*s == '.') {
- for (s++; *s && isdigit((unsigned char)*s); s++)
+ for (s++; isdigit((unsigned char)*s); s++)
;
}
@@ -594,6 +668,8 @@ printfields(void)
string_print_trimmed(&ctx.fields[FeedFieldAuthor].str);
putchar(FieldSeparator);
string_print_uri(&ctx.fields[FeedFieldEnclosure].str);
+ putchar(FieldSeparator);
+ string_print_trimmed_multi(&ctx.fields[FeedFieldCategory].str);
putchar('\n');
}
@@ -620,7 +696,7 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
return;
}
- if (!ctx.tag->id)
+ if (!ctx.tag.id)
return;
/* content-type may be: Atom: text, xhtml, html or mime-type.
@@ -631,39 +707,49 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
isattr(v, vl, STRP("xhtml")) ||
isattr(v, vl, STRP("text/html")) ||
isattr(v, vl, STRP("text/xhtml"))) {
- ctx.contenttype = ContentTypeHTML;
+ tmpcontenttype = ContentTypeHTML;
} else if (isattr(v, vl, STRP("text")) ||
isattr(v, vl, STRP("plain")) ||
isattr(v, vl, STRP("text/plain"))) {
- ctx.contenttype = ContentTypePlain;
+ tmpcontenttype = ContentTypePlain;
}
}
return;
}
if (ctx.feedtype == FeedTypeRSS) {
- if (ctx.tag->id == RSSTagEnclosure &&
- isattr(n, nl, STRP("url")) && ctx.field) {
- string_append(ctx.field, v, vl);
- } else if (ctx.tag->id == RSSTagGuid &&
- isattr(n, nl, STRP("ispermalink")) &&
- !isattr(v, vl, STRP("true"))) {
- rssidpermalink = 0;
+ if (ctx.tag.id == RSSTagEnclosure &&
+ isattr(n, nl, STRP("url"))) {
+ string_append(&tmpstr, v, vl);
+ } else if ((ctx.tag.id == RSSTagGuid ||
+ ctx.tag.id == RSSTagGuidPermalinkFalse ||
+ ctx.tag.id == RSSTagGuidPermalinkTrue) &&
+ isattr(n, nl, STRP("ispermalink"))) {
+ if (isattr(v, vl, STRP("true")))
+ ctx.tag.id = RSSTagGuidPermalinkTrue;
+ else
+ ctx.tag.id = RSSTagGuidPermalinkFalse;
}
} else if (ctx.feedtype == FeedTypeAtom) {
- if (ctx.tag->id == AtomTagLink &&
- isattr(n, nl, STRP("rel"))) {
- /* empty or "alternate": other types could be
- "enclosure", "related", "self" or "via" */
- if (!vl || isattr(v, vl, STRP("alternate")))
- atomlinktype = AtomTagLinkAlternate;
- else if (isattr(v, vl, STRP("enclosure")))
- atomlinktype = AtomTagLinkEnclosure;
- else
- atomlinktype = TagUnknown;
- } else if (ctx.tag->id == AtomTagLink &&
- isattr(n, nl, STRP("href"))) {
- string_append(&atomlink, v, vl);
+ if (ctx.tag.id == AtomTagLink ||
+ ctx.tag.id == AtomTagLinkAlternate ||
+ ctx.tag.id == AtomTagLinkEnclosure) {
+ if (isattr(n, nl, STRP("rel"))) {
+ /* empty or "alternate": other types could be
+ "enclosure", "related", "self" or "via" */
+ if (!vl || isattr(v, vl, STRP("alternate")))
+ ctx.tag.id = AtomTagLinkAlternate;
+ else if (isattr(v, vl, STRP("enclosure")))
+ ctx.tag.id = AtomTagLinkEnclosure;
+ else
+ ctx.tag.id = AtomTagLink; /* unknown */
+ } else if (ctx.tag.id != AtomTagLink &&
+ isattr(n, nl, STRP("href"))) {
+ string_append(&tmpstr, v, vl);
+ }
+ } else if (ctx.tag.id == AtomTagCategory &&
+ isattr(n, nl, STRP("term"))) {
+ string_append(&tmpstr, v, vl);
}
}
}
@@ -682,7 +768,7 @@ xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
return;
}
- if (!ctx.tag->id)
+ if (!ctx.tag.id)
return;
/* try to translate entity, else just pass as data to
@@ -696,35 +782,30 @@ xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
static void
xmlattrend(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl)
{
- if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML)
+ if (ISINCONTENT(ctx)) {
+ if (ctx.contenttype == ContentTypeHTML) {
+ /* handles transforming inline XML to data */
+ xmldata(p, "\"", 1);
+ ctx.attrcount = 0;
+ }
return;
-
- /* handles transforming inline XML to data */
- xmldata(p, "\"", 1);
- ctx.attrcount = 0;
+ }
}
static void
xmlattrstart(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl)
{
- if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML)
- return;
-
- /* handles transforming inline XML to data */
- if (!ctx.attrcount)
- xmldata(p, " ", 1);
- ctx.attrcount++;
- xmldata(p, n, nl);
- xmldata(p, "=\"", 2);
-}
-
-static void
-xmlcdata(XMLParser *p, const char *s, size_t len)
-{
- if (!ctx.field)
+ if (ISINCONTENT(ctx)) {
+ if (ctx.contenttype == ContentTypeHTML) {
+ /* handles transforming inline XML to data */
+ if (!ctx.attrcount)
+ xmldata(p, " ", 1);
+ ctx.attrcount++;
+ xmldata(p, n, nl);
+ xmldata(p, "=\"", 2);
+ }
return;
-
- string_append(ctx.field, s, len);
+ }
}
/* NOTE: this handler can be called multiple times if the data in this
@@ -735,9 +816,9 @@ xmldata(XMLParser *p, const char *s, size_t len)
if (!ctx.field)
return;
- /* add only data from <name> inside <author> tag
- * or any other non-<author> tag */
- if (ctx.tag->id != AtomTagAuthor || istag(p->tag, p->taglen, STRP("name")))
+ if (ISFEEDFIELDMULTI(fieldmap[ctx.tag.id]))
+ string_append(&tmpstr, s, len);
+ else
string_append(ctx.field, s, len);
}
@@ -761,11 +842,11 @@ xmldataentity(XMLParser *p, const char *data, size_t datalen)
static void
xmltagstart(XMLParser *p, const char *t, size_t tl)
{
- enum TagId tagid;
+ FeedTag *f;
if (ISINCONTENT(ctx)) {
- ctx.attrcount = 0;
if (ctx.contenttype == ContentTypeHTML) {
+ ctx.attrcount = 0;
xmldata(p, "<", 1);
xmldata(p, t, tl);
}
@@ -781,73 +862,87 @@ xmltagstart(XMLParser *p, const char *t, size_t tl)
return;
}
- /* field tagid already set, nested tags are not allowed: return */
- if (ctx.tag->id)
- return;
+ /* field tagid already set or nested tags. */
+ if (ctx.tag.id) {
+ /* nested <author><name> for Atom */
+ if (ctx.tag.id == AtomTagAuthor &&
+ istag(t, tl, STRP("name"))) {
+ memcpy(&(ctx.tag), &atomtagauthorname, sizeof(ctx.tag));
+ } else {
+ return; /* other nested tags are not allowed: return */
+ }
+ }
/* in item */
- if (!(ctx.tag = gettag(ctx.feedtype, t, tl)))
- ctx.tag = ¬ag;
- tagid = ctx.tag->id;
-
- /* without a rel attribute the default link type is "alternate" */
- if (tagid == AtomTagLink) {
- atomlinktype = AtomTagLinkAlternate;
- string_clear(&atomlink); /* reuse and clear temporary link */
- } else if (tagid == RSSTagGuid) {
+ if (ctx.tag.id == TagUnknown) {
+ if (!(f = gettag(ctx.feedtype, t, tl)))
+ f = ¬ag;
+ memcpy(&(ctx.tag), f, sizeof(ctx.tag));
+ }
+
+ switch (ctx.tag.id) {
+ case AtomTagLink:
+ /* without a rel attribute the default link type is "alternate" */
+ ctx.tag.id = AtomTagLinkAlternate;
+ break;
+ case RSSTagGuid:
/* without a ispermalink attribute the default value is "true" */
- rssidpermalink = 1;
+ ctx.tag.id = RSSTagGuidPermalinkTrue;
+ break;
+ case RSSTagContentEncoded:
+ case RSSTagDescription:
+ tmpcontenttype = ContentTypeHTML; /* default content-type */
+ break;
+ case RSSTagMediaDescription:
+ case AtomTagContent:
+ case AtomTagMediaDescription:
+ case AtomTagSummary:
+ tmpcontenttype = ContentTypePlain; /* default content-type */
+ break;
+ default:
+ break;
+ }
+ ctx.iscontenttag = (fieldmap[ctx.tag.id] == FeedFieldContent);
+}
+
+static void
+xmltagstartparsed(XMLParser *p, const char *t, size_t tl, int isshort)
+{
+ enum TagId tagid;
+
+ if (ISINCONTENT(ctx)) {
+ if (ctx.contenttype == ContentTypeHTML) {
+ if (isshort)
+ xmldata(p, "/>", 2);
+ else
+ xmldata(p, ">", 1);
+ }
+ return;
}
+ tagid = ctx.tag.id;
+
/* map tag type to field: unknown or lesser priority is ignored,
when tags of the same type are repeated only the first is used. */
if (fieldmap[tagid] == -1 ||
- tagid <= ctx.fields[fieldmap[tagid]].tagid) {
- ctx.field = NULL;
+ (!ISFEEDFIELDMULTI(fieldmap[tagid]) &&
+ tagid <= ctx.fields[fieldmap[tagid]].tagid)) {
return;
}
- if (fieldmap[tagid] == FeedFieldContent) {
- /* handle default content-type per tag, Atom, RSS, MRSS. */
- switch (tagid) {
- case RSSTagContentEncoded:
- case RSSTagDescription:
- ctx.contenttype = ContentTypeHTML;
- break;
- default:
- ctx.contenttype = ContentTypePlain;
- }
- ctx.iscontenttag = 1;
- } else {
- ctx.iscontenttag = 0;
- }
-
- ctx.field = &(ctx.fields[fieldmap[tagid]].str);
- ctx.fields[fieldmap[tagid]].tagid = tagid;
- /* clear field */
- string_clear(ctx.field);
-}
-
-static void
-xmltagstartparsed(XMLParser *p, const char *tag, size_t taglen, int isshort)
-{
if (ctx.iscontenttag) {
ctx.iscontent = 1;
ctx.iscontenttag = 0;
- return;
+ ctx.contenttype = tmpcontenttype;
}
- /* don't read field value in Atom <link> tag */
- if (ctx.tag->id == AtomTagLink)
- ctx.field = NULL;
-
- if (!ISINCONTENT(ctx) || ctx.contenttype != ContentTypeHTML)
- return;
+ ctx.field = &(ctx.fields[fieldmap[tagid]].str);
+ ctx.fields[fieldmap[tagid]].tagid = tagid;
- if (isshort)
- xmldata(p, "/>", 2);
- else
- xmldata(p, ">", 1);
+ /* clear field if it is overwritten (with a priority order) for the new
+ value, if the field can have multiple values then do not clear it. */
+ if (!ISFEEDFIELDMULTI(fieldmap[ctx.tag.id]))
+ string_clear(ctx.field);
}
static void
@@ -860,7 +955,7 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort)
if (ISINCONTENT(ctx)) {
/* not close content field */
- if (!istag(ctx.tag->name, ctx.tag->len, t, tl)) {
+ if (!istag(ctx.tag.name, ctx.tag.len, t, tl)) {
if (!isshort && ctx.contenttype == ContentTypeHTML) {
xmldata(p, "</", 2);
xmldata(p, t, tl);
@@ -868,23 +963,18 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort)
}
return;
}
- } else if (ctx.tag->id == AtomTagLink) {
- /* map tag type to field: unknown or lesser priority is ignored,
- when tags of the same type are repeated only the first is used. */
- if (atomlinktype && atomlinktype > ctx.fields[fieldmap[atomlinktype]].tagid) {
- string_append(&ctx.fields[fieldmap[atomlinktype]].str,
- atomlink.data, atomlink.len);
- ctx.fields[fieldmap[atomlinktype]].tagid = atomlinktype;
- }
- } else if (ctx.tag->id == RSSTagGuid && rssidpermalink) {
- if (ctx.tag->id > ctx.fields[FeedFieldLink].tagid) {
+ } else if (ctx.tag.id && istag(ctx.tag.name, ctx.tag.len, t, tl)) {
+ /* matched tag end: close it */
+ /* copy also to the link field if the attribute isPermaLink="true"
+ and it is not set by a tag with higher prio. */
+ if (ctx.tag.id == RSSTagGuidPermalinkTrue && ctx.field &&
+ ctx.tag.id > ctx.fields[FeedFieldLink].tagid) {
string_clear(&ctx.fields[FeedFieldLink].str);
string_append(&ctx.fields[FeedFieldLink].str,
- ctx.fields[FeedFieldId].str.data,
- ctx.fields[FeedFieldId].str.len);
- ctx.fields[FeedFieldLink].tagid = ctx.tag->id;
+ ctx.field->data, ctx.field->len);
+ ctx.fields[FeedFieldLink].tagid = ctx.tag.id;
}
- } else if (!ctx.tag->id && ((ctx.feedtype == FeedTypeAtom &&
+ } else if (!ctx.tag.id && ((ctx.feedtype == FeedTypeAtom &&
istag(t, tl, STRP("entry"))) || /* Atom */
(ctx.feedtype == FeedTypeRSS &&
istag(t, tl, STRP("item"))))) /* RSS */
@@ -900,14 +990,33 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort)
ctx.contenttype = ContentTypeNone;
/* allow parsing of Atom and RSS concatenated in one XML stream. */
ctx.feedtype = FeedTypeNone;
- } else if (!ctx.tag->id ||
- !istag(ctx.tag->name, ctx.tag->len, t, tl)) {
- /* not end of field */
- return;
+ } else {
+ return; /* not end of field */
+ }
+
+ /* temporary string: for fields that cannot be processed
+ directly and need more context, for example by it's tag
+ attributes, like the Atom link rel="alternate|enclosure". */
+ if (tmpstr.len && ctx.field) {
+ if (ISFEEDFIELDMULTI(fieldmap[ctx.tag.id])) {
+ if (ctx.field->len)
+ string_append(ctx.field, FieldMultiSeparator, 1);
+ string_append(ctx.field, tmpstr.data, tmpstr.len);
+ } else {
+ string_clear(ctx.field);
+ string_append(ctx.field, tmpstr.data, tmpstr.len);
+ }
}
+
/* close field */
+ string_clear(&tmpstr); /* reuse and clear temporary string */
+
+ if (ctx.tag.id == AtomTagAuthorName)
+ memcpy(&(ctx.tag), &atomtagauthor, sizeof(ctx.tag)); /* outer tag */
+ else
+ memcpy(&(ctx.tag), ¬ag, sizeof(ctx.tag));
+
ctx.iscontent = 0;
- ctx.tag = ¬ag;
ctx.field = NULL;
}
@@ -917,14 +1026,20 @@ main(int argc, char *argv[])
if (pledge("stdio", NULL) == -1)
err(1, "pledge");
- if (argc > 1)
- baseurl = argv[1];
+ if (argc > 1) {
+ if (uri_parse(argv[1], &baseuri) != -1 && baseuri.proto[0])
+ baseurl = argv[1];
+ else
+ errx(1, "baseurl incorrect or too long");
+ }
+
+ memcpy(&(ctx.tag), ¬ag, sizeof(ctx.tag));
parser.xmlattr = xmlattr;
parser.xmlattrentity = xmlattrentity;
parser.xmlattrend = xmlattrend;
parser.xmlattrstart = xmlattrstart;
- parser.xmlcdata = xmlcdata;
+ parser.xmlcdata = xmldata;
parser.xmldata = xmldata;
parser.xmldataentity = xmldataentity;
parser.xmltagend = xmltagend;
diff --git a/sfeed_atom.c b/sfeed_atom.c
@@ -7,6 +7,7 @@
#include "util.h"
+static struct tm tmnow;
static time_t now;
static char *line;
static size_t linesize;
@@ -47,12 +48,6 @@ printfeed(FILE *fp, const char *feedname)
line[--linelen] = '\0';
parseline(line, fields);
- parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = localtime(&parsedtime)))
- err(1, "localtime");
-
fputs("<entry>\n\t<title>", stdout);
if (feedname[0]) {
fputs("[", stdout);
@@ -78,9 +73,15 @@ printfeed(FILE *fp, const char *feedname)
xmlencode(fields[FieldEnclosure], stdout);
fputs("\" />\n", stdout);
}
+
+ parsedtime = 0;
+ if (strtotime(fields[FieldUnixTimestamp], &parsedtime) ||
+ !(tm = gmtime(&parsedtime)))
+ tm = &tmnow;
fprintf(stdout, "\t<updated>%04d-%02d-%02dT%02d:%02d:%02dZ</updated>\n",
tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
tm->tm_hour, tm->tm_min, tm->tm_sec);
+
if (fields[FieldAuthor][0]) {
fputs("\t<author><name>", stdout);
xmlencode(fields[FieldAuthor], stdout);
@@ -110,18 +111,13 @@ main(int argc, char *argv[])
char *name;
int i;
- if (argc == 1) {
- if (pledge("stdio", NULL) == -1)
- err(1, "pledge");
- } else {
- if (pledge("stdio rpath", NULL) == -1)
- err(1, "pledge");
- }
+ if (pledge(argc == 1 ? "stdio" : "stdio rpath", NULL) == -1)
+ err(1, "pledge");
if ((now = time(NULL)) == -1)
err(1, "time");
- if (!(tm = gmtime(&now)))
- err(1, "gmtime");
+ if (!(tm = gmtime_r(&now, &tmnow)))
+ err(1, "gmtime_r");
fputs("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<feed xmlns=\"http://www.w3.org/2005/Atom\">\n"
diff --git a/sfeed_frames.c b/sfeed_frames.c
@@ -8,7 +8,7 @@
#include "util.h"
-static struct feed **feeds;
+static struct feed *feeds;
static char *line;
static size_t linesize;
static time_t comparetime;
@@ -20,7 +20,7 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f)
char *fields[FieldLast];
ssize_t linelen;
unsigned int isnew;
- struct tm *tm;
+ struct tm rtm, *tm;
time_t parsedtime;
/* menu if not unnamed */
@@ -41,19 +41,20 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f)
parseline(line, fields);
parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = localtime(&parsedtime)))
- err(1, "localtime");
-
- isnew = (parsedtime >= comparetime) ? 1 : 0;
- totalnew += isnew;
- f->totalnew += isnew;
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = localtime_r(&parsedtime, &rtm))) {
+ isnew = (parsedtime >= comparetime) ? 1 : 0;
+ totalnew += isnew;
+ f->totalnew += isnew;
+ fprintf(fpitems, "%04d-%02d-%02d %02d:%02d ",
+ tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min);
+ } else {
+ isnew = 0;
+ fputs(" ", fpitems);
+ }
f->total++;
- fprintf(fpitems, "%04d-%02d-%02d %02d:%02d ",
- tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
- tm->tm_hour, tm->tm_min);
if (fields[FieldLink][0]) {
fputs("<a href=\"", fpitems);
xmlencode(fields[FieldLink], fpitems);
@@ -82,7 +83,7 @@ main(int argc, char *argv[])
if (pledge("stdio rpath wpath cpath", NULL) == -1)
err(1, "pledge");
- if (!(feeds = calloc(argc, sizeof(struct feed *))))
+ if (!(feeds = calloc(argc, sizeof(struct feed))))
err(1, "calloc");
if ((comparetime = time(NULL)) == -1)
@@ -111,20 +112,16 @@ main(int argc, char *argv[])
"<body class=\"frame\"><div id=\"items\">", fpitems);
if (argc == 1) {
- if (!(feeds[0] = calloc(1, sizeof(struct feed))))
- err(1, "calloc");
- feeds[0]->name = "";
- printfeed(fpitems, stdin, feeds[0]);
+ feeds[0].name = "";
+ printfeed(fpitems, stdin, &feeds[0]);
} else {
for (i = 1; i < argc; i++) {
- if (!(feeds[i - 1] = calloc(1, sizeof(struct feed))))
- err(1, "calloc");
name = ((name = strrchr(argv[i], '/'))) ? name + 1 : argv[i];
- feeds[i - 1]->name = name;
+ feeds[i - 1].name = name;
if (!(fp = fopen(argv[i], "r")))
err(1, "fopen: %s", argv[i]);
- printfeed(fpitems, fp, feeds[i - 1]);
+ printfeed(fpitems, fp, &feeds[i - 1]);
if (ferror(fp))
err(1, "ferror: %s", argv[i]);
fclose(fp);
@@ -142,7 +139,7 @@ main(int argc, char *argv[])
"<body class=\"frame\">\n<div id=\"sidebar\">\n", fpmenu);
for (i = 1; i < argc; i++) {
- f = feeds[i - 1];
+ f = &feeds[i - 1];
if (f->totalnew)
fputs("<a class=\"n\" href=\"items.html#", fpmenu);
else
diff --git a/sfeed_gopher.1 b/sfeed_gopher.1
@@ -1,4 +1,4 @@
-.Dd March 15, 2020
+.Dd January 3, 2021
.Dt SFEED_GOPHER 1
.Os
.Sh NAME
@@ -46,6 +46,11 @@ The default is "70".
.El
.Sh EXIT STATUS
.Ex -std
+.Sh EXAMPLES
+.Bd -literal
+SFEED_GOPHER_HOST="codemadness.org" SFEED_GOPHER_PATH="/feeds/" \\
+ sfeed_gopher ~/.sfeed/feeds/*
+.Ed
.Sh SEE ALSO
.Xr sfeed 1 ,
.Xr sfeed 5
diff --git a/sfeed_gopher.c b/sfeed_gopher.c
@@ -28,7 +28,7 @@ gophertext(FILE *fp, const char *s)
fputs(" ", fp);
break;
default:
- fputc(*s, fp);
+ putc(*s, fp);
break;
}
}
@@ -38,10 +38,11 @@ static void
printfeed(FILE *fpitems, FILE *fpin, struct feed *f)
{
struct uri u;
- char *fields[FieldLast], *itemhost, *itemport, *itempath;
+ char *fields[FieldLast];
+ char *itemhost, *itemport, *itempath, *itemquery, *itemfragment;
ssize_t linelen;
unsigned int isnew;
- struct tm *tm;
+ struct tm rtm, *tm;
time_t parsedtime;
int itemtype;
@@ -55,29 +56,24 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f)
line[--linelen] = '\0';
parseline(line, fields);
- parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = localtime(&parsedtime)))
- err(1, "localtime");
-
- isnew = (parsedtime >= comparetime) ? 1 : 0;
- f->totalnew += isnew;
- f->total++;
-
itemhost = host;
itemport = port;
itemtype = 'i';
itempath = fields[FieldLink];
+ itemquery = "";
+ itemfragment = "";
if (fields[FieldLink][0]) {
itemtype = 'h';
+ /* if it's a gopher URL then change it into a direntry */
if (!strncmp(fields[FieldLink], "gopher://", 9) &&
- parseuri(fields[FieldLink], &u, 0) != -1) {
+ uri_parse(fields[FieldLink], &u) != -1) {
itemhost = u.host;
itemport = u.port[0] ? u.port : "70";
itemtype = '1';
itempath = u.path;
+ itemquery = u.query;
+ itemfragment = u.fragment;
if (itempath[0] == '/') {
itempath++;
@@ -89,16 +85,35 @@ printfeed(FILE *fpitems, FILE *fpin, struct feed *f)
}
}
- fprintf(fpitems, "%c%c %04d-%02d-%02d %02d:%02d ",
- itemtype,
- isnew ? 'N' : ' ',
- tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
- tm->tm_hour, tm->tm_min);
+ parsedtime = 0;
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = localtime_r(&parsedtime, &rtm))) {
+ isnew = (parsedtime >= comparetime) ? 1 : 0;
+ f->totalnew += isnew;
+
+ fprintf(fpitems, "%c%c %04d-%02d-%02d %02d:%02d ",
+ itemtype,
+ isnew ? 'N' : ' ',
+ tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min);
+ } else {
+ fprintf(fpitems, "%c ", itemtype);
+ }
+ f->total++;
+
gophertext(fpitems, fields[FieldTitle]);
fputs("\t", fpitems);
if (itemtype == 'h' && fields[FieldLink] == itempath)
fputs("URL:", fpitems);
gophertext(fpitems, itempath);
+ if (itemquery[0]) {
+ fputs("?", fpitems);
+ gophertext(fpitems, itemquery);
+ }
+ if (itemfragment[0]) {
+ fputs("#", fpitems);
+ gophertext(fpitems, itemfragment);
+ }
fprintf(fpitems, "\t%s\t%s\r\n", itemhost, itemport);
}
fputs(".\r\n", fpitems);
@@ -111,8 +126,17 @@ main(int argc, char *argv[])
char *name, *p, path[PATH_MAX + 1];
int i, r;
- if (pledge(argc == 1 ? "stdio" : "stdio rpath wpath cpath", NULL) == -1)
- err(1, "pledge");
+ if (argc == 1) {
+ if (pledge("stdio", NULL) == -1)
+ err(1, "pledge");
+ } else {
+ if (unveil("/", "r") == -1)
+ err(1, "unveil");
+ if (unveil(".", "rwc") == -1)
+ err(1, "unveil");
+ if (pledge("stdio rpath wpath cpath", NULL) == -1)
+ err(1, "pledge");
+ }
if ((comparetime = time(NULL)) == -1)
err(1, "time");
diff --git a/sfeed_html.c b/sfeed_html.c
@@ -8,7 +8,7 @@
#include "util.h"
-static struct feed **feeds;
+static struct feed *feeds;
static int showsidebar;
static char *line;
static size_t linesize;
@@ -19,7 +19,7 @@ static void
printfeed(FILE *fp, struct feed *f)
{
char *fields[FieldLast];
- struct tm *tm;
+ struct tm rtm, *tm;
time_t parsedtime;
unsigned int isnew;
ssize_t linelen;
@@ -41,20 +41,21 @@ printfeed(FILE *fp, struct feed *f)
parseline(line, fields);
parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = localtime(&parsedtime)))
- err(1, "localtime");
-
- isnew = (parsedtime >= comparetime) ? 1 : 0;
- totalnew += isnew;
- f->totalnew += isnew;
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = localtime_r(&parsedtime, &rtm))) {
+ isnew = (parsedtime >= comparetime) ? 1 : 0;
+ totalnew += isnew;
+ f->totalnew += isnew;
+
+ fprintf(stdout, "%04d-%02d-%02d %02d:%02d ",
+ tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min);
+ } else {
+ isnew = 0;
+ fputs(" ", stdout);
+ }
f->total++;
- fprintf(stdout, "%04d-%02d-%02d %02d:%02d ",
- tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
- tm->tm_hour, tm->tm_min);
-
if (fields[FieldLink][0]) {
fputs("<a href=\"", stdout);
xmlencode(fields[FieldLink], stdout);
@@ -83,7 +84,7 @@ main(int argc, char *argv[])
if (pledge(argc == 1 ? "stdio" : "stdio rpath", NULL) == -1)
err(1, "pledge");
- if (!(feeds = calloc(argc, sizeof(struct feed *))))
+ if (!(feeds = calloc(argc, sizeof(struct feed))))
err(1, "calloc");
if ((comparetime = time(NULL)) == -1)
err(1, "time");
@@ -106,21 +107,17 @@ main(int argc, char *argv[])
fputs("\t\t<div id=\"items\" class=\"nosidebar\">\n", stdout);
if (argc == 1) {
- if (!(feeds[0] = calloc(1, sizeof(struct feed))))
- err(1, "calloc");
- feeds[0]->name = "";
- printfeed(stdin, feeds[0]);
+ feeds[0].name = "";
+ printfeed(stdin, &feeds[0]);
if (ferror(stdin))
err(1, "ferror: <stdin>:");
} else {
for (i = 1; i < argc; i++) {
- if (!(feeds[i - 1] = calloc(1, sizeof(struct feed))))
- err(1, "calloc");
name = ((name = strrchr(argv[i], '/'))) ? name + 1 : argv[i];
- feeds[i - 1]->name = name;
+ feeds[i - 1].name = name;
if (!(fp = fopen(argv[i], "r")))
err(1, "fopen: %s", argv[i]);
- printfeed(fp, feeds[i - 1]);
+ printfeed(fp, &feeds[i - 1]);
if (ferror(fp))
err(1, "ferror: %s", argv[i]);
fclose(fp);
@@ -132,7 +129,7 @@ main(int argc, char *argv[])
fputs("\t<div id=\"sidebar\">\n\t\t<ul>\n", stdout);
for (i = 1; i < argc; i++) {
- f = feeds[i - 1];
+ f = &feeds[i - 1];
if (f->totalnew > 0)
fputs("<li class=\"n\"><a href=\"#", stdout);
else
diff --git a/sfeed_mbox.c b/sfeed_mbox.c
@@ -9,7 +9,7 @@
static char *line;
static size_t linesize;
-static char host[256], *user, mtimebuf[32];
+static char host[256], *user, dtimebuf[32], mtimebuf[32];
static unsigned long
djb2(unsigned char *s, unsigned long hash)
@@ -36,17 +36,18 @@ printfeed(FILE *fp, const char *feedname)
hash = djb2((unsigned char *)line, 5381UL);
parseline(line, fields);
- parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = gmtime(&parsedtime)))
- err(1, "gmtime");
- if (!strftime(timebuf, sizeof(timebuf), "%a, %d %b %Y %H:%M:%S +0000", tm))
- errx(1, "strftime");
-
/* mbox + mail header */
printf("From MAILER-DAEMON %s\n", mtimebuf);
- printf("Date: %s\n", timebuf);
+
+ parsedtime = 0;
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = gmtime(&parsedtime)) &&
+ strftime(timebuf, sizeof(timebuf), "%a, %d %b %Y %H:%M:%S +0000", tm)) {
+ printf("Date: %s\n", timebuf);
+ } else {
+ printf("Date: %s\n", dtimebuf); /* invalid/missing: use current time */
+ }
+
printf("From: %s <sfeed@>\n", fields[FieldAuthor][0] ? fields[FieldAuthor] : feedname);
printf("To: %s <%s@%s>\n", user, user, host);
printf("Subject: %s\n", fields[FieldTitle]);
@@ -68,8 +69,8 @@ printfeed(FILE *fp, const char *feedname)
int
main(int argc, char *argv[])
{
- struct tm tm;
- time_t t;
+ struct tm tmnow;
+ time_t now;
FILE *fp;
char *name;
int i;
@@ -81,11 +82,13 @@ main(int argc, char *argv[])
user = "you";
if (gethostname(host, sizeof(host)) == -1)
err(1, "gethostname");
- if ((t = time(NULL)) == -1)
+ if ((now = time(NULL)) == -1)
err(1, "time");
- if (!gmtime_r(&t, &tm))
+ if (!gmtime_r(&now, &tmnow))
err(1, "gmtime_r: can't get current time");
- if (!strftime(mtimebuf, sizeof(mtimebuf), "%a %b %d %H:%M:%S %Y", &tm))
+ if (!strftime(mtimebuf, sizeof(mtimebuf), "%a %b %d %H:%M:%S %Y", &tmnow))
+ errx(1, "strftime: can't format current time");
+ if (!strftime(dtimebuf, sizeof(dtimebuf), "%a, %d %b %Y %H:%M:%S +0000", &tmnow))
errx(1, "strftime: can't format current time");
if (argc == 1) {
diff --git a/sfeed_opml_export b/sfeed_opml_export
@@ -22,7 +22,7 @@ loadconfig() {
fi
}
-# override feeds function to ouput opml XML.
+# override feed function to output OPML XML.
# feed(name, feedurl, [basesiteurl], [encoding])
feed() {
# NOTE: TABs in field values are unsupported, be sane.
diff --git a/sfeed_opml_export.1 b/sfeed_opml_export.1
@@ -1,4 +1,4 @@
-.Dd March 15, 2020
+.Dd January 3, 2021
.Dt SFEED_OPML_EXPORT 1
.Os
.Sh NAME
@@ -24,6 +24,10 @@ section for more information.
.El
.Sh EXIT STATUS
.Ex -std
+.Sh EXAMPLES
+.Bd -literal
+sfeed_opml_export ~/.sfeed/sfeedrc > opmlfile.xml
+.Ed
.Sh SEE ALSO
.Xr sfeed_update 1 ,
.Xr sfeedrc 5
diff --git a/sfeed_opml_import.1 b/sfeed_opml_import.1
@@ -1,9 +1,9 @@
-.Dd March 15, 2020
+.Dd January 15, 2021
.Dt SFEED_OPML_IMPORT 1
.Os
.Sh NAME
.Nm sfeed_opml_import
-.Nd generate a sfeedrc config file from an OPML file
+.Nd generate a sfeedrc config file from an OPML subscription list
.Sh SYNOPSIS
.Nm
.Sh DESCRIPTION
@@ -11,6 +11,10 @@
reads the OPML XML data from stdin and writes the config file text to stdout.
.Sh EXIT STATUS
.Ex -std
+.Sh EXAMPLES
+.Bd -literal
+sfeed_opml_import < opmlfile.xml > ~/.sfeed/sfeedrc
+.Ed
.Sh SEE ALSO
.Xr sfeed_update 1 ,
.Xr sfeedrc 5
diff --git a/sfeed_opml_import.c b/sfeed_opml_import.c
@@ -7,7 +7,7 @@
#include "xml.h"
static XMLParser parser; /* XML parser state */
-static char url[2048], text[256], title[256];
+static char text[256], title[256], xmlurl[2048];
static void
printsafe(const char *s)
@@ -27,7 +27,7 @@ printsafe(const char *s)
static void
xmltagstart(XMLParser *p, const char *t, size_t tl)
{
- url[0] = text[0] = title[0] = '\0';
+ text[0] = title[0] = xmlurl[0] = '\0';
}
static void
@@ -36,8 +36,9 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort)
if (strcasecmp(t, "outline"))
return;
- if (url[0]) {
+ if (xmlurl[0]) {
fputs("\tfeed '", stdout);
+ /* prefer title over text attribute */
if (title[0])
printsafe(title);
else if (text[0])
@@ -45,11 +46,11 @@ xmltagend(XMLParser *p, const char *t, size_t tl, int isshort)
else
fputs("unnamed", stdout);
fputs("' '", stdout);
- printsafe(url);
+ printsafe(xmlurl);
fputs("'\n", stdout);
}
- url[0] = text[0] = title[0] = '\0';
+ text[0] = title[0] = xmlurl[0] = '\0';
}
static void
@@ -59,12 +60,12 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
if (strcasecmp(t, "outline"))
return;
- if (!strcasecmp(n, "title"))
- strlcat(title, v, sizeof(title));
- else if (!strcasecmp(n, "text"))
+ if (!strcasecmp(n, "text"))
strlcat(text, v, sizeof(text));
+ else if (!strcasecmp(n, "title"))
+ strlcat(title, v, sizeof(title));
else if (!strcasecmp(n, "xmlurl"))
- strlcat(url, v, sizeof(url));
+ strlcat(xmlurl, v, sizeof(xmlurl));
}
static void
diff --git a/sfeed_plain.1 b/sfeed_plain.1
@@ -1,4 +1,4 @@
-.Dd March 15, 2020
+.Dd January 1, 2021
.Dt SFEED_PLAIN 1
.Os
.Sh NAME
@@ -40,6 +40,10 @@ and
.Xr wcwidth 3 .
.Sh EXIT STATUS
.Ex -std
+.Sh EXAMPLES
+.Bd -literal
+curl -s 'https://codemadness.org/atom.xml' | sfeed | sfeed_plain
+.Ed
.Sh SEE ALSO
.Xr sfeed 1 ,
.Xr sfeed_html 1 ,
diff --git a/sfeed_plain.c b/sfeed_plain.c
@@ -16,7 +16,7 @@ static void
printfeed(FILE *fp, const char *feedname)
{
char *fields[FieldLast];
- struct tm *tm;
+ struct tm rtm, *tm;
time_t parsedtime;
ssize_t linelen;
@@ -26,19 +26,19 @@ printfeed(FILE *fp, const char *feedname)
parseline(line, fields);
parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = localtime(&parsedtime)))
- err(1, "localtime");
-
- if (parsedtime >= comparetime)
- fputs("N ", stdout);
- else
- fputs(" ", stdout);
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = localtime_r(&parsedtime, &rtm))) {
+ if (parsedtime >= comparetime)
+ fputs("N ", stdout);
+ else
+ fputs(" ", stdout);
+ fprintf(stdout, "%04d-%02d-%02d %02d:%02d ",
+ tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min);
+ } else {
+ fputs(" ", stdout);
+ }
- fprintf(stdout, "%04d-%02d-%02d %02d:%02d ",
- tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
- tm->tm_hour, tm->tm_min);
if (feedname[0]) {
printutf8pad(stdout, feedname, 15, ' ');
fputs(" ", stdout);
diff --git a/sfeed_twtxt.c b/sfeed_twtxt.c
@@ -24,14 +24,14 @@ printfeed(FILE *fp, const char *feedname)
parseline(line, fields);
parsedtime = 0;
- if (strtotime(fields[FieldUnixTimestamp], &parsedtime))
- continue;
- if (!(tm = gmtime(&parsedtime)))
- err(1, "gmtime");
-
- fprintf(stdout, "%04d-%02d-%02dT%02d:%02d:%02dZ\t",
- tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
- tm->tm_hour, tm->tm_min, tm->tm_sec);
+ if (!strtotime(fields[FieldUnixTimestamp], &parsedtime) &&
+ (tm = gmtime(&parsedtime))) {
+ fprintf(stdout, "%04d-%02d-%02dT%02d:%02d:%02dZ\t",
+ tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ tm->tm_hour, tm->tm_min, tm->tm_sec);
+ } else {
+ fputs("\t", stdout);
+ }
if (feedname[0])
printf("[%s] ", feedname);
fputs(fields[FieldTitle], stdout);
diff --git a/sfeed_update b/sfeed_update
@@ -15,7 +15,7 @@ loadconfig() {
# allow to specify config via argv[1].
if [ "$1" != "" ]; then
# get absolute path of config file.
- config=$(readlink -f "$1")
+ config=$(readlink -f "$1" 2>/dev/null)
else
# default config location.
config="$HOME/.sfeed/sfeedrc"
@@ -25,7 +25,7 @@ loadconfig() {
if [ -r "${config}" ]; then
. "${config}"
else
- echo "Configuration file \"${config}\" does not exist or is not readable." >&2
+ echo "Configuration file \"$1\" does not exist or is not readable." >&2
echo "See sfeedrc.example for an example." >&2
exit 1
fi
@@ -45,16 +45,22 @@ fetch() {
}
# convert encoding from one encoding to another.
-# convertencoding(from, to)
+# convertencoding(name, from, to)
convertencoding() {
- if [ "$1" != "" ] && [ "$2" != "" ] && [ "$1" != "$2" ]; then
- iconv -cs -f "$1" -t "$2" 2> /dev/null
+ if [ "$2" != "" ] && [ "$3" != "" ] && [ "$2" != "$3" ]; then
+ iconv -cs -f "$2" -t "$3" 2> /dev/null
else
- # else no convert, just output
+ # else no convert, just output.
cat
fi
}
+# parse and convert input, by default XML to the sfeed(5) TSV format.
+# parse(name, feedurl, basesiteurl)
+parse() {
+ sfeed "$3"
+}
+
# filter fields.
# filter(name)
filter() {
@@ -73,25 +79,21 @@ order() {
sort -t ' ' -k1rn,1
}
-# fetch and parse feed.
-# feed(name, feedurl, [basesiteurl], [encoding])
-feed() {
- # wait until ${maxjobs} are finished: will stall the queue if an item
- # is slow, but it is portable.
- [ ${signo} -ne 0 ] && return
- [ $((curjobs % maxjobs)) -eq 0 ] && wait
- [ ${signo} -ne 0 ] && return
- curjobs=$((curjobs + 1))
-
- (name="$1"
- filename="$(printf '%s' "$1" | tr '/' '_')"
+# internal handler to fetch and process a feed.
+# _feed(name, feedurl, [basesiteurl], [encoding])
+_feed() {
+ name="$1"
feedurl="$2"
basesiteurl="$3"
encoding="$4"
+ filename="$(printf '%s' "${name}" | tr '/' '_')"
sfeedfile="${sfeedpath}/${filename}"
tmpfeedfile="${sfeedtmpdir}/${filename}"
+ # if file does not exist yet create it.
+ [ -e "${sfeedfile}" ] || touch "${sfeedfile}" 2>/dev/null
+
if ! fetch "${name}" "${feedurl}" "${sfeedfile}" > "${tmpfeedfile}.fetch"; then
log "${name}" "FAIL (FETCH)"
return
@@ -100,14 +102,15 @@ feed() {
# try to detect encoding (if not specified). if detecting the encoding fails assume utf-8.
[ "${encoding}" = "" ] && encoding=$(sfeed_xmlenc < "${tmpfeedfile}.fetch")
- if ! convertencoding "${encoding}" "utf-8" < "${tmpfeedfile}.fetch" > "${tmpfeedfile}.utf8"; then
+ if ! convertencoding "${name}" "${encoding}" "utf-8" < "${tmpfeedfile}.fetch" > "${tmpfeedfile}.utf8"; then
log "${name}" "FAIL (ENCODING)"
return
fi
rm -f "${tmpfeedfile}.fetch"
- if ! sfeed "${basesiteurl}" < "${tmpfeedfile}.utf8" > "${tmpfeedfile}.tsv"; then
- log "${name}" "FAIL (CONVERT)"
+ # if baseurl is empty then use feedurl.
+ if ! parse "${name}" "${feedurl}" "${basesiteurl:-${feedurl}}" < "${tmpfeedfile}.utf8" > "${tmpfeedfile}.tsv"; then
+ log "${name}" "FAIL (PARSE)"
return
fi
rm -f "${tmpfeedfile}.utf8"
@@ -124,14 +127,7 @@ feed() {
return
fi
- # if file does not exist yet "merge" with /dev/null.
- if [ -e "${sfeedfile}" ]; then
- oldfile="${sfeedfile}"
- else
- oldfile="/dev/null"
- fi
-
- if ! merge "${name}" "${oldfile}" "${tmpfeedfile}.filter" > "${tmpfeedfile}.merge"; then
+ if ! merge "${name}" "${sfeedfile}" "${tmpfeedfile}.filter" > "${tmpfeedfile}.merge"; then
log "${name}" "FAIL (MERGE)"
return
fi
@@ -152,7 +148,19 @@ feed() {
# OK
log "${name}" "OK"
- ) &
+}
+
+# fetch and process a feed in parallel.
+# feed(name, feedurl, [basesiteurl], [encoding])
+feed() {
+ # wait until ${maxjobs} are finished: will stall the queue if an item
+ # is slow, but it is portable.
+ [ ${signo} -ne 0 ] && return
+ [ $((curjobs % maxjobs)) -eq 0 ] && wait
+ [ ${signo} -ne 0 ] && return
+ curjobs=$((curjobs + 1))
+
+ _feed "$@" &
}
cleanup() {
@@ -173,26 +181,30 @@ feeds() {
echo "See sfeedrc.example for an example." >&2
}
-# job counter.
-curjobs=0
-# signal number received for parent.
-signo=0
-# SIGINT: signal to interrupt parent.
-trap -- "sighandler 2" "INT"
-# SIGTERM: signal to terminate parent.
-trap -- "sighandler 15" "TERM"
-# load config file.
-loadconfig "$1"
-# fetch feeds and store in temporary directory.
-sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')"
-# make sure path exists.
-mkdir -p "${sfeedpath}"
-# fetch feeds specified in config file.
-feeds
-# wait till all feeds are fetched (concurrently).
-[ ${signo} -eq 0 ] && wait
-# cleanup temporary files etc.
-cleanup
-# on signal SIGINT and SIGTERM exit with signal number + 128.
-[ ${signo} -ne 0 ] && exit $((signo+128))
-exit 0
+main() {
+ # job counter.
+ curjobs=0
+ # signal number received for parent.
+ signo=0
+ # SIGINT: signal to interrupt parent.
+ trap -- "sighandler 2" "INT"
+ # SIGTERM: signal to terminate parent.
+ trap -- "sighandler 15" "TERM"
+ # load config file.
+ loadconfig "$1"
+ # fetch feeds and store in temporary directory.
+ sfeedtmpdir="$(mktemp -d '/tmp/sfeed_XXXXXX')"
+ # make sure path exists.
+ mkdir -p "${sfeedpath}"
+ # fetch feeds specified in config file.
+ feeds
+ # wait till all feeds are fetched (concurrently).
+ [ ${signo} -eq 0 ] && wait
+ # cleanup temporary files etc.
+ cleanup
+ # on signal SIGINT and SIGTERM exit with signal number + 128.
+ [ ${signo} -ne 0 ] && exit $((signo+128))
+ return 0
+}
+
+[ "${SFEED_UPDATE_INCLUDE}" = "1" ] || main "$@"
diff --git a/sfeed_update.1 b/sfeed_update.1
@@ -1,4 +1,4 @@
-.Dd March 15, 2020
+.Dd February 5, 2021
.Dt SFEED_UPDATE 1
.Os
.Sh NAME
@@ -18,14 +18,14 @@ The directory can be changed in the
.Xr sfeedrc 5
file.
.Sh OPTIONS
-.Bl -tag -width 17n
+.Bl -tag -width Ds
.It Ar sfeedrc
Config file.
The default is
.Pa $HOME/.sfeed/sfeedrc .
.El
.Sh FILES READ
-.Bl -tag -width 17n
+.Bl -tag -width Ds
.It Ar sfeedrc
This file is evaluated as a shellscript in
.Nm .
@@ -34,7 +34,7 @@ See also the
man page for a detailed description of the format and an example file.
.El
.Sh FILES WRITTEN
-.Bl -tag -width 17n
+.Bl -tag -width Ds
.It feedname
TAB-separated
.Xr sfeed 5
@@ -44,6 +44,15 @@ The
script merges new items with this file.
The feedname cannot contain '/' characters, they will be replaced with '_'.
.El
+.Sh ENVIRONMENT VARIABLES
+.Bl -tag -width Ds
+.It SFEED_UPDATE_INCLUDE
+When set to "1"
+.Nm
+can be sourced as a script, but it won't run the
+.Fn main
+entry-point.
+.El
.Sh EXIT STATUS
.Ex -std
.Sh EXAMPLES
diff --git a/sfeed_web.1 b/sfeed_web.1
@@ -1,42 +1,41 @@
-.Dd March 15, 2020
+.Dd March 12, 2021
.Dt SFEED_WEB 1
.Os
.Sh NAME
.Nm sfeed_web
-.Nd finds urls to feeds from a HTML webpage
+.Nd finds URLs to feeds from a HTML webpage
.Sh SYNOPSIS
.Nm
.Op Ar baseurl
.Sh DESCRIPTION
.Nm
reads the HTML website as XML or HTML data from stdin and writes the found
-urls to stdout.
+URLs to stdout.
.Sh OPTIONS
.Bl -tag -width 8n
.It Ar baseurl
-Optional base url to use for found feed urls that are relative.
+Optional base URL to use for found feed URLs that are relative.
.El
.Sh OUTPUT FORMAT
-url<TAB>content\-type<newline>
+url<TAB>content-type<newline>
.Bl -tag -width Ds
.It url
-Found relative or absolute url.
-If the url is relative and the
+Found relative or absolute URL.
+.Pp
+For relative URLs if a <base href="..." /> tag is found it will be used,
+otherwise if the
.Ar baseurl
-option is
-specified then the url is made absolute.
-If the url is relative and no
-.Ar baseurl
-option is specified then it is printed as is.
-.It content\-type
+option is specified then that is used, if neither are set then the relative URL
+is printed.
+.It content-type
Usually application/atom+xml or application/rss+xml.
.El
.Sh EXIT STATUS
.Ex -std
.Sh EXAMPLES
-Get urls from xkcd website:
+Get URLs from xkcd website:
.Bd -literal
-curl -s -L http://www.xkcd.com | sfeed_web "http://www.xkcd.com/"
+curl -s -L 'http://www.xkcd.com/' | sfeed_web 'http://www.xkcd.com/'
.Ed
.Sh SEE ALSO
.Xr sfeed_update 1 ,
diff --git a/sfeed_web.c b/sfeed_web.c
@@ -10,64 +10,115 @@
#define STRP(s) s,sizeof(s)-1
static XMLParser parser;
-static int isbase, islink, isfeedlink;
-static char abslink[4096], feedlink[4096], basehref[4096], feedtype[256];
+static int isbasetag, islinktag, ishrefattr, istypeattr;
+static char linkhref[4096], linktype[256], basehref[4096];
static void
-printfeedtype(const char *s, FILE *fp)
+printvalue(const char *s)
{
for (; *s; s++)
- if (!isspace((unsigned char)*s))
- fputc(*s, fp);
+ if (!iscntrl((unsigned char)*s))
+ putchar(*s);
}
static void
xmltagstart(XMLParser *p, const char *t, size_t tl)
{
- isbase = islink = isfeedlink = 0;
+ isbasetag = islinktag = 0;
- if (!strcasecmp(t, "base"))
- isbase = 1;
- else if (!strcasecmp(t, "link"))
- islink = 1;
+ if (!strcasecmp(t, "base")) {
+ isbasetag = 1;
+ } else if (!strcasecmp(t, "link")) {
+ islinktag = 1;
+ linkhref[0] = '\0';
+ linktype[0] = '\0';
+ }
}
static void
xmltagstartparsed(XMLParser *p, const char *t, size_t tl, int isshort)
{
- if (!isfeedlink)
+ struct uri baseuri, linkuri, u;
+ char buf[4096];
+ int r = -1;
+
+ if (!islinktag)
+ return;
+
+ if (strncasecmp(linktype, STRP("application/atom")) &&
+ strncasecmp(linktype, STRP("application/xml")) &&
+ strncasecmp(linktype, STRP("application/rss")))
return;
- if (absuri(abslink, sizeof(abslink), feedlink, basehref) != -1)
- fputs(abslink, stdout);
+ /* parse base URI each time: it can change. */
+ if (basehref[0] &&
+ uri_parse(linkhref, &linkuri) != -1 && !linkuri.proto[0] &&
+ uri_parse(basehref, &baseuri) != -1 &&
+ uri_makeabs(&u, &linkuri, &baseuri) != -1 && u.proto[0])
+ r = uri_format(buf, sizeof(buf), &u);
+
+ if (r >= 0 && (size_t)r < sizeof(buf))
+ printvalue(buf);
else
- fputs(feedlink, stdout);
+ printvalue(linkhref);
+
putchar('\t');
- printfeedtype(feedtype, stdout);
+ printvalue(linktype);
putchar('\n');
}
static void
+xmlattrstart(XMLParser *p, const char *t, size_t tl, const char *a, size_t al)
+{
+ ishrefattr = istypeattr = 0;
+
+ if (!isbasetag && !islinktag)
+ return;
+
+ if (!strcasecmp(a, "href")) {
+ ishrefattr = 1;
+ if (isbasetag)
+ basehref[0] = '\0';
+ else if (islinktag)
+ linkhref[0] = '\0';
+ } else if (!strcasecmp(a, "type") && islinktag) {
+ istypeattr = 1;
+ linktype[0] = '\0';
+ }
+}
+
+static void
xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
const char *v, size_t vl)
{
- if (isbase) {
- if (!strcasecmp(n, "href"))
- strlcpy(basehref, v, sizeof(basehref));
- } else if (islink) {
- if (!strcasecmp(n, "type")) {
- if (!strncasecmp(v, STRP("application/atom")) ||
- !strncasecmp(v, STRP("application/xml")) ||
- !strncasecmp(v, STRP("application/rss"))) {
- isfeedlink = 1;
- strlcpy(feedtype, v, sizeof(feedtype));
- }
- } else if (!strcasecmp(n, "href")) {
- strlcpy(feedlink, v, sizeof(feedlink));
- }
+ if (isbasetag && ishrefattr) {
+ strlcat(basehref, v, sizeof(basehref));
+ } else if (islinktag) {
+ if (ishrefattr)
+ strlcat(linkhref, v, sizeof(linkhref));
+ else if (istypeattr)
+ strlcat(linktype, v, sizeof(linktype));
}
}
+static void
+xmlattrentity(XMLParser *p, const char *t, size_t tl, const char *a, size_t al,
+ const char *v, size_t vl)
+{
+ char buf[16];
+ int len;
+
+ if (!ishrefattr && !istypeattr)
+ return;
+
+ /* try to translate entity, else just pass as data to
+ * xmlattr handler. */
+ if ((len = xml_entitytostr(v, buf, sizeof(buf))) > 0)
+ xmlattr(p, t, tl, a, al, buf, (size_t)len);
+ else
+ xmlattr(p, t, tl, a, al, v, vl);
+}
+
int
main(int argc, char *argv[])
{
@@ -78,6 +129,8 @@ main(int argc, char *argv[])
strlcpy(basehref, argv[1], sizeof(basehref));
parser.xmlattr = xmlattr;
+ parser.xmlattrentity = xmlattrentity;
+ parser.xmlattrstart = xmlattrstart;
parser.xmltagstart = xmltagstart;
parser.xmltagstartparsed = xmltagstartparsed;
diff --git a/sfeed_xmlenc.1 b/sfeed_xmlenc.1
@@ -1,19 +1,20 @@
-.Dd March 15, 2020
+.Dd March 12, 2021
.Dt SFEED_XMLENC 1
.Os
.Sh NAME
.Nm sfeed_xmlenc
-.Nd get text\-encoding from XML
+.Nd get text-encoding from XML
.Sh SYNOPSIS
.Nm
.Sh DESCRIPTION
.Nm
-reads XML data from stdin and writes the found text\-encoding to stdout.
-It translates the characters to lowercase and strips control characters.
+reads XML data from stdin and writes the found text-encoding to stdout.
+It converts characters in the encoding-name to lowercase and strips characters
+which are not allowed in an encoding-name.
.Sh EXIT STATUS
.Ex -std
.Sh EXAMPLES
-Get text\-encoding from xkcd Atom feed:
+Get text-encoding from xkcd Atom feed:
.Bd -literal
curl -s -L http://www.xkcd.com/atom.xml | sfeed_xmlenc
.Ed
diff --git a/sfeed_xmlenc.c b/sfeed_xmlenc.c
@@ -14,7 +14,7 @@ static void
xmltagstart(XMLParser *p, const char *t, size_t tl)
{
/* optimization: try to find a processing instruction only at the
- start of the data. */
+ start of the data at the first few starting tags. */
if (tags++ > 3)
exit(0);
}
@@ -26,11 +26,19 @@ xmlattr(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl,
if (strcasecmp(t, "?xml") || strcasecmp(n, "encoding"))
return;
- /* output lowercase, no control characters */
for (; *v; v++) {
- if (!iscntrl((unsigned char)*v))
+ if (isalpha((unsigned char)*v) ||
+ isdigit((unsigned char)*v) ||
+ *v == '.' || *v == ':' || *v == '-' || *v == '_')
putchar(tolower((unsigned char)*v));
}
+}
+
+static void
+xmlattrend(XMLParser *p, const char *t, size_t tl, const char *n, size_t nl)
+{
+ if (strcasecmp(t, "?xml") || strcasecmp(n, "encoding"))
+ return;
putchar('\n');
exit(0);
}
@@ -42,6 +50,8 @@ main(void)
err(1, "pledge");
parser.xmlattr = xmlattr;
+ parser.xmlattrentity = xmlattr; /* no entity conversion */
+ parser.xmlattrend = xmlattrend;
parser.xmltagstart = xmltagstart;
/* NOTE: getnext is defined in xml.h for inline optimization */
diff --git a/sfeedrc.5 b/sfeedrc.5
@@ -1,4 +1,4 @@
-.Dd February 4, 2020
+.Dd January 26, 2021
.Dt SFEEDRC 5
.Os
.Sh NAME
@@ -37,13 +37,13 @@ Name of the feed, this is also used as the filename for the TAB-separated
feed file.
The feedname cannot contain '/' characters, they will be replaced with '_'.
.It Fa feedurl
-Url to fetch the RSS/Atom data from, usually a HTTP or HTTPS url.
+URL to fetch the RSS/Atom data from, usually a HTTP or HTTPS URL.
.It Op Fa basesiteurl
Baseurl of the feed links.
This argument allows to fix relative item links.
.Pp
According to the RSS and Atom specification feeds should always have absolute
-urls, however this is not always the case in practise.
+URLs, however this is not always the case in practise.
.It Op Fa encoding
Feeds are decoded from this name to UTF-8, the name should be a usable
character-set for the
@@ -58,7 +58,7 @@ is a shellscript each function can be overridden to change its behaviour,
notable functions are:
.Bl -tag -width Ds
.It Fn fetch "name" "url" "feedfile"
-Fetch feed from url and writes data to stdout, its arguments are:
+Fetch feed from URL and writes data to stdout, its arguments are:
.Bl -tag -width Ds
.It Fa name
Specified name in configuration file (useful for logging).
@@ -71,10 +71,12 @@ Used feedfile (useful for comparing modification times).
By default the tool
.Xr curl 1
is used.
-.It Fn convertencoding "from" "to"
+.It Fn convertencoding "name" "from" "to"
Convert from text-encoding to another and writes it to stdout, its arguments
are:
.Bl -tag -width Ds
+.It Fa name
+Feed name.
.It Fa from
From text-encoding.
.It Fa to
@@ -84,6 +86,19 @@ To text-encoding.
By default the tool
.Xr iconv 1
is used.
+.It Fn parse "name" "feedurl" "basesiteurl"
+Parse and convert RSS/Atom XML to the
+.Xr sfeed 5
+TSV format.
+.Bl -tag -width Ds
+.It Fa name
+Name of the feed.
+.It Fa feedurl
+URL of the feed.
+.It Fa basesiteurl
+Baseurl of the feed links.
+This argument allows to fix relative item links.
+.El
.It Fn filter "name"
Filter
.Xr sfeed 5
diff --git a/util.c b/util.c
@@ -7,167 +7,203 @@
#include "util.h"
+/* check if string has a non-empty scheme / protocol part */
int
-parseuri(const char *s, struct uri *u, int rel)
+uri_hasscheme(const char *s)
{
- const char *p = s, *b;
- char *endptr = NULL;
+ const char *p = s;
+
+ for (; isalpha((unsigned char)*p) || isdigit((unsigned char)*p) ||
+ *p == '+' || *p == '-' || *p == '.'; p++)
+ ;
+ /* scheme, except if empty and starts with ":" then it is a path */
+ return (*p == ':' && p != s);
+}
+
+int
+uri_parse(const char *s, struct uri *u)
+{
+ const char *p = s;
+ char *endptr;
size_t i;
- unsigned long l;
+ long l;
- u->proto[0] = u->host[0] = u->path[0] = u->port[0] = '\0';
- if (!*s)
- return 0;
+ u->proto[0] = u->userinfo[0] = u->host[0] = u->port[0] = '\0';
+ u->path[0] = u->query[0] = u->fragment[0] = '\0';
- /* prefix is "//", don't read protocol, skip to domain parsing */
- if (!strncmp(p, "//", 2)) {
+ /* protocol-relative */
+ if (*p == '/' && *(p + 1) == '/') {
p += 2; /* skip "//" */
- } else {
- /* protocol part */
- for (p = s; *p && (isalpha((unsigned char)*p) || isdigit((unsigned char)*p) ||
- *p == '+' || *p == '-' || *p == '.'); p++)
- ;
- if (!strncmp(p, "://", 3)) {
- if ((size_t)(p - s) >= sizeof(u->proto))
- return -1; /* protocol too long */
- memcpy(u->proto, s, p - s);
- u->proto[p - s] = '\0';
+ goto parseauth;
+ }
+
+ /* scheme / protocol part */
+ for (; isalpha((unsigned char)*p) || isdigit((unsigned char)*p) ||
+ *p == '+' || *p == '-' || *p == '.'; p++)
+ ;
+ /* scheme, except if empty and starts with ":" then it is a path */
+ if (*p == ':' && p != s) {
+ if (*(p + 1) == '/' && *(p + 2) == '/')
p += 3; /* skip "://" */
- } else {
- p = s; /* no protocol format, set to start */
- /* relative url: read rest as path, else as domain */
- if (rel)
- goto readpath;
- }
+ else
+ p++; /* skip ":" */
+
+ if ((size_t)(p - s) >= sizeof(u->proto))
+ return -1; /* protocol too long */
+ memcpy(u->proto, s, p - s);
+ u->proto[p - s] = '\0';
+
+ if (*(p - 1) != '/')
+ goto parsepath;
+ } else {
+ p = s; /* no scheme format, reset to start */
+ goto parsepath;
+ }
+
+parseauth:
+ /* userinfo (username:password) */
+ i = strcspn(p, "@/?#");
+ if (p[i] == '@') {
+ if (i >= sizeof(u->userinfo))
+ return -1; /* userinfo too long */
+ memcpy(u->userinfo, p, i);
+ u->userinfo[i] = '\0';
+ p += i + 1;
}
+
/* IPv6 address */
if (*p == '[') {
- /* bracket not found or host too long */
- if (!(b = strchr(p, ']')) || (size_t)(b - p) < 3 ||
- (size_t)(b - p) >= sizeof(u->host))
+ /* bracket not found, host too short or too long */
+ i = strcspn(p, "]");
+ if (p[i] != ']' || i < 3)
return -1;
- memcpy(u->host, p, b - p + 1);
- u->host[b - p + 1] = '\0';
- p = b + 1;
+ i++; /* including "]" */
} else {
/* domain / host part, skip until port, path or end. */
- if ((i = strcspn(p, ":/")) >= sizeof(u->host))
- return -1; /* host too long */
- memcpy(u->host, p, i);
- u->host[i] = '\0';
- p = &p[i];
+ i = strcspn(p, ":/?#");
}
+ if (i >= sizeof(u->host))
+ return -1; /* host too long */
+ memcpy(u->host, p, i);
+ u->host[i] = '\0';
+ p += i;
+
/* port */
if (*p == ':') {
- if ((i = strcspn(++p, "/")) >= sizeof(u->port))
+ p++;
+ if ((i = strcspn(p, "/?#")) >= sizeof(u->port))
return -1; /* port too long */
memcpy(u->port, p, i);
u->port[i] = '\0';
- /* check for valid port: range 1 - 65535 */
+ /* check for valid port: range 1 - 65535, may be empty */
errno = 0;
- l = strtoul(u->port, &endptr, 10);
- if (errno || u->port[0] == '\0' || *endptr ||
- !l || l > 65535)
- return -1;
- p = &p[i];
- }
-readpath:
- if (u->host[0]) {
- p = &p[strspn(p, "/")];
- strlcpy(u->path, "/", sizeof(u->path));
- } else {
- /* absolute uri must have a host specified */
- if (!rel)
+ l = strtol(u->port, &endptr, 10);
+ if (i && (errno || *endptr || l <= 0 || l > 65535))
return -1;
+ p += i;
}
- /* treat truncation as an error */
- if (strlcat(u->path, p, sizeof(u->path)) >= sizeof(u->path))
- return -1;
- return 0;
-}
-static int
-encodeuri(char *buf, size_t bufsiz, const char *s)
-{
- static const char *table = "0123456789ABCDEF";
- size_t i, b;
+parsepath:
+ /* path */
+ if ((i = strcspn(p, "?#")) >= sizeof(u->path))
+ return -1; /* path too long */
+ memcpy(u->path, p, i);
+ u->path[i] = '\0';
+ p += i;
- for (i = 0, b = 0; s[i]; i++) {
- if ((unsigned char)s[i] <= ' ' ||
- (unsigned char)s[i] >= 127) {
- if (b + 3 >= bufsiz)
- return -1;
- buf[b++] = '%';
- buf[b++] = table[((unsigned char)s[i] >> 4) & 15];
- buf[b++] = table[(unsigned char)s[i] & 15];
- } else if (b < bufsiz) {
- buf[b++] = s[i];
- } else {
- return -1;
- }
+ /* query */
+ if (*p == '?') {
+ p++;
+ if ((i = strcspn(p, "#")) >= sizeof(u->query))
+ return -1; /* query too long */
+ memcpy(u->query, p, i);
+ u->query[i] = '\0';
+ p += i;
+ }
+
+ /* fragment */
+ if (*p == '#') {
+ p++;
+ if ((i = strlen(p)) >= sizeof(u->fragment))
+ return -1; /* fragment too long */
+ memcpy(u->fragment, p, i);
+ u->fragment[i] = '\0';
}
- if (b >= bufsiz)
- return -1;
- buf[b] = '\0';
return 0;
}
-/* Get absolute uri; if `link` is relative use `base` to make it absolute.
- * the returned string in `buf` is uri encoded, see: encodeuri(). */
+/* Transform and try to make the URI `u` absolute using base URI `b` into `a`.
+ Follows some of the logic from "RFC 3986 - 5.2.2. Transform References".
+ Returns 0 on success, -1 on error or truncation. */
int
-absuri(char *buf, size_t bufsiz, const char *link, const char *base)
+uri_makeabs(struct uri *a, struct uri *u, struct uri *b)
{
- struct uri ulink, ubase;
- char tmp[4096], *host, *p, *port;
- int c, r;
- size_t i;
+ char *p;
+ int c;
- buf[0] = '\0';
- if (parseuri(base, &ubase, 0) == -1 ||
- parseuri(link, &ulink, 1) == -1 ||
- (!ulink.host[0] && !ubase.host[0]))
- return -1;
+ strlcpy(a->fragment, u->fragment, sizeof(a->fragment));
- if (!strncmp(link, "//", 2)) {
- host = ulink.host;
- port = ulink.port;
- } else {
- host = ulink.host[0] ? ulink.host : ubase.host;
- port = ulink.port[0] ? ulink.port : ubase.port;
+ if (u->proto[0] || u->host[0]) {
+ strlcpy(a->proto, u->proto[0] ? u->proto : b->proto, sizeof(a->proto));
+ strlcpy(a->host, u->host, sizeof(a->host));
+ strlcpy(a->userinfo, u->userinfo, sizeof(a->userinfo));
+ strlcpy(a->host, u->host, sizeof(a->host));
+ strlcpy(a->port, u->port, sizeof(a->port));
+ strlcpy(a->path, u->path, sizeof(a->path));
+ strlcpy(a->query, u->query, sizeof(a->query));
+ return 0;
}
- r = snprintf(tmp, sizeof(tmp), "%s://%s%s%s",
- ulink.proto[0] ?
- ulink.proto :
- (ubase.proto[0] ? ubase.proto : "http"),
- host,
- port[0] ? ":" : "",
- port);
- if (r < 0 || (size_t)r >= sizeof(tmp))
- return -1; /* error or truncation */
-
- /* relative to root */
- if (!ulink.host[0] && ulink.path[0] != '/') {
- /* relative to base url path */
- if (ulink.path[0]) {
- if ((p = strrchr(ubase.path, '/'))) {
- /* temporary null-terminate */
- c = *(++p);
- *p = '\0';
- i = strlcat(tmp, ubase.path, sizeof(tmp));
- *p = c; /* restore */
- if (i >= sizeof(tmp))
- return -1;
- }
- } else if (strlcat(tmp, ubase.path, sizeof(tmp)) >=
- sizeof(tmp)) {
- return -1;
+
+ strlcpy(a->proto, b->proto, sizeof(a->proto));
+ strlcpy(a->host, b->host, sizeof(a->host));
+ strlcpy(a->userinfo, b->userinfo, sizeof(a->userinfo));
+ strlcpy(a->host, b->host, sizeof(a->host));
+ strlcpy(a->port, b->port, sizeof(a->port));
+
+ if (!u->path[0]) {
+ strlcpy(a->path, b->path, sizeof(a->path));
+ } else if (u->path[0] == '/') {
+ strlcpy(a->path, u->path, sizeof(a->path));
+ } else {
+ a->path[0] = (b->host[0] && b->path[0] != '/') ? '/' : '\0';
+ a->path[1] = '\0';
+
+ if ((p = strrchr(b->path, '/'))) {
+ c = *(++p);
+ *p = '\0'; /* temporary NUL-terminate */
+ if (strlcat(a->path, b->path, sizeof(a->path)) >= sizeof(a->path))
+ return -1;
+ *p = c; /* restore */
}
+ if (strlcat(a->path, u->path, sizeof(a->path)) >= sizeof(a->path))
+ return -1;
}
- if (strlcat(tmp, ulink.path, sizeof(tmp)) >= sizeof(tmp))
- return -1;
- return encodeuri(buf, bufsiz, tmp);
+ if (u->path[0] || u->query[0])
+ strlcpy(a->query, u->query, sizeof(a->query));
+ else
+ strlcpy(a->query, b->query, sizeof(a->query));
+
+ return 0;
+}
+
+int
+uri_format(char *buf, size_t bufsiz, struct uri *u)
+{
+ return snprintf(buf, bufsiz, "%s%s%s%s%s%s%s%s%s%s%s%s",
+ u->proto,
+ u->userinfo[0] ? u->userinfo : "",
+ u->userinfo[0] ? "@" : "",
+ u->host,
+ u->port[0] ? ":" : "",
+ u->port,
+ u->host[0] && u->path[0] && u->path[0] != '/' ? "/" : "",
+ u->path,
+ u->query[0] ? "?" : "",
+ u->query,
+ u->fragment[0] ? "#" : "",
+ u->fragment);
}
/* Splits fields in the line buffer by replacing TAB separators with NUL ('\0')
@@ -204,7 +240,7 @@ strtotime(const char *s, time_t *t)
if (errno || *s == '\0' || *e)
return -1;
/* NOTE: assumes time_t is 64-bit on 64-bit platforms:
- long long (atleast 32-bit) to time_t. */
+ long long (at least 32-bit) to time_t. */
if (t)
*t = (time_t)l;
@@ -222,7 +258,7 @@ xmlencode(const char *s, FILE *fp)
case '\'': fputs("'", fp); break;
case '&': fputs("&", fp); break;
case '"': fputs(""", fp); break;
- default: fputc(*s, fp);
+ default: putc(*s, fp);
}
}
}
@@ -234,29 +270,49 @@ printutf8pad(FILE *fp, const char *s, size_t len, int pad)
{
wchar_t wc;
size_t col = 0, i, slen;
- int rl, w;
+ int inc, rl, w;
if (!len)
return;
slen = strlen(s);
- for (i = 0; i < slen; i += rl) {
- rl = w = 1;
- if ((unsigned char)s[i] < 32)
- continue;
- if ((unsigned char)s[i] >= 127) {
- if ((rl = mbtowc(&wc, s + i, slen - i < 4 ? slen - i : 4)) <= 0)
+ for (i = 0; i < slen; i += inc) {
+ inc = 1; /* next byte */
+ if ((unsigned char)s[i] < 32) {
+ continue; /* skip control characters */
+ } else if ((unsigned char)s[i] >= 127) {
+ rl = mbtowc(&wc, s + i, slen - i < 4 ? slen - i : 4);
+ inc = rl;
+ if (rl < 0) {
+ mbtowc(NULL, NULL, 0); /* reset state */
+ inc = 1; /* invalid, seek next byte */
+ w = 1; /* replacement char is one width */
+ } else if ((w = wcwidth(wc)) == -1) {
+ continue;
+ }
+
+ if (col + w > len || (col + w == len && s[i + inc])) {
+ fputs("\xe2\x80\xa6", fp); /* ellipsis */
+ col++;
break;
- if ((w = wcwidth(wc)) == -1)
+ } else if (rl < 0) {
+ fputs("\xef\xbf\xbd", fp); /* replacement */
+ col++;
continue;
- }
- if (col + w > len || (col + w == len && s[i + rl])) {
- fputs("\xe2\x80\xa6", fp);
+ }
+ fwrite(&s[i], 1, rl, fp);
+ col += w;
+ } else {
+ /* optimization: simple ASCII character */
+ if (col + 1 > len || (col + 1 == len && s[i + 1])) {
+ fputs("\xe2\x80\xa6", fp); /* ellipsis */
+ col++;
+ break;
+ }
+ putc(s[i], fp);
col++;
- break;
}
- fwrite(&s[i], 1, rl, fp);
- col += w;
+
}
for (; col < len; ++col)
putc(pad, fp);
diff --git a/util.h b/util.h
@@ -6,6 +6,7 @@
#include <unistd.h>
#else
#define pledge(p1,p2) 0
+#define unveil(p1,p2) 0
#endif
#undef strlcat
@@ -20,22 +21,29 @@ struct feed {
unsigned long total; /* total items */
};
-/* uri */
+/* URI */
struct uri {
- char proto[48];
+ char proto[48]; /* scheme including ":" or "://" */
+ char userinfo[256]; /* username [:password] */
char host[256];
- char path[2048];
- char port[6]; /* numeric port */
+ char port[6]; /* numeric port */
+ char path[1024];
+ char query[1024];
+ char fragment[1024];
};
enum {
FieldUnixTimestamp = 0, FieldTitle, FieldLink, FieldContent,
- FieldContentType, FieldId, FieldAuthor, FieldEnclosure, FieldLast
+ FieldContentType, FieldId, FieldAuthor, FieldEnclosure, FieldCategory,
+ FieldLast
};
-int absuri(char *, size_t, const char *, const char *);
+int uri_format(char *, size_t, struct uri *);
+int uri_hasscheme(const char *);
+int uri_makeabs(struct uri *, struct uri *, struct uri *);
+int uri_parse(const char *, struct uri *);
+
void parseline(char *, char *[FieldLast]);
-int parseuri(const char *, struct uri *, int);
void printutf8pad(FILE *, const char *, size_t, int);
int strtotime(const char *, time_t *);
void xmlencode(const char *, FILE *);
diff --git a/xml.c b/xml.c
@@ -116,49 +116,18 @@ startvalue:
static void
xml_parsecomment(XMLParser *x)
{
- size_t datalen = 0, i = 0;
- int c;
+ int c, i = 0;
- if (x->xmlcommentstart)
- x->xmlcommentstart(x);
while ((c = GETNEXT()) != EOF) {
- if (c == '-' || c == '>') {
- if (x->xmlcomment && datalen) {
- x->data[datalen] = '\0';
- x->xmlcomment(x, x->data, datalen);
- datalen = 0;
- }
- }
-
if (c == '-') {
- if (++i > 2) {
- if (x->xmlcomment)
- for (; i > 2; i--)
- x->xmlcomment(x, "-", 1);
+ if (++i > 2)
i = 2;
- }
continue;
} else if (c == '>' && i == 2) {
- if (x->xmlcommentend)
- x->xmlcommentend(x);
return;
} else if (i) {
- if (x->xmlcomment) {
- for (; i > 0; i--)
- x->xmlcomment(x, "-", 1);
- }
i = 0;
}
-
- if (datalen < sizeof(x->data) - 1) {
- x->data[datalen++] = c;
- } else {
- x->data[datalen] = '\0';
- if (x->xmlcomment)
- x->xmlcomment(x, x->data, datalen);
- x->data[0] = c;
- datalen = 1;
- }
}
}
@@ -168,8 +137,6 @@ xml_parsecdata(XMLParser *x)
size_t datalen = 0, i = 0;
int c;
- if (x->xmlcdatastart)
- x->xmlcdatastart(x);
while ((c = GETNEXT()) != EOF) {
if (c == ']' || c == '>') {
if (x->xmlcdata && datalen) {
@@ -188,8 +155,6 @@ xml_parsecdata(XMLParser *x)
}
continue;
} else if (c == '>' && i == 2) {
- if (x->xmlcdataend)
- x->xmlcdataend(x);
return;
} else if (i) {
if (x->xmlcdata)
@@ -287,7 +252,8 @@ numericentitytostr(const char *e, char *buf, size_t bufsiz)
else
l = strtol(e, &end, 10);
/* invalid value or not a well-formed entity or invalid code point */
- if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff)
+ if (errno || e == end || *end != ';' || l < 0 || l > 0x10ffff ||
+ (l >= 0xd800 && l <= 0xdfff))
return -1;
len = codepointtoutf8(l, buf);
buf[len] = '\0';
@@ -326,7 +292,7 @@ xml_parse(XMLParser *x)
if (c == '!') { /* cdata and comments */
for (tagdatalen = 0; (c = GETNEXT()) != EOF;) {
- /* NOTE: sizeof(x->data) must be atleast sizeof("[CDATA[") */
+ /* NOTE: sizeof(x->data) must be at least sizeof("[CDATA[") */
if (tagdatalen <= sizeof("[CDATA[") - 1)
x->data[tagdatalen++] = c;
if (c == '>')
@@ -393,8 +359,6 @@ xml_parse(XMLParser *x)
} else {
/* parse tag data */
datalen = 0;
- if (x->xmldatastart)
- x->xmldatastart(x);
while ((c = GETNEXT()) != EOF) {
if (c == '&') {
if (datalen) {
@@ -441,8 +405,6 @@ xml_parse(XMLParser *x)
x->data[datalen] = '\0';
if (x->xmldata && datalen)
x->xmldata(x, x->data, datalen);
- if (x->xmldataend)
- x->xmldataend(x);
break;
}
}
diff --git a/xml.h b/xml.h
@@ -1,5 +1,5 @@
-#ifndef _XML_H
-#define _XML_H
+#ifndef _XML_H_
+#define _XML_H_
#include <stdio.h>
@@ -13,24 +13,17 @@ typedef struct xmlparser {
const char *, size_t);
void (*xmlattrentity)(struct xmlparser *, const char *, size_t,
const char *, size_t, const char *, size_t);
- void (*xmlcdatastart)(struct xmlparser *);
void (*xmlcdata)(struct xmlparser *, const char *, size_t);
- void (*xmlcdataend)(struct xmlparser *);
- void (*xmlcommentstart)(struct xmlparser *);
- void (*xmlcomment)(struct xmlparser *, const char *, size_t);
- void (*xmlcommentend)(struct xmlparser *);
void (*xmldata)(struct xmlparser *, const char *, size_t);
- void (*xmldataend)(struct xmlparser *);
void (*xmldataentity)(struct xmlparser *, const char *, size_t);
- void (*xmldatastart)(struct xmlparser *);
void (*xmltagend)(struct xmlparser *, const char *, size_t, int);
void (*xmltagstart)(struct xmlparser *, const char *, size_t);
void (*xmltagstartparsed)(struct xmlparser *, const char *,
size_t, int);
#ifndef GETNEXT
- /* GETNEXT overridden for sfeed to reduce function call overhead and
- further context optimizations. */
+ /* GETNEXT overridden to reduce function call overhead and further
+ context optimizations. */
#define GETNEXT getchar
#endif