diff --git a/v2/content/posts/2022/generate-rss-from-webpage/index.md b/v2/content/posts/2022/generate-rss-from-webpage/index.md new file mode 100644 index 0000000..bf5a69e --- /dev/null +++ b/v2/content/posts/2022/generate-rss-from-webpage/index.md @@ -0,0 +1,188 @@ ++++ +title = "Generating RSS Feeds From Web Pages With RSS Please" +date = 2022-07-04T09:54:29+10:00 + +[extra] +#updated = 2022-01-27T21:07:32+10:00 ++++ + +Sometimes I come across a web page that I'd like to revisit when there's +new content. Typically, I do this by subscribing to the [RSS feed][feed] in +[Feedbin]. Unfortunately some sites don't provide an RSS feed, which is why I +built [RSS Please][rsspls] (`rsspls`). RSS Please allows you to generate an RSS +feed by extracting specific parts of a web page. In this post I give a bit of +background on the tool and how I'm running it in my Docker infrastructure. + + + +### Background + +Sometimes an RSS feed isn't available on a website. If the site is open source +I will often try to [open a PR to add or enable one][rss-pr]. That's not always +possible though. Other time the page may be one that would naturally think to provide a feed for, but one would still be useful. + +As an example, when we were looking to buy a house I noticed that listings +would often go live on agent's websites several days or more before they were +published to the big aggregators. The market was very competitive so I was +regularly visiting all the real estate agent websites to run my search, and +check for new listings. At the time I used [Feedfry] to create RSS feeds from +the search results. I could then subscribe to them in [Feedbin]. Paired with +the [Feedbin Notifier app][notifier] I received a notification on my phone +whenever there was a new listing matching my search criteria from any of the +agents. + +Feedfry is free with ads or paid subscription. I paid while house shopping but +let that lapse afterwards. I don't begrudge them funding the service with ads or +subscriptions but I figured I could probably put something together and +self-host it. At the same time providing a bit more control over how the +elements of the page were extracted to generate the feed. [RSS Please][rsspls] +is the result. + +RSS Please is an open-source command line application implemented in Rust. It +has no runtime dependencies and runs on UNIX-like platforms including FreeBSD, +Linux, and macOS. Once I resolve [this issue][windows-issue] it will run on +Windows too. The following sections describe how it's configured and how I'm +running it on my server. + +### Configuration + +The `rsspls` configuration file allows a number of feeds to be defined. It +uses [CSS Selectors][css] to describe how parts of each page will be extracted +to produce a feed. As an example here's a configuration that builds a feed +from this site—although I already have an RSS feed at + if you want to subscribe. + +```toml +# The configuration must start with the [rsspls] section +[rsspls] +output = "/tmp" + +[[feed]] +# The title of the channel in the feed +title = "Example WezM.net Feed" +# The output filename within the output directory to write this feed to. +filename = "wezm.rss" + +[feed.config] +url = "https://www.wezm.net/" +item = "article" +heading = "h3 a" +summary = ".post-body" +date = "time" +``` + +The configuration format is [TOML]. The `item` key selects `article` elements +from the page. `heading`, `summary`, and `date` are selectors upon the element +selected by `item`. `summary` and `date` are optional. `heading` is expected to +select an element with a `href` attribute, which is used as the link for the +item in the feed. + +### Running It + +Once installed running `rsspls` will update the configured feeds. Caching is +used to skip updates when the origin server indicates nothing has changed since +last time. By default `rsspls` looks for its configuration file in +`$XDG_CONFIG_HOME/rsspls/feeds.toml`, defaulting to +`~/.config/rsspls/feeds.toml` if `XDG_CONFIG_HOME` is not set. Alternatively +the path can be supplied with `--config`. + +{{ figure(image="posts/2022/generate-rss-from-webpage/rsspls-output.png", link="posts/2022/generate-rss-from-webpage/rsspls-output.png", alt="Screenshot of the output when running rsspls. It has several log messages prefixed with INFO describing the actions taken", caption="rsspls prints informational messages when updating feeds", width=335, border=false) }} + +### Deployment + +Since I +[host my things with Docker + Compose](@/posts/2022/alpine-linux-docker-infrastructure-three-years/index.md) +I'm running `rsspls` with Docker as well, but that's not required. There are +plenty of other ways you could go about it. E.g. you could have `cron` run +`rsspls` on your computer and `rsync` the feeds to a server. Some RSS aggregators +like [Liferea] even let you subscribe to local files. + +I create a Docker image from the `rsspls` binaries I publish: + +```dockerfile +FROM wezm-alpine:3.16.0 + +# UID needs to match owner of /home/rsspls/feeds volume +ARG PUID=1000 +ARG PGID=1000 +ARG USER=rsspls + +RUN addgroup -g ${PGID} ${USER} && \ + adduser -D -u ${PUID} -G ${USER} -h /home/${USER} -D ${USER} + +ARG RSSPLS_VERSION=0.2.0 + +RUN cd /usr/local/bin && \ + wget -O - https://releases.wezm.net/rsspls/${RSSPLS_VERSION}/rsspls-${RSSPLS_VERSION}-x86_64-unknown-linux-musl.tar.gz | tar zxf - && \ + mkdir /home/${USER}/feeds && \ + chown ${USER}:${USER} /home/${USER}/feeds + +COPY ./entrypoint.sh /home/${USER}/entrypoint.sh + +WORKDIR /home/${USER} + +USER ${USER} + +VOLUME ["/home/rsspls/feeds"] +ENTRYPOINT ["./entrypoint.sh"] +``` + +It uses my standard [Alpine] base image which is built from the "Mini root +filesystem" they publish and does not require any other packages to be +installed. + +I use an entry point script to run `rsspls` every 12 hours: + +```sh +#!/bin/sh + +set -e + +trap 'exit' TERM INT + +while true; do + rsspls --config /etc/rsspls.toml + sleep 1036800 # 12 hours +done +``` + +In my `docker-compose.yml` I have the following: + +```yaml + rsspls: + image: example.com/rsspls + volumes: + - ./rsspls/rsspls.toml:/etc/rsspls.toml:ro + - ./volumes/www/rsspls.wezm.net:/home/rsspls/feeds + restart: unless-stopped +``` + +The `./volumes/www/rsspls.wezm.net` path is shared with the container running `nginx`, so +the generated feeds are accessible at `rsspls.wezm.net`—although I'm not making them +obvious to visitors (there's no directory index so visiting that domain will just give +a 403 Forbidden error). + +### Conclusion + +This was a fun project to put together over a weekend. I get a lot of +satisfaction from building and self-hosting tools to solve my own problems. Not +everyone has the time or desire to do that though so if you're looking for +similar functionality check out [Feed43] and [Feedfry]. + +As mentioned the tool is open-source (MIT or Apache 2.0). Check out the repo at + and if you like what you see maybe give it a +star. + + +[Alpine]: https://alpinelinux.org/ +[css]: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors +[Feed43]: https://feed43.com/ +[feed]: https://en.wikipedia.org/wiki/RSS +[Feedbin]: https://feedbin.com/ +[Feedfry]: https://feedfry.com/ +[notifier]: https://feedbin.com/notifier +[rss-pr]: https://github.com/pulls?q=is%3Apr+author%3Awezm++rss+is%3Aclosed+ +[rsspls]: https://github.com/wezm/rsspls +[windows-issue]: https://github.com/wezm/rsspls/issues/4 +[TOML]: https://toml.io/ +[Liferea]: http://lzone.de/liferea/ diff --git a/v2/content/posts/2022/generate-rss-from-webpage/rsspls-output.png b/v2/content/posts/2022/generate-rss-from-webpage/rsspls-output.png new file mode 100644 index 0000000..aa22384 Binary files /dev/null and b/v2/content/posts/2022/generate-rss-from-webpage/rsspls-output.png differ