wezm.net/v2/content/posts/2022/generate-rss-from-webpage/index.md
2022-07-24 09:28:41 +10:00

7 KiB

+++ title = "Generating RSS Feeds From Web Pages With RSS Please" date = 2022-07-04T09:54:29+10:00

[extra] updated = 2022-07-24T09:28:15+10:00 +++

Sometimes I come across a web page that I'd like to revisit when there's new content. Typically, I do this by subscribing to the RSS feed in Feedbin. Unfortunately some sites don't provide an RSS feed, which is why I built RSS Please (rsspls). RSS Please allows you to generate an RSS feed by extracting specific parts of a web page. In this post I give a bit of background on the tool and how I'm running it in my Docker infrastructure.

Background

Sometimes an RSS feed isn't available on a website. If the site is open source I will often try to open a PR to add or enable one. That's not always possible though. Other times the page may be one that the author wouldn't naturally think to provide a feed for, but one would still be useful.

As an example, when we were looking to buy a house I noticed that listings would often go live on agent's websites several days or more before they were published to the big aggregators. The market was very competitive so I was regularly visiting all the real estate agent websites to run my search, and check for new listings. At the time I used Feedfry to create RSS feeds from the search results. I could then subscribe to them in Feedbin. Paired with the Feedbin Notifier app I received a notification on my phone whenever there was a new listing matching my search criteria.

Feedfry is free with ads or paid subscription. I paid while house shopping but let that lapse afterwards. I don't begrudge them funding the service with ads or subscriptions but I figured I could probably put something together and self-host it. At the same time providing a bit more control over how the elements of the page were extracted to generate the feed. RSS Please is the result.

RSS Please is an open-source command line application implemented in Rust. It has no runtime dependencies and runs on UNIX-like platforms including FreeBSD, Linux, and macOS. Once I resolve this issue it will run on Windows too. The following sections describe how it's configured and how I'm running it on my server.

Configuration

The rsspls configuration file allows a number of feeds to be defined. It uses CSS Selectors to describe how parts of each page will be extracted to produce a feed. As an example here's a configuration that builds a feed from this site—although I already have an RSS feed at https://www.wezm.net/v2/rss.xml if you want to subscribe.

# The configuration must start with the [rsspls] section
[rsspls]
output = "/tmp"

[[feed]]
# The title of the channel in the feed
title = "Example WezM.net Feed"
# The output filename within the output directory to write this feed to.
filename = "wezm.rss"

[feed.config]
url = "https://www.wezm.net/"
item = "article"
heading = "h3 a"
summary = ".post-body"
date = "time"

The configuration format is TOML. The item key selects article elements from the page. heading, summary, and date are selectors upon the element selected by item. summary and date are optional. heading is expected to select an element with a href attribute, which is used as the link for the item in the feed.

Running It

Once installed running rsspls will update the configured feeds. Caching is used to skip updates when the origin server indicates nothing has changed since last time. By default rsspls looks for its configuration file in $XDG_CONFIG_HOME/rsspls/feeds.toml, defaulting to ~/.config/rsspls/feeds.toml if XDG_CONFIG_HOME is not set. Alternatively the path can be supplied with --config.

{{ figure(image="posts/2022/generate-rss-from-webpage/rsspls-output.png", link="posts/2022/generate-rss-from-webpage/rsspls-output.png", alt="Screenshot of the output when running rsspls. It has several log messages prefixed with INFO describing the actions taken", caption="rsspls prints informational messages when updating feeds", width=335, border=false) }}

Deployment

Since I host my things with Docker + Compose I'm running rsspls with Docker as well, but that's not required. There are plenty of other ways you could go about it. E.g. you could have cron run rsspls on your computer and rsync the feeds to a server. Some RSS aggregators like Liferea even let you subscribe to local files.

I create a Docker image from the rsspls binaries I publish:

FROM wezm-alpine:3.16.0

# UID needs to match owner of /home/rsspls/feeds volume
ARG PUID=1000
ARG PGID=1000
ARG USER=rsspls

RUN addgroup -g ${PGID} ${USER} && \
    adduser -D -u ${PUID} -G ${USER} -h /home/${USER} -D ${USER}

ARG RSSPLS_VERSION=0.2.0

RUN cd /usr/local/bin && \
    wget -O - https://releases.wezm.net/rsspls/${RSSPLS_VERSION}/rsspls-${RSSPLS_VERSION}-x86_64-unknown-linux-musl.tar.gz | tar zxf - && \
    mkdir /home/${USER}/feeds && \
    chown ${USER}:${USER} /home/${USER}/feeds

COPY ./entrypoint.sh /home/${USER}/entrypoint.sh

WORKDIR /home/${USER}

USER ${USER}

VOLUME ["/home/rsspls/feeds"]
ENTRYPOINT ["./entrypoint.sh"]

It uses my standard Alpine base image which is built from the "Mini root filesystem" they publish and does not require any other packages to be installed.

I use an entry point script to run rsspls every 12 hours:

#!/bin/sh

set -e

trap 'exit' TERM INT

while true; do
  rsspls --config /etc/rsspls.toml
  sleep 1036800 # 12 hours
done

In my docker-compose.yml I have the following:

  rsspls:
    image: example.com/rsspls
    volumes:
      - ./rsspls/rsspls.toml:/etc/rsspls.toml:ro
      - ./volumes/www/rsspls.wezm.net:/home/rsspls/feeds
    restart: unless-stopped

The ./volumes/www/rsspls.wezm.net path is shared with the container running nginx, so the generated feeds are accessible at rsspls.wezm.net—although I'm not making them obvious to visitors (there's no directory index so visiting that domain will just give a 403 Forbidden error).

Conclusion

This was a fun project to put together over a weekend. I get a lot of satisfaction from building and self-hosting tools to solve my own problems. Not everyone has the time or desire to do that though so if you're looking for similar functionality check out Feed43 and Feedfry.

As mentioned the tool is open-source (MIT or Apache 2.0). Check out the repo at https://github.com/wezm/rsspls and if you like what you see maybe give it a star.