+++
title = "Generating RSS Feeds From Web Pages With RSS Please"
date = 2022-07-04T09:54:29+10:00

[extra]
updated = 2022-07-24T09:28:15+10:00
+++

Sometimes I come across a web page that I'd like to revisit when there's
new content. Typically, I do this by subscribing to the [RSS feed][feed] in
[Feedbin]. Unfortunately some sites don't provide an RSS feed, which is why I
built [RSS Please][rsspls] (`rsspls`). RSS Please allows you to generate an RSS
feed by extracting specific parts of a web page. In this post I give a bit of
background on the tool and how I'm running it in my Docker infrastructure.

<!-- more -->

### Background

Sometimes an RSS feed isn't available on a website. If the site is open source
I will often try to [open a PR to add or enable one][rss-pr]. That's not always
possible though. Other times the page may be one that the author wouldn't
naturally think to provide a feed for, but one would still be useful.

As an example, when we were looking to buy a house I noticed that listings
would often go live on agent's websites several days or more before they were
published to the big aggregators. The market was very competitive so I was
regularly visiting all the real estate agent websites to run my search, and
check for new listings. At the time I used [Feedfry] to create RSS feeds from
the search results. I could then subscribe to them in [Feedbin]. Paired with
the [Feedbin Notifier app][notifier] I received a notification on my phone
whenever there was a new listing matching my search criteria.

Feedfry is free with ads or paid subscription. I paid while house shopping but
let that lapse afterwards. I don't begrudge them funding the service with ads or
subscriptions but I figured I could probably put something together and
self-host it. At the same time providing a bit more control over how the
elements of the page were extracted to generate the feed. [RSS Please][rsspls]
is the result.

RSS Please is an open-source command line application implemented in Rust. It
has no runtime dependencies and runs on UNIX-like platforms including FreeBSD,
Linux, and macOS. Once I resolve [this issue][windows-issue] it will run on
Windows too. The following sections describe how it's configured and how I'm
running it on my server.

### Configuration

The `rsspls` configuration file allows a number of feeds to be defined. It
uses [CSS Selectors][css] to describe how parts of each page will be extracted
to produce a feed. As an example here's a configuration that builds a feed
from this site—although I already have an RSS feed at
<https://www.wezm.net/v2/rss.xml> if you want to subscribe.

```toml
# The configuration must start with the [rsspls] section
[rsspls]
output = "/tmp"

[[feed]]
# The title of the channel in the feed
title = "Example WezM.net Feed"
# The output filename within the output directory to write this feed to.
filename = "wezm.rss"

[feed.config]
url = "https://www.wezm.net/"
item = "article"
heading = "h3 a"
summary = ".post-body"
date = "time"
```

The configuration format is [TOML]. The `item` key selects `article` elements
from the page. `heading`, `summary`, and `date` are selectors upon the element
selected by `item`. `summary` and `date` are optional. `heading` is expected to
select an element with a `href` attribute, which is used as the link for the
item in the feed.

### Running It

Once installed running `rsspls` will update the configured feeds. Caching is
used to skip updates when the origin server indicates nothing has changed since
last time. By default `rsspls` looks for its configuration file in
`$XDG_CONFIG_HOME/rsspls/feeds.toml`, defaulting to
`~/.config/rsspls/feeds.toml` if `XDG_CONFIG_HOME` is not set. Alternatively
the path can be supplied with `--config`.

{{ figure(image="posts/2022/generate-rss-from-webpage/rsspls-output.png", link="posts/2022/generate-rss-from-webpage/rsspls-output.png", alt="Screenshot of the output when running rsspls. It has several log messages prefixed with INFO describing the actions taken", caption="rsspls prints informational messages when updating feeds", width=335, border=false) }}

### Deployment

Since I
[host my things with Docker + Compose](@/posts/2022/alpine-linux-docker-infrastructure-three-years/index.md)
I'm running `rsspls` with Docker as well, but that's not required. There are
plenty of other ways you could go about it. E.g. you could have `cron` run
`rsspls` on your computer and `rsync` the feeds to a server. Some RSS aggregators
like [Liferea] even let you subscribe to local files.

I create a Docker image from the `rsspls` binaries I publish:

```dockerfile
FROM wezm-alpine:3.16.0

# UID needs to match owner of /home/rsspls/feeds volume
ARG PUID=1000
ARG PGID=1000
ARG USER=rsspls

RUN addgroup -g ${PGID} ${USER} && \
    adduser -D -u ${PUID} -G ${USER} -h /home/${USER} -D ${USER}

ARG RSSPLS_VERSION=0.2.0

RUN cd /usr/local/bin && \
    wget -O - https://releases.wezm.net/rsspls/${RSSPLS_VERSION}/rsspls-${RSSPLS_VERSION}-x86_64-unknown-linux-musl.tar.gz | tar zxf - && \
    mkdir /home/${USER}/feeds && \
    chown ${USER}:${USER} /home/${USER}/feeds

COPY ./entrypoint.sh /home/${USER}/entrypoint.sh

WORKDIR /home/${USER}

USER ${USER}

VOLUME ["/home/rsspls/feeds"]
ENTRYPOINT ["./entrypoint.sh"]
```

It uses my standard [Alpine] base image which is built from the "Mini root
filesystem" they publish and does not require any other packages to be
installed.

I use an entry point script to run `rsspls` every 12 hours:

```sh
#!/bin/sh

set -e

trap 'exit' TERM INT

while true; do
  rsspls --config /etc/rsspls.toml
  sleep 1036800 # 12 hours
done
```

In my `docker-compose.yml` I have the following:

```yaml
  rsspls:
    image: example.com/rsspls
    volumes:
      - ./rsspls/rsspls.toml:/etc/rsspls.toml:ro
      - ./volumes/www/rsspls.wezm.net:/home/rsspls/feeds
    restart: unless-stopped
```

The `./volumes/www/rsspls.wezm.net` path is shared with the container running `nginx`, so
the generated feeds are accessible at `rsspls.wezm.net`—although I'm not making them
obvious to visitors (there's no directory index so visiting that domain will just give
a 403 Forbidden error).

### Conclusion

This was a fun project to put together over a weekend. I get a lot of
satisfaction from building and self-hosting tools to solve my own problems. Not
everyone has the time or desire to do that though so if you're looking for
similar functionality check out [Feed43] and [Feedfry].

As mentioned the tool is open-source (MIT or Apache 2.0). Check out the repo at
<https://github.com/wezm/rsspls> and if you like what you see maybe give it a
star.


[Alpine]: https://alpinelinux.org/
[css]: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
[Feed43]: https://feed43.com/
[feed]: https://en.wikipedia.org/wiki/RSS
[Feedbin]: https://feedbin.com/
[Feedfry]: https://feedfry.com/
[notifier]: https://feedbin.com/notifier
[rss-pr]: https://github.com/pulls?q=is%3Apr+author%3Awezm++rss+is%3Aclosed+
[rsspls]: https://github.com/wezm/rsspls
[windows-issue]: https://github.com/wezm/rsspls/issues/4
[TOML]: https://toml.io/
[Liferea]: http://lzone.de/liferea/