wezm.net/v2/content/posts/2022/generate-rss-from-webpage/index.md

189 lines
7 KiB
Markdown
Raw Permalink Normal View History

+++
title = "Generating RSS Feeds From Web Pages With RSS Please"
date = 2022-07-04T09:54:29+10:00
[extra]
2022-07-23 23:27:19 +00:00
updated = 2022-07-24T09:28:15+10:00
+++
Sometimes I come across a web page that I'd like to revisit when there's
new content. Typically, I do this by subscribing to the [RSS feed][feed] in
[Feedbin]. Unfortunately some sites don't provide an RSS feed, which is why I
built [RSS Please][rsspls] (`rsspls`). RSS Please allows you to generate an RSS
feed by extracting specific parts of a web page. In this post I give a bit of
background on the tool and how I'm running it in my Docker infrastructure.
<!-- more -->
### Background
Sometimes an RSS feed isn't available on a website. If the site is open source
I will often try to [open a PR to add or enable one][rss-pr]. That's not always
2022-07-23 23:27:19 +00:00
possible though. Other times the page may be one that the author wouldn't
naturally think to provide a feed for, but one would still be useful.
As an example, when we were looking to buy a house I noticed that listings
would often go live on agent's websites several days or more before they were
published to the big aggregators. The market was very competitive so I was
regularly visiting all the real estate agent websites to run my search, and
check for new listings. At the time I used [Feedfry] to create RSS feeds from
the search results. I could then subscribe to them in [Feedbin]. Paired with
the [Feedbin Notifier app][notifier] I received a notification on my phone
2022-07-23 23:27:19 +00:00
whenever there was a new listing matching my search criteria.
Feedfry is free with ads or paid subscription. I paid while house shopping but
let that lapse afterwards. I don't begrudge them funding the service with ads or
subscriptions but I figured I could probably put something together and
self-host it. At the same time providing a bit more control over how the
elements of the page were extracted to generate the feed. [RSS Please][rsspls]
is the result.
RSS Please is an open-source command line application implemented in Rust. It
has no runtime dependencies and runs on UNIX-like platforms including FreeBSD,
Linux, and macOS. Once I resolve [this issue][windows-issue] it will run on
Windows too. The following sections describe how it's configured and how I'm
running it on my server.
### Configuration
The `rsspls` configuration file allows a number of feeds to be defined. It
uses [CSS Selectors][css] to describe how parts of each page will be extracted
to produce a feed. As an example here's a configuration that builds a feed
from this site—although I already have an RSS feed at
<https://www.wezm.net/v2/rss.xml> if you want to subscribe.
```toml
# The configuration must start with the [rsspls] section
[rsspls]
output = "/tmp"
[[feed]]
# The title of the channel in the feed
title = "Example WezM.net Feed"
# The output filename within the output directory to write this feed to.
filename = "wezm.rss"
[feed.config]
url = "https://www.wezm.net/"
item = "article"
heading = "h3 a"
summary = ".post-body"
date = "time"
```
The configuration format is [TOML]. The `item` key selects `article` elements
from the page. `heading`, `summary`, and `date` are selectors upon the element
selected by `item`. `summary` and `date` are optional. `heading` is expected to
select an element with a `href` attribute, which is used as the link for the
item in the feed.
### Running It
Once installed running `rsspls` will update the configured feeds. Caching is
used to skip updates when the origin server indicates nothing has changed since
last time. By default `rsspls` looks for its configuration file in
`$XDG_CONFIG_HOME/rsspls/feeds.toml`, defaulting to
`~/.config/rsspls/feeds.toml` if `XDG_CONFIG_HOME` is not set. Alternatively
the path can be supplied with `--config`.
{{ figure(image="posts/2022/generate-rss-from-webpage/rsspls-output.png", link="posts/2022/generate-rss-from-webpage/rsspls-output.png", alt="Screenshot of the output when running rsspls. It has several log messages prefixed with INFO describing the actions taken", caption="rsspls prints informational messages when updating feeds", width=335, border=false) }}
### Deployment
Since I
[host my things with Docker + Compose](@/posts/2022/alpine-linux-docker-infrastructure-three-years/index.md)
I'm running `rsspls` with Docker as well, but that's not required. There are
plenty of other ways you could go about it. E.g. you could have `cron` run
`rsspls` on your computer and `rsync` the feeds to a server. Some RSS aggregators
like [Liferea] even let you subscribe to local files.
I create a Docker image from the `rsspls` binaries I publish:
```dockerfile
FROM wezm-alpine:3.16.0
# UID needs to match owner of /home/rsspls/feeds volume
ARG PUID=1000
ARG PGID=1000
ARG USER=rsspls
RUN addgroup -g ${PGID} ${USER} && \
adduser -D -u ${PUID} -G ${USER} -h /home/${USER} -D ${USER}
ARG RSSPLS_VERSION=0.2.0
RUN cd /usr/local/bin && \
wget -O - https://releases.wezm.net/rsspls/${RSSPLS_VERSION}/rsspls-${RSSPLS_VERSION}-x86_64-unknown-linux-musl.tar.gz | tar zxf - && \
mkdir /home/${USER}/feeds && \
chown ${USER}:${USER} /home/${USER}/feeds
COPY ./entrypoint.sh /home/${USER}/entrypoint.sh
WORKDIR /home/${USER}
USER ${USER}
VOLUME ["/home/rsspls/feeds"]
ENTRYPOINT ["./entrypoint.sh"]
```
It uses my standard [Alpine] base image which is built from the "Mini root
filesystem" they publish and does not require any other packages to be
installed.
I use an entry point script to run `rsspls` every 12 hours:
```sh
#!/bin/sh
set -e
trap 'exit' TERM INT
while true; do
rsspls --config /etc/rsspls.toml
sleep 1036800 # 12 hours
done
```
In my `docker-compose.yml` I have the following:
```yaml
rsspls:
image: example.com/rsspls
volumes:
- ./rsspls/rsspls.toml:/etc/rsspls.toml:ro
- ./volumes/www/rsspls.wezm.net:/home/rsspls/feeds
restart: unless-stopped
```
The `./volumes/www/rsspls.wezm.net` path is shared with the container running `nginx`, so
the generated feeds are accessible at `rsspls.wezm.net`—although I'm not making them
obvious to visitors (there's no directory index so visiting that domain will just give
a 403 Forbidden error).
### Conclusion
This was a fun project to put together over a weekend. I get a lot of
satisfaction from building and self-hosting tools to solve my own problems. Not
everyone has the time or desire to do that though so if you're looking for
similar functionality check out [Feed43] and [Feedfry].
As mentioned the tool is open-source (MIT or Apache 2.0). Check out the repo at
<https://github.com/wezm/rsspls> and if you like what you see maybe give it a
star.
[Alpine]: https://alpinelinux.org/
[css]: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
[Feed43]: https://feed43.com/
[feed]: https://en.wikipedia.org/wiki/RSS
[Feedbin]: https://feedbin.com/
[Feedfry]: https://feedfry.com/
[notifier]: https://feedbin.com/notifier
[rss-pr]: https://github.com/pulls?q=is%3Apr+author%3Awezm++rss+is%3Aclosed+
[rsspls]: https://github.com/wezm/rsspls
[windows-issue]: https://github.com/wezm/rsspls/issues/4
[TOML]: https://toml.io/
[Liferea]: http://lzone.de/liferea/