forked from wezm/wezm.net
189 lines
7 KiB
Markdown
189 lines
7 KiB
Markdown
|
+++
|
||
|
title = "Generating RSS Feeds From Web Pages With RSS Please"
|
||
|
date = 2022-07-04T09:54:29+10:00
|
||
|
|
||
|
[extra]
|
||
|
#updated = 2022-01-27T21:07:32+10:00
|
||
|
+++
|
||
|
|
||
|
Sometimes I come across a web page that I'd like to revisit when there's
|
||
|
new content. Typically, I do this by subscribing to the [RSS feed][feed] in
|
||
|
[Feedbin]. Unfortunately some sites don't provide an RSS feed, which is why I
|
||
|
built [RSS Please][rsspls] (`rsspls`). RSS Please allows you to generate an RSS
|
||
|
feed by extracting specific parts of a web page. In this post I give a bit of
|
||
|
background on the tool and how I'm running it in my Docker infrastructure.
|
||
|
|
||
|
<!-- more -->
|
||
|
|
||
|
### Background
|
||
|
|
||
|
Sometimes an RSS feed isn't available on a website. If the site is open source
|
||
|
I will often try to [open a PR to add or enable one][rss-pr]. That's not always
|
||
|
possible though. Other time the page may be one that would naturally think to provide a feed for, but one would still be useful.
|
||
|
|
||
|
As an example, when we were looking to buy a house I noticed that listings
|
||
|
would often go live on agent's websites several days or more before they were
|
||
|
published to the big aggregators. The market was very competitive so I was
|
||
|
regularly visiting all the real estate agent websites to run my search, and
|
||
|
check for new listings. At the time I used [Feedfry] to create RSS feeds from
|
||
|
the search results. I could then subscribe to them in [Feedbin]. Paired with
|
||
|
the [Feedbin Notifier app][notifier] I received a notification on my phone
|
||
|
whenever there was a new listing matching my search criteria from any of the
|
||
|
agents.
|
||
|
|
||
|
Feedfry is free with ads or paid subscription. I paid while house shopping but
|
||
|
let that lapse afterwards. I don't begrudge them funding the service with ads or
|
||
|
subscriptions but I figured I could probably put something together and
|
||
|
self-host it. At the same time providing a bit more control over how the
|
||
|
elements of the page were extracted to generate the feed. [RSS Please][rsspls]
|
||
|
is the result.
|
||
|
|
||
|
RSS Please is an open-source command line application implemented in Rust. It
|
||
|
has no runtime dependencies and runs on UNIX-like platforms including FreeBSD,
|
||
|
Linux, and macOS. Once I resolve [this issue][windows-issue] it will run on
|
||
|
Windows too. The following sections describe how it's configured and how I'm
|
||
|
running it on my server.
|
||
|
|
||
|
### Configuration
|
||
|
|
||
|
The `rsspls` configuration file allows a number of feeds to be defined. It
|
||
|
uses [CSS Selectors][css] to describe how parts of each page will be extracted
|
||
|
to produce a feed. As an example here's a configuration that builds a feed
|
||
|
from this site—although I already have an RSS feed at
|
||
|
<https://www.wezm.net/v2/rss.xml> if you want to subscribe.
|
||
|
|
||
|
```toml
|
||
|
# The configuration must start with the [rsspls] section
|
||
|
[rsspls]
|
||
|
output = "/tmp"
|
||
|
|
||
|
[[feed]]
|
||
|
# The title of the channel in the feed
|
||
|
title = "Example WezM.net Feed"
|
||
|
# The output filename within the output directory to write this feed to.
|
||
|
filename = "wezm.rss"
|
||
|
|
||
|
[feed.config]
|
||
|
url = "https://www.wezm.net/"
|
||
|
item = "article"
|
||
|
heading = "h3 a"
|
||
|
summary = ".post-body"
|
||
|
date = "time"
|
||
|
```
|
||
|
|
||
|
The configuration format is [TOML]. The `item` key selects `article` elements
|
||
|
from the page. `heading`, `summary`, and `date` are selectors upon the element
|
||
|
selected by `item`. `summary` and `date` are optional. `heading` is expected to
|
||
|
select an element with a `href` attribute, which is used as the link for the
|
||
|
item in the feed.
|
||
|
|
||
|
### Running It
|
||
|
|
||
|
Once installed running `rsspls` will update the configured feeds. Caching is
|
||
|
used to skip updates when the origin server indicates nothing has changed since
|
||
|
last time. By default `rsspls` looks for its configuration file in
|
||
|
`$XDG_CONFIG_HOME/rsspls/feeds.toml`, defaulting to
|
||
|
`~/.config/rsspls/feeds.toml` if `XDG_CONFIG_HOME` is not set. Alternatively
|
||
|
the path can be supplied with `--config`.
|
||
|
|
||
|
{{ figure(image="posts/2022/generate-rss-from-webpage/rsspls-output.png", link="posts/2022/generate-rss-from-webpage/rsspls-output.png", alt="Screenshot of the output when running rsspls. It has several log messages prefixed with INFO describing the actions taken", caption="rsspls prints informational messages when updating feeds", width=335, border=false) }}
|
||
|
|
||
|
### Deployment
|
||
|
|
||
|
Since I
|
||
|
[host my things with Docker + Compose](@/posts/2022/alpine-linux-docker-infrastructure-three-years/index.md)
|
||
|
I'm running `rsspls` with Docker as well, but that's not required. There are
|
||
|
plenty of other ways you could go about it. E.g. you could have `cron` run
|
||
|
`rsspls` on your computer and `rsync` the feeds to a server. Some RSS aggregators
|
||
|
like [Liferea] even let you subscribe to local files.
|
||
|
|
||
|
I create a Docker image from the `rsspls` binaries I publish:
|
||
|
|
||
|
```dockerfile
|
||
|
FROM wezm-alpine:3.16.0
|
||
|
|
||
|
# UID needs to match owner of /home/rsspls/feeds volume
|
||
|
ARG PUID=1000
|
||
|
ARG PGID=1000
|
||
|
ARG USER=rsspls
|
||
|
|
||
|
RUN addgroup -g ${PGID} ${USER} && \
|
||
|
adduser -D -u ${PUID} -G ${USER} -h /home/${USER} -D ${USER}
|
||
|
|
||
|
ARG RSSPLS_VERSION=0.2.0
|
||
|
|
||
|
RUN cd /usr/local/bin && \
|
||
|
wget -O - https://releases.wezm.net/rsspls/${RSSPLS_VERSION}/rsspls-${RSSPLS_VERSION}-x86_64-unknown-linux-musl.tar.gz | tar zxf - && \
|
||
|
mkdir /home/${USER}/feeds && \
|
||
|
chown ${USER}:${USER} /home/${USER}/feeds
|
||
|
|
||
|
COPY ./entrypoint.sh /home/${USER}/entrypoint.sh
|
||
|
|
||
|
WORKDIR /home/${USER}
|
||
|
|
||
|
USER ${USER}
|
||
|
|
||
|
VOLUME ["/home/rsspls/feeds"]
|
||
|
ENTRYPOINT ["./entrypoint.sh"]
|
||
|
```
|
||
|
|
||
|
It uses my standard [Alpine] base image which is built from the "Mini root
|
||
|
filesystem" they publish and does not require any other packages to be
|
||
|
installed.
|
||
|
|
||
|
I use an entry point script to run `rsspls` every 12 hours:
|
||
|
|
||
|
```sh
|
||
|
#!/bin/sh
|
||
|
|
||
|
set -e
|
||
|
|
||
|
trap 'exit' TERM INT
|
||
|
|
||
|
while true; do
|
||
|
rsspls --config /etc/rsspls.toml
|
||
|
sleep 1036800 # 12 hours
|
||
|
done
|
||
|
```
|
||
|
|
||
|
In my `docker-compose.yml` I have the following:
|
||
|
|
||
|
```yaml
|
||
|
rsspls:
|
||
|
image: example.com/rsspls
|
||
|
volumes:
|
||
|
- ./rsspls/rsspls.toml:/etc/rsspls.toml:ro
|
||
|
- ./volumes/www/rsspls.wezm.net:/home/rsspls/feeds
|
||
|
restart: unless-stopped
|
||
|
```
|
||
|
|
||
|
The `./volumes/www/rsspls.wezm.net` path is shared with the container running `nginx`, so
|
||
|
the generated feeds are accessible at `rsspls.wezm.net`—although I'm not making them
|
||
|
obvious to visitors (there's no directory index so visiting that domain will just give
|
||
|
a 403 Forbidden error).
|
||
|
|
||
|
### Conclusion
|
||
|
|
||
|
This was a fun project to put together over a weekend. I get a lot of
|
||
|
satisfaction from building and self-hosting tools to solve my own problems. Not
|
||
|
everyone has the time or desire to do that though so if you're looking for
|
||
|
similar functionality check out [Feed43] and [Feedfry].
|
||
|
|
||
|
As mentioned the tool is open-source (MIT or Apache 2.0). Check out the repo at
|
||
|
<https://github.com/wezm/rsspls> and if you like what you see maybe give it a
|
||
|
star.
|
||
|
|
||
|
|
||
|
[Alpine]: https://alpinelinux.org/
|
||
|
[css]: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
|
||
|
[Feed43]: https://feed43.com/
|
||
|
[feed]: https://en.wikipedia.org/wiki/RSS
|
||
|
[Feedbin]: https://feedbin.com/
|
||
|
[Feedfry]: https://feedfry.com/
|
||
|
[notifier]: https://feedbin.com/notifier
|
||
|
[rss-pr]: https://github.com/pulls?q=is%3Apr+author%3Awezm++rss+is%3Aclosed+
|
||
|
[rsspls]: https://github.com/wezm/rsspls
|
||
|
[windows-issue]: https://github.com/wezm/rsspls/issues/4
|
||
|
[TOML]: https://toml.io/
|
||
|
[Liferea]: http://lzone.de/liferea/
|