rsspls.7bit.org/content/documentation.md

184 lines
6.7 KiB
Markdown
Raw Normal View History

+++
title = "Documentation"
description = "Documentation"
weight = 2
+++
### Configuration
Unless specified via the `--config` command line option `rsspls` reads its
configuration from one of the following paths:
* UNIX-like systems:
* `$XDG_CONFIG_HOME/rsspls/feeds.toml`
* `~/.config/rsspls/feeds.toml` if `XDG_CONFIG_HOME` is unset.
* Windows:
* `C:\Users\You\AppData\Roaming\rsspls\feeds.toml`
The configuration file is in [TOML][toml] format.
The parts of the page to extract for the feed are specified using [CSS
selectors][selectors].
#### Annotated Sample Configuration
The sample file below demonstrates all the parts of the configuration.
```toml
# The configuration must start with the [rsspls] section
[rsspls]
# Optional output directory to write the feeds to. If not specified it must be supplied via
# the --output command line option.
output = "/tmp"
# Optional proxy address. If specified, all requests will be routed through it.
# The address needs to be in the format: protocol://ip_address:port
# The supported protocols are: http, https, socks and socks5h.
# It can also be specified as environment variable `http_proxy` or `HTTPS_PROXY`.
# The config file takes precedence, then the env vars in the above order.
# proxy = socks5://10.64.0.1:1080
# Next is the array of feeds, each one starts with [[feed]]
[[feed]]
# The title of the channel in the feed
title = "My Great RSS Feed"
# The output filename without the output directory to write this feed to.
# Note: this is a filename only, not a path. It should not contain slashes.
filename = "wezm.rss"
# Optional User-Agent header to be set for the HTTP request.
# user_agent = "Mozilla/5.0"
# The configuration for the feed
[feed.config]
# The URL of the web page to generate the feed from.
url = "https://www.wezm.net/"
# A CSS selector to select elements on the page that represent items in the feed.
item = "article"
# A CSS selector relative to `item` to an element that will supply the title for the item.
heading = "h3"
# A CSS selector relative to `item` to an element that will supply the link for the item.
# Note: This element must have a `href` attribute.
# Note: If not supplied rsspls will attempt to use the heading selector for link for backwards
# compatibility with earlier versions. A message will be emitted in this case.
link = "h3 a"
# Optional CSS selector relative to `item` that will supply the content of the RSS item.
summary = ".post-body"
# Optional CSS selector relative to `item` that supplies media content (audio, video, image)
# to be added as an RSS enclosure.
# Note: The media URL must be given by the `src` or `href` attribute of the selected element.
# Note: Currently if the item does not match the media selector then it will be skipped.
# media = "figure img"
# Optional CSS selector relative to `item` that supples the publication date of the RSS item.
date = "time"
# Alternatively for more control `date` can be specified as a table:
# [feed.config.date]
# selector = "time"
# # Optional type of value being parsed.
# # Defaults to DateTime, can also be Date if you're parsing a value without a time.
# type = "Date"
# # format of the date to parse. See the following for the syntax
# # https://time-rs.github.io/book/api/format-description.html
# format = "[day padding:none]/[month padding:none]/[year]" # will parse 1/2/1934 style dates
# A second example feed
[[feed]]
title = "Example Site"
filename = "example.rss"
[feed.config]
url = "https://example.com/"
item = "div"
heading = "a"
```
The first example above (for my blog WezM.net) matches HTML that looks like this:
```html
<section class="posts-section">
<h2>Recent Posts</h2>
<article id="garage-door-monitor">
<h3><a href="https://www.wezm.net/v2/posts/2022/garage-door-monitor/">Monitoring My Garage Door With a Raspberry Pi, Rust, and a 13Mb Linux System</a></h3>
<div class="post-metadata">
<div class="date-published">
<time datetime="2022-04-20T06:38:27+10:00">20 April 2022</time>
</div>
</div>
<div class="post-body">
<p>Ive accidentally left our garage door open a few times. To combat this I built
a monitor that sends an alert via Mattermost when the door has been left open
for more than 5 minutes. This turned out to be a super fun project. I used
parts on hand as much as possible, implemented the monitoring application in
Rust, and then built a stripped down Linux image to run it.
</p>
</div>
<a href="https://www.wezm.net/v2/posts/2022/garage-door-monitor/">Continue Reading →</a>
</article>
<article id="monospace-kobo-ereader">
<!-- another article -->
</article>
<!-- more articles -->
<a href="https://www.wezm.net/v2/posts/">View more posts →</a>
</section>
```
#### More Detail on Date Handling
The `date` key in the configuration can be a string or a table. If it's a
string then it's used as selector to find the element containing the date and
`rsspls` will attempt to automatically parse the value. If automatic parsing
fails you can manually specify the format using the table form of `date`:
```toml
[feed.config.date]
selector = "time" # required
type = "Date"
format = "[day padding:none]/[month padding:none]/[year]"
```
* `type` is `Date` when you want to parse just a date. Use `DateTime` if you're
parsing a date and time with the format. Defaults to `DateTime`.
* `format` is a format description using the syntax described on this page:
<https://time-rs.github.io/book/api/format-description.html>.
If the element matched by the `date` selector is a `<time>` element then
`rsspls` will first try to parse the value in the `datetime` attribute if
present. If the attribute is missing or the element is not a `time` element
then `rsspls` will use the supplied format or attempt automatic parsing of the
text content of the element.
### Hosting
It is expected that `rsspls` will be run on a web server that is serving the
directory the feeds are written to. `rsspls` just generates the feeds, it's not
a server. In order to have the feeds update you will need to arrange for
`rsspls` to be run periodically. You might do this with [cron], [systemd
timers][timers], or the Windows equivalent.
### Caveats
`rsspls` just fetches and parses the HTML of the web page you specify. It does
not run JavaScript. If the website is entirely generated by JavaScript (such as
Twitter) then `rsspls` will not work.
### Caching
When websites respond with cache headers `rsspls` will make a conditional
request on subsequent runs and will not regenerate the feed if the server
responds with 304 Not Modified. Cache data is stored in
`$XDG_CACHE_HOME/rsspls`, which defaults to `~/.cache/rsspls` on UNIX-like
systems or `C:\Users\You\AppData\Local\rsspls` on Windows.