mirror of
https://github.com/wezm/wezm.net.git
synced 2024-11-18 04:42:47 +00:00
Add YouTube to OPML post
This commit is contained in:
parent
07b8cc089f
commit
3acf555ada
3 changed files with 282 additions and 0 deletions
Binary file not shown.
After Width: | Height: | Size: 108 KiB |
BIN
v2/content/posts/2024/youtube-subscriptions-opml/feedme.png
Normal file
BIN
v2/content/posts/2024/youtube-subscriptions-opml/feedme.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 486 KiB |
282
v2/content/posts/2024/youtube-subscriptions-opml/index.md
Normal file
282
v2/content/posts/2024/youtube-subscriptions-opml/index.md
Normal file
|
@ -0,0 +1,282 @@
|
|||
+++
|
||||
title = "Exporting YouTube Subscriptions to OPML and Watching via RSS"
|
||||
date = 2024-05-06T10:38:22+10:00
|
||||
|
||||
#[extra]
|
||||
#updated = 2024-02-21T10:05:19+10:00
|
||||
+++
|
||||
|
||||
This post describes how I exported my 500+ YouTube subscriptions to an OPML
|
||||
file so that I could import them into my RSS reader. I go into fine detail
|
||||
about the scripts and tools I used. If you just want to see the end result the
|
||||
code is in [this repository][repo], which describes the steps needed to run it.
|
||||
|
||||
I was previously a YouTube Premium subscriber but I cancelled it when they
|
||||
jacked up the already high prices. Since then I've been watching videos in
|
||||
[NewPipe] on my Android tablet or via an [Invidious] instance on real
|
||||
computers.
|
||||
|
||||
<!-- more -->
|
||||
|
||||
To import my subscriptions into NewPipe I was able to use the
|
||||
`subscriptions.csv` file included in the Google Takeout dump of my YouTube
|
||||
data. This worked fine initially but imposed some friction when adding new
|
||||
subscriptions.
|
||||
|
||||
If I only subscribed to new channels in NewPipe they were only
|
||||
accessible on my tablet. If I added them to YouTube then I had to remember to
|
||||
also add them in NewPipe, which was inconvenient if I wasn't using the tablet
|
||||
at the time. Inevitably the subscriptions would drift out of sync and I would have to
|
||||
periodically re-import the subscriptions from YouTube into NewPipe. This was
|
||||
cumbersome as it doesn't seem to have a way to do this incrementally. Last time
|
||||
I had to nuke all its data in order to re-import.
|
||||
|
||||
To solve these problems I wanted to manage my subscriptions in my RSS reader,
|
||||
[Feedbin]. This way Feedbin would track my subscriptions and new/viewed videos
|
||||
in a way that would sync between all my devices. Notably this is possible
|
||||
because Google actually publishes an RSS feed for each YouTube channel.
|
||||
|
||||
To do that I needed to export all my subscriptions to an OPML file that Feedbin
|
||||
could import. I opted to do that without requesting another Google Takeout dump
|
||||
as they take a long time to generate and also result in multiple gigabytes of
|
||||
archives I have to download (it includes all the videos I've uploaded to my
|
||||
personal account) just to get at the `subscriptions.csv` file within.
|
||||
|
||||
### Generating OPML
|
||||
|
||||
I started
|
||||
by visiting my [subscriptions page][subscriptions] and using some JavaScript to
|
||||
generate a JSON array of all the channels I am subscribed to:
|
||||
|
||||
```javascript
|
||||
copy(JSON.stringify(Array.from(new Set(Array.prototype.map.call(document.querySelectorAll('a.channel-link'), (link) => link.href))).filter((x) => !x.includes('/channel/')), null, 2))
|
||||
```
|
||||
|
||||
This snippet:
|
||||
|
||||
- queries the page for all channel links
|
||||
- gets the link URL of each matching element
|
||||
- Creates a `Set` from them to de-duplicate them
|
||||
- Turns the set back into an `Array`
|
||||
- filters out ones that contain `/channel/` to exclude some links like Trending
|
||||
that also appear on that page
|
||||
- Turns the Array into pretty printed JSON
|
||||
- Copies it to the clipboard
|
||||
|
||||
With the list of channel URLs on my clipboard I pasted this into a
|
||||
`subscriptions.json` file. The challenge now was that these URLs were of the
|
||||
channel pages like:
|
||||
|
||||
`https://www.youtube.com/@mooretech`
|
||||
|
||||
but the RSS URL of a channel is like:
|
||||
|
||||
`https://www.youtube.com/feeds/videos.xml?channel_id=<CHANNEL_ID>`,
|
||||
|
||||
which means I needed to determine the channel id for each page. To do that
|
||||
without futzing around with Google API keys and APIs I needed to download the
|
||||
HTML of each channel page.
|
||||
|
||||
To do that I generated a config file for `curl` from the JSON file:
|
||||
|
||||
jaq --raw-output '.[] | (split("/") | last) as $name | "url \(.)\noutput \($name).html"' subscriptions.json > subscriptions.curl
|
||||
|
||||
[jaq] is an alternative implementation of [jq] that I use. This `jaq` expression does the following:
|
||||
|
||||
- `.[]` iterate over each element of the `subscriptions.json` array.
|
||||
- `(split("/") | last) as $$name` split the URL on `/` and take the last element, storing it in a variable called `$name`.
|
||||
- for a URL like `https://www.youtube.com/@mooretech` this stores `@mooretech` in `$name`.
|
||||
- `"url \(.)\noutput \($$name).html"` generates the output text interpolating the channel page url and channel name.
|
||||
|
||||
This results in lines like this for each entry in `subscriptions.json`, output
|
||||
to `subscriptions.curl`:
|
||||
|
||||
url https://www.youtube.com/@mooretech
|
||||
output @mooretech.html
|
||||
|
||||
Curl was then run against this file to download all the pages:
|
||||
|
||||
curl --location --output-dir html --create-dirs --rate 1/s --config subscriptions.curl
|
||||
|
||||
- `--location` tells curl to follow redirects, for some reason three of my subscriptions redirected to alternate names when accessed.
|
||||
- `--output-dir` tells curl to output the files into the `html` directory.
|
||||
- `--create-dirs` tells curl to create output directories if they don't exist (just the `html` one in this case).
|
||||
- `--rate 1/s` tells curl to only download at a rate of 1 page per second—I was concerned YouTube might block me if I requested the pages too quickly.
|
||||
- `--config subscriptions.curl` tells curl to read additional command line arguments from the `subscriptions.curl` file generated above.
|
||||
|
||||
Now that I had the HTML for each channel I needed to extract the channel id
|
||||
from it. While I was processing each HTML file I also extracted the channel
|
||||
title for use later. For each HTML file I ran this script on it. I called the
|
||||
script `generate-json-opml`:
|
||||
|
||||
```sh
|
||||
#!/bin/sh
|
||||
|
||||
set -eu
|
||||
|
||||
URL="$1"
|
||||
NAME=$(echo "$URL" | awk -F / '{ print $NF }')
|
||||
HTML="html/${NAME}.html"
|
||||
CHANNEL_ID=$(scraper -a content 'meta[property="og:url"]' < "$HTML" | awk -F / '{ print $NF }')
|
||||
TITLE=$(scraper -a content 'meta[property="og:title"]' < "$HTML")
|
||||
XML_URL="https://www.youtube.com/feeds/videos.xml?channel_id=${CHANNEL_ID}"
|
||||
|
||||
json_escape() {
|
||||
echo "$1" | jaq --raw-input .
|
||||
}
|
||||
|
||||
JSON_TITLE=$(json_escape "$TITLE")
|
||||
JSON_XML_URL=$(json_escape "$XML_URL")
|
||||
JSON_URL=$(json_escape "$URL")
|
||||
|
||||
printf '{"title": %s, "xmlUrl": %s, "htmlUrl": %s}\n' "$JSON_TITLE" "$JSON_XML_URL" "$JSON_URL" > json/"$NAME".json
|
||||
```
|
||||
|
||||
Let's break that down:
|
||||
|
||||
- The channel URL is stored in `URL`.
|
||||
- The channel name is determined by using `awk` to split the URL on `/` and take the last element.
|
||||
- The path to the downloaded HTML page is stored in `HTML`.
|
||||
- The channel id is determined by finding the `<meta>` tag in the html with a `property` attribute of `og:url` (the [OpenGraph metadata][OpenGraph] URL property). This URL is again split on `/` and the last element stored in `CHANNEL_ID`.
|
||||
- Querying the HTML is done with a tool called [scraper] that allows you to use CSS selectors to extract parts of a HTML document.
|
||||
- The channel title is done similarly by extracting the value of the `og:title` metadata.
|
||||
- The URL of the RSS feed for the channel is stored in `XML_URL` using `CHANNEL_ID`.
|
||||
- A function to escape strings destined for JSON is defined. This makes use of `jaq`.
|
||||
- `TITLE`, `XML_URL`, and `URL` are escaped.
|
||||
- Finally we generate a JSON object with the title, URL, and RSS URL and write it into a `json` directory under the name of the channel.
|
||||
|
||||
Ok, almost there. That script had to be run for each of the channel URLs.
|
||||
First I generated a file with just a plain text list of the channel URLs:
|
||||
|
||||
jaq --raw-output '.[]' subscriptions.json > subscriptions.txt
|
||||
|
||||
Then I used `xargs` to process them in parallel:
|
||||
|
||||
xargs -n1 --max-procs=$(nproc) --arg-file subscriptions.txt --verbose ./generate-json-opml
|
||||
|
||||
This does the following:
|
||||
|
||||
- `-n1` read one line from `subscriptions.txt` to be passed as the argument to `generate-json-opml`.
|
||||
- `--max-procs=$(nproc)` run up the number of cores my machine has in parallel.
|
||||
- `--arg-file subscriptions.txt` read arguments for `generate-json-opml` from `subscriptions.txt`.
|
||||
- `--verbose` show the commands being run.
|
||||
- `./generate-json-opml` the command to run (this is the script above).
|
||||
|
||||
Finally all those JSON files need to be turned into an OPML file. For this I
|
||||
used Python:
|
||||
|
||||
```python
|
||||
#!/usr/bin/env python
|
||||
|
||||
import email.utils
|
||||
import glob
|
||||
import json
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
opml = ET.Element("opml")
|
||||
|
||||
head = ET.SubElement(opml, "head")
|
||||
title = ET.SubElement(head, "title")
|
||||
title.text = "YouTube Subscriptions"
|
||||
dateCreated = ET.SubElement(head, "dateCreated")
|
||||
dateCreated.text = email.utils.formatdate(timeval=None, localtime=True)
|
||||
|
||||
body = ET.SubElement(opml, "body")
|
||||
youtube = ET.SubElement(body, "outline", {"title": "YouTube", "text": "YouTube"})
|
||||
|
||||
for path in glob.glob("json/*.json"):
|
||||
with open(path) as f:
|
||||
info = json.load(f)
|
||||
ET.SubElement(youtube, "outline", info, type="rss", text=info["title"])
|
||||
|
||||
ET.indent(opml)
|
||||
print(ET.tostring(opml, encoding="unicode", xml_declaration=True))
|
||||
```
|
||||
|
||||
This generates an OPML file (which is XML) using the ElementTree library. The
|
||||
OPML file has this structure:
|
||||
|
||||
```xml
|
||||
<?xml version='1.0' encoding='utf-8'?>
|
||||
<opml>
|
||||
<head>
|
||||
<title>YouTube Subscriptions</title>
|
||||
<dateCreated>Sun, 05 May 2024 15:57:23 +1000</dateCreated>
|
||||
</head>
|
||||
<body>
|
||||
<outline title="YouTube" text="YouTube">
|
||||
<outline title="MooreTech" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCLi0H57HGGpAdCkVOb_ykVg" htmlUrl="https://www.youtube.com/@mooretech" type="rss" text="MooreTech" />
|
||||
</outline>
|
||||
</body>
|
||||
</opml>
|
||||
```
|
||||
|
||||
I does the following:
|
||||
|
||||
- Generates the top level OPML structure.
|
||||
- For each JSON file, read and parse the JSON and then use that to generate an `outline` entry for that channel.
|
||||
- Indent the OPML document.
|
||||
- Write it to stdout using a Unicode encoding with an XML declaration (`<?xml version='1.0' encoding='utf-8'?>`).
|
||||
|
||||
Whew that was a lot! With the OMPL file generated I was finally able to import
|
||||
all my subscriptions into Feedbin.
|
||||
|
||||
All the code is available in [this
|
||||
repository](https://forge.wezm.net/wezm/youtube-to-opml). In practice I used a
|
||||
`Makefile` to run the various commands so that I didn't have to remember them.
|
||||
|
||||
### Watching videos from Feedbin
|
||||
|
||||
Now that Feedbin is the source of truth for subscriptions, how do I actually
|
||||
watch them? I set up the [FeedMe] app on my Android tablet. In the settings I
|
||||
enabled the NewPipe integration and set it to open the video page when tapped:
|
||||
|
||||
{{ figure(image="posts/2024/youtube-subscriptions-opml/feedme-settings.png", link="posts/2024/youtube-subscriptions-opml/feedme-settings.png", alt='Screenshot of the FeedMe integration settings. There are lots of apps listed. The entry for NewPipe is turned on.', caption="Screenshot of the FeedMe integration settings") }}
|
||||
|
||||
Now when viewing an item in FeedMe there is a NewPipe button that I can tap to
|
||||
watch it:
|
||||
|
||||
{{ figure(image="posts/2024/youtube-subscriptions-opml/feedme.png", link="posts/2024/youtube-subscriptions-opml/feedme.png", alt='Screenshot of FeedMe viewing a video item. In the top left there is a NewPipe button, which when tapped opens the video in NewPipe.', caption="Screenshot of FeedMe viewing a video item") }}
|
||||
|
||||
### Closing Thoughts
|
||||
|
||||
Could I have done all the processing to generate the OPML file with a single
|
||||
Python file? Yes, but I rarely write Python so I preferred to just cobble
|
||||
things together from tools I already knew.
|
||||
|
||||
Should I ever become a YouTube Premium subscriber again I can continue to
|
||||
use this workflow and watch the videos from the YouTube embeds that
|
||||
Feedbin generates, or open the item in the YouTube app instead of NewPipe.
|
||||
|
||||
Lastly, what about desktop usage? When I'm on a real computer I read my RSS via
|
||||
the Feedbin web app. It supports [custom sharing
|
||||
integrations][feedbin-sharing]. In order to open a video on an Invidious
|
||||
instance I need to rewrite it from a URL like:
|
||||
|
||||
<https://www.youtube.com/watch?v=u1wfCnRINkE>
|
||||
|
||||
to one like:
|
||||
|
||||
<https://invidious.perennialte.ch/watch?v=u1wfCnRINkE>.
|
||||
|
||||
I can't do that
|
||||
directly with a Feedbin custom sharing service definition but it would be
|
||||
trivial to set up a little redirector application to do it. I even published [a
|
||||
video on building a very similar thing][url-shortener] last year. Alternatively
|
||||
I could install a [redirector browser
|
||||
plugin](https://docs.invidious.io/redirector/), although that would require set
|
||||
up on each of the computers and OS installs I use so I prefer the former
|
||||
option.
|
||||
|
||||
[url-shortener]: https://www.youtube.com/watch?v=d-tsfUVg4II
|
||||
[Invidious]: https://invidious.io/
|
||||
[Feedbin]: https://feedbin.com/
|
||||
[scraper]: https://github.com/causal-agent/scraper
|
||||
[repo]: https://forge.wezm.net/wezm/youtube-to-opml
|
||||
[NewPipe]: https://newpipe.net/
|
||||
[subscriptions]: https://www.youtube.com/feed/channels
|
||||
[jaq]: https://github.com/01mf02/jaq
|
||||
[jq]: https://jqlang.github.io/jq/
|
||||
[FeedMe]: https://play.google.com/store/apps/details?id=com.seazon.feedme
|
||||
[feedbin-sharing]: https://feedbin.com/help/sharing-read-it-later-services/
|
||||
[OpenGraph]: https://ogp.me/
|
Loading…
Reference in a new issue