Add YouTube to OPML post

2024-11-10 01:42:32 +00:00 · 2024-05-06 12:32:07 +10:00 · 2024-05-06 12:32:07 +10:00 · 3acf555ada
commit 3acf555ada
parent 07b8cc089f
3 changed files with 282 additions and 0 deletions
--- a/v2/content/posts/2024/youtube-subscriptions-opml/feedme-settings.png
+++ b/v2/content/posts/2024/youtube-subscriptions-opml/feedme-settings.png
--- a/v2/content/posts/2024/youtube-subscriptions-opml/feedme.png
+++ b/v2/content/posts/2024/youtube-subscriptions-opml/feedme.png
--- a/v2/content/posts/2024/youtube-subscriptions-opml/index.md
+++ b/v2/content/posts/2024/youtube-subscriptions-opml/index.md
@ -0,0 +1,282 @@
+++
+title = "Exporting YouTube Subscriptions to OPML and Watching via RSS"
+date = 2024-05-06T10:38:22+10:00
+
+#[extra]
+#updated = 2024-02-21T10:05:19+10:00
+++
+
+This post describes how I exported my 500+ YouTube subscriptions to an OPML
+file so that I could import them into my RSS reader. I go into fine detail
+about the scripts and tools I used. If you just want to see the end result the
+code is in [this repository][repo], which describes the steps needed to run it.
+
+I was previously a YouTube Premium subscriber but I cancelled it when they
+jacked up the already high prices. Since then I've been watching videos in
+[NewPipe] on my Android tablet or via an [Invidious] instance on real
+computers.
+
+<!-- more -->
+
+To import my subscriptions into NewPipe I was able to use the
+`subscriptions.csv` file included in the Google Takeout dump of my YouTube
+data. This worked fine initially but imposed some friction when adding new
+subscriptions.
+
+If I only subscribed to new channels in NewPipe they were only
+accessible on my tablet. If I added them to YouTube then I had to remember to
+also add them in NewPipe, which was inconvenient if I wasn't using the tablet
+at the time. Inevitably the subscriptions would drift out of sync and I would have to
+periodically re-import the subscriptions from YouTube into NewPipe. This was
+cumbersome as it doesn't seem to have a way to do this incrementally. Last time
+I had to nuke all its data in order to re-import.
+
+To solve these problems I wanted to manage my subscriptions in my RSS reader,
+[Feedbin]. This way Feedbin would track my subscriptions and new/viewed videos
+in a way that would sync between all my devices. Notably this is possible
+because Google actually publishes an RSS feed for each YouTube channel.
+
+To do that I needed to export all my subscriptions to an OPML file that Feedbin
+could import. I opted to do that without requesting another Google Takeout dump
+as they take a long time to generate and also result in multiple gigabytes of
+archives I have to download (it includes all the videos I've uploaded to my
+personal account) just to get at the `subscriptions.csv` file within.
+
+### Generating OPML
+
+I started
+by visiting my [subscriptions page][subscriptions] and using some JavaScript to
+generate a JSON array of all the channels I am subscribed to:
+
+```javascript
+copy(JSON.stringify(Array.from(new Set(Array.prototype.map.call(document.querySelectorAll('a.channel-link'), (link) => link.href))).filter((x) => !x.includes('/channel/')), null, 2))
+```
+
+This snippet:
+
+- queries the page for all channel links
+- gets the link URL of each matching element
+- Creates a `Set` from them to de-duplicate them
+- Turns the set back into an `Array`
+- filters out ones that contain `/channel/` to exclude some links like Trending
+  that also appear on that page
+- Turns the Array into pretty printed JSON
+- Copies it to the clipboard
+
+With the list of channel URLs on my clipboard I pasted this into a
+`subscriptions.json` file. The challenge now was that these URLs were of the
+channel pages like:
+
+`https://www.youtube.com/@mooretech`
+
+but the RSS URL of a channel is like:
+
+`https://www.youtube.com/feeds/videos.xml?channel_id=<CHANNEL_ID>`,
+
+which means I needed to determine the channel id for each page. To do that
+without futzing around with Google API keys and APIs I needed to download the
+HTML of each channel page.
+
+To do that I generated a config file for `curl` from the JSON file:
+
+    jaq --raw-output '.[] | (split("/") | last) as $name | "url \(.)\noutput \($name).html"' subscriptions.json > subscriptions.curl
+
+[jaq] is an alternative implementation of [jq] that I use. This `jaq` expression does the following:
+
+- `.[]` iterate over each element of the `subscriptions.json` array.
+- `(split("/") | last) as $$name` split the URL on `/` and take the last element, storing it in a variable called `$name`.
+  - for a URL like `https://www.youtube.com/@mooretech` this stores `@mooretech` in `$name`.
+- `"url \(.)\noutput \($$name).html"` generates the output text interpolating the channel page url and channel name.
+
+This results in lines like this for each entry in `subscriptions.json`, output
+to `subscriptions.curl`:
+
+    url https://www.youtube.com/@mooretech
+    output @mooretech.html
+
+Curl was then run against this file to download all the pages:
+
+    curl --location --output-dir html --create-dirs --rate 1/s --config subscriptions.curl
+
+- `--location` tells curl to follow redirects, for some reason three of my subscriptions redirected to alternate names when accessed.
+- `--output-dir` tells curl to output the files into the `html` directory.
+- `--create-dirs` tells curl to create output directories if they don't exist (just the `html` one in this case).
+- `--rate 1/s` tells curl to only download at a rate of 1 page per second—I was concerned YouTube might block me if I requested the pages too quickly.
+- `--config subscriptions.curl` tells curl to read additional command line arguments from the `subscriptions.curl` file generated above.
+
+Now that I had the HTML for each channel I needed to extract the channel id
+from it. While I was processing each HTML file I also extracted the channel
+title for use later. For each HTML file I ran this script on it. I called the
+script `generate-json-opml`:
+
+```sh
+#!/bin/sh
+
+set -eu
+
+URL="$1"
+NAME=$(echo "$URL" | awk -F / '{ print $NF }')
+HTML="html/${NAME}.html"
+CHANNEL_ID=$(scraper -a content 'meta[property="og:url"]' < "$HTML" | awk -F / '{ print $NF }')
+TITLE=$(scraper -a content 'meta[property="og:title"]' < "$HTML")
+XML_URL="https://www.youtube.com/feeds/videos.xml?channel_id=${CHANNEL_ID}"
+
+json_escape() {
+  echo "$1" | jaq --raw-input .
+}
+
+JSON_TITLE=$(json_escape "$TITLE")
+JSON_XML_URL=$(json_escape "$XML_URL")
+JSON_URL=$(json_escape "$URL")
+
+printf '{"title": %s, "xmlUrl": %s, "htmlUrl": %s}\n' "$JSON_TITLE" "$JSON_XML_URL" "$JSON_URL" > json/"$NAME".json
+```
+
+Let's break that down:
+
+- The channel URL is stored in `URL`.
+- The channel name is determined by using `awk` to split the URL on `/` and take the last element.
+- The path to the downloaded HTML page is stored in `HTML`.
+- The channel id is determined by finding the `<meta>` tag in the html with a `property` attribute of `og:url` (the [OpenGraph metadata][OpenGraph] URL property). This URL is again split on `/` and the last element stored in `CHANNEL_ID`.
+  - Querying the HTML is done with a tool called [scraper] that allows you to use CSS selectors to extract parts of a HTML document.
+- The channel title is done similarly by extracting the value of the `og:title` metadata.
+- The URL of the RSS feed for the channel is stored in `XML_URL` using `CHANNEL_ID`.
+- A function to escape strings destined for JSON is defined. This makes use of `jaq`.
+- `TITLE`, `XML_URL`, and `URL` are escaped.
+- Finally we generate a JSON object with the title, URL, and RSS URL and write it into a `json` directory under the name of the channel.
+
+Ok, almost there. That script had to be run for each of the channel URLs.
+First I generated a file with just a plain text list of the channel URLs:
+
+    jaq --raw-output '.[]' subscriptions.json > subscriptions.txt
+
+Then I used `xargs` to process them in parallel:
+
+    xargs -n1 --max-procs=$(nproc) --arg-file subscriptions.txt --verbose ./generate-json-opml
+
+This does the following:
+
+- `-n1` read one line from `subscriptions.txt` to be passed as the argument to `generate-json-opml`.
+- `--max-procs=$(nproc)` run up the number of cores my machine has in parallel.
+- `--arg-file subscriptions.txt` read arguments for `generate-json-opml` from `subscriptions.txt`.
+- `--verbose` show the commands being run.
+- `./generate-json-opml` the command to run (this is the script above).
+
+Finally all those JSON files need to be turned into an OPML file. For this I
+used Python:
+
+```python
+#!/usr/bin/env python
+
+import email.utils
+import glob
+import json
+import xml.etree.ElementTree as ET
+
+opml = ET.Element("opml")
+
+head = ET.SubElement(opml, "head")
+title = ET.SubElement(head, "title")
+title.text = "YouTube Subscriptions"
+dateCreated = ET.SubElement(head, "dateCreated")
+dateCreated.text = email.utils.formatdate(timeval=None, localtime=True)
+
+body = ET.SubElement(opml, "body")
+youtube = ET.SubElement(body, "outline", {"title": "YouTube", "text": "YouTube"})
+
+for path in glob.glob("json/*.json"):
+    with open(path) as f:
+        info = json.load(f)
+        ET.SubElement(youtube, "outline", info, type="rss", text=info["title"])
+
+ET.indent(opml)
+print(ET.tostring(opml, encoding="unicode", xml_declaration=True))
+```
+
+This generates an OPML file (which is XML) using the ElementTree library. The
+OPML file has this structure:
+
+```xml
+<?xml version='1.0' encoding='utf-8'?>
+<opml>
+  <head>
+    <title>YouTube Subscriptions</title>
+    <dateCreated>Sun, 05 May 2024 15:57:23 +1000</dateCreated>
+  </head>
+  <body>
+    <outline title="YouTube" text="YouTube">
+      <outline title="MooreTech" xmlUrl="https://www.youtube.com/feeds/videos.xml?channel_id=UCLi0H57HGGpAdCkVOb_ykVg" htmlUrl="https://www.youtube.com/@mooretech" type="rss" text="MooreTech" />
+    </outline>
+  </body>
+</opml>
+```
+
+I does the following:
+
+- Generates the top level OPML structure.
+- For each JSON file, read and parse the JSON and then use that to generate an `outline` entry for that channel.
+- Indent the OPML document.
+- Write it to stdout using a Unicode encoding with an XML declaration (`<?xml version='1.0' encoding='utf-8'?>`).
+
+Whew that was a lot! With the OMPL file generated I was finally able to import
+all my subscriptions into Feedbin.
+
+All the code is available in [this
+repository](https://forge.wezm.net/wezm/youtube-to-opml). In practice I used a
+`Makefile` to run the various commands so that I didn't have to remember them.
+
+### Watching videos from Feedbin
+
+Now that Feedbin is the source of truth for subscriptions, how do I actually
+watch them? I set up the [FeedMe] app on my Android tablet. In the settings I
+enabled the NewPipe integration and set it to open the video page when tapped:
+
+{{ figure(image="posts/2024/youtube-subscriptions-opml/feedme-settings.png", link="posts/2024/youtube-subscriptions-opml/feedme-settings.png", alt='Screenshot of the FeedMe integration settings. There are lots of apps listed. The entry for NewPipe is turned on.', caption="Screenshot of the FeedMe integration settings") }}
+
+Now when viewing an item in FeedMe there is a NewPipe button that I can tap to
+watch it:
+
+{{ figure(image="posts/2024/youtube-subscriptions-opml/feedme.png", link="posts/2024/youtube-subscriptions-opml/feedme.png", alt='Screenshot of FeedMe viewing a video item. In the top left there is a NewPipe button, which when tapped opens the video in NewPipe.', caption="Screenshot of FeedMe viewing a video item") }}
+
+### Closing Thoughts
+
+Could I have done all the processing to generate the OPML file with a single
+Python file? Yes, but I rarely write Python so I preferred to just cobble
+things together from tools I already knew.
+
+Should I ever become a YouTube Premium subscriber again I can continue to
+use this workflow and watch the videos from the YouTube embeds that
+Feedbin generates, or open the item in the YouTube app instead of NewPipe.
+
+Lastly, what about desktop usage? When I'm on a real computer I read my RSS via
+the Feedbin web app. It supports [custom sharing
+integrations][feedbin-sharing]. In order to open a video on an Invidious
+instance I need to rewrite it from a URL like:
+
+<https://www.youtube.com/watch?v=u1wfCnRINkE>
+
+to one like:
+
+<https://invidious.perennialte.ch/watch?v=u1wfCnRINkE>.
+
+I can't do that
+directly with a Feedbin custom sharing service definition but it would be
+trivial to set up a little redirector application to do it. I even published [a
+video on building a very similar thing][url-shortener] last year. Alternatively
+I could install a [redirector browser
+plugin](https://docs.invidious.io/redirector/), although that would require set
+up on each of the computers and OS installs I use so I prefer the former
+option.
+
+[url-shortener]: https://www.youtube.com/watch?v=d-tsfUVg4II
+[Invidious]: https://invidious.io/
+[Feedbin]: https://feedbin.com/
+[scraper]: https://github.com/causal-agent/scraper
+[repo]: https://forge.wezm.net/wezm/youtube-to-opml
+[NewPipe]: https://newpipe.net/
+[subscriptions]: https://www.youtube.com/feed/channels
+[jaq]: https://github.com/01mf02/jaq
+[jq]: https://jqlang.github.io/jq/
+[FeedMe]: https://play.google.com/store/apps/details?id=com.seazon.feedme
+[feedbin-sharing]: https://feedbin.com/help/sharing-read-it-later-services/
+[OpenGraph]: https://ogp.me/