Add turning-one-hundred-tweets-into-a-blog-post post

2020-11-03 11:41:30 +11:00 · 2020-11-03 11:41:30 +11:00 · 8acfbbcba6
commit 8acfbbcba6
parent 59c8c21b14
2 changed files with 319 additions and 1 deletions
--- a/v2/content/posts/2020/turning-one-hundred-tweets-into-a-blog-post.md
+++ b/v2/content/posts/2020/turning-one-hundred-tweets-into-a-blog-post.md
@ -0,0 +1,318 @@
+++
+title = "Turning One Hundred Tweets Into a Blog Post"
+date = 2020-11-03T11:40:00+11:00
+
+[extra]
+#updated = 2020-06-19T09:30:00+10:00
+++
+
+Near the conclusion of my [#100binaries] Twitter series I started working on
+[the blog post that contained all the tweets](@/posts/2020/100-rust-binaries/index.md).
+It ended up posing a number of interesting challenges and design decisions, as
+well as a couple of Rust binaries. Whilst I don't think the process was
+necessary optimal I thought I'd share the process to show my approach to
+solving the problem. Perhaps the tools used and approach taken is
+interesting to others.
+
+<!-- more -->
+
+My initial plan was to use Twitter embeds. Given a tweet URL it's relatively
+easy to turn it into some HTML markup. By including Twitter's embed JavaScript
+on the page the markup turns into rich Twitter embed. However there were a few
+things I didn't like about this option:
+
+* The page was going to end up massive, even split across a couple of pages
+  because the Twitter JS was loading all the images for each tweet up front.
+* I didn't like relying on JavaScript for the page to render media.
+* I didn't really want to include Twitter's JavaScript (it's likely it would be
+  blocked by visitors with an ad blocker anyway).
+
+So I decided I'd render the content myself. I also decided that I'd host the
+original screenshots and videos instead of saving them from the tweets. This
+was relatively time consuming as they were across a couple of computers and
+not named well but I found them all in the end.
+
+To ensure the page wasn't enormous I used the [`loading="lazy"`][lazy-loading]
+attribute on images. This is a relatively new attribute that tells the browser
+to delay loading of images until they're within some threshold of the
+view port. It currently works in Firefox and Chrome.
+
+I used `preload="none"` on videos to ensure video data was only loaded if the
+visitor attempted to play it.
+
+To prevent the blog post from being too long/heavy I split it across two pages.
+
+### Collecting All the Tweet URLs
+
+With the plan in mind the first step was getting the full list of tweets. For
+better or worse I decided to avoid using any of Twitter's APIs that require
+authentication. Instead I turned to [nitter] (an alternative Twitter
+front-end) for its simple markup and JS free rendering.
+
+For each page of [search results for '#100binaries from:@wezm'][search] I ran
+the following in the JS Console in Firefox:
+
+```javascript
+tweets = []
+document.querySelectorAll('.tweet-date a').forEach(a => tweets.push(a.href))
+copy(tweets.join("\n"))
+```
+
+and pasted the result into [tweets.txt] in Neovim.
+
+When all pages had be processed I turned the nitter.net URLs in to twitter.com URLs:
+`:%s/nitter\.net/twitter.com/`.
+
+This tells Neovim: for every line (`%`) substitute (`s`) `nitter.net` with `twitter.com`.
+
+### Turning Tweet URLs Into Tweet Content
+
+Now I needed to turn the tweet URLs into tweet content. In hindsight it may
+have been better to use [Twitter's GET statuses/show/:id][get-status] API to do
+this (possibly via [twurl]) but that is not what I did. Onwards!
+
+I used the unauthenticated [oEmbed API][oembed] to get some markup for each
+tweet. `xargs` was used to take a line from `tweets.txt` and make the API
+(HTTP) request with `curl`]
+
+```
+xargs -I '{url}' -a tweets.txt -n 1 curl https://api.twitter.com/1/statuses/oembed.json\?omit_script\=true\&dnt\=true\&lang\=en\&url\=\{url\} > tweets.json
+```
+
+This tells `xargs` to replace occurrences of `{url}` in the command with a line
+(`-n 1`) read from `tweets.txt` (`-a tweets.txt`).
+
+The result of one of these API requests is JSON like this (formatted with
+[`jq`][jq] for readability):
+
+```json
+{
+  "url": "https://twitter.com/wezm/status/1322855912076386304",
+  "author_name": "Wesley Moore",
+  "author_url": "https://twitter.com/wezm",
+  "html": "<blockquote class=\"twitter-tweet\" data-lang=\"en\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Day 100 of <a href=\"https://twitter.com/hashtag/100binaries?src=hash&amp;ref_src=twsrc%5Etfw\">#100binaries</a><br><br>Today I&#39;m featuring the Rust compiler — the binary that made the previous 99 fast, efficient, user-friendly, easy-to-build, and reliable binaries possible.<br><br>Thanks to all the people that have worked on it past, present, and future. <a href=\"https://t.co/aBEdLE87eq\">https://t.co/aBEdLE87eq</a> <a href=\"https://t.co/jzyJtIMGn1\">pic.twitter.com/jzyJtIMGn1</a></p>&mdash; Wesley Moore (@wezm) <a href=\"https://twitter.com/wezm/status/1322855912076386304?ref_src=twsrc%5Etfw\">November 1, 2020</a></blockquote>\n",
+  "width": 550,
+  "height": null,
+  "type": "rich",
+  "cache_age": "3153600000",
+  "provider_name": "Twitter",
+  "provider_url": "https://twitter.com",
+  "version": "1.0"
+}
+```
+
+The output from `xargs` is lots of these JSON objects all concatenated
+together. I needed to turn [tweets.json] into an array of objects to make it
+valid JSON. I opened up the file in Neovim and:
+
+* Added commas between the JSON objects: `%s/}{/},\r{/g`.
+  * This is, substitute `}{` with `},{` and a newline (`\r`), multiple times (`/g`).
+* Added `[` and `]` to start and end of the file.
+
+I then reversed the order of the objects and formatted the document with `jq` (from within Neovim): `%!jq '.|reverse' -`.
+
+This filters the whole file though a command (`%!`). The command is `jq` and it
+filters the entire document `.`, read from stdin (`-`), through the `reverse`
+filter to reverse the order of the array. `jq` automatically pretty prints.
+
+It would have been better to have reversed `tweets.txt` but I didn't
+realise they were in reverse chronological ordering until this point and
+doing it this way avoided making another 100 HTTP requests.
+
+### Rendering tweets.json
+
+I created a custom [Zola shortcode][shortcode], [tweet_list] that reads
+`tweets.json` and renders each item in an ordered list. It evolved over time as
+I kept adding more information to the JSON file. It allowed me to see how
+the blog post looked as I implemented the following improvements.
+
+### Expanding t.co Links
+
+{% aside(title="You used Rust for this!?", float="right") %}
+This is the sort of thing that would be well suited to a scripting language
+too.  These days I tend to reach for Rust, even for little tasks like this.
+It's what I'm most familiar with nowadays and I can mostly write a "script"
+like this off the cuff with little need to refer to API docs.
+{% end %}
+
+The markup Twitter returns is full of `t.co` redirect links. I wanted to avoid
+sending my visitors through the Twitter redirect so I needed to expand these
+links to their target. I whipped up a little Rust program to do this:
+[expand-t-co]. It finds all `t.co` links with a regex
+(`https://t\.co/[a-zA-Z0-9]+`) and replaces each occurrence with the target
+of the link.
+
+The target URL is determined by making making a HTTP HEAD request for the
+`t.co` URL and noting the value of the `Location` header. The tool
+caches the result in a `HashMap` to avoid repeating a request for
+the same `t.co` URL if it's encountered again.
+
+I used  the [ureq] crate to make the HTTP requests. Arguably it would have been
+better to use an async client so that more requests were made in parallel but
+that was added complexity I didn't want to deal with for a mostly one-off
+program.
+
+### Adding the Media
+
+At this point I did a lot of manual work to find all the screenshots and videos
+that I shared in the tweets and [added them to my blog][media-files]. I also
+renamed them after the tool they depicted. As part of this process I noted the
+source of media files that I didn't create in a `"media_source"` key in
+`tweets.json` so that I could attribute them. I also added a `"media"` key with
+the name of the media file for each binary.
+
+Some of the externally sourced images were animated GIFs, which lack
+playback controls and are very inefficient file size wise. Whenever I encountered an
+animated GIF I converted it to an MP4 with `ffmpeg`, resulting in large space savings:
+
+```
+ffmpeg -i ~/Downloads/so.gif -movflags faststart -pix_fmt yuv420p -vf "scale=trunc(iw/2)*2:trunc(ih/2)*2" so.mp4
+```
+
+This converts `so.gif` to `so.mp4` and ensures the dimensions are a divisible
+by 2, which is apparently a requirement of H.264 streams encapsulated in MP4. I
+worked out how to do this from: <https://unix.stackexchange.com/a/294892/5444>
+
+I also wanted to know the media dimensions for each file so that I could have them
+scaled properly on the page — most images are HiDPI and need to be presented at
+half their pixel width to appear the right size.
+
+For this I used `ffprobe`, which is part of `ffmpeg`. I originally planned to
+use another tool to handle images (as opposed to videos) but it turns out
+`ffprobe` handles them too.
+
+Since I wanted to update the values of JSON objects in `tweets.json` I opted to
+parse the JSON this time. Again I whipped up a little Rust "script":
+[add-media-dimensions]. It parses `tweets.json` and for each object in the
+array runs `ffprobe` on the media file, like this:
+
+```
+ffprobe -v quiet -print_format json -show_format -show_streams file.mp4
+```
+
+I learned how to do this from: <https://stackoverflow.com/a/11236144/38820>
+
+With this invocation `ffprobe` produces JSON so `add-media-dimensions` also
+parses that and adds the width and height values to `tweets.json`. At the end
+the updated JSON document is printed to stdout. This turned out to be a handy
+sanity check as it detected a couple of copy/paste errors and typos in the
+manually added `"media"` values.
+
+### Cleaning Up pic.twitter.com Links
+
+The oEmbed markup that Twitter returns includes links for each piece of media. Now that
+I'm handling that myself these can be deleted. Neovim is used for this:
+
+```
+:%s/ <a href=\\"https:\/\/twitter\.com[^"]\+\(photo\|video\)[^"]\+">pic.twitter.com[^<]\+<\/a>//
+```
+
+For each line of the file (`%`) substitute (`s`) matches with nothing. And that
+took care of them. Yes I'm matching HTML with a regex, no you shouldn't do this
+for something that's part of a program. For one-off text editing it's fine
+though, especially since you can eyeball the differences with `git diff`, or in
+my case `tig status`.
+
+### Adding a HiDPI Flag
+
+I initially tried using a heuristic in `tweet_list` to determine if a media
+file was HiDPI or not but there were a few exceptions to the rule. I decided to
+add a `"hidpi"` value to the JSON to indicate if it was HiDPI media or not. A
+bit of trial and error with [jq] led to this:
+
+```
+jq 'map(. + if .width > 776 then {hidpi: true} else {hidpi:false} end)' tweets.json > tweets-hidpi.json
+```
+
+If the image is greater then 776 pixels wide then set the `hidpi` property to
+`true`, otherwise `false`. 776 was picked via visual inspection of the rendered
+page. Once satisfied with the result I examined the rendered result and flipped
+the `hidpi` value on some items where the heuristic was wrong.
+
+### Adding alt Text
+
+[Di], ever my good conscience when it comes to such things enquired at one
+point if I'd added `alt` text to the images. I was on the fence since the
+images were mostly there to show what the tools looked like — I didn't think
+they were really essential content — but she made a good argument for including
+some `alt` text even if it was fairly simplistic.
+
+I turned to `jq` again to add a basic `"media_description"` to the JSON,
+which `tweet_list` would include as `alt` text:
+
+```
+jq 'map(. + {media_description: ("Screenshot of " + (.media // "????" | sub(".(png|gif|mp4|jpg)$"; "")) + " running in a terminal.")})' tweets.json > tweets-alt.json
+```
+
+For each object in the JSON array it adds a `media_description` key with a
+value derived from the `media` key (the file name with the extension removed).
+If the object doesn't have a `media` value then it is defaulted to "????"
+(`.media // "????"`).
+
+After these initial descriptions were added I went though the rendered page and
+updated the text of items where the description was incorrect or inadequate.
+
+### Video Poster Images
+
+As it stood all the videos were just white boxes with playback controls since I
+has used `preload="none"` to limit the data usage of the page. I decided to pay
+the cost of the larger page weight and add poster images to each of the videos.
+I used `ffmpeg` to extract the first frame of each video as a PNG:
+
+```
+for m in *.mp4; do ffmpeg -i $m -vf "select=1" -vframes 1 $m.png; done
+```
+
+I learned how to do this from: <https://superuser.com/a/1010108>
+
+I then converted the PNGs to JPEGs for smaller files. I could have generated
+JPEGs directly from `ffmpeg` but I didn't know how to control the quality — I
+wanted a relatively low quality for smaller files.
+
+```
+for f in *.mp4.png; do convert "$f" -quality 60 $f.jpg ; done
+```
+
+This produced files named `filename.mp4.png.jpg`. I'm yet to memorise how to
+manipulate file extensions in `zsh`, despite having [been told how to do
+it][zsh-ext], so I did a follow up step to rename them:
+
+```
+for f in *.mp4; do mv $f.png.jpg $f.jpg ; done
+```
+
+### Wrapping Up
+
+Lastly I ran [`pngcrush`][pngcrush] on all of the PNGs. It reliably reduces the file size
+in a lossless manner:
+
+```
+for f in *.png; do pngcrush -reduce -ow $f; done
+```
+
+With that I did some styling tweaks, added a little commentary and published
+[the page](@/posts/2020/100-rust-binaries/index.md).
+
+If you made it this far, thanks for sticking with it to the end. I'm not sure
+how interesting or useful this post is but if you liked it let me know and I
+might do more like it in the future.
+
+[#100binaries]: https://twitter.com/search?q=%23100binaries%20from%3A%40wezm&src=typed_query&f=live
+[nitter]: https://nitter.net/about
+[search]: https://nitter.net/search?f=tweets&q=%23100binaries+from%3A%40wezm
+[tweets.txt]: https://github.com/wezm/wezm.net/blob/master/v2/content/posts/2020/100-rust-binaries/tweets.txt
+[tweets.json]: https://github.com/wezm/wezm.net/blob/master/v2/content/posts/2020/100-rust-binaries/tweets.json
+[get-status]: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-show-id
+[twurl]: https://github.com/twitter/twurl
+[oembed]: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-oembed
+[jq]: https://stedolan.github.io/jq/
+[expand-t-co]: https://github.com/wezm/expand-t-co
+[media-files]: https://github.com/wezm/wezm.net/tree/master/v2/content/posts/2020/100-rust-binaries
+[add-media-dimensions]: https://github.com/wezm/add-media-dimensions
+[shortcode]: https://www.getzola.org/documentation/content/shortcodes/
+[tweet_list]: https://github.com/wezm/wezm.net/blob/master/v2/templates/shortcodes/tweet_list.html
+[Di]: https://didoesdigital.com/
+[zsh-ext]: https://twitter.com/Sasha_Boyd/status/1300666988608454656
+[ureq]: https://github.com/algesten/ureq
+[lazy-loading]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img#attr-loading
+[pngcrush]: https://pmt.sourceforge.io/pngcrush/index.html
--- a/v2/sass/screen.scss
+++ b/v2/sass/screen.scss
@ -31,7 +31,7 @@ body.home {
 pre, code {
  font-family: "Pragmata Pro", "Pragmata Pro Mono", "JetBrains Mono", "Iosevka", "Consolas", monospace;
 }
-code {
+:not(pre) > code {
  background-color: #ffedf0;
  padding: 0.1em 0.2em;
  font-size: 16px;