Add turning-one-hundred-tweets-into-a-blog-post post

2020-11-03 11:41:30 +11:00 · 2020-11-03 11:41:30 +11:00 · 8acfbbcba6
commit 8acfbbcba6
parent 59c8c21b14
2 changed files with 319 additions and 1 deletions
--- a/v2/content/posts/2020/turning-one-hundred-tweets-into-a-blog-post.md
+++ b/v2/content/posts/2020/turning-one-hundred-tweets-into-a-blog-post.md
@ -0,0 +1,318 @@
 +++
 title = "Turning One Hundred Tweets Into a Blog Post"
 date = 2020-11-03T11:40:00+11:00
 [extra]
 #updated = 2020-06-19T09:30:00+10:00
 +++
 Near the conclusion of my [#100binaries] Twitter series I started working on
 [the blog post that contained all the tweets](@/posts/2020/100-rust-binaries/index.md).
 It ended up posing a number of interesting challenges and design decisions, as
 well as a couple of Rust binaries. Whilst I don't think the process was
 necessary optimal I thought I'd share the process to show my approach to
 solving the problem. Perhaps the tools used and approach taken is
 interesting to others.
 <!-- more -->
 My initial plan was to use Twitter embeds. Given a tweet URL it's relatively
 easy to turn it into some HTML markup. By including Twitter's embed JavaScript
 on the page the markup turns into rich Twitter embed. However there were a few
 things I didn't like about this option:
 * The page was going to end up massive, even split across a couple of pages
  because the Twitter JS was loading all the images for each tweet up front.
 * I didn't like relying on JavaScript for the page to render media.
 * I didn't really want to include Twitter's JavaScript (it's likely it would be
  blocked by visitors with an ad blocker anyway).
 So I decided I'd render the content myself. I also decided that I'd host the
 original screenshots and videos instead of saving them from the tweets. This
 was relatively time consuming as they were across a couple of computers and
 not named well but I found them all in the end.
 To ensure the page wasn't enormous I used the [`loading="lazy"`][lazy-loading]
 attribute on images. This is a relatively new attribute that tells the browser
 to delay loading of images until they're within some threshold of the
 view port. It currently works in Firefox and Chrome.
 I used `preload="none"` on videos to ensure video data was only loaded if the
 visitor attempted to play it.
 To prevent the blog post from being too long/heavy I split it across two pages.
 ### Collecting All the Tweet URLs
 With the plan in mind the first step was getting the full list of tweets. For
 better or worse I decided to avoid using any of Twitter's APIs that require
 authentication. Instead I turned to [nitter] (an alternative Twitter
 front-end) for its simple markup and JS free rendering.
 For each page of [search results for '#100binaries from:@wezm'][search] I ran
 the following in the JS Console in Firefox:
 ```javascript
 tweets = []
 document.querySelectorAll('.tweet-date a').forEach(a => tweets.push(a.href))
 copy(tweets.join("\n"))
 ```
 and pasted the result into [tweets.txt] in Neovim.
 When all pages had be processed I turned the nitter.net URLs in to twitter.com URLs:
 `:%s/nitter\.net/twitter.com/`.
 This tells Neovim: for every line (`%`) substitute (`s`) `nitter.net` with `twitter.com`.
 ### Turning Tweet URLs Into Tweet Content
 Now I needed to turn the tweet URLs into tweet content. In hindsight it may
 have been better to use [Twitter's GET statuses/show/:id][get-status] API to do
 this (possibly via [twurl]) but that is not what I did. Onwards!
 I used the unauthenticated [oEmbed API][oembed] to get some markup for each
 tweet. `xargs` was used to take a line from `tweets.txt` and make the API
 (HTTP) request with `curl`]
 ```
 xargs -I '{url}' -a tweets.txt -n 1 curl https://api.twitter.com/1/statuses/oembed.json\?omit_script\=true\&dnt\=true\&lang\=en\&url\=\{url\} > tweets.json
 ```
 This tells `xargs` to replace occurrences of `{url}` in the command with a line
 (`-n 1`) read from `tweets.txt` (`-a tweets.txt`).
 The result of one of these API requests is JSON like this (formatted with
 [`jq`][jq] for readability):
 ```json
 {
  "url": "https://twitter.com/wezm/status/1322855912076386304",
  "author_name": "Wesley Moore",
  "author_url": "https://twitter.com/wezm",
  "html": "<blockquote class=\"twitter-tweet\" data-lang=\"en\" data-dnt=\"true\"><p lang=\"en\" dir=\"ltr\">Day 100 of <a href=\"https://twitter.com/hashtag/100binaries?src=hash&amp;ref_src=twsrc%5Etfw\">#100binaries</a><br><br>Today I&#39;m featuring the Rust compiler — the binary that made the previous 99 fast, efficient, user-friendly, easy-to-build, and reliable binaries possible.<br><br>Thanks to all the people that have worked on it past, present, and future. <a href=\"https://t.co/aBEdLE87eq\">https://t.co/aBEdLE87eq</a> <a href=\"https://t.co/jzyJtIMGn1\">pic.twitter.com/jzyJtIMGn1</a></p>&mdash; Wesley Moore (@wezm) <a href=\"https://twitter.com/wezm/status/1322855912076386304?ref_src=twsrc%5Etfw\">November 1, 2020</a></blockquote>\n",
  "width": 550,
  "height": null,
  "type": "rich",
  "cache_age": "3153600000",
  "provider_name": "Twitter",
  "provider_url": "https://twitter.com",
  "version": "1.0"
 }
 ```
 The output from `xargs` is lots of these JSON objects all concatenated
 together. I needed to turn [tweets.json] into an array of objects to make it
 valid JSON. I opened up the file in Neovim and:
 * Added commas between the JSON objects: `%s/}{/},\r{/g`.
  * This is, substitute `}{` with `},{` and a newline (`\r`), multiple times (`/g`).
 * Added `[` and `]` to start and end of the file.
 I then reversed the order of the objects and formatted the document with `jq` (from within Neovim): `%!jq '.|reverse' -`.
 This filters the whole file though a command (`%!`). The command is `jq` and it
 filters the entire document `.`, read from stdin (`-`), through the `reverse`
 filter to reverse the order of the array. `jq` automatically pretty prints.
 It would have been better to have reversed `tweets.txt` but I didn't
 realise they were in reverse chronological ordering until this point and
 doing it this way avoided making another 100 HTTP requests.
 ### Rendering tweets.json
 I created a custom [Zola shortcode][shortcode], [tweet_list] that reads
 `tweets.json` and renders each item in an ordered list. It evolved over time as
 I kept adding more information to the JSON file. It allowed me to see how
 the blog post looked as I implemented the following improvements.
 ### Expanding t.co Links
 {% aside(title="You used Rust for this!?", float="right") %}
 This is the sort of thing that would be well suited to a scripting language
 too.  These days I tend to reach for Rust, even for little tasks like this.
 It's what I'm most familiar with nowadays and I can mostly write a "script"
 like this off the cuff with little need to refer to API docs.
 {% end %}
 The markup Twitter returns is full of `t.co` redirect links. I wanted to avoid
 sending my visitors through the Twitter redirect so I needed to expand these
 links to their target. I whipped up a little Rust program to do this:
 [expand-t-co]. It finds all `t.co` links with a regex
 (`https://t\.co/[a-zA-Z0-9]+`) and replaces each occurrence with the target
 of the link.
 The target URL is determined by making making a HTTP HEAD request for the
 `t.co` URL and noting the value of the `Location` header. The tool
 caches the result in a `HashMap` to avoid repeating a request for
 the same `t.co` URL if it's encountered again.
 I used  the [ureq] crate to make the HTTP requests. Arguably it would have been
 better to use an async client so that more requests were made in parallel but
 that was added complexity I didn't want to deal with for a mostly one-off
 program.
 ### Adding the Media
 At this point I did a lot of manual work to find all the screenshots and videos
 that I shared in the tweets and [added them to my blog][media-files]. I also
 renamed them after the tool they depicted. As part of this process I noted the
 source of media files that I didn't create in a `"media_source"` key in
 `tweets.json` so that I could attribute them. I also added a `"media"` key with
 the name of the media file for each binary.
 Some of the externally sourced images were animated GIFs, which lack
 playback controls and are very inefficient file size wise. Whenever I encountered an
 animated GIF I converted it to an MP4 with `ffmpeg`, resulting in large space savings:
 ```
 ffmpeg -i ~/Downloads/so.gif -movflags faststart -pix_fmt yuv420p -vf "scale=trunc(iw/2)*2:trunc(ih/2)*2" so.mp4
 ```
 This converts `so.gif` to `so.mp4` and ensures the dimensions are a divisible
 by 2, which is apparently a requirement of H.264 streams encapsulated in MP4. I
 worked out how to do this from: <https://unix.stackexchange.com/a/294892/5444>
 I also wanted to know the media dimensions for each file so that I could have them
 scaled properly on the page — most images are HiDPI and need to be presented at
 half their pixel width to appear the right size.
 For this I used `ffprobe`, which is part of `ffmpeg`. I originally planned to
 use another tool to handle images (as opposed to videos) but it turns out
 `ffprobe` handles them too.
 Since I wanted to update the values of JSON objects in `tweets.json` I opted to
 parse the JSON this time. Again I whipped up a little Rust "script":
 [add-media-dimensions]. It parses `tweets.json` and for each object in the
 array runs `ffprobe` on the media file, like this:
 ```
 ffprobe -v quiet -print_format json -show_format -show_streams file.mp4
 ```
 I learned how to do this from: <https://stackoverflow.com/a/11236144/38820>
 With this invocation `ffprobe` produces JSON so `add-media-dimensions` also
 parses that and adds the width and height values to `tweets.json`. At the end
 the updated JSON document is printed to stdout. This turned out to be a handy
 sanity check as it detected a couple of copy/paste errors and typos in the
 manually added `"media"` values.
 ### Cleaning Up pic.twitter.com Links
 The oEmbed markup that Twitter returns includes links for each piece of media. Now that
 I'm handling that myself these can be deleted. Neovim is used for this:
 ```
 :%s/ <a href=\\"https:\/\/twitter\.com[^"]\+\(photo\|video\)[^"]\+">pic.twitter.com[^<]\+<\/a>//
 ```
 For each line of the file (`%`) substitute (`s`) matches with nothing. And that
 took care of them. Yes I'm matching HTML with a regex, no you shouldn't do this
 for something that's part of a program. For one-off text editing it's fine
 though, especially since you can eyeball the differences with `git diff`, or in
 my case `tig status`.
 ### Adding a HiDPI Flag
 I initially tried using a heuristic in `tweet_list` to determine if a media
 file was HiDPI or not but there were a few exceptions to the rule. I decided to
 add a `"hidpi"` value to the JSON to indicate if it was HiDPI media or not. A
 bit of trial and error with [jq] led to this:
 ```
 jq 'map(. + if .width > 776 then {hidpi: true} else {hidpi:false} end)' tweets.json > tweets-hidpi.json
 ```
 If the image is greater then 776 pixels wide then set the `hidpi` property to
 `true`, otherwise `false`. 776 was picked via visual inspection of the rendered
 page. Once satisfied with the result I examined the rendered result and flipped
 the `hidpi` value on some items where the heuristic was wrong.
 ### Adding alt Text
 [Di], ever my good conscience when it comes to such things enquired at one
 point if I'd added `alt` text to the images. I was on the fence since the
 images were mostly there to show what the tools looked like — I didn't think
 they were really essential content — but she made a good argument for including
 some `alt` text even if it was fairly simplistic.
 I turned to `jq` again to add a basic `"media_description"` to the JSON,
 which `tweet_list` would include as `alt` text:
 ```
 jq 'map(. + {media_description: ("Screenshot of " + (.media // "????" | sub(".(png|gif|mp4|jpg)$"; "")) + " running in a terminal.")})' tweets.json > tweets-alt.json
 ```
 For each object in the JSON array it adds a `media_description` key with a
 value derived from the `media` key (the file name with the extension removed).
 If the object doesn't have a `media` value then it is defaulted to "????"
 (`.media // "????"`).
 After these initial descriptions were added I went though the rendered page and
 updated the text of items where the description was incorrect or inadequate.
 ### Video Poster Images
 As it stood all the videos were just white boxes with playback controls since I
 has used `preload="none"` to limit the data usage of the page. I decided to pay
 the cost of the larger page weight and add poster images to each of the videos.
 I used `ffmpeg` to extract the first frame of each video as a PNG:
 ```
 for m in *.mp4; do ffmpeg -i $m -vf "select=1" -vframes 1 $m.png; done
 ```
 I learned how to do this from: <https://superuser.com/a/1010108>
 I then converted the PNGs to JPEGs for smaller files. I could have generated
 JPEGs directly from `ffmpeg` but I didn't know how to control the quality — I
 wanted a relatively low quality for smaller files.
 ```
 for f in *.mp4.png; do convert "$f" -quality 60 $f.jpg ; done
 ```
 This produced files named `filename.mp4.png.jpg`. I'm yet to memorise how to
 manipulate file extensions in `zsh`, despite having [been told how to do
 it][zsh-ext], so I did a follow up step to rename them:
 ```
 for f in *.mp4; do mv $f.png.jpg $f.jpg ; done
 ```
 ### Wrapping Up
 Lastly I ran [`pngcrush`][pngcrush] on all of the PNGs. It reliably reduces the file size
 in a lossless manner:
 ```
 for f in *.png; do pngcrush -reduce -ow $f; done
 ```
 With that I did some styling tweaks, added a little commentary and published
 [the page](@/posts/2020/100-rust-binaries/index.md).
 If you made it this far, thanks for sticking with it to the end. I'm not sure
 how interesting or useful this post is but if you liked it let me know and I
 might do more like it in the future.
 [#100binaries]: https://twitter.com/search?q=%23100binaries%20from%3A%40wezm&src=typed_query&f=live
 [nitter]: https://nitter.net/about
 [search]: https://nitter.net/search?f=tweets&q=%23100binaries+from%3A%40wezm
 [tweets.txt]: https://github.com/wezm/wezm.net/blob/master/v2/content/posts/2020/100-rust-binaries/tweets.txt
 [tweets.json]: https://github.com/wezm/wezm.net/blob/master/v2/content/posts/2020/100-rust-binaries/tweets.json
 [get-status]: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-show-id
 [twurl]: https://github.com/twitter/twurl
 [oembed]: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/post-and-engage/api-reference/get-statuses-oembed
 [jq]: https://stedolan.github.io/jq/
 [expand-t-co]: https://github.com/wezm/expand-t-co
 [media-files]: https://github.com/wezm/wezm.net/tree/master/v2/content/posts/2020/100-rust-binaries
 [add-media-dimensions]: https://github.com/wezm/add-media-dimensions
 [shortcode]: https://www.getzola.org/documentation/content/shortcodes/
 [tweet_list]: https://github.com/wezm/wezm.net/blob/master/v2/templates/shortcodes/tweet_list.html
 [Di]: https://didoesdigital.com/
 [zsh-ext]: https://twitter.com/Sasha_Boyd/status/1300666988608454656
 [ureq]: https://github.com/algesten/ureq
 [lazy-loading]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img#attr-loading
 [pngcrush]: https://pmt.sourceforge.io/pngcrush/index.html
--- a/v2/sass/screen.scss
+++ b/v2/sass/screen.scss
@ -31,7 +31,7 @@ body.home {
 pre, code {
  font-family: "Pragmata Pro", "Pragmata Pro Mono", "JetBrains Mono", "Iosevka", "Consolas", monospace;
 }
-code {
+:not(pre) > code {
  background-color: #ffedf0;
  padding: 0.1em 0.2em;
  font-size: 16px;