+++ title = "Turning One Hundred Tweets Into a Blog Post" date = 2020-11-03T11:40:00+11:00 [extra] #updated = 2020-06-19T09:30:00+10:00 +++ Near the conclusion of my [#100binaries] Twitter series I started working on [the blog post that contained all the tweets](@/posts/2020/100-rust-binaries/index.md). It ended up posing a number of interesting challenges and design decisions, as well as a couple of Rust binaries. Whilst I don't think the process was necessary optimal I thought I'd share the process to show my approach to solving the problem. Perhaps the tools used and approach taken is interesting to others. My initial plan was to use Twitter embeds. Given a tweet URL it's relatively easy to turn it into some HTML markup. By including Twitter's embed JavaScript on the page the markup turns into rich Twitter embed. However there were a few things I didn't like about this option: * The page was going to end up massive, even split across a couple of pages because the Twitter JS was loading all the images for each tweet up front. * I didn't like relying on JavaScript for the page to render media. * I didn't really want to include Twitter's JavaScript (it's likely it would be blocked by visitors with an ad blocker anyway). So I decided I'd render the content myself. I also decided that I'd host the original screenshots and videos instead of saving them from the tweets. This was relatively time consuming as they were across a couple of computers and not named well but I found them all in the end. To ensure the page wasn't enormous I used the [`loading="lazy"`][lazy-loading] attribute on images. This is a relatively new attribute that tells the browser to delay loading of images until they're within some threshold of the view port. It currently works in Firefox and Chrome. I used `preload="none"` on videos to ensure video data was only loaded if the visitor attempted to play it. To prevent the blog post from being too long/heavy I split it across two pages. ### Collecting All the Tweet URLs With the plan in mind the first step was getting the full list of tweets. For better or worse I decided to avoid using any of Twitter's APIs that require authentication. Instead I turned to [nitter] (an alternative Twitter front-end) for its simple markup and JS free rendering. For each page of [search results for '#100binaries from:@wezm'][search] I ran the following in the JS Console in Firefox: ```javascript tweets = [] document.querySelectorAll('.tweet-date a').forEach(a => tweets.push(a.href)) copy(tweets.join("\n")) ``` and pasted the result into [tweets.txt] in Neovim. When all pages had be processed I turned the nitter.net URLs in to twitter.com URLs: `:%s/nitter\.net/twitter.com/`. This tells Neovim: for every line (`%`) substitute (`s`) `nitter.net` with `twitter.com`. ### Turning Tweet URLs Into Tweet Content Now I needed to turn the tweet URLs into tweet content. In hindsight it may have been better to use [Twitter's GET statuses/show/:id][get-status] API to do this (possibly via [twurl]) but that is not what I did. Onwards! I used the unauthenticated [oEmbed API][oembed] to get some markup for each tweet. `xargs` was used to take a line from `tweets.txt` and make the API (HTTP) request with `curl`] ``` xargs -I '{url}' -a tweets.txt -n 1 curl https://api.twitter.com/1/statuses/oembed.json\?omit_script\=true\&dnt\=true\&lang\=en\&url\=\{url\} > tweets.json ``` This tells `xargs` to replace occurrences of `{url}` in the command with a line (`-n 1`) read from `tweets.txt` (`-a tweets.txt`). The result of one of these API requests is JSON like this (formatted with [`jq`][jq] for readability): ```json { "url": "https://twitter.com/wezm/status/1322855912076386304", "author_name": "Wesley Moore", "author_url": "https://twitter.com/wezm", "html": "
\n", "width": 550, "height": null, "type": "rich", "cache_age": "3153600000", "provider_name": "Twitter", "provider_url": "https://twitter.com", "version": "1.0" } ``` The output from `xargs` is lots of these JSON objects all concatenated together. I needed to turn [tweets.json] into an array of objects to make it valid JSON. I opened up the file in Neovim and: * Added commas between the JSON objects: `%s/}{/},\r{/g`. * This is, substitute `}{` with `},{` and a newline (`\r`), multiple times (`/g`). * Added `[` and `]` to start and end of the file. I then reversed the order of the objects and formatted the document with `jq` (from within Neovim): `%!jq '.|reverse' -`. This filters the whole file though a command (`%!`). The command is `jq` and it filters the entire document `.`, read from stdin (`-`), through the `reverse` filter to reverse the order of the array. `jq` automatically pretty prints. It would have been better to have reversed `tweets.txt` but I didn't realise they were in reverse chronological ordering until this point and doing it this way avoided making another 100 HTTP requests. ### Rendering tweets.json I created a custom [Zola shortcode][shortcode], [tweet_list] that reads `tweets.json` and renders each item in an ordered list. It evolved over time as I kept adding more information to the JSON file. It allowed me to see how the blog post looked as I implemented the following improvements. ### Expanding t.co Links {% aside(title="You used Rust for this!?", float="right") %} This is the sort of thing that would be well suited to a scripting language too. These days I tend to reach for Rust, even for little tasks like this. It's what I'm most familiar with nowadays and I can mostly write a "script" like this off the cuff with little need to refer to API docs. {% end %} The markup Twitter returns is full of `t.co` redirect links. I wanted to avoid sending my visitors through the Twitter redirect so I needed to expand these links to their target. I whipped up a little Rust program to do this: [expand-t-co]. It finds all `t.co` links with a regex (`https://t\.co/[a-zA-Z0-9]+`) and replaces each occurrence with the target of the link. The target URL is determined by making making a HTTP HEAD request for the `t.co` URL and noting the value of the `Location` header. The tool caches the result in a `HashMap` to avoid repeating a request for the same `t.co` URL if it's encountered again. I used the [ureq] crate to make the HTTP requests. Arguably it would have been better to use an async client so that more requests were made in parallel but that was added complexity I didn't want to deal with for a mostly one-off program. ### Adding the Media At this point I did a lot of manual work to find all the screenshots and videos that I shared in the tweets and [added them to my blog][media-files]. I also renamed them after the tool they depicted. As part of this process I noted the source of media files that I didn't create in a `"media_source"` key in `tweets.json` so that I could attribute them. I also added a `"media"` key with the name of the media file for each binary. Some of the externally sourced images were animated GIFs, which lack playback controls and are very inefficient file size wise. Whenever I encountered an animated GIF I converted it to an MP4 with `ffmpeg`, resulting in large space savings: ``` ffmpeg -i ~/Downloads/so.gif -movflags faststart -pix_fmt yuv420p -vf "scale=trunc(iw/2)*2:trunc(ih/2)*2" so.mp4 ``` This converts `so.gif` to `so.mp4` and ensures the dimensions are a divisible by 2, which is apparently a requirement of H.264 streams encapsulated in MP4. I worked out how to do this from:Day 100 of #100binaries
— Wesley Moore (@wezm) November 1, 2020
Today I'm featuring the Rust compiler — the binary that made the previous 99 fast, efficient, user-friendly, easy-to-build, and reliable binaries possible.
Thanks to all the people that have worked on it past, present, and future. https://t.co/aBEdLE87eq pic.twitter.com/jzyJtIMGn1