Ep 022: Evidence of Attempted Posting

March 29, 2019

► Play Episode

Christoph questions his attempts to post to Twitter.

This week, continuing to dig into the "Twitter problem". We want to post to Twitter on a schedule.
"Writing code to help out with laziness."
Start with data to keep track of: inside (our data) and outside (Twitter data)
"Data from a foreign land."
We need to determine our "working view" of Twitter's data.
What is in our data? For each "scheduled tweet":
- Text to post: the "status"
- Timestamp of when to post
Timestamps are nice
- Milliseconds since the epoch
- Universal instant
- Allows the client to localize
How do we know a scheduled tweet has been posted? A "posted?" boolean?
Boolean says, "Yes! It has been posted somewhere on the Internet."
Correlating identifiers are more useful than a Boolean.
The tweet ID is a correlating identifier. We can use it to lookup all of Twitter's data about it.
"We don't need to store all of Twitter in our database."
What is the story you need to tell about what happened?
- A record of all the attempts allows us to tell a story about what happened.
- Useful to have the timestamp of when our application posted it.
Make a separate log for attempts.
- Attempting to post is a separate concern than what to post.
- Don't complicate the scheduled tweet information by embedding the log.
"Once you have all the data, it allows you to ask new questions you didn't originally think of."
Clojure makes it easy to work with a large tree of data that came from an external source. We don't have to care about the structure of that data. We can just write it down.
Simply attempt to post the next scheduled tweet that does not have a Twitter ID recorded.
If it fails, just record the attempt, and go back to sleep.
"Handle the brick in front of you, and if you keep doing that, you'll eventually build the wall."
What if we don't hear the success response from Twitter, but it did get posted?
Idea: Try to detect if a tweet has already been posted.
If we can uniquely identify something by its content, we can know two things are the same without having a common ID.
Problem: Twitter can alter the contents.
Idea: fuzzy "measure of similarity" between our recent tweets and the next scheduled tweet.
We can record the fuzzy match in our attempt log too!
If we can correlate by contents, we could even identify when we manually post in advance.
As soon as you can determine equality by the substance of the thing itself, you can have more than one writer.
How "recent" is "recent"? Is it 100? Is it 200? Is it 500?
Even better, fetch all the tweets since the last ID we recorded.
- we know we're seeing all of the tweets
- can scan each of those for a match (in the case of a manual post)
- know when the tweet stream ends, so we can know a posting is still needed
The worker will get there eventually. Can just give up on an error. No complex retry and recovery logic.
With more than one writer, we still can have a race condition. Ultimately Twitter has to deal with deduplication to avoid a double post in a short interval.

Message Queue discussion:

Namespacing in a map is really useful
A flat, namespaced map is easier to traverse than a nested map.
One use for namespaces: indicate the origin of the data
Eg. :twitter/id, :twitter/status vs :local/id, :local/text
You see the namespace in your code, so it makes the data origin very visible.

Related episodes:

Schemas for data. "Internal" vs "external" data.
- 007: Input Overflow
Creating a "big bag of data" and asking it questions.

Related projects:

Clojure in this episode:

pr-str

← Ep 021: Mutate the Internet Ep 023: Poster Child → Top