Ep 022: Evidence of Attempted Posting
► Play EpisodeChristoph questions his attempts to post to Twitter.
- This week, continuing to dig into the "Twitter problem". We want to post to Twitter on a schedule.
- "Writing code to help out with laziness."
- Start with data to keep track of: inside (our data) and outside (Twitter data)
- "Data from a foreign land."
- We need to determine our "working view" of Twitter's data.
- What is in our data? For each "scheduled tweet":
- Text to post: the "status"
- Timestamp of when to post
- Timestamps are nice
- Milliseconds since the epoch
- Universal instant
- Allows the client to localize
- How do we know a scheduled tweet has been posted? A "posted?" boolean?
- Boolean says, "Yes! It has been posted somewhere on the Internet."
- Correlating identifiers are more useful than a Boolean.
- The tweet ID is a correlating identifier. We can use it to lookup all of Twitter's data about it.
- "We don't need to store all of Twitter in our database."
- What is the story you need to tell about what happened?
- A record of all the attempts allows us to tell a story about what happened.
- Useful to have the timestamp of when our application posted it.
- Make a separate log for attempts.
- Attempting to post is a separate concern than what to post.
- Don't complicate the scheduled tweet information by embedding the log.
- "Once you have all the data, it allows you to ask new questions you didn't originally think of."
- Clojure makes it easy to work with a large tree of data that came from an external source. We don't have to care about the structure of that data. We can just write it down.
- Simply attempt to post the next scheduled tweet that does not have a Twitter ID recorded.
- If it fails, just record the attempt, and go back to sleep.
- "Handle the brick in front of you, and if you keep doing that, you'll eventually build the wall."
- What if we don't hear the success response from Twitter, but it did get posted?
- Idea: Try to detect if a tweet has already been posted.
- If we can uniquely identify something by its content, we can know two things are the same without having a common ID.
- Problem: Twitter can alter the contents.
- Idea: fuzzy "measure of similarity" between our recent tweets and the next scheduled tweet.
- We can record the fuzzy match in our attempt log too!
- If we can correlate by contents, we could even identify when we manually post in advance.
- As soon as you can determine equality by the substance of the thing itself, you can have more than one writer.
- How "recent" is "recent"? Is it 100? Is it 200? Is it 500?
- Even better, fetch all the tweets since the last ID we recorded.
- we know we're seeing all of the tweets
- can scan each of those for a match (in the case of a manual post)
- know when the tweet stream ends, so we can know a posting is still needed
- The worker will get there eventually. Can just give up on an error. No complex retry and recovery logic.
- With more than one writer, we still can have a race condition. Ultimately Twitter has to deal with deduplication to avoid a double post in a short interval.
Message Queue discussion:
- Namespacing in a map is really useful
- A flat, namespaced map is easier to traverse than a nested map.
- One use for namespaces: indicate the origin of the data
- Eg.
:twitter/id, :twitter/status
vs:local/id, :local/text
- You see the namespace in your code, so it makes the data origin very visible.
Related episodes:
- Schemas for data. "Internal" vs "external" data.
- Creating a "big bag of data" and asking it questions.
Related projects:
Clojure in this episode:
pr-str