Ep 106: Robustify!► Play Episode
Each week, we discuss a different topic about Clojure and functional programming.
This week, the topic is: "building up reliability". We push our software to reach out to the real world and the real world pushes back.
Our discussion includes:
- Progress on the Sportify implementation.
- How do you keep development rapid as functionality grows?
- How can you learn and iterate quickly as you encounter failures and edge cases?
- The real world vs the portrayed world. ("The map is not the territory.")
- How to handle configuration during development.
- How to handle large downloads.
- What to do with I/O exceptions.
- How do you make software reliable?
- Where should you invest your effort for reliability?
- What learning is the most important in the tracer-bullet phase?
- How do you handle intermediate artifacts?
- What are different ways to handle failure?
- How to work with temp files.
- Exception handling tips and tricks.
- Why "what ifs" are an appealing trap.
- What type systems cannot help with.
- Finding joy in errors and failures.
This is very incremental. We are rapidly accreting functionality. We're bringing it together with a very interactive REPL-driven way of getting things done.
There's no reason why we would ever need to modify this code ever again because everything will go smoothly! I'm sure nobody will ever change their mind on functionality and nobody will ever make a mistake in any of the other systems either!
You always have to deal with the uncertainties through time, not just the uncertainties of requirements.
One of the great things about this interactive REPL-driven way is that we are exploring the real world. We're not exploring somebody's documentation or somebody's article or somebody's representation of the real world. We're actually interacting with real systems, and we're looking at real data, and we're figuring out the real situation.
We're not trying to make the ideal version for production. We are trying to get a fully automated solution end to end to understand all the specific situations we have to handle.
That S3 function is built on a tower of abstractions. Some of those abstractions involve the network, and other ones involve other companies. There's a variety of reasons why those might fail—whether for geopolitical or network-based reasons.
The human retry loop is a completely valid solution at this point in time. It's actually a valid solution for a lot of things.
Every time a human has to retry (and the human being is us) we learn every time. We're learning how the systems fail, and that's just as important as the happy path.
Over time, we're going to accrete more and more reliability in the system by handling more and more things.
Deleting the temporary files is a return to known state. It's a pretty harsh return to known state, but it is a return to known state. The initial state of having nothing is a sound state to return to.
I/O is the greatest source of failures when you're automating processes.
Nobody asked your program for permission to turn off the power.
The boss man says, "Make some more! Make some more!"
Let's reduce the recovery time as opposed to trying to avoid the need to recover.
We can convince ourselves that work is needed because we have evidence of failure over time. We're growing functionality on demand as needed. It's very lean.
We're building the right software just in time. Not only are you iterating quickly, so it's not a long time between each change in each rerun, you're also building the right thing every time. It is in the realm of the world, not in the realm of what you think the world is going to do. The actual world is there.
The situation we're in is the real world. It's not something that could happen or maybe happen or "what if" happened.
I/O is the source of pain in our lives, but it's the source of actually making useful software, so it's worth it.
Even if you're in a programming language like Haskell that tries to do proof systems around your logic for handling I/O, it still can't save you from the fact that I/O is going to blow up on you! I/O failures happen.
What happens if you need to retry the retry and then retry that retry? There's only so far you can go with this imperative assembling of the application.
We're letting the reality of our situation dictate where we apply our effort.
We still have learning to do.
Well, that was exceptionally fun!
We're still having fun even though we're encountering errors. Both of those things can happen at the same time!
It's just delightful to see progress. You're always feeling progress! That's a big goal: feel the wind at your back!