Idempotence / self-healing: the system should be built in such a way that it tries to reach the correct end state, even if the current state is wrong. For instance, every time our system gets an update, it will re-evaluate the calculation from first principles, instead of doing a diff based on what was there before. This prevents bad data from snowballing and becoming a catastrophe.
Giving yourself knobs to twiddle in production: at work we have ways of triggering functionality in the system on request. Basically calling a method directly on the running process. This is so, so useful in prod issues, especially when combined with the above. We can basically tell the system “reprocess this action/command/message” at any time and it will do it again from first principles.
Debugging: I always first try and find a way to replicate it quickly. Then, I try and simplify it one tiny step at a time until it’s small enough I can understand in one go. I never combine multiple steps per re-run and always verify whether the bug is there or not at every single stage. This can be quite a slow approach but it also means I am always making progress towards finding the answer, instead of coming up with theories which are often wrong, and getting lost in the process.
Would you be willing to give an example of the second? I feel like my boss would throw a shitfit if I told him I wrote anything that even remotely alter prod
Certainly! The line we don’t cross is that we don’t directly edit data. Every record in our database must be generated by the system itself. But, we can re-trigger behaviour, or select different flows, or tweak properties around the edges as much as we want.
For example:
Reflows - for every message that enters or leaves our system, we store it in a table. We can then reflow the message either into our system or to our downstreams. This means if there was a transient error or a code change since we received the message, we can replay it again without having to involve anyone else.
Triggers - i.e. ask the system to regenerate its output based on its inputs again. This is useful if there’s a bug that’s only hit in certain situations.
Migration - we have lots of different flows and some are triggered only on some accounts. We have some scripts that lets us turn on/off migration and then automatically reflow all the different messages.
Idempotence / self-healing: the system should be built in such a way that it tries to reach the correct end state, even if the current state is wrong. For instance, every time our system gets an update, it will re-evaluate the calculation from first principles, instead of doing a diff based on what was there before. This prevents bad data from snowballing and becoming a catastrophe.
Giving yourself knobs to twiddle in production: at work we have ways of triggering functionality in the system on request. Basically calling a method directly on the running process. This is so, so useful in prod issues, especially when combined with the above. We can basically tell the system “reprocess this action/command/message” at any time and it will do it again from first principles.
Debugging: I always first try and find a way to replicate it quickly. Then, I try and simplify it one tiny step at a time until it’s small enough I can understand in one go. I never combine multiple steps per re-run and always verify whether the bug is there or not at every single stage. This can be quite a slow approach but it also means I am always making progress towards finding the answer, instead of coming up with theories which are often wrong, and getting lost in the process.
Would you be willing to give an example of the second? I feel like my boss would throw a shitfit if I told him I wrote anything that even remotely alter prod
Certainly! The line we don’t cross is that we don’t directly edit data. Every record in our database must be generated by the system itself. But, we can re-trigger behaviour, or select different flows, or tweak properties around the edges as much as we want.
For example: