Building resilient services

I am surprised to see how many programmers are assuming that their external services are always available, when in reality nothing has a 100% uptime.

If you want to make a your system reliable, you must plan for worst case scenarios and to implement a default retry mechanism in you code, some kind of fallback. Throwing a fatal error is not going to make your solution reliable.

So for this reason I implemented an automatic retry mechanism inside jira-python library and due to this we are not so affected about network downtimes and temporary downtimes of the services we do rely on. That’s a feature that is enabled by default, but if you do not like it you can disable it with a simple parameter.

Especially if you are working with REST HTTP calls, is extremely easy to implement such things.

You can add them without modifying your HTTP calls, just by changing the behavior of the underlying HTTP library

For example the solution that I used does enable retry for anything that is using the standard python urllib.

Few questions that you should ask: * what if NIS or NFS is down for 10 minutes? * what if a specific web-server is down for few minutes, or two hours?