Google Cloud Platform

When you outsource a component of your architecture to a cloud company you are consciously making a trade-off: You often get a cheap(er) (hopefully better, faster) service but you yield control to the service provider. This trade-off just became personal with Google Pub/Sub.

We've been using Google Pubsub at work for a distributed application that can handle 1 minute of pubsub downtime just fine, but a 5 minutes downtime starts setting off alarms. We've been experiencing sporadic downtimes longer than 5 minutes, roughly 10 minute windows here and there. I can't be 100% sure these were all a fault of pubsub, but I recently refactored my app to just die at the first sign of trouble - an automatic restart just wipes the slate clean and opens new connections. Meaning that any downtime after a restart is very likely a fault with pubsub. And today we had 23 minutes of Sub(criber) downtime with errors:

{ [Error: The service was unable to fulfill your request. Please try again.]
errors: undefined,
code: 504,
message: 'The service was unable to fulfill your request. Please try again.',
response: undefined }

I confirmed with a very nice and helpful Google Engineer that there were "more errors than expected" during that window with the pubsub service. He apologized for the downtime and pointed to the SLA where we could get a partial refund. Of course, you're probably using this service because the value of the service heavily outweights the cost. And conversely, the loss of the service heavily outweighs the value of a refund. So a partial refund is not very comforting.

So the moral of the story, in my experience if you need "pubsub" with better than 1 hour guaranteed latency, consider rolling your own or using another hosted solution. But if you can handle stretches of downtime, Google Pubsub is great. When it's running, latency is usually quite low (sub-second, I believe, though I haven't measured).

It's certainly possible, perhaps probable, that the Google Engineers will overcome whatever growing pains they are going through and that the service will become much more reliable in the future. Until then I need to consider regaining some control.


comments powered by Disqus