Sitting in on our co-founder Eric’s presentation last week, gave me a new realization of the value of real-time. We use the word a lot and it means different things to different folks. I am sure that NASA expects absolute real-time telemetry coming from one of its satellites in orbit, so that action can be taken immediately. That’s just a given for the business of keeping satellites in orbit.
So here we all are in 2011, where we use a variety of programs almost every minute of our waking day. How many of us get angry, when we click on something like “reset my password” when trying to log into a web app, and that email does not show up in our inbox in just a few seconds? In our world today, we get used to things happening right now. We watch movies on demand. We text back and forth with no apparent delay. We watch tweet feeds come in. These things happen in seconds. If they don’t happen the way we expect, we walk away. Perhaps moving on forever.
For those of us who create applications that people pay for or depend on, this is not a good thing. If our app does not respond…well, folks, our customers, can just walk away. Which translates into lost revenue for us, and lost productivity for them.
However, many of you that I talk with everyday, say, well, if I can monitor every minute or so, that’s ok. But when I looked at Eric’s slides that he presented below at the Cloud User Group meeting, I see that if a monitoring system only measures every minute, (slide 17), the severity of an issue gets averaged out. It’s even worse if you measure only every 5 minutes. (slide 16) We lose perspective on the intensity of the true system performance in the 25 or so seconds that that system was totally pegged. (slide 18) The customers who were using the system at that time definitely did not lose perspective. They absolutely knew that it was not responding. So it seems there is a gap here of what our customers expect and what we as application developers are delivering.
Just to put this in perspective, here below is a screen capture of one of our systems, viewed through the cloud monitoring product, RevealCloud. Notice, the system "staging" goes completely south in just 25 seconds. The decline was relatively gradual, with a warning state (yellow), where corrective action could have been taken before things became critical (red).
Seconds count folks.
screenshots from Erics preso showing the averaging out of a severe performance spike: