At AOL, When Service Shuts Down
John-Paul Carriveau is a Mobile KDS Analyst at Keynote with a unique perspective. He was formerly a member of the Mobile Partner Services team at AOL. While there he used Keynote's Mobile Device Perspective (MDP) to monitor and help troubleshoot AOL transactions when there were problems. He now brings his understanding of the customer experience to the work he does at Keynote. Today we asked John-Paul to give us his perspective on application monitoring—what works and what doesn't work, why it is important, and why it makes sense for a large mobile portal like AOL.
Benchmark: What did you do at AOL?
Carriveau: One of my key responsibilities was monitoring and troubleshooting for AOL’s mobile products. I used both internal and end-to-end metrics to aid in the identification and resolution of issues impacting end users’ ability to use the mobile products and services. Like any online communications and media provider, AOL makes money when people are able to communicate with friends and access relevant mobile content quickly and easily. When AOL products were unavailable to customers, we lost money.
Benchmark: What kinds of things kept you up at night?
Carriveau: Monitoring the network side of how the site was performing was reasonably easy, but what we really needed was to see how the applications were performing. Each transaction was quite complicated on the back end. A simple user request could entail several transactions back and forth. If we couldn’t see them end-to-end, we couldn’t understand the monetary impact of network problems. Also, because our internal measurements alone had no way of determining real-world impact, we risked allowing a high-profile outage to go on too long because we didn’t realize just how widespread the impact was.
Benchmark: Tell us more.
Carriveau: I was a member of the team responsible for assuring AOL network availability to our mobile carrier partners. I needed to know when something broke or precisely when an error started so that I could correlate that data with application changes, service outages and other known events. There is a huge difference between how an event impacts a server or traffic and how it impacts the end user. End-to-end testing is the only way to really understand and estimate the impact of an event on the end user. We needed the end-to-end metrics as well as internal message and network traffic data to get a complete view of issues that were impacting our users.
Benchmark: What was a typical day like?
Carriveau: I checked our internal metrics to ensure the network was healthy and traffic levels were normal. I correlated anomalies I found internally with data collected from the Keynote MDP end-to-end monitoring system to validate the existence of issues, and assess their impact to customers. I also provided monthly reporting on our products’ end-to-end performance. The detail provided in the script results gave me the information I needed to speak accurately to any issues we reported.
Benchmark: How did you prioritize issues for resolution?
Carriveau: Customer impact was our first concern—and where we wanted to spend our time. When we found that end-to-end data showed a considerable drop in transaction success rates, even if only a small change in traffic was detected internally, the end-to-end data were invaluable in helping us assign proper urgency to issues and get them resolved. On the other hand, when internal systems would show a problem, but our end-to-end data would show good results, we could avoid spending cycles on a lower priority issue.
Benchmark: How does end-to-end testing differ from other types of monitoring?
Carriveau: As an applications developer, it’s not easy to see how successful an average Joe would experience your product. For example, if you have an application where users can find restaurants that are close to a certain zip code, the only way to see how often their requests succeed is through end-to-end testing in a live production environment. That means using actual handsets to obtain the results. Accurate results also require sophisticated scripting—something I preferred to entrust to a firm that specializes in writing and maintaining these scripts.
Benchmark: How often did you run your tests?
Carriveau: At AOL, we tested each of our services once an hour, on 12 different handsets, 24x7. The data was rolled up into a monthly report. You have to test regularly if you really want to know what the end user success rates are in the real world. There is a huge difference between looking at individual components of a transaction and understanding whether all components are functioning correctly together at any given time.
Benchmark: Give us an example of how application diagnostics can have an impact on service and revenues.
Carriveau: One time a new handset was launched with a messaging client that AOL uses. The handset was expected to be hot, and it was, but when it was launched, we ran into an immediate problem. The client was written and tested by an outsourced team that didn’t have all the information necessary about how to interpret AOL’s protocol. The result was that the client was generating massive amounts of data—way too many transactions at the same time for our servers to handle. We were experiencing a transaction volume that we had projected we wouldn’t reach in two years, let alone overnight. Here was the challenge: some internal graphs indicated that we had a problem, but others didn’t. It was impossible to tell what was going on until we added our end-to-end testing data to the mix.
Benchmark: How did end-to-end testing data change the equation?
Carriveau: We brought actionable data to the table. We were able to separate performance data by carrier, even though all the traffic was going through the same pipe. We could show that the day before the launch, the carrier was operating at 95%, while after the launch, performance dropped to 70%. In addition, we could show that traffic had increased by 35-40% and failures were occurring. We could calculate the daily hit to the bottom line if the problem were not fixed quickly. The data demonstrated that we were experiencing high failure rates that could be associated with a single event—the launch. With this kind of information on hand, we could focus on fixing the problem rather than spending time debating the cause.
Benchmark: Are there a lot of options for end-to-end monitoring solutions?
Carriveau: We had tried a variety of solutions, and most were not that attractive. We didn’t want to reinvent the wheel in-house by buying and maintaining a bank of computers and phones, and we certainly didn’t believe it was cost-effective to depend on an army of human testers checking out possibilities. There was another company providing an automated end-to-end monitoring product, but you had to write the scripts and run them yourself. Keynote provided the one solution that delivered technical expertise in writing and maintaining the scripts as well as in executing the monitoring.
Benchmark: Why did you find the script support so attractive?
Carriveau:Like most content providers, our expertise at AOL was in the content we provided and the audiences we served, not in the mobile phones and network infrastructure required to deliver that content. It’s like driving a car or flying in an airplane. Your expertise is in getting where you want to go. You don’t want or need to know what’s under the hood, how the route is maintained and other infrastructure details; you just want everything to work the way it is supposed to.
As a customer, I wanted to know that we were getting round-the-clock monitoring and alerts, along with diagnostic information, including screen shots, when something went wrong. It was the best possible way to ensure uptime.
Benchmark: What’s different about your job at Keynote from what you were doing at AOL?
Carriveau: It’s a difference in perspective. At AOL I wanted to make sure my applications were performing in an acceptable manner—and positively impacting the revenue stream. At Keynote, I combine my experience as a customer with an in-depth knowledge of scripting strategies. I can provide each customer with the testing capabilities that will give them accurate, actionable data to provide the best possible end user experience.
Changes in application performance can have a huge impact on the bottom line. As John-Paul’s experience demonstrates, consistent end-to-end monitoring, with the ability to troubleshoot when things go wrong is insurance that you—your company—cannot afford to be without.