A Three-Screen Perspective from the World’s Largest Retailer
A Webcast Q&A Follow-Up with Walmart’s Cliff Crocker & Aaron Kulick
Keynote Systems and Walmart Labs recently collaborated on a webinar, “What Retailers Need to Know About Site Performance in a Three-Screen World,” presented by Cliff Crocker, senior manager of performance and reliability, global e-commerce for Walmart Labs; AaronKulick, senior software engineer from Walmart Labs; and Ben Rushlo, director of performance consulting at Keynote Systems.
Three-screen site performance is an exciting and timely topic for anyone involved in the Web,whether that involvement includes retail or not. As might be expected, the audience had many questions — too many for the one-hour webinar. Cliff, Aaron and Ben agreed to do a follow-up session to answer the rest of the audience’s questions. This is the transcript of that follow-up session.
UPDATE: After the webinar and follow-up session took place, Cliff Crocker left Walmart to join another firm.
Benchmark: Question number one: What kind of requests does synthetic measure well? Script requests? Others?
Cliff Crocker: This is a great question and I think we probably all have some opinion about this. Synthetic is actually great for measuring anything over HTTP, to be quite honest. I think we find it pretty sensible to measure most any HTTP traffic that we do need to monitor.
Aaron Kulick: I’d have to agree. Specifically, one of the things I think script is targeting is user flow. If you have sufficient instrumentation of the Web event then you can actually get really good stitching. However, they may not follow the path you wish and, scripts which are very dialed in and exercise explicit functionalities are great for both availability and performance monitoring in a consistent environment. And synthetic will never be replaced for that particular piece.
Benchmark: Next question. Has Walmart considered using responsive design, where one interface works across all devices and resolutions?
Cliff Crocker: I think Aaron and I both have strong opinions on this one and we’ve absolutely considered it. Our new platform that we are designing and coming out with now is definitely well-positioned to be able to handle this, but for the current site we’re not using it.
I think it’s great. I think that it’s extremely extensible. It definitely helps serve many devices and many platforms on the client side. And I think we’re going to see more and more of it, so I’m actually really excited about that one. Aaron, I know you’ve actually used this.
Aaron Kulick: Yes, I agree. Use it, love it, wish everybody did, wish we already did. It would make our lives significantly simpler than having to maintain distinctly separate code bases and repositories, not to mention that our responsive design solution avoids all of the inherent difficulty of having to identify the agent before, at the server side, in order to serve them the specific page so as not to redirect them. Gone are the days of m-sites and mobile sites. Unified sites are totally the future.
Ben Rushlo: It sounds like this is nirvana for you. So what’s the downside? There’s a downside to every kind of technical architecture, so what’s the negative?
Aaron Kulick: Well, the downside is always going to be that, in a responsive design that’s truly browser-compliant, you can’t do as much as you would like to do for everybody. Or that you might be able to do when you target someone specifically, like Flash. Some of the libraries can auto-adjust, like little tiny issues like formatting.
Libraries are not perfect and you will find edge cases and you will find them quickly. I found them the moment I started using them. You have to be conscientious of those limitations and continuously be aware of them when you go about organizing and orchestrating your site. And I think many of them can be worked around.
But as you hinted there, it’s a compromise, so it makes no one happy. But if you’re looking for a best-of-breed, you can often find it in some of the responsive designs. I love Boilerplate. I love the Twitter Boilerplate implementation — it’s gorgeous and it works. It’s so easy to make it work. But on the flipside, in my own personal use, I get some like, oh, I used a big font here and it doesn’t display right when it’s on a phone because the shadow box doesn’t trap the whole text. You just have to know what those limitations are. You need to be prepared to handle those defects.
Ben Rushlo: Not to mention legacy browser support as well. We still absolutely have to think about that, and there’s still a lot that was not considered around IE 7, and even IE 8 in cases. They just don’t support this as well.
Aaron Kulick: Yes, Boilerplate has got some nice functionality to handle IE 6, but it’s not perfect and will never be perfect. And you have to still test for every device you intend to serve. Otherwise you won’t know where it’s breaking.Or you will when you find out you don’t have those customers anymore.
Benchmark: Is it sensitive to the touch interface on phones and tablets? Does it support swiping and touch sensitivity?
Aaron Kulick: It can. It depends on the sophistication of the library. The nice thing about Boilerplate is it’s extensible, so once it figures out you’ve got an enhanced agent or a capable one, you can extend it and say, oh, I know you have more stuff. So responsive design rolls it up nicely into progressive enhancement, which is the opposite of say, graceful degradation, which is, we’ll ship you everything and fall back to whatever we can use.
Instead, you can have a uniform code base and then extend upward into a particular vertical. It’s still, technically, designing to a particular device or a particular use case, but at least what you have is a generalized code base that provides consistency. There’s a huge thing to be said for being able to provide a uniform experience across multiple platforms.
Benchmark: Okay, question number three: For synthetic testing, do you find that WebPagetest offers more accurate results than other platforms, since its nodes are not located on backbones and include last-mile latency?
Ben Rushlo: I think first you have to understand the framework of what it means to do last-mile testing, or what it means to do performance testing. In terms of understanding design architecture issues, point-time measurements can be very accurate, but I think you have to first differentiate point-time versus ongoing, and what the value is from ongoing.
So the question kind of mixes the two, but we view WebPagetest— my team uses it at Keynote — as extremely useful. I love what they’re doing. It’s an amazing tool. Great for single point-in-time tests. I know Cliff uses it in a slightly different way, but it’s great for scheduling and running maybe competitive measurements, again point-in-time, one data point in a day.
And you could say that, HTTP-Watch, HP Analyzer, Firebug, Google Page Speed — all these tools which can give you a view of a page’s architecture, the waterfalls, integration of third-parties, application calls — they’re useful for that sort of thing.
The problem is when you get into understanding architecture issues that only occur over time such as, I might have an application call that slows down every day at 8 o’ clock, or slows down under load, or in a month slows down when we do a mailer and people are coming to the site — which is what one of our customers is experiencing now — and operational issues which would have a big effect on performance management. How do you catch them with single-point-in-time measurements?
So first we have to say, what tools are good for single-point-in-time, what tools are good for ongoing, and I think tools like Keynote and our true competitors — which I wouldn’t put WebPagetest in that category — really are about collecting thousands and thousands of data points and the ability to slice and dice and be more granular at that level. I think this is actually quite useful compared to a single-point-in-time test which has some benefits.
The other piece of the question says, well, they’re in last-mile nodes. Okay, Keynote has, Gomez has last-mile nodes, so that’s just a confusion of what the product space offers. All synthetic monitoring companies offer the ability to do last-mile in different ways — but with real end-user connections, DSL connections, end-user PCs potentially. So the question is, is it better to measure that way?
I personally think it’s interesting, but it provides a lot of additional variability and so, if you have a thousand dollars to spend, would I spend it on last-mile measurement? I wouldn’t and I would say that, I’ve seen customers do that and they aren’t very happy with the result, because what you end up doing is tracking issues with that cable provider, that DSL provider, whatever, and you end up chasing things that really have no effect on what you can control.
So from my perspective, first is, separate what is the point-in-time tool good for — which is really where I put WebPagetest, but Cliff and Aaron might disagree — and then, what is the value of ongoing measurement? They’re different. And then within ongoing measurement, last-mile versus backbone versus first-mile.
They all have their place. The question is, where do you want to focus and for us, in doing this for quite a long time, our value really comes I think from getting out on the Internet, which is backbone, but removing as much variability as you can, which is not doing last-mile for your primary measurement, I think is kind of the sweet spot for the work that we do, at least.
Cliff Crocker: Yes, I agree with Ben on most all those points. I think that it’s definitely a great point-in-time tool. We are big fans and big supporters and definitely use the tool a lot internally and do it for more of our competitive benchmarking and things like that.
There are great APIs, there are great ways to schedule a test with WebPagetest, whether you’re using a private instance or the public instances and have an API key or something to actually run these tests on. But I would defer more back to Ben, and I think what a lot of the synthetic tools that Keynote and the competitors in that space offer is more of the clean lab, predictable, consistent environment which makes it more attractive.
But having said that I know Aaron has got a really good point on WebPagetest as well.
Aaron Kulick: I agree completely with everything you’ve said and before everybody runs out and bombards Patrick Meenan with an API request, realize real fast that those API keys are few and precious. You’re never going to get an API key that will give you sufficient volume to be able to do as much as we want. And Patrick is a great guy and he does this because he loves it, and it’s fortunate that he has Google behind him to help provide additional support for it. But it’s still a free service and there’s no way that he can handle everybody’s monitoring requests sufficiently in order to use it as anything other than a point-in-time solution, as you point out. One of the things that I would also point out about WebPagetest, specifically to the last-mile component that the questioner specifically calls out, is that I can tell you very factually and very truly that that is not actually always true.
Specifically, like with the Dulles instance case — the primary node for WebPagetest — the Dullas instance is actually located on a FIOS line in a basement in a home which, while technically last mile, has throughput and bandwidth available to it that far outstrip a true DSL connection.
When you are using WebPagetest and you select DSL or one of the different or custom throughput models, you’re actually imposing a physical delay via the IPSW, which is a Unix-type firewall which actually does packet munging and simply spin-loops the packet for 50 milliseconds and then says, ‘my total throughput available to you is one and a half megabits per second down and Number of Kbits per second up.’ So you have to be very conscientious of the fact that the DSL requirement is an artificial construct. It’s not actually DSL, it’s simulated DSL. This is true for almost all of the nodes in the WebPagetest pool, because they’re actually, usually, traditionally served from either virtual machines or in datacenters, or in corporate offices that have large amounts of bandwidth, which they’re donating. And I think that it’s unfair to necessarily say that their last mile is superior to a different last mile.
It’s a simulated last mile and it helps you get a last mile or what might be a DSL-like view, and it’s very useful to know when you’re saturating your connection when you’re trying to profile for performance, because many times you do, and that can be extremely telling when you are basically trying to push a lot of data down the DSL pipe. It can be very helpful. But at the same time, most of the ‘last-mile view’ that you’re seeing here are actually artificial constructs in the case of WebPagetest. They’re simulated last mile.
Ben Rushlo: I think that’s an interesting point. I assumed they were using some bandwidth-shaping model, which sounds like what you’re saying.
Aaron Kulick: It’s exactly what they’re doing.
Ben Rushlo: And that’s interesting for people to understand. Again, that’s great in a point-in-time scenario. It’s great for sort of lab experimentation, as you said —mess around with that. But it’s not going to be completely accurate. It’s not going to have all the nuances of taking a real measurement over DSL — thousands of data points and two weeks, and then understanding the real nuances of DSL, what happens when the DSL terminates with co-lo and then that gets overloaded.
It’s knowing where your tools play, and we would never say at Keynote, ‘dissuade people from using WebPagetest,’ or Firebug or the 10,000 tools — well, there’s probably more like five or ten — that people use. You’ve got to know what you’re using it for, and I think that’s something that is an education in the industry.
You can’t replace ongoing synthetic monitoring with WebPagetest, just like you can’t replace RUM with Keynote synthetic monitoring, and you have to know how each plays and then what you do with each of them in terms of informing your business and technology decisions.
Benchmark: That seems like pretty broad coverage. But how many users are actually using LTE?
I think the question’s great in terms of, it helps us talk about that, which really more people should be talking about, basically.
Benchmark: Our next question from the audience is, what technology and methodology do you use for real-user monitoring or ‘RUM,’ and what is the typical RUM test case?
Aaron Kulick: The current implementation that we’re using is actually based on Boomerang JS, which was built by Philip Tellis while he was at Yahoo. It was open-sourcedat Velocity about three years ago. He basically wrote it and open-sourced it in the same breadth. And there are one or two others out there, Episodes is one.
But the second half of the question — what is the typical RUM test case? I would argue that RUM becomes desirable in any ‘two’ scenarios. In other words, B2C, C2C, B2B — if you have a customer with a browser on the other end who is using it in something other than a simple API which can be timed, when it’s not server-to-server communication and you can’t instrument the call, RUM has the use case. But at the same time it won’t fill in all your gaps.
Cliff Crocker: A typical RUM test case would be — with us, we actually use RUM and we tag everything possible, right. So we aim to have 100 percent of our traffic tagged in the interim. And whether or not we actually get all of the beacons back for all of our traffic — typically we do. We’re estimating that we get between 96 and 98 percent of beacons back for all our traffic. So I would say it’s less of a test case and more of, hey we tag everything. We want to monitor everything. We use this to drive our analytics as well. So our implementation is more widespread —100 percent of traffic instead of a sample.
Aaron Kulick: Test case also is just a curious term. After you have RUM and you’ve instrumented your site, you’ll find out what are representative cases inside it. But until you have it, you won’t really have much visibility into your customer patterns and their actual experience, until that’s in place.
Ben Rushlo: It can be that the person asking this question is on this Keynote webinar and coming with the Keynote mindset of synthetic test cases, where we think about user journeys and paths and pages on a retail site — you pick one product, you pick one search term. And so RUM is a whole paradigm shift. RUM is about —if you do it right — it’s about tagging everything, it’s about measuring timers on every product page, on every permutation of search.
Aaron Kulick: I agree completely. I would also point out that RUM will drive you crazy, primarily because it’s going to generate a deluge of data, and you have to dig through it. Otherwise it’s just actually going to create noise and will distract you if you’re not prepared to consume it. And because RUM can’t really do technical instrumentation of the page itself, you’ll see outliers and you’ll wonder was that me, was that them, was it the Internet, was it a guy with a backhoe in the middle of the freeway?
You know, you can end up second guessing yourself aggressively. You have to be prepared for statistics and lots of them.
Benchmark: Are the tags you are using for RUM similar to Google Analytics, that is, added to the header?
Aaron Kulick: Yes, it is very similar Google Analytics. In fact, our RUM snippet is loaded very similarly to the Google Analytics snippet, which is to say it’s asynchronously called. Right now we have, I would almost say, two pieces of RUM in the header. One is, we start a timer that has no regard for RUM itself. It’s just so that we know when we started the page load almost immediately after the HTML stanza starts.
And then we actually load the library later and do a call back to know when it’s available. So yes, it is in the header. Do we always have to load it in the header? No, you could easily push it down to the bottom. But since we choose to load it in an async manner, we’d like it to be available as soon as possible, but still deferring to the user experience as a priority.
So, it tries to be as well-behaved as Google Analytics and in most cases that is 100 percent true.
Benchmark: Who pressed for implementing real user monitoring and how was it justified?
Cliff Crocker: I’ll take credit for that. When I first came to Walmart, it was something that I was really dying to do, and Ben knows this as well around my past, just really wanting to get to a different level of performance monitoring and understanding user experience and client-side performance. I’ll say that I didn’t have to push super hard.
When I got here, finding the right team and finding the right people to help me get it started and get implemented was probably the biggest challenge, but once we actually got moving with it, we were able to move very fast, because we had great business partners and great executive sponsorships that saw the value, and that knew that fast was important, that knew that speed was a feature, and the only way we were going to start to be able to quantify that was with the real user monitoring solution.
So the justification was fairly easy when you’ve got the business on board and actually behind you and pushing for it. At the end of the day, engineering is going to play ball as well. So we were able to move pretty fast in a pretty large organization and get it pushed out in a pretty short manner.
I think it paid off in spades. I think we know our customer a lot more than we did nine months ago, and truly are getting some very granular understanding, which has given us a very big generation skip in terms of how we do performance monitoring in e-commerce.
Benchmark: One more question about RUM. What are some of the open source tools available for real-user monitoring?
Cliff Crocker: Boomerang JS. Episodes is another one that I know was around either before or right at the time that Boomerang came out.
Aaron Kulick: That was Souders.
Cliff Crocker: Souders, yes.
Aaron Kulick: He had actually written one beforehand that is open source, but I know it’s orphanware, but it has been extended and there are actually people using it.
Cliff Crocker: Yes, the other thing a lot of people do is just simply put a start timer at the beginning of their page and a stop timer — a very generic implementation of RUM. This can be extremely useful, just to have that start tag at the start of your HTML and just set a timer.
And now obviously, it’s not even that it’s open source, but with navigation timing API, you’ve got all that there in the actual browser, so you’re able to pull those measurements out without any RUM implementation if you want to just measure based on navigation timings which, while I think it’s a great start, I don’t think that it captures 100 percent of the traffic, and 100 percent of the use cases that we want. Anyway I think people are rolling their own, I think also your Boomerang and Episodes are existing and have support and have extensibility and pretty wide adoption.
Aaron Kulick: One other use case or available solution —Google analytics, up until recently, also had what was a limited RUM solution, through which they gave you anonymized timing data up to a certain number of queries per month or requests per day. They have since lifted that cap, but again, that is data that Google controls and it’s close to them and it’s only available through them and you have to mine through them. So not truly an open source solution in that sense.
Benchmark: The next question is a few questions in one, about data. Keynote delivers a lot of data. How are you mining it? How are you benchmarking it? What tools do you use to represent the data for distribution to internal stakeholders?
Cliff Crocker: I’ll be honest with you and say that Keynote does deliver a lot of data and they’ve got some next-generation APIs that they’re coming out with that I think are going to make it easier to consume and start to pull that data down. And while we do use some of the data feed pushed out from Keynote, we don’t use it enough. What I would say to take this a little bit further, though, is just to talk about data in general, and where we pull performance data in, how are we mining it, how are we storing it, and what are we using to visualize that data?
I can answer those questions, but I think that we have plans to start leveraging more of the synthetic data that we pull in and use it a little bit more effectively. Typically, you know, we like to use Hadoop for a long-term store and also MapReduce to really run over larger data sets.
In the case of synthetic monitoring, it’s not always a big data-type scenario because really, you’re not looking at as many data points as RUM. And I’d say in terms of tools to re-represent the data, we’re big fans of visualization and Aaron typically shows me a tool a week that I like. We do POCs and use cases against these tools to see which ones are a great fit.
And there’s a lot of other great visualization tools out there including the Google Charts API, which we’ve used and still use quite a bit. I’d say on the other side of that, there are some other tools, and I won’t go into vendor specifics here, but there are some other tools that you can use to do visualization on top of very large data sets that’s more throwaway — not long-term, not a lot of development, but more point-and-click and drag-and-drop, create a graph and see if it looks good.
Aaron Kulick: And a lot of that was iterative. Specifically to the data that Keynote delivers, a lot of the data that we’re very, very interested in is actually some of the things that are uniquely provided by synthetic — in particular, things that we want to be able to compare release-to-release, day-to-day for general consistency, they become gates for QA and other things as well.
And so we extend them to production logically as well — rules like how many objects were downloaded, how many requests, how many connections, DOM elements, the number of elements on the page (usually being not the same), how long they took, et cetera, et cetera. There’s a tremendous amount of data, and to Cliff’s point, it’s one of those things that we want to have more of, because there are some very key points that we would like to be able to synthesize.
Cliff Crocker: It would be great to see long-term trending of third-party performance and be able to rehash back over that data over months and years. You can get some good stuff from the portal itself and from the Keynote portal around component-level stuff by domain. But it’s limited to however long Keynote holds the data. We want to keep data. We want to keep everything forever.
Ben Rushlo: I think what people who are listening or reading this need to understand also is that Walmart is a very advanced customer. If you look across Keynote’s customers and put them in a distribution, Walmart is at the one percentile or at the 99th depending on what side you want to be on, Aaron and Cliff. So what I would say about the first piece is — data visualization and how you consume Keynote data is an interesting question, because data can just become a mass of confusion. So whether it’s Google Analytics data, RUM data, Keynote data or god forbid, our competitor’s data, whatever, WebPagetest data, you can amass piles of data and you can apply pretty charts and graphs on top, which you see customers are doing, and then you don’t know what the heck to do with it.
You don’t know the context of it, you don’t know how to fix it, you’ve kind of created — not Walmart — but I have customers that have created very sophisticated reporting and SLA management and it’s kind of all meaningless and they don’t understand, first of all, how to apply that to Internet data, which has a very unique distribution, and they don’t know what metrics to use.
Then beyond that, they don’t know how to answer questions with it, which is really what we should be doing with all this data. As much as I like playing with data and visualizing data and trying new programs, we really need to be answering business questions, which are:
- what do I invest,
- how do I improve,
- what do I need my development team to do,
- how do I make user experience better?
That’s really what Keynote data should do, RUM data should do, Google Analytics data should do, et cetera. That’s how I would also challenge the listener or reader: before you start saying, well our data is limited in terms of visualization or some other tool’s data is limited in terms of visualization, so let’s spend a lot of money applying visualization to it — the first thing you should say is, what business question am I trying to answer, and how am I deriving that?
And if I don’t have the expertise to do it, there’s lots of teams you can partner with, including mine, that can help you with that. But I think that’s really also the challenge, is that people have to jump to visualization without really understanding what am I trying to actually answer and how do I do that. OK, that’s my rant on that point.
Aaron Kulick: In regard to your specific comment there — hear, hear and bravo! If you’re worried about visualization and you can’t find actionable tasks or motivators or create questions that have explicit answers that will require work from just the Keynote scatter plot, then you don’t need to do any of the stuff we’re talking about.
You should leverage what is right there in front of you. We use and look at Keynote scatter plots all the time. Just because we have fancy pretty dashboards to go with a lot of our data does not mean it’s a replacement for the visualization component that already exists there. They help broaden our audience, they help enhance our audience, and they help us ask new questions to be able to find new answers.
It’s about the evolving question, like you said. If you can’t do something with your data, you’re wasting time on data. Work instead on something of value.
Benchmark: Next question: How have you fixed problems with Transaction Timed Out error or Page Completion Condition Changed errors?
Ben Rushlo: So this is one of the negatives about synthetics. The more accurate your synthetic measurement is, in terms of the higher up on the stack it interacts with the site — we interact at the application layer with Keynote, we’re actually clicking in the page and that sort of thing —and as pages have gotten more complex, where they’re doing lazy loading, they’re doing async loading, they’re loading third-party tags, they’re firing lots of things after the onload event — you basically have a situation where it’s not as easy to measure pages or especially transactions — sets of pages together — as it was ten years ago certainly, or even five years ago.
And so the questioner is getting to the point about, wow, we probably have some measurements and they’re hard to maintain. We get these false positives. And my answer is, that’s true. And you need to have a way of approaching that, whether it’s with your team building up that expertise, the customer themselves, or working with Keynote. But there is the reality of, if you want very highly accurate measurements, that those measurements especially in the synthetic space, can be fragile, and they can be fragile because the site changes, because the third-party tags are changing, because content’s changing. And as I say it, it sounds very doom and gloom like, how do you ever use it?
The reality is, if you apply the right approaches, you can stabilize it, you can get out that one or two percent noise, you can do some interesting things to exclude it, you can use some things to make the script more resilient.
So there’s hope. But I understand the pain, and so if someone who is listening to or reading this is hearing that, and that was your question, what do I do? Reach out to me directly, reach out to Keynote and we can help you work through that pain. But it comes from the fact that you’re using highly accurate, interactive data that is working at a high layer on the site, and because of that can be somewhat sensitive or somewhat brittle.
Cliff Crocker: The only thing I’d follow up there, Ben, and I think you said it all perfectly, but the thing I’d add there is, don’t always assume it’s a script issue, right — that tends to be the tendency.
Those are real errors sometimes, so there’s definitely times where it’s not a problem and people just resort to, oh it’s broken, it’s broken, it’s the script, it’s the script. You know, that’s why we are trying to measure and that’s why we get all the data behind the failures and the screen captures and everything else to actually validate that they are in fact errors, and frankly, in a lot of cases, they are.
Ben Rushlo: Yes, great point. I sort of assumed the worst case but you’re right, you should start with understanding, are those even script fragility issues? And if they are, then apply what I was talking about. And if not, yes you need to definitely look at what’s going on.
Aaron Kulick: I would also add that a lot of the reasons why synthetic testing has a little bit of, as you referred to, brittleness or fragility, is because of the complexity of the navigation itself. And that begs an immediate question: Why is it so hard to script? If it’s so hard to script, ask yourself first, did I make it harder than it needs to be to navigate because I’ve decided to build my entire site out of Flash navigation?
To be fair, there are sometimes where I looked at some of these issues before, both here and at other sites, where it’s like, we have built ourselves a problem. It’s beautiful, it’s gorgeous. We have no way to test it. Thank you.
Benchmark: Okay, next question. The user is looking for some specific advice. My problem is we don't have a live environment and the pre-production environment is 1:32 scaled down to live; very difficult to do meaningful performance testing. Any suggestions?
Aaron Kulick: I can speak to this one specifically. I don’t think I’ve worked anywhere where I’ve had a QA environment or a performance environment or pre-production environment that was 100 percent of live, except when I was working for the government.
That is a very real problem. I think one of the big things you need to do is be able to first have consistency in that environment — this person has an actual ratio, which is really awesome. In some cases, you can’t be that specific. But what you need to do is have a baseline and be comparative.
You need to know how long it takes to run the same request in prod, versus how long it takes to run in your test environment, and you need to know running the same code in the same way with the same people and the same devices. It’s just merely having a baseline.
And one of the other things is, when you have a slightly scaled down pre-production environment, or QA environment, you need to be conscientious of a lot of things that, even when you have an unreliable pre-production environment or one that is in no way representative of live, you can still extract a lot of meaningful performance data from things that would come from synthetic monitoring, like elements, counts, connections, responses — they’re very simple gates that you can use to block release of bad code because hey, you’ve made this page N number of bytes bigger or this many more requests and yet it does nothing for functionality, or well, that’s a great feature but you added half a meg. Great. Have a nice day.
Cliff Crocker: I think Aaron is exactly right. Having a good baseline, having consistency and baselining from release-to-release or from build-to-build is really what you need to do in that pre-production environment. So it doesn’t actually have to scale.
However, I will say that, and this is more of a scalability question than a page speed question, which is most what we’re talking about here in this forum, but I would say that there’s really — and our CTO will hate me for saying this — but there’s really nothing like production, to be honest.
I think that a lot of people, retailers included, that have to deal with the flash crowds on Black Friday and have to deal with very large events that drive traffic to the site, need to test in production.
They need to test everything that’s out there including the network, network compliance, the firewalls, and all your carrier networks and all the other things that you just can’t simulate well in a staged environment.
So I guess that’s more of a callout for, where you can and if you can, test in production or test with real live traffic.
Aaron Kulick: Or percentages of production. That’s what A-B tests are for.
Cliff Crocker: Yes, exactly.
Benchmark: Very good. Next question: What does Walmart consider acceptable page load time for desktops, smartphones and tablets?
Cliff Crocker: I think the easy answer to this one is ‘faster than it was yesterday.’ We have a moving target that’s always getting faster. And if we’re not meeting that, then we’re outside of SLA. And we do readjust our SLAs on a month-to-month basis or release-to-release basis, with the objective of always being faster.
I’m not going to bite on that one and say how we went from eight seconds to three seconds to two seconds and what’s really acceptable. I’ll just say that it’s not fast enough, and it needs to always be faster, specifically if you look at smartphones and tablets, and industry experience that you’re getting there, and the users’ expectation that that’s going to be or should be as fast as FIOS in terms of the desktop. So it’s a moving target. It’s never good enough and that’s why we still have jobs.
Aaron Kulick: I can’t agree more. The way I understood what our acceptable page load times were was the first day we measured. That became the ceiling and we never looked back at it. And it just keeps crawling down toward the floor. And the moment you start measuring, that becomes your ceiling and from then on you should always be looking to go faster, faster, faster.
Ben Rushlo: I think that’s great. I wish that everyone had that perspective. What we hear from most of our customers is, well, five seconds seems reasonable — it’s like the targets are based on ad hoc, subjective understanding.
The way you guys are doing it at Walmart is actually excellent. ‘Hey, here’s what where we started. Let’s just drive it down. And as we add functionality, add releases and add whatever, we’re still going to drive it down.’ And that to me is the right way to do performance management. I think a lot of people, especially business people but I think even technology people, get kind of sucked into ‘well, anecdotally, I don’t think five seconds is slow, I don’t think three seconds is slow.’
They also sometimes use wrong data. So they look at their data using the average, and say ‘well that’s fine.’ But when you guys showed that presentation, you guys were looking at the 95th percentile, which I think is great.
You want to help all users, not just the average customer. I think yes, the need to set realistic targets based on context of other folks in the industry, based on understanding the shape of the data and making sure you’re looking at all customers. I think your idea is great, I actually want to start using that, — ‘wherever you are now, a year from now you should be 10 percent faster, 20 percent faster, including all the additional functionality that you’ve added.’ That’s great. That’s what needs to happen for some of these sites to really improve.
Cliff Crocker: I think the other thing is that, if you can quantify it — if you can quantify the speed, the conversion ratio, which we’re doing and which a lot of sites are starting to do with their analytics or with RUM — that tells you. It tells you that, if the page is four seconds and conversion drops by one percent from three seconds, then that tells you that, quantifiably, this is too slow.
And also, as you alluded to, benchmarking against your competitors and just understanding where they’re at. Or, if you’re like us, or you’re like many companies that have many sites, benchmark against each other and have a wall of shame — who’s faster, who’s slower and use that to drive it competitively as well.
Ben Rushlo: I like that wall of shame. That’s good.
Aaron Kulick: Our internal index for that is actually called ‘Wall of Shame.’ It’s in the database as Wall of Shame.
Benchmark: When you’re looking at tablets, are you viewing them as primarily a Wi-Fi device or a wireless device? Are you looking at a typical Wi-Fi connection or are you measuring that as a 3G device? Or do you look at it both ways?
Cliff Crocker: I think we look at it both ways, but we assume the worst. We assume the worst possible bandwidth. What do you think, Aaron?
Aaron Kulick: I would say both ways. It’s pretty straightforward. We know that they come in both ways, but the breadth of our data is sufficient that we know that some of our lesser experiences, or poor experiences, are definitely attributable to the fact that carriers are just not as fast when compared to the desktop. So yes, both ways.
Benchmark: How do you simulate device and network when doing smartphone/tablet performance testing?
Cliff Crocker: I think you simulate it, but you also don’t simulate it. I think you try to use real devices and that gets really interesting. There’s some great stuff happening in this area right now for mobile devices. We are using what is now an open source project around MobiTest — it’s built on top of WebPagetest. We use that, we use Keynote’s more simulated Mobile Device Perspective or no, I’m sorry — what’s the other one?
Ben Rushlo: Mobile Web Perspective.
Cliff Crocker: Thank you, Mobile Web Perspective. For us, it’s less of a simulation and a lot of times, unfortunately, we have to stick to more Wi-Fi networks to do this testing. But when we want to test over an actual carrier, we do that as well. So it’s less of a simulation and more of what’s really happening on a real device.
I would also add that, while it’s hard to do, and it may not necessarily be acceptable to everyone, the expenditure outlay — there was a great blog article by one of the larger names in the Web performance community who had basically said, for about $5K you can get a representative set of devices on a couple of different contracts that at least will be able to get you to sufficiently represent the broader generalized spectrum of footprints out there — an iPhone, an iPad, an Android, an Android tablet.
It wasn’t ridiculously expensive, and if you’ve got a large site you probably spend that on a server and it doesn’t seem all that much. The other thing is, don’t hesitate to simulate. I’m a big fan of connecting your smartphone or tablet or whatever to a Wi-Fi connection, which in turn goes through a traffic-shaping router so that you can actually simulate carrier performance using a device without being on contract. Don’t hesitate to do that. I encourage it. It’s a great habit.
And you can probably go around your office and ask people for devices to borrow. It’s a better reason to encourage your users, who should also be your engineers, to use their own dog food, because if they won’t use it on a particular platform, you have a problem because they’re a savvy audience and most people will get frustrated. I personally get very frustrated on mobile devices because the carrier performance is abysmal.
Ben Rushlo: And let me just give quick shout out to Keynote DeviceAnywhere, which was DeviceAnywhere when we purchased them. And Aaron’s idea of going out and buying phones — I like that idea because I’d like to have about 10 phones in my office, but the problem with that is that phones change so much and over time, even though we’re consolidating into a few major platforms, but the Android space is still changing.
A company like DeviceAnywhere is worth looking at, because what it allows you to do is as a QA tester is check out endless numbers of real devices on real carrier connections, and they’re always updating their device list. They’re basically doing what Aaron’s suggesting, but for a larger audience, and then you can pay even less than five thousand dollars to check out a device, and make sure that your site runs and you can actually move up the chain and do scheduled QA testing and a bunch of other things.
But I think it’s an interesting idea. With the mobile space, it’s no longer test it in Chrome, test it in Firefox, test a few different OSs. It’s test it across all these different devices, and it becomes much more complicated. So yes, buying I think is great, but obviously a service like Keynote DeviceAnywhere might be another solution where you can leverage that piece.
Aaron Kulick: Yes, I didn’t mean to imply it was trivial. If you’re a developer and you have some QA engineers, you can probably justify the expense and, you can probably just go around the office and beg or borrow people’s upgrade. It’s like ‘dude, I want your device when you break it.’ Or ‘you cracked the screen, we’ll buy it.’ But you make an excellent point. The DeviceAnywhere product — I’m a huge fan of that product. I introduced it at every place I’ve been. There are other services like it, too, and they’re very useful and allow you to bridge the gap if you don’t want to play a technology race with the ever-evolving mobile industry.
Benchmark: Any closing comments or general advice for the audience?
Cliff Crocker: What I would say is — somebody had a great question about what’s the most important screen to really start at and start looking at measuring, and where do you start monitoring if you can’t afford all three? I think that starting somewhere, wherever you see the majority of your traffic, whatever the niche is that you actually want to go after and target, the important thing is just to get started and not to throw your hands up or fix yourself into one specific solution. It’s really about the right tool for the job. I think that, while we’re huge fans of RUM, there’s obviously a great place for synthetic as well as, simulated and even just pure availability monitoring.
So build yourself a quiver and start from there. Otherwise, I think that as technology changes and as the development community continues to get more and more creative, just trying to stay on top of this is interesting. It’s definitely a moving space, where the leaders have yet to be decided in this long run in terms of e-commerce.
Aaron Kulick: I would add that I think, to three screens, Cliff’s particular comment is actually very true. The other particular thing that I’d like to address is, don’t be forced to inaction or stalled by data. Don’t be intimidated by it, but also don’t worry about it if you don’t have it. Start with whatever you have and work from there.
Specific to Cliff’s comment of ‘build a quiver.’ RUM is not for everybody and it is a non-trivial exercise. It is simple but it is also very complex, like anything else. And if you can’t ask the right question, then you can’t do anything with that data.
Don’t worry about trying to be like us or like anybody or worry about what somebody else is doing. Instead, worry about identifying, as Cliff said, your customer, what they want, and what they do and improving their experience, and over time, you will eventually evolve to where you need to be.
I worry sometimes that we push this agenda that, oh, you’ve got to hold data and you’ve got to have it forever. I would argue, what are you going to do with it two years from now except maybe do some comparisons of trend-based analysis? Other than that — you can be blinded to it. It works just like snow. You can’t see through a lot of it.
Ben Rushlo: My first parting comment is, it’s very rare to find three people that are actually so on the same page, and we didn’t plan that. That’s pretty exciting because I totally agree with what Cliff and Aaron just said.
For me, I think the biggest parting thought is to make performance a business metric, make performance part of the culture, and then everything else will follow. So what tools to use, as they said, is going to be different for every client. How much you spend is going to be different for every client. How much you invest in RUM versus synthetic versus mobile versus tablet is going to be different for every client. What tools you end up with in terms of Keynote versus our competitors — all of that’s going to be different.
But the clients that we have seen being successful — and I would say Walmart’s one of those — are clients that, as you can tell from listening to Cliff and Aaron, care deeply about performance. They’re passionate about performance. And not just one guy, but someone at a high enough level, that there’s a culture of performance that spans across operations, development, and business. Performance is a business metric. It’s tied to business metrics. And so, if you have the pieces in place — which is very challenging admittedly, to get a large organization who doesn’t have that culture already, to change. But if you can work on that as the overarching thing, and then bring in tools and people and processes as it makes sense, I think that’s what makes people successful. That’s really the key for us, is really begin with the culture, make performance something you’re passionate about at all levels of the business. And things will change.
Benchmark: That sounds like a great wrap. Thank you very much for your time.