Article originally posted on InfoQ. Visit InfoQ
Anderson: Welcome to this talk about optimizing your web frontend performance. How to Measure web performance is a broad, acknowledged topic. Over the years, while we’ve been cracking at it at Trainline, we’ve scored some wins, but also made a number of mistakes, many of which are common, even considered best practices in the industry, and which I’ll share with you towards the end.
My name is Carl Anderson. I’m the Director of Engineering for frontends at Trainline. That means that I’m responsible for our applications, iOS and Android, and our website’s APIs, pretty much everything that’s externally facing at Trainline.
Trainline’s Purpose and Vision
Trainline’s mission is to empower people to make greener travel choices, that is, hopping in trains rather than taking the car or the plane. To do that, we’re building the world’s number one rail platform, which basically takes form in a website and applications to which we’ve connected to over 270 rail and coach carriers. That’s in the UK, but it’s all across Europe, serving 45 countries. We got roughly 30,000 stations that we’re serving. To give you an idea of scale, pre-COVID, we had roughly 90 million visitors per month, and our apps were downloaded over 30 million times. That’s what we do. That’s for context.
Why Speed Matters
The first one is why speed matters. I think it obviously matters because it’s good karma. As a developer, I think none of us wake up in the morning, thinking, “I’m going to do a slow website.” As developers, we’re all striving for performance. We’re all striving to create some sleek, nice UIs and UX for users. Beyond that, from a personal standpoint, we all know that the internet consumes a lot of energy. Actually, today, powering the internet consumes more than the energy consumed by the UK. This electricity consumption actually has an enormous carbon footprint because all that electricity is not entirely green. The less energy we consume, the faster our product is, the less time the users use it, and the faster our server actually serves back the results to it, the lesser our footprint. That’s good karma. If we take a step back, and this ties back to our mission at Trainline, transportation is the fourth emitter of greenhouse gases in the world. If you look at that category, half of it is cars, 10% is airplanes, and trains is actually the greenest means of transportation in the world right now. The count is at less than 1%. The more we can get people to forget about air travel, or their car and actually commute by train, the better it is for the planet. We can do that at Trainline by just providing a really smooth, really sleek user experience that just makes using trains as easy as hopping in your car. That’s what we’re aiming for.
That’s not just good karma to create a fast and sleek website, what’s great is that it’s also aligned with your business goals. I think one of the first thing that we did when we started working actively on performance a couple years ago, is we started looking at different metrics. One of them was the time to first byte. That’s what you’re seeing in the graph right there. Do you see the green line and the trend, it’s indexed on the time that it takes, and the spike of the best step conversion on that page is roughly around 200 milliseconds. As your time to first byte gets longer to get your data across the wire, actually, your step conversion drastically reduces, and actually, it gets divided by two. Because here, at the beginning of the step conversion is at 40%, but if your time to first byte is around 2 seconds, you’re now down to 20%. People just don’t go, just don’t convert, just go back because your website is too slow.
You get the same idea here with another performance metric named the cumulative layout shift, CLS, which is basically the amount of change in your page. It’s a score. The amount of change in your page where, as things load, the layout changes and things move, which can be quite annoying as a user. We’ve all been faced with those moments where you’re trying to click on something, and that thing changes and/or moves, and then you end up clicking on something else, which is quite annoying. That’s what the cumulative layout shift score measures. Same with that one. It starts at roughly 6% step conversion, but as the cumulative layout score increases, which means that more things are moving on the page, you can see the step conversion drastically reducing as well. Creating a fast and great page experience in your product is just good for your business as well.
Very recently, because actually, this is happening right now, there’s the web vitals introduced by Google, which is a Google initiative. I think it was introduced probably last year. I think the difference is that this year, what they call the core web vitals is now becoming a ranking factor in the Google search, so that if two websites that provide the same relevance of information, one has a better page experience than the other one, is going to have a higher ranking in the Google search algorithm. This is actually an opportunity for those that have fast websites to actually gain in page ranking. It’s even more important more than other. This is actually rolling out right now, in May 2021.
How to Get Started
I think we all bought into the fact that we want to create fast websites. I think the first thing to get started with is that you need to measure. There’s a famous saying, I think it’s Peter Drucker, the management consultant, said, “If you can’t measure it, you can’t improve it.” Actually, you can even go further that whatever you measure, you improve, or, we optimize what we measure, is really the idea behind it.
Build a Baseline: Google Core Web Vitals
The first thing is to build a baseline, but building a baseline is super hard with web performance, because there’s not a single speed metric. It’s a mix of lots of different signals, and lots of different metrics actually got to go. I think if you’re starting on this journey, the best way to get started is obviously the Google Core Web Vitals, because they’re so important in your ranking algorithm at the minute, or it’s becoming increasingly important. It already was. Basically, the way Google’s structured it, again, is measuring the page experience. It’s a balancing act in between three pillars, loading, so how fast you’re loading your content, interactivity, and then visual stability. Because if you only take one of these, if you optimize for the load time, you can go super-fast to actually show content, but then it might not be interactive, because this is where the first input delay comes in is they’re going to measure when you tap or click on an item. If your CPU is still running and still processing other things, it’s not idle yet, it cannot take your interaction into account. It’s very annoying as a customer, you tap on the thing and it doesn’t take that into account.
Then the opposite, if you go for interactivity where you want to build everything and make sure that everything is ready before you actually paint anything on the screen, you end up having your customers just staring at a blank page for seconds, which is not a great experience. You need to find the right balance. The same with visual stability. It’s in between the two. You got the interactivity ready. You load all the content. If you keep moving things as the content pops in the page, because it’s always asynchronous, each and every individual component is actually interactive, but they pop and they move stuff, it’s hard for the user to actually start interacting with your page as it builds, which then becomes really annoying. Building a baseline with the Google Core Web Vitals is a great start to drive your page experience.
Find Your Own Metrics
The next step, I think, is you shouldn’t just stop there. They’re great to start, but there are lots of other metrics that you can use to drive great performance within your web application. An example that we’d already talked about was the time to first byte, a bit earlier, is the first byte that you send over the wire, you want that to be as fast as possible. If you look at that curve, you can see that the sweet spot seems to be around 200 milliseconds. If you look at the histogram, you want to shift that curve to that sweet spot of 200 milliseconds here. Same with the first contentful paint, so that’s the first thing that a user sees, so going from a blank screen to something. You want that something to be meaningful, first of all. Then, same, you want to get it where your step conversion is on that example, so you want to shift it all to the left. Measuring is the first key thing.
Then, I think what’s also very interesting with what we just talked about, and with those graphs is that, it doesn’t just measure performance and pure tech performance, it actually links it to business metrics, so you can see the impact. I think really the key thing, what we started in our journey was really about measuring all the things that we can measure, let’s link them to different business metrics, such as the conversion rates, such as the bounce rate. Then just really see, is there any correlation, and let’s find the right tech metrics that have a direct business impact, because those are the ones you want to go after. I think another quick tip that we’ve learned from going onto that journey and starting doing that is, there’s lots of different tools to actually measure your performance. They don’t all have the same definition of the same metric, or they measure it slightly differently. Really, I think compare apples with apples is really just try to use the same tool with everything so you can have that overlay over your business metric and your tech metrics. Then, it’s all comparable in between the different graphs that you have, because it all uses the same way of measuring a specific metric.
I think also what’s very important to record as much as possible are metrics, waterfalls, and do that continuously over time, so that you can see the trend. I think a key thing that we had at Trainline, when we started this, is we’re starting as well on those graphs, we started to set markers each time we were doing a release. You would see the graph, you would see the release, and you could see the curve then either go up or down so you can see the impact of what you just released. That was a very insightful thing, because then we would build our hypothesis. We’d go, “Let’s see what impact it will have.” That was also very interesting.
Create Your Own
We started with our own metrics, and the Google Core Web Vitals came in. That was quite interesting. Still, as we started going into optimizing for it, we actually realized that there’s still some gaps, because those metrics are generic and standard. Google needs a standard way to measure the performance of websites from various companies, but no one else but us knows better than know how our product works. I think this is what we started doing is that if you can see on the left, there’s quite a gap. That’s the waterfall of a web page loading. There’s quite a gap in between the early metrics and then the later metrics that you get when a web page gets constructed. What we started doing is we started increasing granularity with our own metrics as to the time to boot our React.JS application, the time for the single page application to be loaded, and time to a duration, and all those metrics. That was quite interesting, because there are all those times where as an engineer, you can actually input in to try various things to have an impact on those.
It’s Not About Speed, It’s About Users Experience
Then I’ll talk a little bit about home ready metrics. That was also a key finding for us was, again, nobody knows your product as well as you do. To measure your page experience the best that you can, what we soon started to do is identify with our product teams, what’s the key action of that page. Then, the tech teams, what we started doing is measuring when that action actually was ready. By ready, I mean when it’s ready to interact with, so it’s painted. The JS events are there, just ready for the user to consume it. That was a critical one, because then it gave us the actual moment when the page is just ready to go, and is what the user expects in there. This created the graph that you’re seeing here, is that then we have that booking flow, where we go from loading the application, like a hard navigation going when we load the single page application, all the way to when the home is ready, which basically means that the search widget is interactable. It can still mean that we’re still loading asynchronously in the background, extra things, but at least we can now search from London to Cambridge, for instance.
Then you type in your stuff. You hit search. You arrive on the search results page. Then that next marker is when all the results are actually displayed on the screen and ready to interact. Then that’s the next marker. Then we’re going to measure from that moment you click search to that moment that those are ready. That’s an extra wait time. We’re going to do that for every step of your flow. At the end, it allows us to build that graph that you’re seeing at the bottom of the screen, which is the total wait time that a user sees, or experiences, after going through a flow. Obviously, our goal is to make that as small as possible. It also actually highlights the fact that performance is a very cumulative sport. It’s not just frontend and size, not just backend, you want to have that entire stack to be as fast as possible.
Synthetics or RUM
At that stage, you’ve got a set of metrics that you’re monitoring to measure performance. We talked about Google Web Vitals, and other technical ones that are interesting to you, such as time to first byte, or first contentful paint. We’ve also talked about your own custom metrics to measure, really, the moment where a user can actually interact with the web page. Now, there’s, how do you monitor those? There’s always that debate of, shall we go synthetics, shall we go real user monitoring, RUM? Do we need both? The answer is definitely both, because they both serve a very specific purpose. Synthetics is lab data. It’s the one that you’ve got the waterfalls, you’ve got all your profiler information. It’s a control test environment. It’s very consistent, so you can run the same test over and over. You’re really testing your actual code in a controlled environment. You can see a trend, you can see progression through the updates that you’re doing in your code.
Real user monitoring is the key one, because that’s still data. That’s what users really experience, so that’s the end goal. The end goal is to improve and to create the best experience possible for these users. Those are great to use as KPIs. Then what you’re doing is that you’re using synthetics to test your hypothesis, to build a backlog, then you ship those. Then you’re monitoring in production. See, did it have the impact that I thought it was going to be? Really, you need a combination of both.
How to Measure
Now you have all these metrics, you’re measuring synthetics, you’re measuring in production. That’s a lot of data. How do you look at that data? I think, to that, the first thing, the first trap, don’t use averages ever, because it just paints a crooked picture of the world. If you compute an average, it will paint a picture of the world where all your requests run at the same speed. That’s obviously not true. I think one mistake that we’ve made at the very beginning when we started that journey of optimizing our web performance, we started at the 95th percentile to scrape off oddities or the outliers at the very end of the scope where they have network issues or connectivity issues. We said, let’s just drop the 5 last percent, but then let’s just focus on that value, which 95% of our customers are actually experiencing or below.
That was a mistake, retrospectively, because if you look at the median where most of our users are, and you have a significant degradation, that’s the example on the right, often, you won’t actually see anything in the 90th percentile. Because the value is so high compared to where the bulk of your users are, where actually you found out that it was better to have two separate buckets, and always look at your median. That’s what most of your customers are experiencing, and really monitor that space. Then have a separate one for the 90th percentile, because usually at the 90th percentile it’s so high, so slow compared to the median, that if you lose 1 second out of a 10-second wait, you sure got to look into it, but that’s not the top priority. Whereas if you’re losing, as in the example on the slide, 100 milliseconds, for half of your customers for the bulk of your requests, then you definitely should hop on to that one and identify what’s going on in there. Having a mix of the 50th and the 90th percentiles definitely was a big win for us to help us improve performance.
Key Learnings – Do’s and Don’ts
What we could do is talk about key learnings that we got on top of stuff, because right now you’re in a world where you’re monitoring all those metrics, you get the right signals, and not the noise, but then, how do you optimize? I think one of the first mistakes, what we’ve done on Trainline is, before we have the ready metrics, there’s a metric named time to interactive, which just sounds like the ready metric. That moment where the website is interactive for your users. That sounds great. Actually, in practice, that metric doesn’t really work. It does work and doesn’t work for optimization, because it doesn’t paint what the user actually experiences. That’s due to the fact of how it’s calculated. You’ve got the definition on the right. Actually, I like the original name of that metric, which was the time to consistently interactive, which is basically that moment where your website is interactive, and nothing else happens. It literally is idle. The trick with the TTI is that if ever there is an asynchronous operation happening in the background, where it’s actually still running, even though you don’t really care because your search widget is already there, loaded, and the customer is already able to interact with. It doesn’t really matter, actually. From that, this is how we went to dropping that metric, and actually going and focusing more into the page ready metrics.
Another key learning that we got from that journey was to not aim for 100% Lighthouse score. As a developer, that one is very tempting. You got that great lab tool that tells you, “You got to do this, and that. Then you’re going to be perfect from a performance perspective.” It’s very interesting. It’s old lab data in Lighthouse, so that might not be what your user experienced. Actually, you might have in production, some other pitfalls, or some other things that you can optimize that are actually of a higher value than going for the one in Lighthouse. In terms of prioritization, that’s not the best one as well. Then, also, it’s lab data. You’re really focusing on the first experience, because in the world, if you have a lot of repeated customers, for instance, that just come back to your website, they have a lot of things already cached in their browser, so it might not be as bad as Lighthouse paints.
Do the Obvious Stuff
Obviously, we focused on the good practice and how you set up your monitoring so that you pick up the signal and focus on the experience. Out of this, you’re going to pick up a lot of stuff. All these are actually fairly no-brainers. It’s just in between all these feature development. What I want to say and what I was trying to convey, is rather than giving one-size-fits-all solutions, just set the right monitoring, and then you’ll find those out yourself. Then there’s just the obvious stuff as always, that performance is cumulative. It’s pointless to have a great, fast, sleek website if you have a very slow backend. That’s what you want to tackle. If you can skip entire steps of a flow to optimize for specific users that have some details or travel options with their profile, you should definitely do so. This is where you’re going to shave off more wait time for your users, rather than just go full technical.
Size does matter. Look at all your assets. Really, don’t load what you don’t need to, I’m thinking all those third-party dependencies. Lazy loading is always a good thing. Then, when you load something, you really only load what you use. If you can split your bundles so you don’t have to load the entire application on your first page, because most of the users are actually not going to use the rest of the application, it’s always a win. I’ve just put a few best practices in there like connection pooling, caching. When I talk about caching, it’s everything pulled at the platform level, or within dev level, you optimize the payloads. All those basic, obvious stuff, definitely do so. Having the right monitoring is going to make sure that you keep the attention to all those little details that actually do matter in the end.
Then another one that was actually quite interesting is by setting this monitoring, sometimes in between releases, we get the spikes or this performance degradation, and we didn’t understand where it was coming from, and it was an advertising campaign, or it was the marketing team that introduced yet another tracking pixel with a synchronous external dependency on another server. Watch out for those. Definitely look at it. Have a process so that you’re working jointly with the marketing team to make sure that those don’t happen, because again, if you shave off a second out of there, and then they introduce a new tracking pixel that just wipes it off, it’s just such a bummer.
Don’t Focus on a Metric, Focus on User Experience
If there’s one key thing to take away, is just don’t focus on a single metric. A single speed metric does not exist. What we should be focusing on is really the user experience. To do that you’re going to need multiple metrics to find the right balance in between the load time, interactivity, layout shifts, but also making sure that the caches of each and every of your page becomes interactive as fast as possible.
What we’ve seen and talked about in here was about creating fast websites. It’s actually good karma for devs, users, and the planet. You should definitely spend some time having the right monitoring and focused to make it happen. Page experience is becoming a ranking factor in Google search, so it’s important to you, and it actually creates new opportunities for us. Start measuring your website performance because we optimize what we measure. There isn’t a single speed metric, find yours. It’s about finding balance. Then, I think the key one really, you have to collect data continuously, and link it to releases to be able to compare and iterate over your performance backlog, globally. The gist of it is focusing on the user experience. The way we did it and what worked best for us really was with the ready metrics. Then, in terms of how you monitor them, you need both synthetics and real user monitoring with a mix of 50th and 90th percentile is what worked best for Trainline.
Questions and Answers
Humble: Can you give us a high level overview of what Trainline’s architecture looks like?
Anderson: It’s a microservice architecture. We have about 350, 400 services in the background. Those can be of multiple languages, either .NET, or Ruby, mostly. Some Node.js as well. Then on top of that, we got various frontends consuming either .NET APIs that serve as an orchestration layer, or Node.js APIs doing the same work, than for React.JS single page applications or mobile applications.
Humble: Would you expect the method you were describing to optimize web performance, would you expect that to work with any architecture?
Anderson: Yes, absolutely. Here, what we talked about really was, I think, a recipe to measure performance in a frontend application, which is very different from a backend application where you can just look at the transaction time, the request time, the time for a request that comes in and out of the box, whereas here, in the frontend, there are numerous signals. We talked about, don’t just optimize for the load time. Having a fast and nice web experience is much more than that.
Humble: You mentioned you were using Lighthouse, but obviously not to optimize for 100% Lighthouse score. Is there other tooling that you’re using specifically around optimizing for the core web vitals ranking trend that’s coming in?
Anderson: Yes, absolutely. I think the key tool that we’re using is speed curve. The first graph that you’ve seen with the black background, and orange, that’s the speed curve that we’re using. What’s great about a speed curve is that it provides synthetics checks as well as LUX. It has a very nice waterfall. You can compare between two deployments, two versions of your websites with the waterfall, and you can get videos of how fast they load, and compare side by side. That’s actually quite nice. I really like that one. I think the other graphs that you’ve seen at the top, actually just came out from New Relic. It’s us having our own metrics and raising that through New Relic events, so that we can create those wait time graphs that we had. It’s a mix of standard tooling and in-house.
Humble: Could adding that much monitoring lead to performance issues in itself?
Humble: When you talk about caching as a good practice, what kinds of caching are you referring to?
Anderson: Here I was talking about caching generally. The challenge with caching is always the same thing, is, when do you invalidate your cache? You got to be always super careful with that one. It’s at all layers, basically. You got platform caching, where you cache your train on your CDN, so it doesn’t go straight to your server, so you improve your time to first byte. It’s also caching of data within your application. Whenever you can spare yourself a round trip to the backend, it’s always a good thing. Again, be careful with the invalidation. Because definitely, our users expect to have a real time and the correct price on the tickets, so, today, if they click on something and then turn out that it’s 20 quid more expensive later on, it’s not going to be a good experience. You got to be careful with those, because you cannot cache all the data. The more you do, and the more round trips you spare your users, the better it is for sure. It’s the same for assets and pretty much everything.
Humble: Do you have a sense of how much impact you think core web vitals will have in terms of a ranking factor? I remember the SSL one which I have sites that are still not running on SSL, and to be honest, I can’t measure the difference. I’m curious to know what you think of this one. Do you think it will actually have a significant impact?
Anderson: It’s hard to say. I don’t work at Google. I think the answer is, I don’t know yet. What’s pretty clear is that they are going to be rolling it out. I think it got postponed a little bit in June now. The rollout is going to be gradual. We’re going to see. My thinking is that it’s so impactful that they’re going to take it slowly. They’re probably going to start ramping up on English languages first, and then Europe usually comes a few weeks after. It’s usually two to three weeks behind. I think over the summer, we’re going to start seeing how impactful it is. I think it’s fair to say that I think the example that you took with HTTPS is probably a right one in the beginning. Then we’ll see, because it’s just one indicator out of many others. Relevancy is still going to be the top main thing that you expect as a ranking factor.
See more presentations with transcripts