Article originally posted on InfoQ. Visit InfoQ
Transcript
Zeigen: My name is Jenna Zeigen. This talk is, several components are rendering. I am a staff software engineer at Slack, on our client performance infrastructure team. The team is only about 2 years old at this point. I was one of the founding members. I’ve been working on performance at Slack full time for a little bit longer than that. Before I was on the client performance infrastructure team, I was on Slack search team, where I worked a lot on the desktop autocomplete experience that you may know and love. It was on that team where I really cut my teeth doing render performance and JavaScript runtime performance. Since that feature does more work than you would expect an autocomplete to do on the frontend, and it has to do it in a very short amount of time. I had a lot of fun doing that and decided that performance was the thing for me.
Performance
What is performance? In short, we want to make the app go fast, reduce latency, have buttery smooth interactions, get rid of jank, no dropped frames. There’s all sorts of goals for performance work. As I like to say, my number one performance rule about how to make performance happen is to do less work. It all comes down to no matter what your strategy, you’re trying to do less work. Then the why, which is really why I wanted to have this slide. It seemed like there needed to be more words on it. The why is so that our users have a great experience. It’s really easy to get bogged down when you’re doing performance work in the numbers. You want that number of milliseconds to be a smaller number of milliseconds. You want the graph to go in the right direction. It’s important to keep in mind that we are doing performance work, because we want our users to have a good experience. In Slack’s case, we want channel switches to be fast. We want typing and input to feel smooth. We don’t want there to be input delay. It should feel instantaneous. Keep that in mind as we go through this talk and try to stay rooted in that idea of why we are doing all of this and why I’m talking about this.
Slack (React App on Desktop)
First, some stuff about Slack. Slack, at its core is a channel-based collaboration tool. There’s a lot of text. There’s also a lot of images, and files, and now video conferencing. The Slack desktop app isn’t native, it’s an Electron app, which means it’s a web application like you would have in a browser being rendered by a special Chromium instance via the Electron framework. This means the Slack desktop app is the same application that you would use in Chrome, or Firefox, or Safari, or your browser of choice. It’s using good-old HTML, CSS, and JavaScript, just like in a browser. This means that we’re also subject to the same performance constraints and challenges as we face when we are coding frontends for browsers.
How Do Browsers Even? (JavaScript is Single Threaded)
Now I’m going to talk a little bit about browsers. What’s going on inside of a browser? One of the jobs of a browser is to convert the code that we send it over the wire into pixels on a page. It does this by creating some trees and turning those trees into different trees. It’s going to take the HTML and it’s going to turn it into the DOM tree, the document object model. It’s also going to do something similar to the CSS. Then, by their powers combined, you get the render tree. Then the browser is going to take the render tree and go through three more phases. First, layout phase. We still need to figure out where all of the elements are going to go on the page and how big they’re supposed to be. Then we need to paint those things. The painting phase, which is representing them as pixels on the screen, and this is going to be a series of layers, which then get sent to the GPU for compositing or smooshing all the layers together. The browser will try its best to do this 60 times per second, provided there’s something that has changed that it needs to animate. We’re trying for 60 frames per second or 16.66 milliseconds, and 16 milliseconds is a magic number in frontend performance. Try and keep that number in mind as we go through these slides.
These 60 frames per second only happens in the most perfect of conditions, for you see, renders are constrained by the speed of your JavaScript. JavaScript is single threaded, running on the browser’s main thread along with all of the repainting, layout, compositing that has to happen. Everything that gets called in JavaScript in the browser is going to get thrown onto the stack. Synchronous calls are going to go right on, and async callbacks like event handlers, input handlers, click handlers, are going to get thrown into a callback queue. Then they get moved to that stack by the event loop once all the synchronous JavaScript is done happening. There’s also that render queue that’s trying to get stuff done, 60 frames per second, but renders can’t happen if there’s anything still on the JavaScript callback. To put it differently, the browser won’t complete a repaint if there’s any JavaScript still left to be called in the stack.
That means that if your JavaScript takes longer than 16 milliseconds to run, you could potentially end up with dropped frames, or laggy inputs if the browser has something to animate, or if you’re trying to type into an input.
Performance, a UX Perspective
Performance is about user experience. Let’s take it back to that. Google’s done a lot of research as they do on browsers and user experience. They’ve come up with a model of user experience called the RAIL model. This work was also informed by Jakob Nielsen and some of his research on how users perceive delay. According to the RAIL model, you want to, R, respond to users’ actions within 100 milliseconds, or your users are going to start feeling the lag. This means that practically, you need to produce actions within 50 milliseconds to give time for other work. The browser has a lot of stuff to do. It’s a busy girl. You got to give it some breathing room on either side to get your JavaScript done and get all the work done that you ask it to do. On the animation frame side, the A in RAIL is for animation, you need to produce that animation frame in 16 milliseconds, that magic 16 milliseconds, or you could end up dropping frames and blocking the loop and animations could start to feel choppy. This practically means that you need to get all that setup done in 10 milliseconds, since the browsers need about 6 milliseconds to actually render the frame.
Frontend Performance
I would be remiss if I didn’t take this slight detour. A few years ago, I was reading this book called, “The Every Computer Performance Book.” It said that, in my experience, the application is rarely re-rendered, unless the inefficiency is egregious, and the fix is easy and obvious. The alternative presented in this book was to move the code to a faster machine, or split the code over several machines. We’re talking about the client here. We’re talking about people’s laptops. We don’t have that luxury. That’s just simply a nonstarter for frontend. Unlike on the backend, we don’t have control over the computers that our code is running on. We can’t mail our users’ laptops that are up to spec and whatever. People can have anything from most souped up M2, all the way down to a 2-core machine with who knows what other software is running on that computer, competing for resources, especially if it’s a corporate owned laptop. We still got to get our software to perform well, no matter what. That’s part of the thrill of frontend performance.
React and Redux 101
I mentioned earlier that Slack is a React app, so some details about React. React is a popular, well maintained, and easy to use component-based UI framework. It promotes modularity by letting engineers write their markup and JavaScript side by side. It’s used across a wide variety of applications from toy apps, through enterprise software like Slack, since it allows you to iterate and scale quickly. Its popularity also means that it’s well documented and there’s solid developer tooling and a lot of libraries that we can bring in if we need to. Six years ago, when Slack was going through a huge rewrite and rearchitecture, it was like the thing to choose. Some details about React that are going to come in handy to know, components get data as props, or they can store data in component state. As you see here, the avatar component gets person and size as props. You can see in the second code block there, it’s receiving Taylor Swift and 1989 as some props. There’s not an example here of storing component state, but that’s also another way that it can deal with its data. Then, crucial detail, like core bit about React is that changes to props are going to cause components to re-render. When a component says, ok, one of my props is different, it’s going to then re-render so it can redisplay the updated data to you the user.
In a large application, like Slack, this fragmented type of storing data in a component itself, or even just purely passing data down via props, could get unwieldy. A central datastore is quite appealing. We decided to use a state management library called Redux, that’s a popular companion to React and is used to supplement component state. Instead, there’s a central store that components can connect to. Then data is read from Redux via selectors, which aid in computing connected props. A component can connect to Redux. You see that useSelector example there on the code block. We passed it the prop of ID and the component is using that ID prop to then say, Redux, give me that person by ID. That is a connected prop making avatar now a connected component.
Let’s explain this with a diagram. You have Redux in the middle, it’s the central datastore. Then there are a whole bunch of connected components that are reminiscent of Slack. Actions are going to get dispatched to Redux which causes reducers to run, which causes a Redux state to get updated. Dispatches are the result of interacting with the app or receiving information over the wire like an API over the WebSocket. Actions will continue to get dispatched, which, again, updates Redux. Then, when Redux changes, it sends out a notification to all the components that subscribe to it. I like to call this the Redux bat signal. Redux will send out its bat signal to all of the components that are connected to it. Then, everything that’s connected to Redux, every single component is going to then rerun all of its connections. All of the connected components are going to recalculate, see if any of them have updated. This is a caveat, it will only do this if it has verified that state has changed. That’s at least one thing. It will only do this if state has actually changed. Then, central tenant of React, components with change props will re-render. Again, if a component thinks that its data is now different, it will re-render. Here’s a different view, actions cause reducers to run, which then updates the store. The store then sends out the subscriber notification to the component, which then re-render. Then actions can then be sent from components or over the wire via API handlers. This process, this loop, this Redux loop that I like to call it, is going to happen every single time there is a dispatch, every single time Redux gets updated, that whole thing happens.
You might start to see how this could go wrong and start to cause performance issues. After that, we are seeing that Redux loops are just taking way too long to happen. Unsurprisingly, what we’re seeing just like, at rest, like you don’t even have to be doing anything. You could have your hands off the keyboard, and just like maybe the app is receiving notifications and stuff over the WebSocket or via API, just hands off the keyboard, even at p50, we are seeing that the Redux loop is taking 25 milliseconds, which is more than 16. We know that we’re already potentially dropping at least one frame, at least 50% of the time, that’s what p50 means. Then at p99, so 1% of the time, we are taking more than 200 milliseconds. We’re taking, in fact, 210 milliseconds to do all of this work, which means that we’ve blown through, we’ve doubled that threshold in which humans are going to be able to tell that something is going wrong. We’re going to start to drop frames. If you try to type into the input, they’re going to be feeling it.
What did we do? Like any good performance team, we profiled. The classic story here is you profile, you find the worst offending function, the thing that’s taking up the most amount of time. You rinse, repeat until your app is faster. In this case, what we had here was a classic death by a thousand cuts. You might say, there’s those little yellow ones, and that purple one. The yellow ones are garbage collections. The purple one is, we did something that caused the browser to be a little bit mad at us, we’ll just say that it’s a recalculation of styles. Otherwise, it’s just this pile of papercuts. We had to take some novel approaches to figuring out how to take the bulk out of this. Because there wasn’t anything in particular that was taking a long time, it was just a lot of stuff.
How can we just, again, make less work happen during every loop? We took a step back and figured out where performance was breaking down. Ultimately, it comes down to three main categories of things that echo the stages of the loop. One, every change to Redux results in a Redux subscriber notification firing. That’s the core problem with Redux. Then we spend way too long running selectors. There’s a lot of components on the page, they all have a lot of connections. Too much work is happening on every loop, just checking to see if we need to re-render. Then, three, we are spending too long re-rendering components that decide that they need to re-render, often unnecessarily. The first thing is, too many dispatches. For large enterprises, we can be dispatching hundreds of times per second. If you’re in a chatty channel, maybe with a bot in it that talks a lot to a channel, you can be receiving a lot every second. This Redux out of the box, it’s just going to keep saying, dispatch, update all the components, dispatch, update all the components. That just means a lot of updates. Every API call, WebSocket event, any clicks, switching the channel, subscriber notification, switching channels, sending messages, receiving messages, reactjis, everything in Slack: Redux, Redux, Redux. Then this leads to again, every connection runs every time Redux notifies. Practically, we ran some ad hoc logging, that we would never put in production. Practically, it’s 5,000 to 25,000 connected props in every loop. This is just how Redux works. This is why scaling it is starting to get to us. Even in 5,000, if every check takes 0.1 milliseconds, that’s a long task. We’ve blown through that 50 milliseconds. The 50 milliseconds is a long task. At that point, once you get to 50 milliseconds, like their browser performance APIs, if you hook into them, they’re like, maybe you should start making that a shorter task. Yes, just again, way too much work happening.
Then all of this extra work is leading to even more extra work. Because, as I said, we’re having unnecessary re-renders, which is a common problem in React land, but just have a lot of them. This happens because components are receiving or calculating props that fail equality checks, but they are deep-equal or unstable. This can happen, for example, if you calculate a prop via map, filter, reduce, what you get from a selector right out of the Redux store isn’t exactly what you need. You want to filter out everyone who isn’t the current member from this list of members today. If you run a map, as you might know, that returns a new array. Every single time you do it, that is a new array that is going to fail reference equality checks. That means the component thinks something is different and it needs to re-render. Bad for performance. There’s all differing varieties of this type of issue happening. Basically, components think that data is different when it actually isn’t.
Improving Performance by Doing Less Work
How are we making this better? Actually, doing less work. There are two attacks that we’ve been taking here. First, we’re going to target some problem components. There are components that we know are contributing to this pile of papercuts more than others, then, also, we know that these mistakes are happening everywhere, so we also need to take a more broad-spectrum approach to some of these things. First, targeting problem components. There’s one in particular that I’m going to talk about. What do you think is the most performance offending component in Slack? It is on the screen right now. It’s not messages. It’s not the search bar. It is the channel sidebar. We had a hunch about this. If you look at it, it doesn’t look that complicated, but we had a hunch that it might be a naughty component. Through some natural experiments, we discovered that neutralizing the sidebars or removing it from being rendered alleviated the performance problems with people who were having particularly bad performance problems. Kind of a surprise. Naively, at face value, sidebar looks like a simple list of channels. It looks like some icons, and maybe some section headings and some channel name, and then maybe another icon on the right. It was taking 115 milliseconds. This was me, I put my sidebar into bat mode, which was showing all of the conversations. Usually, I have it on unreads only performance tip, have your sidebar in unreads only. To make it bad, I made my sidebar, sidebar a lot longer. We found that there’s a lot of React and Redux work happening. This was a bit of a surprise to me. I knew the sidebar was bad, but I thought it was going to be the calculating what to render that was going to stick out and not the, we’re doing all this React and Redux work. Calculating what to render is taking 15 milliseconds, which is almost 16 milliseconds. Either way, this is not fun for anyone. There was definitely some extra work happening in that first green section, the React and Redux stuff.
Again, lots of selectors. We found through that same ad hoc logging that folks who had 20,000 selector calls, when we got rid of their sidebar, that dropped to 2,000. That is quite a 90% increase in improvement. That made us realize there’s some opportunities there. This is mainly because inefficiencies in lists compound quickly. There are 40 connected prop calculations in every sidebar item component, so the canonical like channel name with the icon, and that doesn’t even count all of the child components of that connected channel.
Forty times, if you have 400 things in your sidebar, that’s 16,000. A lot of work to try and dig into. We found that a lot of those selector calls were unnecessary. Surprise, isn’t it like revolutionary that we were doing work that we didn’t need to do? One of my specifically pet peeve, which is why it’s on this slide, is instead of checking experiments, so like someone had a feature flag on or didn’t. Instead of doing that once at the list level, we’re doing it in every single connected channel component, and maybe that was a reasonable thing for them to do. Maybe the experiment had something to do with showing the tool tip on a channel under some certain condition or something. We didn’t need to be checking experiment on every single one.
Then, also, there were some cases where we were calculating some data that was irrelevant to that type of channel. For instance, if you have the pref on to like show if someone’s typing in a DM in your sidebar, we only do that for DMs, it has nothing to do with channels. Yet, we would go and grab that pref, you would see like, does a person have that pref on? Even though it was a public channel, we were never going to use that data. We were just going to drop it on the floor. Surprise, we moved repeated work that we could up a level, we call it once instead of 400 times, and created more specialized components. Then we found some props that were unused. Then, all of this fairly banal work. I’m not standing up here and saying we did some amazing revolutionary stuff. It ended up creating a 30% improvement in sidebar performance, which I thought was pretty rad. We didn’t even touch that 15-millisecond bar on the right. Then, ok, but how did that impact overall Redux subscriber notification time? It made a sizable impact, over 10% across the board over the month that we were doing it. That was pretty neat to see that just focusing on a particular component that we knew was bad was going to have such a noticeable impact. People, even anecdotally, were saying that the performance was feeling better.
What’s Next: List Virtualization
We’re not done. We want to try revirtualizing the sidebar, this technique, which is to only render what’s going to be on the screen with a little bit of buffering to try and allow for smooth scrolling, actually had a tradeoff for scroll speed and performance. There’s some issues that you see if you scroll too quickly. We just were like, virtualization isn’t worth it, we want to focus on scroll speed. When actually now we’re seeing that maybe we took the wrong side of the tradeoff, so we want to try turning on list virtualization. List virtualization will be good for React and Redux performance, because fewer components being rendered means fewer connected props being calculated on every Redux loop, because there’s less components on the page trying to figure out if they need to re-render.
What’s Next: State Shapes and Storage
Another thing that we want to try that really targets that 15-millisecond section that we didn’t really touch with this work is to figure out if we can store our data closer to the shape that we needed for the sidebar. We store data like it’s the backend, which is reasonable. You get a list of channels over the wire, and like, I’ll just put it into Redux in that way. Then I will munge it however I need it for the use case that I in particular have. Engineers also tend to be afraid of storage. I think this is a reasonable fear that people have that like, “No, memory.” If I store this thing with 100 entries in it, we might start to feel it in the memory footprint, when in fact, it’s not the side of the tradeoff that you actually need at that point. We have this question of how can we store data, so it serves our UI better, so we cannot be spending so much time deriving the data that we need on every Redux loop? For example, why are we recalculating what should be shown in every channel section on every loop? Also, we store every single channel that you come across. This might be the fault of autocomplete. I’m not blaming myself or anything. We say, give me all the channels that match this particular query, and then you get a pile of channels back over the wire, and they’re not all channels that you’re in, and we store them anyway. Then to make your channel sidebar, we have to iterate through all of the channels that you have in Redux, while really the only ones we’re ever going to need for your channel sidebar is the ones that you’re actually in. Little fun tidbits like that.
Solutions: Batched Updates, Codemods, and Using Redux Less
As I said, just focusing on problem components isn’t going to solve the whole problem. We have a scaling problem with Redux, and we need to figure out what to do with that. It’s diffused, so it’s everywhere. It’s across components that are at the sidebar. We need some broader spectrum solutions. The first one that seems really intuitive, and in fact, they’re putting this in by default in React 18, is to batch updates. Out of the box, Redux every single time a dispatch happens and a reducer runs is then going to send out that bat signal to all of the components. Instead, we added a little bit of code to batch the updates. We’ve wrapped this call to this half secret React DOM API that flushes the updates.
We wrap this in a request animation frame call, which is our way of saying, browser, right before you’re about to do all that work to figure out where all the pixels have to go on the screen, run this code. It’s like a browser aware way of debalancing the subscriber notification, essentially. I like to joke that this is called the batch signal. Another thing that I think is pretty cool that we’re doing is codemods for performance. Performance problems can be detected and fixed at the AST level. We found that we can find when there are static unstable props. If someone is just creating a static object that they then pass as a prop to a child component, you can replace that out so it’s not getting remade on every single loop. Similarly, we could rewrite prop calculation for the selectors in a way that facilitates memoization. Another example is replacing empty values to an empty array. An empty array does not reference equal empty array, or empty object for that matter. We can replace it with constants that will always have reference equality.
We’re also trying to use Redux less. You might be wondering when this was going to come up. We are investigating using IndexedDB to store more evicted items that we can evict from the cache. Less data in Redux means fewer loops as a result of keeping the items in the cache fresh. Every time something gets stale, we need to get its update over the wire, which causes a dispatch, which causes the subscriber notification. Also, cache eviction is fun, but we could also not store stuff in Redux that we’re never going to use again. Finer-grained subscription would be cool, but it’s harder than it sounds. It would be great if we could say, this component only cares about data in the channel store. Let’s get it to subscribe only to the channel store. With Redux, it’s all or nothing.
Why React and Redux, Still?
Why are we still using Redux? This is a question that we’ve been asking a lot over the past year. We’ve started to tinker with proofs of concepts and found like, yes, this toy thing that we made with this thing that does finer-grained subscription, or has a totally different model of subscription than Redux, it’s a lot faster. Scale is our problem to begin with. Why are we still sticking with React and Redux at this point? React is popular, well maintained, and easy to use. It has thus far served us pretty well, with slather ever-growing team of 200 frontend engineers at this point, build features with confidence and velocity, we chose a system that kind of fits our engineering culture. It’s people friendly. It allows for people on the flip side to remain agnostic of the internals of the framework, which for the most part works for your average, everyday feature work. This is breaking down at scale as we push the limits. These problems are systemic and architectural. We’re faced with this question, we either change the architecture, or we put in the time to fix it with mitigation and education at the place where we have it now. We put in all these efforts into choosing a new framework, or we write our own thing that solves the problems that we have. This too would be a multiyear project for the team to undertake. We’d have to stop everything we’re doing and just switch to another thing if we really wanted to commit to this. There is no silver bullet out there. Changing everything would be a huge lift with 200 frontend engineers and counting. We want our associate engineers to be able to understand things. We want them to be able to find documentation and help themselves. Redux prefers consistency and approachability over performance. That’s the bottom line. Every other architecture makes different tradeoffs that we can’t be sure about, and how anything else would break down at scale either, once we start to see the emergent properties of that architecture that we chose.
Fighting a Problem of Scale at Scale
Where are we now? People love performance. No engineers are like, I don’t want my app to be fast. They’ll want the things that they write to, to work fast. We’re engineers, I think it’s in our blood, essentially. Let’s set our engineers up for success by giving them tools and helping them learn more about performance. I’ve really been taking the attack of creating a performance culture through education, tooling, and evangelism. React and Redux abstract away the internals, so you don’t really need to know about that whole loop. Some of you probably know more about how React and Redux work now than some frontend engineers. I believe that understanding the system contextualizes and motivates performance work, which is why I spent a lot of time in my talk explaining those things to you. The first thing that we can do to make people aware of the issues that are happening is to show them when they’re writing code that could be a performance liability. We’ve done this via Lint rules, remember with the codemods. You can find this stuff via AST and via static analysis. We bring this to their VS Code, with some Lint rules that show them when unstable properties are being passed to children, when unstable properties are being computed. When they’re passing functions or values that are breaking memoization, but they don’t have to. Not everything can be done via static analysis, though. We’ve created some warnings that go into people’s development consoles. While you might not be able to tell from the code itself when props are going to be unstable, we know what the values are at runtime, and we can compare the previous to the current value and be like, those were deep-equal, you might want to think about stabilizing them. We can also tell based on what’s in Redux when an experiment, for example, is finished, and we don’t need to be checking it on every single Redux loop. That’s just one less papercut in the pile.
Once they know what problems are in their code, we’ve been telling them about how to fix these things. These tips and tricks are fairly lightweight. None of these things is a particularly challenging thing to understand. Wrap your component in React memo if it’s re-rendering too much. Maybe use that EMPTY_ARRAY constant, instead of creating a new EMPTY_ARRAY on every single loop. Same with these other examples here. We’ve taken those Lint rules and those console warnings, and we’ve beaconed them to the backend, so now we have some burndown charts. While making people aware of the issues in the first step, sometimes you also need a good burndown chart. I’m not going to say that having your graph isn’t motivational. That’s why we all love graphs in performance land. Sometimes, yes, you do need a good chart that is going down to help light a fire under people’s tuchus. Say the graph is going in the right direction, it’s working. I like to joke that performance work is like fighting entropy, more stuff is going to keep popping up, it’s like Whack a Mole. For the most part, we’re heading in the right direction in the Lint rules and in the console warnings. That’s been really great to see.
I found in my career that performance has been built up as this problem for experts. There’s this hero culture that surrounds it. You go into the performance channel and you say, “I shaved 5 milliseconds off of this API call, look at me, look at how great I am.” This isn’t necessarily a good thing. It’s great to celebrate your wins. It’s great that you’re improving performance and reducing latency. We’re doing ourselves a disservice if we keep performance inaccessible, and that we keep this air around it that like the only way that we solve performance issues is by these huge architectural feats. We have to change the whole architecture if we want to fix performance problems. Because, again, engineers, your teammates care about performance, and they want to help and they want to pitch in. Let’s get a lot of people to fix our performance problems. We had a problem before this problem that came out of the scale of our app, we had all of these papercuts. If we get 200 engineers to fix 200 performance problems, that is 200 fewer papercuts in the pile.
Conclusion
As I was putting this story down on the slides, this has been rattling around in my head, it takes a lot of work to do less work. We could have taken the other side of the story and put in the work to rearchitect the whole application, and that would be a lot of work for my team. It would be a lot of work for the other engineers to readapt and change their way of working and change the code to use a “more performant framework.” Or we could trust our coworkers and teach them, and they’re good engineers. They work at your company for a reason. Let’s help them understand the systems that they are working on. Then they might start to see how the systems break down and then they could start to fix the system when it breaks down.
See more presentations with transcripts