Article originally posted on InfoQ. Visit InfoQ
Transcript
Calçado: My name is Phil. I’ve given talks at a few different QCons and other conferences about various topics, but mostly microservices. That’s what I’ve been working with. That’s what I’ve been around in terms of community, technology, process, everything, for the better part of the last 10 years. When thinking about microservices, there’s different stages of growth and understanding as an industry of where we’ve been. I think probably now, we’re late stage microservices. We look at the consequences of our actions. We’re looking at the consequences of the decisions we made 10, 5, whatever many years ago, and how to evolve all these different pieces of technology and understanding to something more sustainable. This is interesting to me in particular, because historically, like I said, I’ve been in the microservices scene for quite a while. I was lucky enough to work in some of the pioneers in this space, being Thoughtworks, SoundCloud, DigitalOcean, and many others. A lot of what my experience has been so far has been a little bit been there for the pregame, or the party, like just before things start heating up, either as a consultant. I would come and help clients and figure out that actually the best way was to split big services into smaller ones, or growing a company like SoundCloud, DigitalOcean, and a few others, in terms of, we are growing a lot, our business is growing in interesting ways. We need to adopt an architecture that allows for us to grow the way we need to.
This often has to do with a little bit of hyperscale, growing like crazy, hiring too many people. One thing I’ve learned going through this process a few times, is that the most important thing you can do, like really the basics, is that, first, you really should think about the whole situation as you step into a more distributed architecture style as a city planner, not a bricklayer. This is from SimCity. Basically, if you haven’t played SimCity ever, what you do is you manage a city. What the name says. It’s a city simulator, but you don’t manage any specific buildings. This is a screenshot of the game, like a little animated thing. You have no control over any of these particular buildings, cars, what have you. All you can do is just say, this is a residential area. There’s a fire station here. There’s a police station over there. There’s a park here. You cannot build this way. A road goes from A to B. You don’t have control over what gets built or what motivation people have to go along with exploring the city, you just frame the problem and let people go. This is a lot of what I have experienced building a company based on microservices, or product engineering especially. Specifically, a product engineer organization from microservices, building from the ground up. The thing to me is really to think about the whole problem as a city planner, not a bricklayer. You really need to think about how you define your whole city, which eventually is your company.
Procedural Generation
Another image metaphor, analogy I like to use is a little bit like procedural content generation, it’s something that I find fascinating. Basically, procedural code generation or content generation is a technique by which you feed a computer with some constraints, some logic, some rules, and an algorithm and a random number generator, and just let it go. Then you produce maybe Dungeons for your RPG game. There is levels for your first-person shooter or whatever it is, with the idea that a person wouldn’t have to explicitly draw levels for a game, or what have you. The computer can generate that based on these rules that you’ve created. To me, this is so similar to what happens when you are starting a company, starting from the ground up, or maybe growing a company using microservices. What you need to do is you need to define a set of constraints to define the boundaries of the problem and let people go, because you’re hiring thousands of people every month or every year. You will absolutely not be able to control what these people are going to do. These people are going to go. You hire them for their actual ability to build wonderful things, but you really need these things to talk to each other, to have some level of governance, so that the whole situation is manageable.
Ultimately, you’re not controlling each specific piece. This is extremely important. It’s very hard as somebody with a technical background, somebody who is used to write code to understand that, no, you don’t own it anymore. That’s how you do. You define the rules, and you let the organization go. Sometimes it’s called enterprise architecture, although there’s different flavors, but it’s how I see it. The interesting thing that I was talking about how in procedural generation for content, you let the computer randomness go about it. Obviously, when you’re talking about microservices, you let your organization’s randomness which exists, just like people come, people go, different projects are staffed or killed, these things come and go. What you need to do is to define the rules that will let the company generate the content here being the microservices or what have you. That’s how I’ve been successful running companies based on these highly distributed architectures we call microservices.
PicPay (Digital Wallet)
Then more recently, I’ve been facing a new challenge. I work for this company called PicPay. PicPay is the largest digital wallet in Latin America. We do peer to peer payments, merchant payments, everything around finance, both for consumers and enterprises. It’s like a lot of projects, basically what we call usually a Super App focused in Brazil. If you’re Brazilian, you for sure know the app, and what it does, because that’s what everybody down there does use for payments and various other things. It’s a very famous brand within this demographic. The most important thing for this talk is not so much what PicPay does. You can think of it just like a payments application, a little bit of Cash App and Square here in the West, Venmo, AppBrain, similar to what we do down in Brazil. Basically, the important thing here is how big PicPay grew. I’m used to wild growth and what we generally call hyper-growth. This is a whole new level, at least to me. This is a chart of our team size. We started with a few people, the small offices often do. After, not quickly, because these people working on this app for a very long time, like six, seven years. Then eventually, market conditions, funding, everything comes together and we need to scale this business up. We go from one product to over 30, and quickly moved from 50 people in engineering to about 2000 now. We are I believe 3500 people right now across the whole business, there’s like the business side, support, everything. We are about 2000 engineers or people who work in engineering capacities, managers, and data scientists. This is wild. To me, it’s a good representation of the challenges we have right now, which is really like, we’ve scaled so much. How can we over such a quick period of time? We had to make a lot of decisions. We had to go about things in a way that allow for us to scale like that. The decisions we had to make as we’re growing, the decisions we had to make during this process are not necessarily the best decisions for the current phase of the company, which is a little bit of the reason I and a few other people were hired and put into this place. It was like, ok, we need folks to look at what we have and put us back on track to keep growing as we’re doing, but in a more sustainable way.
This is an interesting challenge. The way I see this is really that for the first time I’m invited to the after party, because I’ve been playing this startup game for a little while. You more often than not, are there from the beginning, the company succeeds or not, maybe makes money, maybe doesn’t make money. There’s a lot of different things, and your interest changes. Maybe after a few years, it was like, ok, I’ve done what I had to do here and move on to a different thing. This time, my team and I are facing the challenge of a highly successful company that grew extremely quickly, and now we need to make sure that we keep executing at a pace that allows for us to keep playing the game in the financial marketplace and the financial business that we are at, which is highly competitive. There’s old school banks. There’s new banks popping up every day. There’s new apps. There’s regulation. There’s all sorts of different things. We really need to make sure that we are on the right track when it comes to our experimentation, in our product quality.
1. Stop the Bleed
It’s important for me, also, to give you our disclaimer, this is all work in progress. I’ve been at PicPay since January. This is my three and a half month mark now. I think I know most people’s names. I’m happy to share a few of the things that we’ve been doing, that we started doing. If you catch up with me in about a year, I will probably update you on what of this maybe have changed, maybe didn’t work out so well. What we keep investing on. What are the things we found out that work and that don’t work in this scenario? Because there’s so much to talk about, there’s so many areas, from an organization’s technology, to architecture, to many things, I want to focus on three pieces of advice that I’ve been giving people in similar situations. Also, when I hire, I’m hiring a lot of more senior leaders to help me join the team and build a lot of the organization. This is the advice I’ve been giving them in coming to PicPay, what I think is the right thing to do joining and during the after party. The first thing I think is the most important is stop the bleed. A lot has happened in this company and possibly in your company. There’s a lot of different teams, projects, things going up. In a company with 2000 people, at any given time, there’s going to be 10 new projects popping up, there’s going to be 10 new systems popping up. There’s things that you possibly can’t know about that are being done right now that are going to impact your architecture, your technology, your strategy, really quickly in the future. The most important thing to do as a leader coming in, being a senior engineer, a manager, a director, a CTO, whatever are you, really, the first thing you should focus on, is to stop the bleed. Make sure that we’re not making the problem worse, every day. Whatever new systems, whatever new things pop up, should not be taking a step back, they should move you forward.
The first thing or one of the things that you need to do around this is to make explicit the rules that you want to follow, even if they’re not strictly enforced. The last part is very important, but the first part too. Make the rules explicit. It’s something that even when you’re growing with microservices, when you grow in a more organized way from the beginning, again, it’s the procedural code generation we’re talking about, I believe you need to do. You need to make sure that the constraints you want to follow are very explicit, and encourage people to follow them. What are these constraints? For example, you could decide that we use RESTful designs for our applications here, we use HTTP and all the niceties around that. Or, no, actually, we use gRPC and don’t want to see any HTTP endpoint ever. All these decisions that you need to make around observability, telemetry, how things talk to each other, how things are deployed, those are the rules I’m talking about here. Make sure that you have them explicit, either if you’re growing with microservices, or if you’re refactoring an existing microservices setup, you really need to make sure that these are clear.
There’s one difference between growing and refactoring that I found. When you’re growing with microservices, it’s very easy for you to nip in the bud when things are going awry. If I decide that my service will use Thrift instead of gRPC, everything else using gRPC, it’s a lot easier as you’re growing to identify that and say, “No, you cannot do that. This is not what we do here. That’s not what we’re going to do here.” Because you only have maybe a few services, everybody knows everybody else. The company is still growing. When you’re refactoring the organization, it’s a little harder to enforce things so strictly, because maybe, yes, there’s a new service that’s going to be produced, and it’s going to use Thrift instead of gRPC. You’ve decided to use gRPC, they’re going to use Thrift. You could go there and say, “You cannot do this. It is not allowed.” Maybe that will be ok, but there’s so much entropy. Maybe you’re talking about an organization with 100 engineers that have always used Thrift on that side. Your decisions are between migrating everything that exists to use gRPC or what have you, or allowing just one case to use Thrift. Then there’s another case more. It’s an interesting balance between these things. In my experience, at first, you shouldn’t really focus too much on trying to strictly enforce the rules that you’ve set for your organization. What you should focus on instead is to promote the things you want. It’s fine that maybe some people are going to use different RPC styles. You really should make sure that whoever decides to use the flavor that you have picked, let’s say in this case, gRPC, have a very seamless experience. Which often means that you will have some platform team, tooling team focused on making sure that the tools used in the golden path, on the paved way, they are the best tools. They are better than the tools that people can use elsewhere.
There’s a few different sets of rules you can adopt on this. There’s various different flavors. My recommendation would definitely be to at least follow some of what we call the prerequisites for microservices, make sure these things are well established as rules in your organization. This is a list that I’ve compiled based on the list that originally Martin Fowler had, called microservices prerequisites. I think this is the basic you need to do, to do microservices well at scale. If you find yourself in a situation like this, make sure that you have an answer to at least each one of these topics that you want to promote and defend and want your people to use. Similar to this, in this same vein, make the work visible is very important. When you have a very big organization and you have a lot of different services, hundreds of services, hundreds of people, maybe hundreds of teams, there’s a lot happening. It’s never going to hit you until you know something’s in production, or when it’s generally way too late. One thing I really like is to foster or promote a culture of people sharing ideas in different ways. They share what they’ve been thinking about and the design of their systems as widely as possible.
There’s a few different ways to go about this. One way that I definitely recommend is a structured RFC process. This is one that I particularly like. It’s one I’ve been using for almost a decade now, with a lot of iteration. Basically, the idea here is that you ask people to write in a document with a specific format that drives to some kind of interactions or thinking that are considered to be good. You have people commenting on that. There’s like different logistic ways. A few of the interesting things is that first this is not a decision making process, this is a knowledge sharing process. What that means is that the person writing the RFC is accountable for the system being proposed. That person needs to be able to make decisions. You do not want to put anyone in a position where they are accountable for something, but they don’t have autonomy over it. Somebody else makes the decision, but if things break in the middle of the night, I’m the one responsible for that. That’s the worst of all worlds. I really would like to avoid that as much as possible. It’s really about writing down what you’re thinking, getting input from people all over the company. That’s why it’s really important to have a mailing list, a Slack channel, whatever forum your organization prefers to publicize this idea. Give it a timeframe. It’s like, we have one week to comment on this idea. Know that the decision is still with the person. The decision still needs to be with whoever is accountable for that. This person should receive feedback from the whole organization and act accordingly on this. There’s a few interesting things like expiry date, review process, and things like that. Definitely recommend you look at this process.
It doesn’t matter if you will use this process or not, what matters to me is that you have some way so that people working on different systems, different services across the company are broadcasting that to others to get input, which is very important, especially because talent disparity often occurs. Also, to share knowledge, so that next time I need some things like, I remember that two weeks ago, somebody was talking about a system that uses Kafka and I want to use Kafka now in my system. Maybe I should read that RFC, talk to that person. One last bit on this is that I find this to be extremely important in a more distributed world like we are, like everybody’s remote, where people don’t go for beers together, or for coffee so often, where you have the serendipity of exchanging information. This is a way to force information to be shared in a way that even if you don’t immediately need or care about that, it stays in the back of your head, and you can quickly search your email, or Slack, or what have you, to find out information about something that was built a while ago.
2. Don’t Miss the Forest for the Trees
The second is, don’t miss the forest for the trees. This is extremely important, again, when you’re growing an organization based on microservices. I think it’s even more important when you’re joining a company that already has adopted that at a large scale. Because the first thing you want to do as an engineer, is to go like, give me access to GitHub, I want to read the code of everything. It’s just a natural instinct. That’s how we used to do things before these highly distributed applications were so prevalent. You would spend some time reading through at least the most important parts of the code of any system that you’ve inherited. In my experience, this doesn’t work so well with microservices, because it’s very distributed and the value of looking inside each box is not as big as you might find. It’s a lot more interesting to look at the whole system so the forest here, not the trees. It’s an interesting analogy. Sometimes it breaks in different and funny ways. Basically, that’s the main idea behind that, that really, we should be looking at the whole ecosystem like the trees themselves, and not the forest as much.
Around this thing, there’s always a few exceptions. The first one I want to acknowledge is that, sure, forest for the trees, but you really need to make sure that you identify and fix any of the hot potatoes immediately. What I mean by hot potato is that those systems that clearly are causing trouble. Those systems are clearly breaking in production, or are not performing very well, are burning too much money, which is a common issue across microservices. These systems that require attention, yes, do look at those systems immediately, don’t waste any time. Only those that are really worth the time, like something that’s critical, or something that’s on fire all the time and is causing a lot of problems, because you should really avoid getting distracted with each individual system.
Going back to our SimCity analogy, you really should start thinking about this as a city. Basically, you inherited this township or this community of people and it grew organically. Like any organic community, or think of a market, if people would start building a market from scratch, there’s not going to be a lot of organization. You’re probably going to have the person who sells raw fish next to the person who sells some toxic cleaning product. It’s probably not a good idea. You need really to step in and look at, I have all these things. It’s a vibrant marketplace of ideas, of projects, of different things. How can I help these folks organize the work they’re doing or structure this community in a way that makes sense?
Clay-to-Rocks Layering Model
I’ve found a few things in my career that I think make good sense in this scenario. One of them is what I call the clay-to-rocks model. It’s a layering model. Layers are a very powerful idea in software engineering and in many other things in life, but software engineering, especially, where you pile your software in a way where there’s, again, layers, groupings of systems with similar properties that you want to grab together and think of them as one thing. The clay-to-rocks model, the idea is that you have systems that are clay, and these systems that are high churn with a small blast radius. Basically, systems that are more malleable, that you can change all the time, they will change all the time. These tend to be close to the use cases, tend to be closer to the user. Maybe that’s the one feature your product team wants to experiment with. Nobody knows it’s going to be successful or not. There’s no point in wasting two years building the perfect version of this if you don’t even know if it’s going to be successful. Just wire some stuff together, put it out. See if it sticks. Evolve after that, like we bake the clay. Don’t spend too much time thinking about the quality of the system. That’s where lead time and cycle time even is so much more valuable than anything else.
You have other systems that are rocks. By this, I mean that they are lower on the layering scheme. They enable almost every other feature in the system. This will be the system that provides you information about authentication or your user database. Or maybe in our case, for example, performs financial transactions. If you want to move money from account A to account B, there is 1 million reasons why we could do this, maybe I want to move money from my savings account to my checking account. Maybe you want to move money from your account to my account to pay me for the pizza we just had, or whatever it is. There’s 1 million different use cases, but there should be one system that has basic interfaces like, from account to account, amount of money, maybe recent. This is how these rock systems work. They need to be super stable, because, again, they empower so many different use cases, they often are the heart of your operation. If they are out, the whole company is out. There’s only so many of them. Usually, you have a lot more of these clay systems than the rock systems, and they have different characteristics.
Drawing the picture a little bit, this is from the original article I wrote about this layering scheme when I was working at meetup.com. That’s the example we’ve been using here where there’s various different systems or services within meetup.com. You have information about groups, information about membership, information about users, and information about events. Then, this is the bottom, as you go up the stack, you have a lot more specialized things. You have the user profile service that provides you with data about how that’s displayed on your screen, on your app, or browser when you go to meetup.com to check out who used this profile. My profile would have some information coming from that. If you look at this situation, you can specify which of these systems are more like clay and more like rocks. The systems that are clay and coming from this particular experience in Meetup, this is where we’re always exploring, changing, churning, experimenting with. Our user profile page used to change every week, because some product person, some designer person, some marketing person will have an idea or wanted to promote something else. Maybe they move things around. Maybe they add more data, remove data from that particular user experience. The actual membership service, the group service, the user service did not change at all. It’s just how the data provided by the services was being used.
When you’re talking about coming from a situation where you inherit a lot of services such as this, and you need to put them into shape, this is a little bit of what happened. You need to start figuring out what are the systems that need to be more stable, and what are the systems that don’t have to be so stable? Invest a lot of time, effort, energy, maybe drop your best people to make sure that your rock service, like the ones that are really important to your normal operation, are working to the level that you need them to work. These fundamental systems they cannot go out, so you probably will be optimizing for stability and performance, not so much for developer productivity and other things. Maybe to even like the code review constraints for these systems are more specific or more stringent than for others. Meanwhile, I didn’t find the clay systems. When you have thousands of engineers working on thousands of things, you really have only so much headspace to worry about each specific thing. In this situation, building a map like this would allow you to make sure that you prioritize putting your attention and your best people on the rocks layer, and then figuring out and then deprioritizing the amount of effort you need to put on the clay layer for now, and then maybe in a case by case basis. This is an incredibly helpful model in many different ways. I found it invaluable in this particular situation where we’re refactoring an existing architecture.
Apply the Edge-Internal-External Layering Model
Still within layering, there’s another model I like that’s also part of the same article and comes from the same experience, which is the edge-internal-external. It’s a very basic idea. There’s nothing new here. Basically, the idea is that you need to flag which of your services are part of the edge. What I mean by part of the edge, means the services that are exposed to the outside world, services that have received inbound connections from outside. Probably, it’s the public internet, probably the way things have to work these days, it’s very likely that even if there’s only a subset of people who can access your service, you still expose it to the internet and use other security things to make sure they’re only accessible to the right people. It could be also that it’s only accessible from your partners or in some network that you might own. Edge means external systems that you don’t have control over access this thing. Services sometimes like API gateways, BFFs, what have you, play a good role here. I also put into this, all the supporting systems for that: authentication, authorization, rate limiting, throttling, all these things belong on this layer. There’s specific things that you want or you don’t want in systems in this layer.
The other one, which is the most commonly used, where most or the vast majority of our systems are going to run, is the internal layer. Systems that are only used inside your VPN. I’m using VPN here loosely. The idea here is, this is really the systems that you develop and talk to each other, so upstream and downstream systems within your network. Obviously, it is 2022, these systems should talk via HTTPS, TLS of some sort. They should have security inside the network. There’s almost no reason not to have security inside your network, even if it’s your own network. Plain HTTP traffic is a big no-no for five, six years now. Still, there are things that you are willing to accept more inside this network, then inbound communication comes from the outside. Then, which is a weird overlap, but if you structure them well ends up being its own layer. It is external services in the sense that services that may call to other services, or systems that may call to services outside your VPN. It’s interesting because you say, but isn’t that just an internal service that maybe makes a call just to the world outside the VPN? Yes. At the same time, I think it’s important to acknowledge that a well-structured architecture of services, wouldn’t allow any service to just make random calls to outside your VPN. Again, VPN here being used very loosely, mostly like your internal systems. I would really encourage you all to make sure that the systems that you own that make external calls, are the gateways to something else. Maybe it’s the push notification system, maybe it’s a third party service that you use or whatever it is, that they are isolated. Basically, that they’re only pieces that can actually make calls to the outside world, everything else is forbidden from making these calls. You have a few systems that can make calls to the outside world.
Then, if you get this lens, you’ve inherited a structure like this, you have a lot of different services up and down that look the same, because again, we’re thinking four is not three. I’m not even going to care which of this system does what. You start identifying that some of these systems belong to what we’re calling the edge layer. Some others will call the internal layer and the external layer. That’s when things get a little more interesting because systems that are belonging to the edge layer, you definitely need similar to the rock systems we were talking about. There’s an overlap between this pattern and the other. There’s an interesting interplay. The edge layer is mostly rock systems, in the sense that some of them are going to have high churn. For example, API gateway will probably come in and go in with new endpoints. There’s interesting ways to make sure that this is not too much of a problem. Your authentication system, your authorization system, your rate limiting system, those things need to just work. If you change the authorization system every week, or every iteration, or every month, there’s something wrong. You really need to get it going, either uses a third party service or you build your own. You get it going once and you barely touch it except like some security patch or some small evolution here and there. Changing systems in the edge layer should be a big deal. If you have a company that has 2000 people or if you have enough people in your platform and infrastructure organizations, I strongly recommend that you have a team fully dedicated just to the edge layer. It doesn’t need to be a big team. Those folks should be looking a lot at your network infrastructure, you’re probably talking about CDN as well, and various other things.
Then immediately below we have the internal layer where requests hitting this layer have a right to be sanitized in different ways. Again, that doesn’t mean that we can drop all security measures and use plaintext, HTTP, or anything like that. We definitely are a little more secure in this layer. This is where high churn happens. There are some scams and some go, the tooling you have for deployment, and support, and monitoring, telemetry should be optimized for this layer, because the vast majority of your systems are going to be here. Then you have the external layer, which is like these systems that make calls to systems outside. These, again, I do recommend that you completely isolate and acknowledge systems that are allowed to make calls to external systems or to external services. A few different reasons. The main one is, if a service does not belong to the external layer, they should not have access to the public internet. If they try to make a call to the public internet, they should just be blocked. There’s a lot of interesting things that would be avoided by this. The Log4j bug more recently is one case where this would be mitigated by having such role in most systems, because even if you receive something that randomly makes your system contact the outside world, they wouldn’t be allowed to, because they are not part of the external layer so they should be forbidden. Even thinking about these layers in different ways, there’s different things you can do in your tooling. Maybe use manifest files, maybe use different things to explicitly acknowledge the layer each system belongs so that your infrastructure tooling, your security tooling, and many other things can automatically detect if one of the systems are not behaving the way they should. This is like the two layering things, thinking about still very much on the city planning mindset.
3. Don’t Panic
As somebody who again, just joined this role, three and a half months ago, the most important thing to me is, don’t panic. It’s crazy when you join a company. There’s thousands of people working on thousands of things, and you’re like, I don’t even know what’s going on. It can be very overwhelming. There’s a whole different talk here around how to build your support systems. How to make sure that you, as a manager, as a leader feel empowered, and have the right tool, the right team, the right mindset to tackle this problem. Thinking a little bit more on the technical side, don’t panic, to me, has a few things, not a lot of them. The most important really is, don’t try to boil the ocean. This is the Windows defrag. Something that whoever probably was born in the ’80s and/or grew up very close to computers in the ’90s are very familiar with. The idea here is that for various different reasons, Windows had file systems, I think it was in fact a family of file systems, who would gradually fragment. They would have one thing that you think of a file, but it’ll actually be distributed all over your hard drive. You would have those spinning hard drives that were really slow. Eventually, because you want to access one file, it was spread across multiple different sectors in that disk. Your computer would get extremely slow, and you had to run this tool that came with the system to defrag, so basically to regroup things slowly. This is an animation that actually it would display to you while it was doing the job, and showing how it was grouping the sectors together and finding bad sectors and things like that.
One of the reasons that I like this, and I use this image all the time, is because this is how I see my job for a very long time, not just this particular job. It’s really like I’m that Windows defrag manager. I’m not here to magically solve all problems at once. My role as the leader of an organization going through this transformation is to gradually defrag sector by sector, file by file, whatever it is, make sure that maybe your company is a mess today. That’s fine. Maybe your company is a mess tomorrow, it’s fine. As long as the mess you have tomorrow has less entropy than what you had yesterday or what you had today, you’re going in the right direction. You’re doing well. Don’t panic. Things are going to be ok. Because I keep saying that management is one of the most lonely professions, I know leadership can be a big drain on people.
The main idea I have, or the main thing I want to share with you all now is, you really need to think of this problem in a more strategic way. Take steps back. Don’t think as an engineer. As you might see by my bookshelf right here, I’m a big nerd. I love programming languages. I love coding, but in a role like this, in the role like I am at right now, this is not my job. My job is to look at the whole ecosystem as it’s the city. I’m really thinking about it like a city planner, not a bricklayer, not somebody who’s building one building. I’m building the whole organization, and there’s different ways to go about this.
Questions and Answers
Ignatowicz: The practice that you explained to drive the hyper-growth of the company are really interesting, but I’m looking for more in a practice perspective. Imagine that I’m just joining a company and the team does not behave well on some of the points, for instance, don’t have a culture of designing docs, document, or RFCs to document the decisions. What would you advise me to start on the string without being too pushy, trying to put the same frame of my company? Besides that, because comparing this, try to not be too pushy, but actually to also deliver results. Because we have the window to impress people and to show what someone to hire you, pay you to go to do at the company, how you balance between changing the direction of a team that just joined and adapting for the team’s culture?
Calçado: Actually, I think this is a little bit of the mindset that I was trying to talk about when I mentioned the defrag component. I have a standing window every day for office hours where I talk to random people throughout the organization. Oftentimes, I get new hires come in and that’s the question I get a lot, like, “I’m new here. It’s a really big organization. My team has practices that may not be the practice I think they’re the best one. What should I do?” My recommendation in that case is really to think about this the way I look. As any big company, there are initiatives in how to improve, standardize, change the whole thing like, again, the defrag mindset. In a company this big, you really need to think globally, act locally. What are the things that you can do to improve your team? You’re not going to go from 3 to 10, but how can you go from 3 to 4, and talk to other people. This is one thing that’s been really challenging to me, because I started my work remote. I’m actually going to go to our headquarters for the first time in a few weeks. I’m so used to having the buzz and the marketplace of ideas happening there, where you get to know what other people are doing just by bumping into them. We’ve been trying more and more to have more social channels where people can exchange ideas, what different teams are doing. I think in this case, the most important thing really is to focus on improving your team, and then sharing those ideas. Obviously, stealing ideas from other teams. The one thing that I am really worried about in a company this big, and even in a company that’s smaller, even talk about 500 people, 200 people, is, don’t try to boil the ocean. Again, don’t try to solve the process for the whole company. Solve for your team, see what works, see what doesn’t work. Then let’s talk as a collective of engineers, what are the things that we can extract and make a process that works for more people? I think it’s really important, like you’re saying, also to deliver within that one team. With thousands of people, if you try to work out what works for everyone, you’re never going to do anything. Think globally, but act locally is the main motto there.
Ignatowicz: If I’m trying to implement an incident response system with people on on-call in my company, what would be your advice to start this process?
Calçado: The first one is, don’t reinvent the wheel. There’s 35 new processes. I think PagerDuty has a really comprehensive guide on that. Even if you don’t use PagerDuty, we don’t use PagerDuty at PicPay at the moment, the guide is great. I really appreciate it. With a lot of these processes, try to implement as-is, and iterate over time. Do not try to create a very sophisticated thing that will take forever, and will require a full team just to maintain the process, especially if you’re smaller. There’s only a few things you need an incident management process, it’s like communications and incident commander and things like that. Everything else is nice, but not necessary. Stick to a model that’s known to work. Again, my recommendation is the PagerDuty model. I know there’s others. I think that people behind FireHydrant also have a different model that probably works better with their tool, but can be generalized as well.
Ignatowicz: When you refactor systems at a wide level, do you have any tips to incentivize developers to stop using old APIs, services, calls, you are trying to get rid of?
Calçado: Make them super slow. Although this works. I think a lot of it has to do with managing the lifecycle of an API. It’s always complicated. My main recommendation would be to, actually, it might be a little late, but to think about it from the beginning, minimize the number of dependencies to things that are not published interfaces. We have this concept that I really like, Martin Fowler wrote an article many years ago, between public and published interfaces, in the sense that a public interface is like a public method in a programming language. You can call it, but that doesn’t necessarily mean you should call it. Maybe you’re calling a private API, it just happens to be public. A published interface is something that’s well supported, well documented, and all this stuff. My main recommendation this way is to make sure that you have a good distinction between public and published interfaces. Whatever you call published interfaces, you really support them well. I’m a leader now, I can go and hit people in the head with like an order and say you need to move from A to B in five months. Not only is it not a great working experience for people, but there’s all the priorities, there’s compromises people need to make. In my own experience, the best way to drive wide refactoring around an organization or wide change is to make sure that whatever you want them to do is much easier, a much better experience, much better supported than the bad way. The bad way can rot and go bad. Obviously, you need to maintain some level of support, but make sure that the new way is awesome. Then, in my experience, you gradually see, especially like the same newcomers we were talking about, they will want to use this new stuff, they will not want to use the old ways of doing things. You might need a project or two to migrate from old to new, but I think the whole thing about stop the bleed is very important. By providing a better alternative that is better to them, not just to the teams providing the API, you end up in a better situation that way.
Ignatowicz: Two questions for the hyper-growth factor. The first one is how you scale as PicPay scaled to thousands of developers, preserving the culture of the company. Secondly, in this hot market, how do you actually find people to scale in the decision role?
Calçado: The first one, I can tell you what we’ve done at PicPay. Before I joined there, we did not preserve the culture at all. If you look at PicPay right now it still is very much divided in eight different business units or groups that have different cultures. Some come from more of a banking/FinTech background, some come from more of a startup, consumer-driven company. What we’re trying now is to find what works for the company as a whole. Then, again, I don’t think that we can apply the same rule to somebody who’s building our social graph or a messaging system, they shouldn’t necessarily have the same ways and the same culture to somebody who’s doing like a deep banking system that integrates with bigger banks that move very slowly. We need to have some baseline, but each team needs a little bit of autonomy to decide what works best for them. I think the most important thing looking at the big picture is really thinking about the interfaces between these teams, like these human interfaces, the protocols. Even for like iteration planning, if I need something on our backlog, how can I get that? Do I have to talk to people? Is there a Jira ticket I open? How does that work? All the way to use the same RPC protocol, and our guys look the same. In a company that’s doing a lot of many different things, the things that are really different, you really need to allow for a little bit of cultural differences amongst various teams.
I definitely recommend that you have one funnel for hiring. I think this is something that larger companies, Google, Facebook, how those do well, in that there’s one funnel for people coming in. Because in a company like this, a team that exists tomorrow or a whole division might not exist in the next month, or might be an experiment. You really don’t want to get to a situation where you hyper-specialized hiring for one particular team, and you feel like you can’t move these people. This applies to managers and engineers very much. On the hiring side, it’s an interesting situation, like everybody else, hiring like crazy. A lot of my mindset comes from my experience when I was actually working in Europe, for SoundCloud, where remote working wasn’t really a thing. This was 2010 to 2015. What we had to do was to attract people from Silicon Valley, really, people who have grown companies the way we were growing
SoundCloud before, and make the move to Berlin, make less money. Because you can make a lot more money in the U.S., more equity, more everything. Some people would love to move to Berlin because they like the lifestyle. Some people would not. How can we attract these people there, and making sure that the culture inside the company and the work we were doing was super interesting, was a fundamental thing in all that.
If you look at the published material we had between 2011 and 2015 at SoundCloud, a lot of it was us making sure that people all over the world knew that we were doing cool stuff. You could go for Google and work on YouTube’s credit card form for eight months, like a famous Hacker News thread from the other day, or you can come to work for us and view the whole ad targeting system that we have built from scratch. I think finding what your competitive advantage is, is fundamental that way. Obviously, you need to have equal pay and allow for various other things. You really need to find a competitive advantage, which is, to a big company, something we’re still finding out at PicPay, finding out what’s the sweet spot at a company this big.
See more presentations with transcripts