Mini book: The Platform Engineering Guide: Principles and Best Practices

MMS Founder
MMS InfoQ

Article originally posted on InfoQ. Visit InfoQ

Platform Engineering has quickly become one of the hottest topics in DevOps. The explosion of new technologies has made developing software more interesting but has substantially increased the number of things that development teams need to understand and own. Couple this with the ever-increasing cost of development cycles, organizations are interested in anything that will make them more efficient. 

The weight of understanding ever-changing technologies coupled with a “you build it, you run it” approach has led to many development teams seeing their velocity crash. This heavy cognitive load can have severe negative impacts on our ability to effectively complete complex tasks, like coding. Reducing the cognitive pressure on stream-aligned development teams enables them to focus more readily on the code underlying the business’s core products. One of the most important resources within a tech company is developer cycles, so finding ways to maximize that investment into the core product is essential.

The recent Puppet State of DevOps report found that organizations with platform teams increased developer velocity, improved system reliability, obtained greater productivity, and implemented better workflow standards. However, creating a healthy, effective platform is not necessarily straightforward.

Understanding how to improve the day-to-day lives of developers at your own company can be more challenging than it sounds. The platform that is built has to not only streamline development but be low-friction, low-cost, and reduce cognitive load on its users. This is why many advocate for treating the platform as a product with an empowered product manager at the helm. 

Rather than taking a trial-and-error approach, we have distilled and collected the expertise from many different software leaders to help readers in building, operating, and evolving their own platforms and platform teams.

We would love to receive your feedback via editors@infoq.com or on Twitter about this eMag. I hope you have a great time reading it!

Free download

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


JetBrains Releases Rider 2023.1 EAP 4

MMS Founder
MMS Robert Krzaczynski

Article originally posted on InfoQ. Visit InfoQ

JetBrains released Rider 2023.1 EAP 4 on February 12. The latest Early Access Program for Rider contains such features as the ability to debug startup code for WASM .NET applications, support for Astro tool, full IDE zoom, a feature related to importing Angular templates, and support for TypeScript in Vue template expressions.

In the Rider 2023 EAP 4, IDE can debug the startup code for WebAssembly .NET applications. This is achieved by changing the order in which the IDE launches the application. Rider waits for the initialisation of the page target, connects to the debugger and only then starts loading the application properly. Previously, for most projects, the WASM debugger could only join after the application was initialised, as computing ports or initialising the connection took some time. The solution added to EAP 4 eliminates the delay and makes it possible to capture breakpoints set early in the application initialisation logic.

This new version of Rider is the first to support Astro, an open-source tool that can create static HTML sites using popular JavaScript frameworks such as React or Vue while loading fully interactive components when required. The Astro plugin for Rider can be downloaded from the JetBrains Marketplace or installed directly from the IDE. The plugin includes basic features such as code completion with automatic import, syntax highlighting, refactoring, navigation or correct code formatting. JetBrains plans to add more advanced Astro support in future releases.

EAP 4 for Rider contains support for TypeScript in Vue template expressions. Vue template expressions are now synchronised with lang=”ts” when added to script tags. This allows Rider to better evaluate TypeScript, providing users with preferences and appropriate refactorings to match what is inside the script tag.

Another feature relates to Angular: while working with global and exported symbols, Rider ensures that their imports are automatically added to components when the code is complete or while using ReSharper quick fixes.

The last highlighted feature is about zooming into and out IDE. Developers can increase and decrease the size of UI elements at once.

The community reacted positively to the new release, especially to information about the Astro plugin. Davor Pihač, a senior software engineer, applauded support for the Astro plugin in the Twitter thread about Rider 2023.1 EAP 4.

The entire changelog of this release is available on YouTrack.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Ably Terraform Provider Aims to Power Realtime Architectures Using Infrastructure as Code

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

Developed by Ably in partnership with HashiCorp, the Ably Terraform Provider enables using Terraform, a popular open-source infrastructure as code (IaC) tool, to provision and manage Ably solutions programmatically.

With its Terraform Provider, Ably brings the flexibility of IaC to its own realtime platform, thus making it possible to configure and provision it in a coherent and synchronized way with the rest of your IaC-managed infrastructure.

Ably Terraform Provider is built on top of the Ably Control API. It enables managing an Ably account and automating recurrent operations like enumerating messaging queues, creating integration rules with external services, and defining namespaces using Terraform configuration files.

You can dynamically create Ably apps, configure them, and delete them if necessary. You can implement multi-tenancy solutions for your customers, and create configuration-driven environments that can easily be replicated under programmatic control.

The Ably Terraform Provider includes a number of integration rules for distinct services, like AWS Lambda, Kafka, Azure Functions, Google Functions, Cloudflare Workers, and more. The following snippet shows how you can configure an AWS Lambda resource using Terraform’s configuration language.

resource "ably_rule_lambda" "rule0" {
  app_id = ably_app.app0.id
  status = "enabled"
  source = {
    channel_filter = "^my-channel.*",
    type           = "channel.message"
  }
  target = {
    region        = "us-west-1",
    function_name = "rule0",
    enveloped     = false,
    format        = "json"
    authentication = {
      mode              = "credentials",
      access_key_id     = "hhhh"
      secret_access_key = "ffff"
    }
  }
}

Once you have described the desired state of your infrastructure, you can let Terraform plan the required changes and apply them automatically.

The Ably platform aims to provide a solution for realtime experiences including pub/sub messaging and data delivery, with support for a number of guarantees, including exactly-once semantics, ordered delivery, message delta compression and so on. It supports multiple realtime protocols, such as WebSockets, MQTT, and Server-Sent Events. Additionally, it also supports the delivery of native push notifications for iOS and Android apps.

HashiCorp Terraform is an infrastructure as code tool that lets you define Cloud and on-premises resources configuration files using a declarative, human-readable language. Terraform can be used with a variety of platforms or services through providers that serve as wrappers around their APIs. Besides Ably, available providers include Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP), Kubernetes, Helm, GitHub, Splunk, DataDog, and many more.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


OpenAI is Rolling Out ChatGPT Plus as a Premium Service

MMS Founder
MMS Daniel Dominguez

Article originally posted on InfoQ. Visit InfoQ

OpenAI has announced the release of ChatGPT Plus, a premium version of its well-liked ChatGPT chatbot. The new service intends to give users a premium experience that includes ChatGPT access during peak hours, priority access to new features and upgrades, and quicker response times.

ChatGPT Plus could be the first of several upcoming proposals aiming to monetize what’s become a viral phenomenon. In addition to an API, OpenAI is actively evaluating options for more affordable plans, business plans, and data packs.

Access to ChatGPT will continue to be free, however the pricing model for subscriptions will make it easier to keep access to free content widely available.

OpenAI has announced that it will use the knowledge it gained from ChatGPT’s research preview to continue developing the chatbot:

We launched ChatGPT as a research preview so we could learn more about the system’s strengths and weaknesses and gather user feedback to help us improve upon its limitations. Since then, millions of people have given us feedback, we’ve made several important updates and we’ve seen users find value across a range of professional use-cases, including drafting & editing content, brainstorming ideas, programming help, and learning new topics.

Early in January, OpenAI provided a sneak peek at ChatGPT Plus by stating that it was beginning to think about ways to monetize ChatGPT and published a survey outlining the potential costs and features of a ChatGPT Professional plan. 

Despite criticism and a number of suspensions, ChatGPT has proven to be a public relations success for OpenAI. By any standard, ChatGPT had an enviable user base of over a million members as of the beginning of December. Yet, running the service costs money. 

The operational costs of ChatGPT, which come to a few cents per chat in total computer expenditures, are “eye-watering” says Sam Altman, co-founder and CEO of OpenAI. The current cost of ChatGPT Plus is $20/month with faster response times and priority access. 

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AWS Releases New Graviton3-Based General Purpose (m7g) and Memory-Optimized (r7g) EC2 Instances

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Amazon Web Services (AWS) has announced the release of new Graviton3-based General Purpose (m7g) and Memory-Optimized (r7g) Amazon EC2 instances, providing customers with enhanced performance and cost savings.

The release of m7g and r7g follows the earlier release of C7g instances that were the first EC2 instances running on Graviton3 processors, which provide up to 25% better compute performance, up to 2x higher floating-point performance, and up to 2x faster cryptographic workload performance compared to AWS Graviton2 processors. 

Graviton3 brings more benefits than Graviton2 with the support of DDR5 memory, which provides 50% more bandwidth than DDR4. In addition, AWS Graviton3 processors in m7g and r7g deliver up to 25% better performance than the equivalent sixth-generation (m6g and r6g) instances, and carbon emissions are reduced.


Source: https://aws.amazon.com/blogs/aws/new-graviton3-based-general-purpose-m7g-and-memory-optimized-r7g-amazon-ec2-instances/

According to AWS, the m7g and r7g instances built on the AWS Nitro System offer up to 64 vCPUs, up to 512 GB of memory, and up to 30 Gbps of network bandwidth, making them suitable for a wide range of workloads. The m7g instances are designed for general-purpose workloads, such as application servers, gaming servers, and microservices. At the same time, the r7g instances are optimized for memory-intensive workloads, such as in-memory databases and real-time big data analytics.

A respondent in a Reddit thread wrote:

The main advantage of Graviton is that you get a core per vCPU instead of a hyper thread and that they cost almost half as much overall. So yeah, you’re seeing an advantage if you need many cores for CPU-bound processes. For instance, now you can use r6g.xlarge or r7g.xlarge instead of m5.2xlarge.

Public cloud providers are increasingly using arm-based CPUs, and some of the significant providers include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. For example, last year, Microsoft announced instances based on Ampere Altra processors. Furthermore, Google offers instances based on the Arm architecture of Compute Engine called Tau T2A (also Ampere Altra Arm-based processors). Steve Brazier, president and CEO of market research firm Canalys, believes that by 2026, some 50 percent of CPUs sold to the public clouds will be Arm-based.

M7g and R7g instances are currently available in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) AWS Regions in On-Demand, Spot, Reserved Instance, and Savings Plan form.

Lastly, AWS recommends the AWS Graviton Ready Program for customers looking to move their applications to graviton instances and other resources such as the Porting Advisor for Graviton and the Graviton Fast Start program.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AWS Creates New Policy-Based Access Control Language Cedar

MMS Founder
MMS Matt Campbell

Article originally posted on InfoQ. Visit InfoQ

AWS has created a new language for defining access permissions using policies called Cedar. Cedar is currently used within Amazon Verified Permissions and AWS Verified Access. Created by the AWS Automated Reasoning Group, Cedar is designed to be agnostic of AWS and simple to understand the effects of policies.

Cedar policies are comprised of an effect, scope, and condition clause. The effect is either one of permit or forbid. As with IAM policies, an explicit forbid overrides any permit statement. However, an explicit permit is required to gain access as the requests are implicitly denied.

permit(
    principal,
    action == Action::"connectDatabase",
    resource == Database::"db1"
) when {
    context.port == 5432
};

The scope is used to specify which principals, actions, and resources the policy affects. It can be left undefined, which causes the policy to be applied to all possible requests as long as the condition clause is met. Ian Mckay, Cloud Principal at Kablamo, notes that

The scope is generally used for role-based access control, where you would like to apply policies scoped to a specific defined or set of resources, actions, principals, or combination thereof.

The condition clause is used to specify the context in which the effect applies. Typically there will be one or no condition clauses, but Cedar supports any number of clauses. There is a basic set of operators available including comparison operators, Boolean operators, and collection operators such as exists, in, and has. Mckay notes that condition clauses are “intended to perform attribute-based access control”.

Cedar has built-in support for defining policy templates. A policy template can be used to simplify creating policies that have similar structures but differ only in principal or resource keywords. If the base template changes, any derived policies will be automatically updated. The question mark operator is used to provide a placeholder for later variable substitution:

permit(
    principal == ?principal,
    action == Action::"download",
    resource in ?resource
) when {
    context.mfa == true
};

Cedar also has built-in support for IP addresses and decimals through extensions. Extensions are called using a function-style syntax and must be operated on using the built-in methods. A policy that confirms that the IP address is within a valid CIDR range would look like this:

permit(
    principal,
    action,
    resource
) when {
    ip(context.client_ip).isInRange("10.0.0.0/8")
};

Designed by AWS’s Automated Reasoning Group, Cedar is built such that automated reasoning tools can be built to analyze policies. Reasoning as it relates to policies could be attempting to deduce if two policies are the same or if a particular policy will grant access. According to Byron Cook, Distinguished Scientist at AWS,

An automated reasoning tool does this work for us: it attempts to answer questions about a program (or a logic formula) by using known techniques from mathematics.

Reception to the news was mixed on social media with user rendaw sharing that “rather than make a new language, they should have made a WASM or eBPF API and just let people use the full power of whatever language they want.” User Taikonerd worried that the agnostic nature of the language won’t be effective, stating that “my worry is that there will be statements that only make sense with one cloud provider”.

At the time of writing, Cedar is available within Amazon Verified Permissions, AWS Verified Access, and the Cedar playground.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Deploy MongoDB in a Container, Access It Outside the Cluster – The New Stack

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

<meta name="x-tns-categories" content="Containers / Data / Kubernetes“><meta name="x-tns-authors" content="“>

Deploy MongoDB in a Container, Access It Outside the Cluster – The New Stack

Modal Title

2023-02-18 07:05:20

Deploy MongoDB in a Container, Access It Outside the Cluster

How to a deploy a containerized version of MongoDB and connect to it from a machine or service outside of the hosting server.


Feb 18th, 2023 7:05am by


Featued image for: Deploy MongoDB in a Container, Access It Outside the Cluster

Image by Jiawei Zhao on Unsplash.

At some point, during your journey with container development, you’ll have to deploy a database to serve as a data storage facility for your applications. Or maybe you plan on using a NoSQL database for the purpose of analytics and you rather deploy that database as a container, for its convenience.

That’s part of the beauty of Docker containers … they’re simple to deploy and use. Consider MongoDB. I’ve had numerous instances (especially when deploying a newer version to my go-to server distribution, Ubuntu), where the mongod service simply won’t start.

However, if I deploy that database as a container, I rarely (if ever) have problems. I can deploy that MongoDB container in seconds and connect to it from outside the cluster.

Hold up a moment. If you’re savvy enough, you know that it’s not so simple. No matter how you deploy MongoDB, the default configuration doesn’t allow connection from anywhere but localhost. So how in the world would you be able to deploy a containerized version of MongoDB and connect to it from a machine or service outside of the hosting server?

There’s a trick to that.

Now, I’m not saying this is the ideal method of using MongoDB as a container. What it does offer, however, is a really good insight into how Docker containers can be used and I’m going to show you how.

Are you ready for this?

Requirements

In order to successfully pull this off, you’ll need an operating system that supports Docker. I’ll demonstrate on Ubuntu Server 22.04, but you can use the platform of your choice. The only thing you’ll have to alter is the installation process for Docker (because I’m going to show you how to take care of that as well).

Let’s do it.

Installing Docker

In the name of not making you read yet another article, let me show you how to install Docker on Ubuntu Server. It’s actually quite simple…just cut and paste the following commands into the Linux terminal window.

To begin with, you must add the official Docker GPG key with the following command:

The next step is adding the Docker repository, which can be achieved with:

Install the necessary dependencies with:

Update apt with:

Install the Docker Community Edition with:

Add your user to the Docker group with the command:

Log out and log back in so the changes take effect.

Outstanding work so far.

Deploy the MongoDB Container

It’s time to deploy the MongoDB container. Just for fun, let’s deploy the container with persistent storage. First, we’ll create a directory on the hosting machine to house the data with the command:

With our directory to house persistent data ready, deploy the container with:

To verify the database container is running, issue the command:

You should see something like this in the output:

Congratulations, your database container has been deployed and is ready for configuration.

Configuring the Database for External Connections

At this point, the container is up and running, but only accessible from within localhost. In other words, you can’t reach it from outside the container. Fortunately, this fix is rather simple. First, you must know how to access the Bash prompt of a running container. Remember, when we deployed our container above, we named it example-mongo with the option --name example-mongo. We’ll use that name to access the container with the command:

You should now find yourself at the Bash prompt of the container, which will be denoted by something like this:

Before we do anything, let’s install the nano text editor (because it’s far easier to use than the included vi. To do that, update apt with:

Once apt is updated, install nano with:

Open the MongoDB configuration file with the command:

Locate the following section:

You must change that section to:

The above configuration opens MongoDB to any connection. If you want to limit to a specific IP on your network, you could configure it with something like this:

Or, if you want to limit to all machines on your network, you could use something like the following:

Save and close the file with the keyboard combination [Ctrl]+[X]. Exit from the container with the exit command. Since we’ve made changes to the database configuration, the easiest method of restarting MongoDB is to simply restart the container itself. To do that, we first must locate the container ID, which is done with the command:

You’ll only need the first four characters of the container ID. With those in hand, restart the container with:

Where ID is the ID of the example-mongo container. If you then access the container’s Bash prompt again, you can view the MongoDB configuration file and see your changes are still intact.

One thing I’ve found to be a big help with MongoDB is installing the MongoDB Compass app, which is a GUI tool for managing your databases. With that installed, you can connect to the containerized MongoDB with a connection string like:

From within Compass, you can create/edit databases/collections.

At this point, you can now access your MongoDB container from beyond the host file, which means you can use that database for a number of purposes. But, more than anything, you’ve learned how to install Docker, deploy a container with persistent storage, access the container, configure a service, and restart the container.

I would also be remiss if I didn’t mention this isn’t exactly the most secure method of using MongoDB. After all, you’d want to create a database user with access to a specific database, but that’s outside of the scope of this piece. However, if you know how to use MongoDB, you already know how to create an authenticated user.

Even so, this is a great way to learn the ins and outs of deploying a Docker container.

Group
Created with Sketch.

TNS owner Insight Partners is an investor in: Docker.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


GitHub Enhances CodeQL, Extends Language Support, Available Queries, and More

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

After adding support for Ruby at GitHub Universe 2022, CodeQL introduced Kotlin support in beta. Additionally, support for other languages has been extended to include more recent versions. GitHub has also extended available queries to fully cover several industry-wide vulnerability directories, and improved the CodeQL ecosystem.

CodeQL powers GitHub semantic code scanning. Based on semantic code analysis engine Semmle, CodeQL allows you to define queries to find, triage, and prioritize fixes, including security issues such as remote code execution (RCE), SQL injection, and cross-site scripting (XSS). CodeQL comes with a library of ready-to-use open-source queries that help identify coding patterns that hint at known vulnerabilities and their variants.

The use of queries might sound complex, but we make it easy by providing out-of-the-box queries, written and curated by GitHub researchers and community security researchers, covering everything from the most critical to common vulnerabilities.

Kotlin support is an extension of existing Java support, with the inclusion of a number of Android-specific queries related to intents, fragments, WebView validation, etc.

Kotlin marks our first investment in mobile application security, and beta support for Swift will be coming later this year.

As mentioned, existing support for other languages has been extended to fully support Java 19, Go 1.19, and Python 3.11.

To make CodeQL more effective, GitHub has extended its collection of queries and now includes 318 security queries by default, which can be brought up to 432 with a query pack. CodeQL can be used along with Dependabot alerts to increase coverage of your code respect to several vulnerability directories. This includes all applicable OWASP categories, SANS CWE Top 25 most dangerous software errors, and 100% of the Web Application Security Consortium (WASC) applicable categories.

Other improvements to the CodeQL experience include CodeQL pack support on GitHub.com and GitHub Enterprise; support for query customization and filtering; increased analysis speed, which is now 16% faster.

As a final note, GitHub is also providing access to stored CodeQL databases for popular open source projects, which contain a representation of the codebase including its abstract syntax tree, the data flow graph, and the control flow graph. That information can be used by security researchers for variant analysis, useful to find similar problems in other codebases.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Being Opiniated About the Engineering Culture you want to Build

MMS Founder
MMS Viraj Mody

Article originally posted on InfoQ. Visit InfoQ

Subscribe on:






Transcript

Shane Hastie: Hey, folks. QCon London is just around the corner. We’ll be back in person in London from March 27 to 29. Join senior software leaders at early adopter companies as they share how they’ve implemented emerging trends and best practices. You’ll learn from their experiences, practical techniques and pitfalls to avoid so you get assurance you’re adopting the right patterns and practices. Learn more at qconlondon.com. We hope to see you there.

Good today folks. This is Shane Hastie from the InfoQ Engineering Culture Podcast. Today I’m sitting down with Viraj Mody. Viraj is the co-founder and CTO of Common Room. Viraj, thanks so much for taking the time to talk to us today.

Viraj Mody: Thanks for having me, Shane. I’m glad to be here.

Shane Hastie: My normal opening question is who’s Viraj?

Introductions [00:55]

Viraj Mody: So, professionally, I am co-founder and CTO of up-and-coming startup called Common Room. And then outside of work, obviously I have a beautiful family and a couple kids and I try to play the best dad and husband I can. And prior to my time at Common Room, I spent a bunch of years early on in my career at Microsoft and then a tiny startup of my own with a couple other co-founders that ended up being acquired by Dropbox. So spent a few years at Dropbox, that was a wild ride. And then I was really curious about how marketplace software works and so worked at this company called Convoy, which was a trucking marketplace and then ended up reuniting with one of my previous co-founders, but also a couple other brilliant folks I bumped into in Linda and Francis to start Common Room.

Shane Hastie: So, tell us a little bit about what Common Room is there for?

Viraj Mody: Common Room is an intelligent community growth platform that helps organizations deepen relationships with their customers and their community wherever it exists, with the end goal of helping to build better products and drive business impact. So, you can think of Common Room as bringing together community engagement, product usage, customer data and business context into one single spot. And then from there provide realtime intelligence to various people in your organization, a more holistic context about who they’re interacting with. So if you’re on product and engineering, you not only see how somebody’s using your product, you see what they’re saying about it, where they’re talking about it, what they like, what they don’t like. If you are in sales and marketing or go to market, you can get context right alongside your CRM platform that then ties into all of the other surfaces where your community exists already.

Shane Hastie: One of the things that attracted me to having this conversation was your focus on developer experience. What makes a great developer experience?

What makes a great developer experience? [02:57]

Viraj Mody: I guess there’s internal engineering culture aspects to developer experience that I spend a lot of time thinking about, as well as external like developer, customer relationship and developer experience as it relates to customers of your company. I think both of them are equally important. You can’t get one without the other. And in many ways, one reflects the other. So, I am a customer of many other companies’ APIs and developer platforms. And similarly, many other engineers are customers of my developer platform. So I think the number one factor for me is a deep empathy with the people you’re serving and building APIs for. As we work on exposing parts of a platform, whoever it is, not just Common Room, deeply understanding what the customer needs and what pain you’re solving for them, is probably the place to start. Instead of the goal of, “I want to publish a platform,” or, “I want to have APIs,” or, “I want to build something that helps me build relationships with developers.” Like that’s a means to an end but not the end in itself, if that makes sense.

Shane Hastie: So that’s one part, the deep understanding of the problem you are looking to solve when making something available for developers. What about the cultural aspects?

Being opinionated about building the culture you want [04:16]

Viraj Mody: Earlier in my career I had a very different approach to what good engineering culture looks like, and I’ll probably focus on the internal culture within a company, but I feel like you can derive from that. Over time though, I’ve become more and more convinced that opinionated engineering culture is the way to win. And I think it’s a direct function of the team you’ve built. And so being opinionated about the team you’re building, who is part of it, how you recruit, what are the characteristics of individuals that would be a good fit for your team, and on the flip side would not be a good fit for your team. And then using those to build an efficient and high velocity engineering organization. Now it’s easy to think of this as homogenous, that’s very different. I think there’s still room for a lot of diversity of perspectives and views and experiences and backgrounds while still building a culture that is convicted in how it operates. Happy to dig deeper into that, but that’s a pretty high level, maybe full of buzzwords answer.

Shane Hastie: Yeah, let’s dig deep. Let’s go down. So opinionated, when you say an opinionated culture, how does that show up in the world?

Viraj Mody: So for us at Common Room, it shows up across the stack all the way from our technology choices that we make. One example of a very opinionated approach for us is boring technology is good technology. We’re all fascinated by new developments in different parts of the tech stack. Everyone’s keeping track, there’s mind-blowing new capabilities that you see every day, but we’re building a business that serves real customers, has real SLAs that we need to go fulfil, and so tried and tested technology is what we’d bet on. That way we can move fast, we have good ecosystems we can rely on, we have good documentation tooling. A similar principle we use for opinionated tooling is just simplicity. When you work with really smart people, they can handle complex problems and they can deal with complexity. It takes even more layer of experience to realize that here’s complexity you want to avoid or look around corners.

And that leads to sort of the next opinionated approach around how we’ve built our team. I was a new grad with starry eyes once upon a time thinking I knew everything about everything, clearly I was wrong back then. But our team today is mostly made up of people who’ve been in the industry and have had experiences at various companies of various types in the past, and that’s very deliberate because of how we work. I think we would not be a great environment for somebody who doesn’t come with a bunch of experience because honestly for our size and scale, we haven’t invested in what it would take to onboard and make successful a new grad version of me. But with the team we have and the processes and tools we have, we are able to move extremely fast. We have a lot of guardrails in place, we have a lot of support in place, but it’s all geared for the kind of individuals we have. So that’s opinionated in terms of hiring and recruiting. Yeah. Does it give you a little bit of a flavor for what I meant?

Shane Hastie: So you touched earlier on and the risk here is the monoculture. How do you make sure when hiring in this opinionated way that you are, the buzzword term at the moment is hiring for culture add rather than culture fit?

Hire for competency and culture add [07:57]

Viraj Mody: That’s a great point. So my approach to that is thinking in terms of competencies and realizing what competencies we have a weakness in or have blind spots in and helping close those gaps. And a derivative of a lot of those competencies is the experiences that people have had to be able to develop those competencies. So your classic infrastructure, heavy experience engineer probably has a spot on our team, but you get a couple of those and you’re good enough and then you realize, “Look, now we lack competency in, I don’t know, front end.” And then on front end again, once you sort of marry the desire for the expertise you need along with the principle, the how you work, you end up creating a pretty clear picture of the kind of skill that is needed, the kind of person that might be a good fit for how you operate.

So a lot of times when I think of culture fit as you mentioned, it’s more about does the person’s ability to operate line up with how you operate? I don’t think it’s too much more than that, kind of reductive way of thinking about this is to break it up into components that are easy to stereotype. But if you stick with principles around scale and competency and ability to operate in a certain environment, that helps you keep a lot of the biases off the table. And then obviously making sure that the interview process is structured to extract maximum signal. Another opinionated approach if you will is, I want to hire people for their strengths, not for everything that may not be great about them. I’m sure all of us are great at a few things and terrible at many, many more things.

If we start looking for the things people are terrible at and then use that as a reason to exclude hiring them, that’s a pretty slippery slope to get down. And so once you have a clear picture of the scale and the kind of operating background or the ability for people to execute the way you want to, you can work into a pretty clear process that helps you evaluate their strengths that line up with those. So, I don’t know if this is infallible, but it’s been our approach and it’s helped us build a pretty good team that people from diverse genders, diverse backgrounds, diverse skill sets. And the bonus that I’ve observed over the last few years is we learn from each other because we come from a very similar place of how we work, but our experiences and skills are so different that that’s where we focus on how we learn from each other.

Shane Hastie: So that learning, engineers are by nature, broad generalization here, are curious, are interested in learning. How do we make space for that learning in the pressures of building a startup?

Support people to grow in areas they are interested in, even if it’s not their core competency [10:43]

Viraj Mody: You can try and aspire and often succeed and sometimes fail, that’s been my experience. So transparency I think is the baseline in terms of being transparent with your team about what you can and cannot offer them in terms of experiences and opportunities, and then expecting transparency from your team about what they want to learn and what they want to experience. And once you have that as a baseline, then it’s very easy to make those connections and open up opportunities for people to do stuff. A simple example, we have engineers who haven’t been exposed to the business side or the customer facing side of companies in the past, and they express the desire to learn more about those aspects of what it takes to build a company so that in the future if they want to do their own startup, they obviously have the technology side of it covered. How do they learn more about running a business?

Once you know that it’s very easy to make that opportunity available to the team because in the back of my mind I know it’s like, “Hey, we have this customer call coming up, this one particular engineer is really curious about how that works. Let’s get them on the call.” And at our scale it’s really easy to do this. There are very few silos, very few boundaries. And in many ways, Common Room, the product aspires to do this for companies at various types of scale. Because I remember from my time at Microsoft I learned a lot because it was my first job but I never interacted with the real customer other than myself. And that was really bizarre now that I think about it, it was just impossibly hard. I’m sure it is at other large companies too, to really speak with customers because there’s layers upon layers of product managers and product marketing and sales and who knows what, customer service.

But the ability to directly interact or hear from or see the raw words that customers are saying about stuff you’ve built, about stuff you are tasked with building can be game changing. Going back to your original question around building developer relations and that empathy component, when it’s filtered through different layers, it’s always hard to tell what the raw feedback was, what the raw sentiment was. But if you see it and you have the capability of seeing who said the thing, in what context, where and what were the exact words, that just helps you solve for their problems in an exponentially efficient way.

Shane Hastie: So let’s segue and talk about community. What does community look like in the developer experience world?

What makes a community? [13:17]

Viraj Mody: I feel like people try and silo that into a single platform or a single program or a forum or whatever is happening on your community.company.com domain. I think community encompasses everyone who’s engaging with your organization or your product or your APIs or your platform. Everywhere they’re engaging with you, it’s made up of your product users who are current customers, it’s made up of your biggest champions and fans, it’s made up of your detractors, it’s made up of people who may have heard about your product but aren’t using it yet.

And more importantly, it exists even if you don’t know it exists. I feel like that’s the one thing I see most over the years, is organizations or companies feel like they have to go find and develop a community but one exists already. You just have to start with it. You just have to discover where it is. There’s people talking about you on Twitter, there’s people talking about you on LinkedIn, there’s people talking about you in forums, there’s people trying out your product and leaving reviews. All of those are part of your community, whether you know it or not.

Shane Hastie: How do you tap into that community?

Viraj Mody: Well, ideally you would use Common Room as a starting point. Yes, I’ll stop shilling for my own product. But no, I think that is really the key question. I feel like today it’s incumbent that you understand your customers and build for them, otherwise you’re bound to fail. And if you try and bring your customers to a particular place, that’s not going to work. People are where they want to be and you have to meet them where they are. And I think that’s the big challenge of discovering and interacting with your customers today because the ecosystem is just so vast. Social media has made it possible for anybody to have an opinion, anybody has a blog, you produce a YouTube video criticizing a product, praising a product, tutorial on a product. And I think that is a key challenge that is emerging for fast growing companies is how do you keep up with all of these platforms?

It’s just physically impossible to have somebody watching everything. And then even if you are able to do that, how do you at scale separate noise from signal? I’ll give you an example. We have customers who have the craziest fans in the best way possible producing content on YouTube and writing articles on blogs or tweeting out tips and tricks on how to use products more efficiently, which is great signal. But at the same time you have a bunch of people retweeting that stuff that doesn’t really count because it’s already something you knew about or some bar producing spam of some sort that just happens to match keywords that you might be listening for, it can get overwhelming. It’s really tempting to try and want to keep up with everything and have this massive spreadsheet where you paste in a link to every tweet and every video.

But even if you’re moderately successful that thing’s not going to scale. So I think you’re correctly highlighting one of the biggest challenges I see is keeping up with what your community is saying where they’re saying it, while resisting the temptation or trying to funnel them into one spot. Right? Someone could very reasonably argue that the right strategy is no matter where I discover people talking about my product, I’m going to try to pull them into my forum software or my Slack channel or my Discord server. And I believe that is doomed to fail because in 2022, people expect to be where they’re comfortable versus trying to be forced to communicate with you in a certain place.

Shane Hastie: Shifting focus, you’ve got a solid career of building companies in that tech space. What advice would you give the new leader in a team? The recently, I was the best technologist and now I’m being promoted into the team lead role. What should that person be doing?

Advice for new technical leaders [17:05]

Viraj Mody: A few things. One that I suspect most, literature already goes into depth about for such transitions is understanding that this is a career shift instead of a progression. Being a great engineer and leading a great engineering team are two very different competencies, require two very different sets of skills and strengths. Just because you may be great at one does not automatically imply that you will be great at the other or you may be great at the other. So having that awareness, I think is important. Everybody including me, this is a transition that may or may not work, and knowing that is I think important. That will be probably the first piece of advice I’d give someone contemplating this. I would probably have more textbook advice that I’m sure people can go look up. So I’ll probably give a hot take version of it.

I think it’s important to believe in your instinct and your conviction once you’ve understood that this transition is for you. Because ultimately, you are still leading a team of engineers. And assuming you have the skill to be a leader and a manager, you still are a damn good engineer. Switching into this role does not make you any less of an engineer than you were before. I feel like there’s this trope in the industry that engineers should not be writing code and engineers should not engage in technical discussions, leave that to the tech leads or the engineering managers. I think that’s not the right way to build a company or a highly technical organization. If you are an engineering leader, you’re an engineering leader, not a manager. And therefore don’t fight your instincts about good engineering intuition. I see that as something that’s been propagated more recently in the industry.

As you know, a manager is a manager and they’re here to run processes and project manage and deal with career progression. Sure, but they’re s still engineers first. So don’t hesitate to lean into your intuition and your technical strengths, would be another one. And then I think this is again generic but worth repeating for anybody considering a transition like this, your gratification model has to change when you are an engineer, you know, write code, you ship, it works and that’s how you get gratification. There was a bug, you solve it, you fix it, it’s no longer there. That’s your gratification. When you’re leading an organization or a team, you are a couple of degrees removed from that instant gratification cycle.

An engineer on your team fixes a bug and feels good about what they’ve done and somehow that has to translate to you feeling like you did your job, or people on your team solve problems and succeed in their careers and are successful. And that has to result in you feeling gratification for you doing your job. So the time delta between you doing a good job and you recognizing that it was effective, is much, much bigger as you transition more into leadership. Getting comfortable with that and recognizing that I think is important. And to the flip side, some of the more terrible managers are the ones who do not understand that and try to either steal glory from their team or deflect blame to their team, because it cuts both ways. You seek gratification from their success, but if things don’t work out, you’ve got to be the one on the front lines for it.

Shane Hastie: Thank you. Looking back over your own career, what’s the thing that has given you the most satisfaction?

Reflection on career satisfaction [20:40]

Viraj Mody: For me, knowing that I can take risk that’s appropriate to the reward without the fear of taking the risk, is probably the thing that I have seen myself do best and learned to really appreciate over the years. I have made a bunch of career transitions that many people would consider too risky to do, and my calculus has always been based on some understanding of risk and reward and over time that keeps getting more and more sophisticated.

So there’s obviously, I could speak loads about technical details or things I’ve learned or grown there, but I feel like a really high level representation that applies both to career things, technical things, leadership things, is just understanding what risk is worth it and what risk is not worth it. Again, this is very subjective, it’s very personal. Different people have different tolerances based on their life circumstances, based on their experiences. So yeah, I think knowing when to back convention if your gut says it’s the right thing to do, and then knowing when you know what, others may be doing a thing, but that’s not for me.

Shane Hastie: Viraj, thank you very much. If people want to continue the conversation, where do they find you?

Viraj Mody: I am on Twitter as @Virajm and email at virajmody@gmail.com. V-I-R-A-J-M-O-D-Y@gmail.com. Happy to chat with folks wherever.

Shane Hastie: Thank you so much.

Viraj Mody: Thank you.

Mentioned

About the Author

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: The After Party: Refactoring After 100x Hypergrowth

MMS Founder
MMS Phil Calcado

Article originally posted on InfoQ. Visit InfoQ

Transcript

Calçado: My name is Phil. I’ve given talks at a few different QCons and other conferences about various topics, but mostly microservices. That’s what I’ve been working with. That’s what I’ve been around in terms of community, technology, process, everything, for the better part of the last 10 years. When thinking about microservices, there’s different stages of growth and understanding as an industry of where we’ve been. I think probably now, we’re late stage microservices. We look at the consequences of our actions. We’re looking at the consequences of the decisions we made 10, 5, whatever many years ago, and how to evolve all these different pieces of technology and understanding to something more sustainable. This is interesting to me in particular, because historically, like I said, I’ve been in the microservices scene for quite a while. I was lucky enough to work in some of the pioneers in this space, being Thoughtworks, SoundCloud, DigitalOcean, and many others. A lot of what my experience has been so far has been a little bit been there for the pregame, or the party, like just before things start heating up, either as a consultant. I would come and help clients and figure out that actually the best way was to split big services into smaller ones, or growing a company like SoundCloud, DigitalOcean, and a few others, in terms of, we are growing a lot, our business is growing in interesting ways. We need to adopt an architecture that allows for us to grow the way we need to.

This often has to do with a little bit of hyperscale, growing like crazy, hiring too many people. One thing I’ve learned going through this process a few times, is that the most important thing you can do, like really the basics, is that, first, you really should think about the whole situation as you step into a more distributed architecture style as a city planner, not a bricklayer. This is from SimCity. Basically, if you haven’t played SimCity ever, what you do is you manage a city. What the name says. It’s a city simulator, but you don’t manage any specific buildings. This is a screenshot of the game, like a little animated thing. You have no control over any of these particular buildings, cars, what have you. All you can do is just say, this is a residential area. There’s a fire station here. There’s a police station over there. There’s a park here. You cannot build this way. A road goes from A to B. You don’t have control over what gets built or what motivation people have to go along with exploring the city, you just frame the problem and let people go. This is a lot of what I have experienced building a company based on microservices, or product engineering especially. Specifically, a product engineer organization from microservices, building from the ground up. The thing to me is really to think about the whole problem as a city planner, not a bricklayer. You really need to think about how you define your whole city, which eventually is your company.

Procedural Generation

Another image metaphor, analogy I like to use is a little bit like procedural content generation, it’s something that I find fascinating. Basically, procedural code generation or content generation is a technique by which you feed a computer with some constraints, some logic, some rules, and an algorithm and a random number generator, and just let it go. Then you produce maybe Dungeons for your RPG game. There is levels for your first-person shooter or whatever it is, with the idea that a person wouldn’t have to explicitly draw levels for a game, or what have you. The computer can generate that based on these rules that you’ve created. To me, this is so similar to what happens when you are starting a company, starting from the ground up, or maybe growing a company using microservices. What you need to do is you need to define a set of constraints to define the boundaries of the problem and let people go, because you’re hiring thousands of people every month or every year. You will absolutely not be able to control what these people are going to do. These people are going to go. You hire them for their actual ability to build wonderful things, but you really need these things to talk to each other, to have some level of governance, so that the whole situation is manageable.

Ultimately, you’re not controlling each specific piece. This is extremely important. It’s very hard as somebody with a technical background, somebody who is used to write code to understand that, no, you don’t own it anymore. That’s how you do. You define the rules, and you let the organization go. Sometimes it’s called enterprise architecture, although there’s different flavors, but it’s how I see it. The interesting thing that I was talking about how in procedural generation for content, you let the computer randomness go about it. Obviously, when you’re talking about microservices, you let your organization’s randomness which exists, just like people come, people go, different projects are staffed or killed, these things come and go. What you need to do is to define the rules that will let the company generate the content here being the microservices or what have you. That’s how I’ve been successful running companies based on these highly distributed architectures we call microservices.

PicPay (Digital Wallet)

Then more recently, I’ve been facing a new challenge. I work for this company called PicPay. PicPay is the largest digital wallet in Latin America. We do peer to peer payments, merchant payments, everything around finance, both for consumers and enterprises. It’s like a lot of projects, basically what we call usually a Super App focused in Brazil. If you’re Brazilian, you for sure know the app, and what it does, because that’s what everybody down there does use for payments and various other things. It’s a very famous brand within this demographic. The most important thing for this talk is not so much what PicPay does. You can think of it just like a payments application, a little bit of Cash App and Square here in the West, Venmo, AppBrain, similar to what we do down in Brazil. Basically, the important thing here is how big PicPay grew. I’m used to wild growth and what we generally call hyper-growth. This is a whole new level, at least to me. This is a chart of our team size. We started with a few people, the small offices often do. After, not quickly, because these people working on this app for a very long time, like six, seven years. Then eventually, market conditions, funding, everything comes together and we need to scale this business up. We go from one product to over 30, and quickly moved from 50 people in engineering to about 2000 now. We are I believe 3500 people right now across the whole business, there’s like the business side, support, everything. We are about 2000 engineers or people who work in engineering capacities, managers, and data scientists. This is wild. To me, it’s a good representation of the challenges we have right now, which is really like, we’ve scaled so much. How can we over such a quick period of time? We had to make a lot of decisions. We had to go about things in a way that allow for us to scale like that. The decisions we had to make as we’re growing, the decisions we had to make during this process are not necessarily the best decisions for the current phase of the company, which is a little bit of the reason I and a few other people were hired and put into this place. It was like, ok, we need folks to look at what we have and put us back on track to keep growing as we’re doing, but in a more sustainable way.

This is an interesting challenge. The way I see this is really that for the first time I’m invited to the after party, because I’ve been playing this startup game for a little while. You more often than not, are there from the beginning, the company succeeds or not, maybe makes money, maybe doesn’t make money. There’s a lot of different things, and your interest changes. Maybe after a few years, it was like, ok, I’ve done what I had to do here and move on to a different thing. This time, my team and I are facing the challenge of a highly successful company that grew extremely quickly, and now we need to make sure that we keep executing at a pace that allows for us to keep playing the game in the financial marketplace and the financial business that we are at, which is highly competitive. There’s old school banks. There’s new banks popping up every day. There’s new apps. There’s regulation. There’s all sorts of different things. We really need to make sure that we are on the right track when it comes to our experimentation, in our product quality.

1. Stop the Bleed

It’s important for me, also, to give you our disclaimer, this is all work in progress. I’ve been at PicPay since January. This is my three and a half month mark now. I think I know most people’s names. I’m happy to share a few of the things that we’ve been doing, that we started doing. If you catch up with me in about a year, I will probably update you on what of this maybe have changed, maybe didn’t work out so well. What we keep investing on. What are the things we found out that work and that don’t work in this scenario? Because there’s so much to talk about, there’s so many areas, from an organization’s technology, to architecture, to many things, I want to focus on three pieces of advice that I’ve been giving people in similar situations. Also, when I hire, I’m hiring a lot of more senior leaders to help me join the team and build a lot of the organization. This is the advice I’ve been giving them in coming to PicPay, what I think is the right thing to do joining and during the after party. The first thing I think is the most important is stop the bleed. A lot has happened in this company and possibly in your company. There’s a lot of different teams, projects, things going up. In a company with 2000 people, at any given time, there’s going to be 10 new projects popping up, there’s going to be 10 new systems popping up. There’s things that you possibly can’t know about that are being done right now that are going to impact your architecture, your technology, your strategy, really quickly in the future. The most important thing to do as a leader coming in, being a senior engineer, a manager, a director, a CTO, whatever are you, really, the first thing you should focus on, is to stop the bleed. Make sure that we’re not making the problem worse, every day. Whatever new systems, whatever new things pop up, should not be taking a step back, they should move you forward.

The first thing or one of the things that you need to do around this is to make explicit the rules that you want to follow, even if they’re not strictly enforced. The last part is very important, but the first part too. Make the rules explicit. It’s something that even when you’re growing with microservices, when you grow in a more organized way from the beginning, again, it’s the procedural code generation we’re talking about, I believe you need to do. You need to make sure that the constraints you want to follow are very explicit, and encourage people to follow them. What are these constraints? For example, you could decide that we use RESTful designs for our applications here, we use HTTP and all the niceties around that. Or, no, actually, we use gRPC and don’t want to see any HTTP endpoint ever. All these decisions that you need to make around observability, telemetry, how things talk to each other, how things are deployed, those are the rules I’m talking about here. Make sure that you have them explicit, either if you’re growing with microservices, or if you’re refactoring an existing microservices setup, you really need to make sure that these are clear.

There’s one difference between growing and refactoring that I found. When you’re growing with microservices, it’s very easy for you to nip in the bud when things are going awry. If I decide that my service will use Thrift instead of gRPC, everything else using gRPC, it’s a lot easier as you’re growing to identify that and say, “No, you cannot do that. This is not what we do here. That’s not what we’re going to do here.” Because you only have maybe a few services, everybody knows everybody else. The company is still growing. When you’re refactoring the organization, it’s a little harder to enforce things so strictly, because maybe, yes, there’s a new service that’s going to be produced, and it’s going to use Thrift instead of gRPC. You’ve decided to use gRPC, they’re going to use Thrift. You could go there and say, “You cannot do this. It is not allowed.” Maybe that will be ok, but there’s so much entropy. Maybe you’re talking about an organization with 100 engineers that have always used Thrift on that side. Your decisions are between migrating everything that exists to use gRPC or what have you, or allowing just one case to use Thrift. Then there’s another case more. It’s an interesting balance between these things. In my experience, at first, you shouldn’t really focus too much on trying to strictly enforce the rules that you’ve set for your organization. What you should focus on instead is to promote the things you want. It’s fine that maybe some people are going to use different RPC styles. You really should make sure that whoever decides to use the flavor that you have picked, let’s say in this case, gRPC, have a very seamless experience. Which often means that you will have some platform team, tooling team focused on making sure that the tools used in the golden path, on the paved way, they are the best tools. They are better than the tools that people can use elsewhere.

There’s a few different sets of rules you can adopt on this. There’s various different flavors. My recommendation would definitely be to at least follow some of what we call the prerequisites for microservices, make sure these things are well established as rules in your organization. This is a list that I’ve compiled based on the list that originally Martin Fowler had, called microservices prerequisites. I think this is the basic you need to do, to do microservices well at scale. If you find yourself in a situation like this, make sure that you have an answer to at least each one of these topics that you want to promote and defend and want your people to use. Similar to this, in this same vein, make the work visible is very important. When you have a very big organization and you have a lot of different services, hundreds of services, hundreds of people, maybe hundreds of teams, there’s a lot happening. It’s never going to hit you until you know something’s in production, or when it’s generally way too late. One thing I really like is to foster or promote a culture of people sharing ideas in different ways. They share what they’ve been thinking about and the design of their systems as widely as possible.

There’s a few different ways to go about this. One way that I definitely recommend is a structured RFC process. This is one that I particularly like. It’s one I’ve been using for almost a decade now, with a lot of iteration. Basically, the idea here is that you ask people to write in a document with a specific format that drives to some kind of interactions or thinking that are considered to be good. You have people commenting on that. There’s like different logistic ways. A few of the interesting things is that first this is not a decision making process, this is a knowledge sharing process. What that means is that the person writing the RFC is accountable for the system being proposed. That person needs to be able to make decisions. You do not want to put anyone in a position where they are accountable for something, but they don’t have autonomy over it. Somebody else makes the decision, but if things break in the middle of the night, I’m the one responsible for that. That’s the worst of all worlds. I really would like to avoid that as much as possible. It’s really about writing down what you’re thinking, getting input from people all over the company. That’s why it’s really important to have a mailing list, a Slack channel, whatever forum your organization prefers to publicize this idea. Give it a timeframe. It’s like, we have one week to comment on this idea. Know that the decision is still with the person. The decision still needs to be with whoever is accountable for that. This person should receive feedback from the whole organization and act accordingly on this. There’s a few interesting things like expiry date, review process, and things like that. Definitely recommend you look at this process.

It doesn’t matter if you will use this process or not, what matters to me is that you have some way so that people working on different systems, different services across the company are broadcasting that to others to get input, which is very important, especially because talent disparity often occurs. Also, to share knowledge, so that next time I need some things like, I remember that two weeks ago, somebody was talking about a system that uses Kafka and I want to use Kafka now in my system. Maybe I should read that RFC, talk to that person. One last bit on this is that I find this to be extremely important in a more distributed world like we are, like everybody’s remote, where people don’t go for beers together, or for coffee so often, where you have the serendipity of exchanging information. This is a way to force information to be shared in a way that even if you don’t immediately need or care about that, it stays in the back of your head, and you can quickly search your email, or Slack, or what have you, to find out information about something that was built a while ago.

2. Don’t Miss the Forest for the Trees

The second is, don’t miss the forest for the trees. This is extremely important, again, when you’re growing an organization based on microservices. I think it’s even more important when you’re joining a company that already has adopted that at a large scale. Because the first thing you want to do as an engineer, is to go like, give me access to GitHub, I want to read the code of everything. It’s just a natural instinct. That’s how we used to do things before these highly distributed applications were so prevalent. You would spend some time reading through at least the most important parts of the code of any system that you’ve inherited. In my experience, this doesn’t work so well with microservices, because it’s very distributed and the value of looking inside each box is not as big as you might find. It’s a lot more interesting to look at the whole system so the forest here, not the trees. It’s an interesting analogy. Sometimes it breaks in different and funny ways. Basically, that’s the main idea behind that, that really, we should be looking at the whole ecosystem like the trees themselves, and not the forest as much.

Around this thing, there’s always a few exceptions. The first one I want to acknowledge is that, sure, forest for the trees, but you really need to make sure that you identify and fix any of the hot potatoes immediately. What I mean by hot potato is that those systems that clearly are causing trouble. Those systems are clearly breaking in production, or are not performing very well, are burning too much money, which is a common issue across microservices. These systems that require attention, yes, do look at those systems immediately, don’t waste any time. Only those that are really worth the time, like something that’s critical, or something that’s on fire all the time and is causing a lot of problems, because you should really avoid getting distracted with each individual system.

Going back to our SimCity analogy, you really should start thinking about this as a city. Basically, you inherited this township or this community of people and it grew organically. Like any organic community, or think of a market, if people would start building a market from scratch, there’s not going to be a lot of organization. You’re probably going to have the person who sells raw fish next to the person who sells some toxic cleaning product. It’s probably not a good idea. You need really to step in and look at, I have all these things. It’s a vibrant marketplace of ideas, of projects, of different things. How can I help these folks organize the work they’re doing or structure this community in a way that makes sense?

Clay-to-Rocks Layering Model

I’ve found a few things in my career that I think make good sense in this scenario. One of them is what I call the clay-to-rocks model. It’s a layering model. Layers are a very powerful idea in software engineering and in many other things in life, but software engineering, especially, where you pile your software in a way where there’s, again, layers, groupings of systems with similar properties that you want to grab together and think of them as one thing. The clay-to-rocks model, the idea is that you have systems that are clay, and these systems that are high churn with a small blast radius. Basically, systems that are more malleable, that you can change all the time, they will change all the time. These tend to be close to the use cases, tend to be closer to the user. Maybe that’s the one feature your product team wants to experiment with. Nobody knows it’s going to be successful or not. There’s no point in wasting two years building the perfect version of this if you don’t even know if it’s going to be successful. Just wire some stuff together, put it out. See if it sticks. Evolve after that, like we bake the clay. Don’t spend too much time thinking about the quality of the system. That’s where lead time and cycle time even is so much more valuable than anything else.

You have other systems that are rocks. By this, I mean that they are lower on the layering scheme. They enable almost every other feature in the system. This will be the system that provides you information about authentication or your user database. Or maybe in our case, for example, performs financial transactions. If you want to move money from account A to account B, there is 1 million reasons why we could do this, maybe I want to move money from my savings account to my checking account. Maybe you want to move money from your account to my account to pay me for the pizza we just had, or whatever it is. There’s 1 million different use cases, but there should be one system that has basic interfaces like, from account to account, amount of money, maybe recent. This is how these rock systems work. They need to be super stable, because, again, they empower so many different use cases, they often are the heart of your operation. If they are out, the whole company is out. There’s only so many of them. Usually, you have a lot more of these clay systems than the rock systems, and they have different characteristics.

Drawing the picture a little bit, this is from the original article I wrote about this layering scheme when I was working at meetup.com. That’s the example we’ve been using here where there’s various different systems or services within meetup.com. You have information about groups, information about membership, information about users, and information about events. Then, this is the bottom, as you go up the stack, you have a lot more specialized things. You have the user profile service that provides you with data about how that’s displayed on your screen, on your app, or browser when you go to meetup.com to check out who used this profile. My profile would have some information coming from that. If you look at this situation, you can specify which of these systems are more like clay and more like rocks. The systems that are clay and coming from this particular experience in Meetup, this is where we’re always exploring, changing, churning, experimenting with. Our user profile page used to change every week, because some product person, some designer person, some marketing person will have an idea or wanted to promote something else. Maybe they move things around. Maybe they add more data, remove data from that particular user experience. The actual membership service, the group service, the user service did not change at all. It’s just how the data provided by the services was being used.

When you’re talking about coming from a situation where you inherit a lot of services such as this, and you need to put them into shape, this is a little bit of what happened. You need to start figuring out what are the systems that need to be more stable, and what are the systems that don’t have to be so stable? Invest a lot of time, effort, energy, maybe drop your best people to make sure that your rock service, like the ones that are really important to your normal operation, are working to the level that you need them to work. These fundamental systems they cannot go out, so you probably will be optimizing for stability and performance, not so much for developer productivity and other things. Maybe to even like the code review constraints for these systems are more specific or more stringent than for others. Meanwhile, I didn’t find the clay systems. When you have thousands of engineers working on thousands of things, you really have only so much headspace to worry about each specific thing. In this situation, building a map like this would allow you to make sure that you prioritize putting your attention and your best people on the rocks layer, and then figuring out and then deprioritizing the amount of effort you need to put on the clay layer for now, and then maybe in a case by case basis. This is an incredibly helpful model in many different ways. I found it invaluable in this particular situation where we’re refactoring an existing architecture.

Apply the Edge-Internal-External Layering Model

Still within layering, there’s another model I like that’s also part of the same article and comes from the same experience, which is the edge-internal-external. It’s a very basic idea. There’s nothing new here. Basically, the idea is that you need to flag which of your services are part of the edge. What I mean by part of the edge, means the services that are exposed to the outside world, services that have received inbound connections from outside. Probably, it’s the public internet, probably the way things have to work these days, it’s very likely that even if there’s only a subset of people who can access your service, you still expose it to the internet and use other security things to make sure they’re only accessible to the right people. It could be also that it’s only accessible from your partners or in some network that you might own. Edge means external systems that you don’t have control over access this thing. Services sometimes like API gateways, BFFs, what have you, play a good role here. I also put into this, all the supporting systems for that: authentication, authorization, rate limiting, throttling, all these things belong on this layer. There’s specific things that you want or you don’t want in systems in this layer.

The other one, which is the most commonly used, where most or the vast majority of our systems are going to run, is the internal layer. Systems that are only used inside your VPN. I’m using VPN here loosely. The idea here is, this is really the systems that you develop and talk to each other, so upstream and downstream systems within your network. Obviously, it is 2022, these systems should talk via HTTPS, TLS of some sort. They should have security inside the network. There’s almost no reason not to have security inside your network, even if it’s your own network. Plain HTTP traffic is a big no-no for five, six years now. Still, there are things that you are willing to accept more inside this network, then inbound communication comes from the outside. Then, which is a weird overlap, but if you structure them well ends up being its own layer. It is external services in the sense that services that may call to other services, or systems that may call to services outside your VPN. It’s interesting because you say, but isn’t that just an internal service that maybe makes a call just to the world outside the VPN? Yes. At the same time, I think it’s important to acknowledge that a well-structured architecture of services, wouldn’t allow any service to just make random calls to outside your VPN. Again, VPN here being used very loosely, mostly like your internal systems. I would really encourage you all to make sure that the systems that you own that make external calls, are the gateways to something else. Maybe it’s the push notification system, maybe it’s a third party service that you use or whatever it is, that they are isolated. Basically, that they’re only pieces that can actually make calls to the outside world, everything else is forbidden from making these calls. You have a few systems that can make calls to the outside world.

Then, if you get this lens, you’ve inherited a structure like this, you have a lot of different services up and down that look the same, because again, we’re thinking four is not three. I’m not even going to care which of this system does what. You start identifying that some of these systems belong to what we’re calling the edge layer. Some others will call the internal layer and the external layer. That’s when things get a little more interesting because systems that are belonging to the edge layer, you definitely need similar to the rock systems we were talking about. There’s an overlap between this pattern and the other. There’s an interesting interplay. The edge layer is mostly rock systems, in the sense that some of them are going to have high churn. For example, API gateway will probably come in and go in with new endpoints. There’s interesting ways to make sure that this is not too much of a problem. Your authentication system, your authorization system, your rate limiting system, those things need to just work. If you change the authorization system every week, or every iteration, or every month, there’s something wrong. You really need to get it going, either uses a third party service or you build your own. You get it going once and you barely touch it except like some security patch or some small evolution here and there. Changing systems in the edge layer should be a big deal. If you have a company that has 2000 people or if you have enough people in your platform and infrastructure organizations, I strongly recommend that you have a team fully dedicated just to the edge layer. It doesn’t need to be a big team. Those folks should be looking a lot at your network infrastructure, you’re probably talking about CDN as well, and various other things.

Then immediately below we have the internal layer where requests hitting this layer have a right to be sanitized in different ways. Again, that doesn’t mean that we can drop all security measures and use plaintext, HTTP, or anything like that. We definitely are a little more secure in this layer. This is where high churn happens. There are some scams and some go, the tooling you have for deployment, and support, and monitoring, telemetry should be optimized for this layer, because the vast majority of your systems are going to be here. Then you have the external layer, which is like these systems that make calls to systems outside. These, again, I do recommend that you completely isolate and acknowledge systems that are allowed to make calls to external systems or to external services. A few different reasons. The main one is, if a service does not belong to the external layer, they should not have access to the public internet. If they try to make a call to the public internet, they should just be blocked. There’s a lot of interesting things that would be avoided by this. The Log4j bug more recently is one case where this would be mitigated by having such role in most systems, because even if you receive something that randomly makes your system contact the outside world, they wouldn’t be allowed to, because they are not part of the external layer so they should be forbidden. Even thinking about these layers in different ways, there’s different things you can do in your tooling. Maybe use manifest files, maybe use different things to explicitly acknowledge the layer each system belongs so that your infrastructure tooling, your security tooling, and many other things can automatically detect if one of the systems are not behaving the way they should. This is like the two layering things, thinking about still very much on the city planning mindset.

3. Don’t Panic

As somebody who again, just joined this role, three and a half months ago, the most important thing to me is, don’t panic. It’s crazy when you join a company. There’s thousands of people working on thousands of things, and you’re like, I don’t even know what’s going on. It can be very overwhelming. There’s a whole different talk here around how to build your support systems. How to make sure that you, as a manager, as a leader feel empowered, and have the right tool, the right team, the right mindset to tackle this problem. Thinking a little bit more on the technical side, don’t panic, to me, has a few things, not a lot of them. The most important really is, don’t try to boil the ocean. This is the Windows defrag. Something that whoever probably was born in the ’80s and/or grew up very close to computers in the ’90s are very familiar with. The idea here is that for various different reasons, Windows had file systems, I think it was in fact a family of file systems, who would gradually fragment. They would have one thing that you think of a file, but it’ll actually be distributed all over your hard drive. You would have those spinning hard drives that were really slow. Eventually, because you want to access one file, it was spread across multiple different sectors in that disk. Your computer would get extremely slow, and you had to run this tool that came with the system to defrag, so basically to regroup things slowly. This is an animation that actually it would display to you while it was doing the job, and showing how it was grouping the sectors together and finding bad sectors and things like that.

One of the reasons that I like this, and I use this image all the time, is because this is how I see my job for a very long time, not just this particular job. It’s really like I’m that Windows defrag manager. I’m not here to magically solve all problems at once. My role as the leader of an organization going through this transformation is to gradually defrag sector by sector, file by file, whatever it is, make sure that maybe your company is a mess today. That’s fine. Maybe your company is a mess tomorrow, it’s fine. As long as the mess you have tomorrow has less entropy than what you had yesterday or what you had today, you’re going in the right direction. You’re doing well. Don’t panic. Things are going to be ok. Because I keep saying that management is one of the most lonely professions, I know leadership can be a big drain on people.

The main idea I have, or the main thing I want to share with you all now is, you really need to think of this problem in a more strategic way. Take steps back. Don’t think as an engineer. As you might see by my bookshelf right here, I’m a big nerd. I love programming languages. I love coding, but in a role like this, in the role like I am at right now, this is not my job. My job is to look at the whole ecosystem as it’s the city. I’m really thinking about it like a city planner, not a bricklayer, not somebody who’s building one building. I’m building the whole organization, and there’s different ways to go about this.

Questions and Answers

Ignatowicz: The practice that you explained to drive the hyper-growth of the company are really interesting, but I’m looking for more in a practice perspective. Imagine that I’m just joining a company and the team does not behave well on some of the points, for instance, don’t have a culture of designing docs, document, or RFCs to document the decisions. What would you advise me to start on the string without being too pushy, trying to put the same frame of my company? Besides that, because comparing this, try to not be too pushy, but actually to also deliver results. Because we have the window to impress people and to show what someone to hire you, pay you to go to do at the company, how you balance between changing the direction of a team that just joined and adapting for the team’s culture?

Calçado: Actually, I think this is a little bit of the mindset that I was trying to talk about when I mentioned the defrag component. I have a standing window every day for office hours where I talk to random people throughout the organization. Oftentimes, I get new hires come in and that’s the question I get a lot, like, “I’m new here. It’s a really big organization. My team has practices that may not be the practice I think they’re the best one. What should I do?” My recommendation in that case is really to think about this the way I look. As any big company, there are initiatives in how to improve, standardize, change the whole thing like, again, the defrag mindset. In a company this big, you really need to think globally, act locally. What are the things that you can do to improve your team? You’re not going to go from 3 to 10, but how can you go from 3 to 4, and talk to other people. This is one thing that’s been really challenging to me, because I started my work remote. I’m actually going to go to our headquarters for the first time in a few weeks. I’m so used to having the buzz and the marketplace of ideas happening there, where you get to know what other people are doing just by bumping into them. We’ve been trying more and more to have more social channels where people can exchange ideas, what different teams are doing. I think in this case, the most important thing really is to focus on improving your team, and then sharing those ideas. Obviously, stealing ideas from other teams. The one thing that I am really worried about in a company this big, and even in a company that’s smaller, even talk about 500 people, 200 people, is, don’t try to boil the ocean. Again, don’t try to solve the process for the whole company. Solve for your team, see what works, see what doesn’t work. Then let’s talk as a collective of engineers, what are the things that we can extract and make a process that works for more people? I think it’s really important, like you’re saying, also to deliver within that one team. With thousands of people, if you try to work out what works for everyone, you’re never going to do anything. Think globally, but act locally is the main motto there.

Ignatowicz: If I’m trying to implement an incident response system with people on on-call in my company, what would be your advice to start this process?

Calçado: The first one is, don’t reinvent the wheel. There’s 35 new processes. I think PagerDuty has a really comprehensive guide on that. Even if you don’t use PagerDuty, we don’t use PagerDuty at PicPay at the moment, the guide is great. I really appreciate it. With a lot of these processes, try to implement as-is, and iterate over time. Do not try to create a very sophisticated thing that will take forever, and will require a full team just to maintain the process, especially if you’re smaller. There’s only a few things you need an incident management process, it’s like communications and incident commander and things like that. Everything else is nice, but not necessary. Stick to a model that’s known to work. Again, my recommendation is the PagerDuty model. I know there’s others. I think that people behind FireHydrant also have a different model that probably works better with their tool, but can be generalized as well.

Ignatowicz: When you refactor systems at a wide level, do you have any tips to incentivize developers to stop using old APIs, services, calls, you are trying to get rid of?

Calçado: Make them super slow. Although this works. I think a lot of it has to do with managing the lifecycle of an API. It’s always complicated. My main recommendation would be to, actually, it might be a little late, but to think about it from the beginning, minimize the number of dependencies to things that are not published interfaces. We have this concept that I really like, Martin Fowler wrote an article many years ago, between public and published interfaces, in the sense that a public interface is like a public method in a programming language. You can call it, but that doesn’t necessarily mean you should call it. Maybe you’re calling a private API, it just happens to be public. A published interface is something that’s well supported, well documented, and all this stuff. My main recommendation this way is to make sure that you have a good distinction between public and published interfaces. Whatever you call published interfaces, you really support them well. I’m a leader now, I can go and hit people in the head with like an order and say you need to move from A to B in five months. Not only is it not a great working experience for people, but there’s all the priorities, there’s compromises people need to make. In my own experience, the best way to drive wide refactoring around an organization or wide change is to make sure that whatever you want them to do is much easier, a much better experience, much better supported than the bad way. The bad way can rot and go bad. Obviously, you need to maintain some level of support, but make sure that the new way is awesome. Then, in my experience, you gradually see, especially like the same newcomers we were talking about, they will want to use this new stuff, they will not want to use the old ways of doing things. You might need a project or two to migrate from old to new, but I think the whole thing about stop the bleed is very important. By providing a better alternative that is better to them, not just to the teams providing the API, you end up in a better situation that way.

Ignatowicz: Two questions for the hyper-growth factor. The first one is how you scale as PicPay scaled to thousands of developers, preserving the culture of the company. Secondly, in this hot market, how do you actually find people to scale in the decision role?

Calçado: The first one, I can tell you what we’ve done at PicPay. Before I joined there, we did not preserve the culture at all. If you look at PicPay right now it still is very much divided in eight different business units or groups that have different cultures. Some come from more of a banking/FinTech background, some come from more of a startup, consumer-driven company. What we’re trying now is to find what works for the company as a whole. Then, again, I don’t think that we can apply the same rule to somebody who’s building our social graph or a messaging system, they shouldn’t necessarily have the same ways and the same culture to somebody who’s doing like a deep banking system that integrates with bigger banks that move very slowly. We need to have some baseline, but each team needs a little bit of autonomy to decide what works best for them. I think the most important thing looking at the big picture is really thinking about the interfaces between these teams, like these human interfaces, the protocols. Even for like iteration planning, if I need something on our backlog, how can I get that? Do I have to talk to people? Is there a Jira ticket I open? How does that work? All the way to use the same RPC protocol, and our guys look the same. In a company that’s doing a lot of many different things, the things that are really different, you really need to allow for a little bit of cultural differences amongst various teams.

I definitely recommend that you have one funnel for hiring. I think this is something that larger companies, Google, Facebook, how those do well, in that there’s one funnel for people coming in. Because in a company like this, a team that exists tomorrow or a whole division might not exist in the next month, or might be an experiment. You really don’t want to get to a situation where you hyper-specialized hiring for one particular team, and you feel like you can’t move these people. This applies to managers and engineers very much. On the hiring side, it’s an interesting situation, like everybody else, hiring like crazy. A lot of my mindset comes from my experience when I was actually working in Europe, for SoundCloud, where remote working wasn’t really a thing. This was 2010 to 2015. What we had to do was to attract people from Silicon Valley, really, people who have grown companies the way we were growing

SoundCloud before, and make the move to Berlin, make less money. Because you can make a lot more money in the U.S., more equity, more everything. Some people would love to move to Berlin because they like the lifestyle. Some people would not. How can we attract these people there, and making sure that the culture inside the company and the work we were doing was super interesting, was a fundamental thing in all that.

If you look at the published material we had between 2011 and 2015 at SoundCloud, a lot of it was us making sure that people all over the world knew that we were doing cool stuff. You could go for Google and work on YouTube’s credit card form for eight months, like a famous Hacker News thread from the other day, or you can come to work for us and view the whole ad targeting system that we have built from scratch. I think finding what your competitive advantage is, is fundamental that way. Obviously, you need to have equal pay and allow for various other things. You really need to find a competitive advantage, which is, to a big company, something we’re still finding out at PicPay, finding out what’s the sweet spot at a company this big.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.