×

GitHub Was Down Multiple Times Last February: Here's Why

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

GitHub completed its internal investigation about what caused multiple service interruptions that affected its service last February for over 8 hours. The root cause for this was a combination of unexpected database load variation and database configuration issues.

All incidents affected GitHub main database cluster, mysql1, which originally was the only MySQL cluster at GitHub.

Over time, as mysql1 grew larger and busier, we split functionally grouped sets of tables into new clusters and created new clusters for new features. However, much of our core dataset still resides within that original cluster.

The first incident happened when GitHub engineers inadvertently sent a resource intensive query against the database master instead of its read-only replicas. As a consequence of this, ProxySQL, which is responsible for connection pooling, was overloaded and started serving queries inconsistently.

A similar issue occurred two days later due to a planned master database promotion aimed to investigate potential issues when a master goes read-only for a while. This also generated an unexpected load and a similar ProxySQL failure.

In both cases reducing the load was enough to fix the failure.

More critical was the third incident, which lasted over two hours. Again in this case ProxySQL was the failure point due to active database connection crossing a given threshold.

The major issue in this case was related to the fact that active connections remained above the critical threshold after remediation, which made the system enter a degraded service state. It turned out ProxySQL was configured improperly and due to a clash between the system-wide and process-local LimitNOFILE setting its descriptor limit was downgraded to 65536.

The fourth incident was caused by a change in GitHub application logic. This change generated queries that rapidly increased load on mysql1 master, which affected all dependent services.

According to GitHub engineers, all issues were easy to remediate once their root cause was identified, but the interaction between systems was not always immediate to understand. Consequently, they explain, a first focus area for improvement has been observability, followed by more thorough system integration and performance testing before deploying to production.

Going to the heart of the problem, though, requires improving data partitioning with the aim to reduce load on mysql1 cluster master, continue GitHub engineers. This would help fulfilling the requirement of zero downtime and reduce user impact of any future incidents.

In particular, they worked on one table, the abilities table, which is used with any authentication request and is thus central for performance. After two month’s work, the abilities table runs now in its own cluster, which previously required making it independent (read, JOIN-free) from any other table in the system. Thanks to this change only, load on mysql1 cluster master was reduced by 20%, say GitHub engineers.

While the effort to further partition their database will going on, with the aim to further reduce write-load on the mysql1 cluster by 60% and move 12 schema domains out of it, GitHub engineers also identified a number of other initiatives to improve things. Those include reducing the number of reads from master when the same data is available from replicas; using feature flags to disable code changes that might prove problematic; improving GitHub internal dashboard to better identify deployment problems; using sharding to improve horizontal scalability.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Angular 9.1 Adds TypeScript 3.8 Support and Faster Builds

MMS Founder
MMS Dylan Schiemann

Article originally posted on InfoQ. Visit InfoQ

The Angular 9.1 release adds support for TypeScript 3.8 and reduces the time it takes to build an Angular application.

TypeScript 3.8 was a significant improvement to the language, adding several new ES2020 features like private class fields and top-level await, as well as type-only imports and exports. Many projects have recently released updates to support these advances in TypeScript.

With Angular 9.0, the long-awaited Ivy engine was switched to the default compilation and rendering pipeline for Angular applications. To add compatibility with various Angular libraries, Angular provides the ngcc tool. With Angular 9.1, ngcc can now compile packages concurrently rather than the previous limit of sequential compilation.

Angular 9.1 also adds support for defining new Angular components as having CSS block display rather than the default of inline.

Updates to Angular’s approach to end-to-end tests now support passing grep and invertGrep options to Protractor, a common approach with other frameworks also supported by Angular.

Angular 9.1 also updates its dependency on TSLint to version 6.1. Developers upgrading their projects from earlier Angular releases should first upgrade to Angular 9.1and then run the following command:

ng update @angular/cli --migrate-only tslint-version-6

The update in TSLint was a common question in the community, who were expecting a switch from TSLint to the TypeScript version of ESLint as TSLint was deprecated in early 2019. The Angular team is already working on version 10, which will include switching from TSLint to ESLint.

Angular is open-source software available under the MIT license via the Angular GitHub repository. Feedback and contributions are encouraged via the Angular contribution guidelines and code of conduct.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AlphaFold Algorithm Predicts COVID-19 Protein Structures

MMS Founder
MMS Faris Gulamali

Article originally posted on InfoQ. Visit InfoQ

In the wake of a growing number of cases of COVID-19, DeepMind has utilized their AlphaFold algorithm to predict a variety of protein structures associated with COVID-19. Given a sequence of amino acids, the building blocks for proteins, AlphaFold is able to predict a three-dimensional protein structure. Typically, going from a sequence of amino acids to a three-dimensional structure is a long and intensive process, requiring a wide variety of protein visualization techniques and structural analysis such as cryo-electron microscopy, nuclear magnetic resonance, and X-ray crystallography.

However, AlphaFold, which recently won the CASP13 competition (Critical Assessment of Techniques for Protein Structure Prediction), bypasses these techniques with a deep neural network that predicts distances and angles between amino acids, scored with gradient descent. It uses free-modeling, which means that it ignores similar structures when making predictions, which is particularly helpful for COVID-19, as few similar protein structures are readily available.

AlphaFold is composed of three distinct layers of deep neural networks. The first layer is composed of a variational autoencoder stacked with an attention model, which generates realistic-looking fragments based on a single sequence’s amino acids. The second layer is split into two sublayers. The first sublayer optimizes inter-residue distances using a 1D CNN on a contact map, which is a 2D representation of amino acid residue distance by projecting the contact map onto a single dimension to input into the CNN. The second sublayer optimizes a scoring network, which is how much the generated substructures look like a protein using a 3D CNN. After regularizing,  they add a third neural network layer that scores the generated protein against the actual model.

The model conducted training on the Protein Data Bank, which is a freely accessible database that contains the three-dimensional structures for larger biological molecules such as proteins and nucleic acids. The model takes in a few inputs including aatype, a one-hot encoding of amino acid type, the deletion probability, the fraction of sequences that had a deletion at this position, and a gap matrix, which gives an indication of the variance due to gap states. The output contains a distogram, which includes the predicted secondary structure and accessible surface area.

After cross-validating their results on the COVID-19 spike protein with the structures determined experimentally by the Francis Crick Institute, DeepMind submitted their predictions for the proteins whose structures are not readily determined. These proteins include the membrane protein, protein 3a, nsp2, nsp4, nsp6, and papain-like C-terminal domain. These protein structures can potentially contain docking sites for new drugs or therapeutics, and were intended to help with future drug development in the efforts to contain COVID-19.

Several other groups are applying AI technologies to assist in the fight against Covid-19. For example, a thoracic imaging group leveraged a ResNet50 backbone connected to a 3D CNN via a max pooling layer to distinguish Covid-19 from community-acquired pneumonia. Blue Dot used an online natural language processing ML algorithm to predict the location of the next outbreak.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Introduces TensorFlow Developer Certification

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Google has launched a certification program for its deep-learning framework TensorFlow. The certification exam is administered using a PyCharm IDE plugin, and candidates who pass can be listed in Google’s world-wide Certification Directory.

The new certification was announced in a blog post by TensorFlow program manager Alina Shinkarsky. Candidates are tested on their ability to develop and train deep-learning models using TensorFlow, and problem spaces include computer vision (CV), natural-language processing (NLP), and sequence modeling. The exam fee is $100, and the certification is valid for three years. Certified developers will receive an official certificate and will be entitled to include a badge on their social media pages. According to Shinkarsky,

This certificate in TensorFlow development is intended as a foundational certificate for students, developers, and data scientists who want to demonstrate practical machine learning skills through building and training of basic models using TensorFlow.

The certificate candidate handbook includes more technical details. Models must be built using Python 3.7 and TensorFlow 2.x. The exam runs in the PyCharm IDE using a special plugin and may be taken on any computer that supports the required software and is connected to the internet. Candidates are given five hours to complete the exam and must achieve a score of 90%. Those who do not pass may attempt the exam a total of three times in one year; the exam fee is required for each attempt. During the exam, test-takers must build five deep learning models in the following categories:

  • Basic model
  • Model from dataset
  • Convolutional neural network (CNN) model for real-world image data
  • Natural-language processing (NLP) model for real-world text data
  • Sequence model for real-world numeric data

Developers who pass the exam can join the Google Developers Certification Directory. The directory also includes engineers who attained two of Google’s other certifications: Associate Android Developer and Mobile Web Specialist. Google announced its Google Developer Certification program in 2016, with Associate Android Developer as the first certification. Later that year, Google also launched its certification program for Google Cloud, announcing Certified Professional exams for Cloud Architect and Data Engineer. However, unlike the Google Developer exams, which are “self-service” exams that candidates may take on their own computers, the Google Cloud certifications require proctored exams that are taken at dedicated test centers.

The other leading cloud providers, Amazon Web Services (AWS) and Microsoft Azure, have certification programs similar to the Google Cloud program, and include certifications focused on machine learning and AI. AWS announced its machine-learning specialty exam in late 2018 and Microsoft announced their AI and data science certifications in early 2019. Google Cloud does not have an AI-specific certification.

In a discussion thread on Reddit, one user described his experience with the TensorFlow exam:

I have recently taken the exam on my MacBook Pro 15″ with i9 CPU. Having GPU would have been beneficial because it’s faster but not I didn’t feel that GPU was necessary. If the problem requires a complex model (not really that complex. About 3~5M parameters), about 10 epochs are sufficient which took about less than 10 minutes.

Google does not offer training materials for the certification, but recommends a Coursera specialization, TensorFlow in Practice, for students who wish to prepare for the exam. The Google Developer TensorFlow Certificate site claims that additional certifications are being developed for “more advanced and specialized TensorFlow practitioners,” but no timeline has been announced.
 

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Stork, a Rust/Wasm-Based Fast Full-Text Search for the JAMStack

MMS Founder
MMS Bruno Couriol

Article originally posted on InfoQ. Visit InfoQ

James Little, developer at Stripe, released Stork (in beta), a Rust/WebAssembly full-text search application. Stork targets static and JAMStack sites. and strives to provide sites’ users with excellent search speed.

Developers using Stork operate in two steps. First, the content to be searched must be indexed by Stork, and the generated index must be uploaded to an URL. Second, the end-user search interface must be linked with the search capabilities provided by Stork. Search interfaces typically take the form of an input field, with a search-as-you-type functionality, and search results displayed below the input field.

To generate a search index, developers must first install Stork. While on Mac, Stork can be installed with brew, on Windows developers will need to install from source, with Rust’s Cargo. With Stork installed, the next step consists of running a command building the index:

stork --build federalist.toml

The TOML configuration file hosts the list of files to index and the name for the generated index. A file list item must include both the file location on the local machine, the file’s URL for end-users to navigate to, and the file title which will be displayed to end-users by the search interface:

base_directory = "test/federalist"
files = [
    {path = "federalist-1.txt", url = "/federalist-1/", title = "Introduction"},
    {path = "federalist-2.txt", url = "/federalist-2/", title = "Concerning Dangers from Foreign Force and Influence"},
(...)
    {path = "federalist-6.txt", url = "/federalist-6/", title = "Concerning Dangers from Dissensions Between the States"},
(...)
    {path = "federalist-8.txt", url = "/federalist-8/", title = "The Consequences of Hostilities Between the States"},
    {path = "federalist-9.txt", url = "/federalist-9/", title = "The Union as a Safeguard Against Domestic Faction and Insurrection"},
(...)

[output]
filename = "federalist.st"

The generated index (here federalist.st) can then be uploaded for instance at a CDN location to speed up its posterior download. In the context of static sites, which is Stork’s targeted use case, the index generation and upload can be automated so the index remains up-to-date with the site’s content.

On the front-end side, developers must include the stork script in their entry HTML, together with any CSS to personalize the search results’ appearance. The HTML will typically include a search input (<input /> tag) and must include an output element to which the search results will be anchored:

<html lang="en">
  <head>
    (...)
    <link rel="stylesheet" href="https://files.stork-search.net/basic.css" />
  </head>
  <body>
    <div class="stork-wrapper">
      <input data-stork="federalist" class="stork-input" />
      <div data-stork="federalist-output" class="stork-output"></div>
    </div>
    (...)
    <script src="https://files.stork-search.net/stork.js"></script>
    <script>
      stork.register(
        'federalist',
        'https://files.stork-search.net/federalist.st'
      )
    </script>
  </body>
</html>

In the previous example, some basic CSS is added in the header, and the stork script is located before the end of the body. A Stork instance is created with parameters linking the index file to the DOM input field (federalist string identifier linking with the data-stork attribute in input and output DOM locations).

The previous examples are enough to generate the following interactive full-text search:

demo full-text search on the federalist corpus

While there are other open-source full-text search utilities available with more sophisticated search abilities (like FlexSearch), Stork differentiates itself by the use cases for which it specializes, making it easy to add a search feature with good speed to an existing JAMStack site.

Stork is a beta project. Its design goals and roadmap towards a first major version are to reduce index size, keep the WebAssembly bundle size low, extend the types of content it can index while keeping developers ergonomics and search speed high.

Stork is available under the Apache 2.0 open-source license. Bugs and features are listed on the project’s Github Issues page. Feature requests are welcome and may be provided via the GitHub project.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Spectro Cloud Launches a Kubernetes-Based Hybrid Cloud Platform

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Spectro Cloud, an enterprise cloud-native infrastructure company, launched a platform for managing multiple distributions of Kubernetes. The platform bearing the company name gives customers fine-grained control, flexibility and multi-cloud capabilities for their Kubernetes stack, including the ease of use and scalability of a managed SaaS platform.

Founders of Spectro were previously part of CliQr, a company that helped customers manage applications across hybrid cloud environments. In 2016 they sold the company to Cisco and started developing the Spectro Cloud in the spring last year. Since January of this year, the product was in private preview, and the company expects to bring it to general availability next quarter. 

Furthermore, the company recently raised an additional $7.5 million in funding from Sierra Ventures with participation from Boldstart Ventures. Managing director at Sierra Ventures, Mark Fernandes, said in the GlobalNewsWire article:

The market for Kubernetes has crossed the chasm. What we’ve heard from our CXO Advisory Board of Global 1000 IT executives is that enterprises are still struggling with the operational complexity that comes with Kubernetes. Spectro Cloud’s team has a deep understanding of the needs of enterprises and has found a unique way to make Kubernetes easy to use for its rapidly growing customer base.

And Saad Malik, co-founder and CTO of Spectro Cloud, told InfoQ:

Spectro Cloud offers modern enterprise IT the flexibility to build and run their Kubernetes infrastructure with the control they need. An infrastructure that just works allows teams to focus on delivering business value – with increased ops efficiency, reduced cost and faster time to market.

Spectro Cloud has a declarative model to define cluster profiles, allowing users to manage Kubernetes with less requirement for coding skills. They can set cluster profiles and use them to automate the deployment and maintenance of Kubernetes clusters across the extended enterprise. Spectro Cloud’s approach with the model not only enables teams to centralize the management of Kubernetes environments regardless of physical location, but also provides declarative tools to make it possible for the IT administrator to manage those clusters


Source: https://www.spectrocloud.com/

Malik told InfoQ:

Spectro Cloud helps the modern enterprise make cloud-native really work for them by giving them the control they need over their Kubernetes infrastructure. That allows enterprises to grow more confidently in any environment.

Currently, Spectro Cloud isn’t the only vendor with a focus on IT Management based on Kubernetes. Others, ranging from startups to large vendors like Microsoft and Google, have recognized hybrid cloud computing and made investments in products such as Azure Arc, and Anthos

Lastly, Malik told InfoQ what the future for Spectro Cloud would hold:

We plan on taking our existing foundation and building on it to solve nascent problems around how to make infrastructure disappear to the developer regardless of where that infrastructure exists.

In addition, Malik also said:

We will continue to empower developers and DevOps to build, run and scale cloud-native applications anywhere without worrying about managing and operating servers and infrastructure.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: Software Teams and Teamwork Trends Report Q1 2020

MMS Founder
MMS RSS

Key Takeaways

  • Remote work is suddenly the new normal due to the impact of COVID-19, and many teams are not fully ready for the change
  • The spread of agile ideas into other areas of organizations continues—business agility is becoming much more than just a buzzword
  • At the practices level, Wardley Mapping is one of the few truly new ideas that have come into this space recently. Invented by Simon Wardley in 2005, they are gaining traction because they are truly a powerful tool for making sense of complexity.
  • The depth of impact that computing technology has on society has heightened the focus on ethical behavior and the move towards creating an ethical framework for software development, as well as growing concern in the environmental impact the industry has.
  • Diversity and inclusion efforts are moving forward, with a long way still to go
  • Practices and approaches that result in more humanistic workplaces, where people can express their whole selves, are recognized as important for attracting and retaining the best people and result in more sustainably profitable organizations

How do we cope with an environment that has been radically disrupted, where people are suddenly thrust into remote work in a chaotic state?  What are the emerging good practices and new ideas that are shaping the way in which software development teams work? What can we do to make the workplace a more secure and diverse one while increasing the productivity of our teams? This report aims to assist technical leaders in making mid- to long-term decisions that will have a positive impact on their organisations and teams and help individual contributors find the practices, approaches, tools, techniques and frameworks that can help them get a better experience at work – irrespective of where they are working from.

The big changes we see in 2020 are centered around more humanistic organizations and remote becoming the way of working for so many people so suddenly, given the impact of Covid-19 on societies, organizations and individuals.

Diversity and inclusion initiatives are continuing and there is a long, long way to go for information technology to become truly inclusive and welcoming.

If a topic is on the right-hand part of the graph, you will probably find lots of existing content about it on InfoQ—we covered it when it was new, and the lessons learned by the innovators and early adopters are available to help guide individuals, teams, and organizations as they adopt these ideas and practices.

The techniques and practices on the left-hand side are the ones we see as emerging now and being used by innovators and early adopters. We focus our reporting and content on bringing these ideas to our readers’ attention so they can decide if they want to explore (some of) these now or wait to see how they unfold.

We asked the editors to give their take on the important trends they’ve seen over the last year or so. We also asked Evan Leybourn, founder and Chief Executive of the Business Agility Institute, to give us his thoughts on the state of business agility in 2020.
 
What follows are the opinions of the Culture & Methods editorial team, looking back and looking forward in 2020 and beyond.

Shane Hastie — Lead Editor, Culture & Methods

Probably because we are focused on culture and methods in this team, I am very aware of the way organization culture impacts peoples’ lives. Over the last 19 years since the Agile Manifesto was written, we’ve seen steady adoption of agile approaches in information technology workplaces. Today the leading organizations don’t talk about agile because it’s just the normal way of working — short delivery cycles, cross-functional teams, empowered people, and technical excellence are the topics of conversation when setting up new IT teams. The best organizations recognize that culture is a competitive advantage — create a place where people can be themselves and find joy in the work they do and sustainable profitability will follow.

Unfortunately faux-agile is still rife — organizations adopting agile practices or undergoing a “digital transformation” without the accompanying cultural shift. The science is clear — organizational performance is directly related to organizational culture, as indicated in this diagram from Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations by Nicole Forsgren PhD, Jez Humble and Gene Kim.

Forsgren and Humble also discussed this and related topics during their 2018 keynote at QCon San Francisco.

What’s still missing is addressing some of the structural dysfunctions that have become a part of the technology industry.

We have a very long way to go before tech is an inclusive and diverse environment where differences of every sort are recognized as bringing value to teams and organizations. Again the science is clear — diverse teams are more creative than homogenous teams — but that message isn’t getting through to the people in our teams and organizations.

Ethics in software is another area where we need to catch up. Most professions have a recognized code of ethics which practitioners agree to and abide by. How many people on our teams know that there is an ACM code of ethics for computing, let alone have ever signed it?

But we also see hope — the emergence of ideas around applying agile thinking on a national scale. The Tedx Talk by Dr. Rashina Hoda on the traits of an agile nation stands out to me as one of the most important pointers to what society can become.  The impact of COVID-19  makes this even more important.
 

The Four Values for an Agile Society:

    People & interactions over protocols and rules
    Community collaboration over closed decision making
    Policies and actions over speeches and promises
    Responding to change over following the status quo

Dr Rashina Hoda – https://www.infoq.com/news/2019/10/traits-agile-nation/

On a more local scale, business agility is moving beyond being a buzzword with leading-edge organizations from unexpected industries adopting agile ways of working, building on genuine culture shift towards more humanistic workplaces. Agile companies have higher employee engagement, disciplines like finance and HR are adopting new approaches, even procurement is becoming more collaborative. Autonomous teams, self-selection, and dynamic reteaming are examples of how people are being empowered and organizations are getting measurable benefits from trusting their people.

At the interpersonal level, remote teams are more and more the norm, and techniques like clean language and liberating structures have emerged to help remote team members communicate more effectively, overcoming the barrier that being remote often puts between people.

The recent outbreak of COVID-19 results in more and more people suddenly having to work remotely, so any tools and techniques which can help remote teams gain cohesion are important. Teams are even using mob programming in remote settings.

Another significant shift is the slow move away from big frameworks to descaling rather than trying to scale up when faced with complex problems. Adoption of #NoProjects/value streams is happening in more and more organizations.

At the practices level, Wardley Mapping is one of the few truly new ideas that have come into this space recently. Invented by Simon Wardley in 2005, they are gaining traction because they are truly a powerful tool for making sense of complexity. Amongst other places, Wardley maps are used within UK government, notably within the Government Digital Service (GDS) for strategic planning and identifying the best targets for government digital service modernisation, and the UN Global Platform.

InfoQ partnered with Map Camp last year, and filmed the talks at the beautiful Sadlers Wells theatre in London.

Ben Linders — Trainer / Coach / Adviser / Author / Speaker and Editor for InfoQ

The awareness of how important facilitation can be is increasing. For team-based activities like agile retrospectives, team chartering, or product refinement, a collaborative and safe culture is essential. Good facilitators that are perceived independently and neutral can greatly improve the outcomes of such events. Practices often used by facilitators are the core protocols, liberating structures, gamification, and self-selection.

Remote working, for example, people working from home, digital nomads, and (fully) remote teams, is on the rise. Techniques that are used in co-located teams are being adapted for usage in remote working settings; examples are remote agile retrospectives, remote pairing, remote mob programming, remote mob testing, and online training and coaching.

The number of certifications for agile practitioners (Certified Scrum Master, SAFe SPC, etc.) and people getting certifications is still growing and companies tend to prefer certified Scrum masters and agile coaches over those who don’t have certification. However, there is a shift where the value of those certifications is now in question. There are discussions about agile certification on social media where people state that certificates themselves  don’t prove a person’s skills or abilities, and the impression arises that the certification industry has become more focused on quantity.

Raf Gemmail — Agile and DevOps Coach, Educator, and Editor for InfoQ

Fully Remote Working

I’ve dedicated much of the past year to remote working and collaboration. This new terrain in the greater history of work is key to competing in the fourth industrial revolution landscape. Remote work has rapidly gotten past the initial hurdle of being seen as second fiddle to co-location, as demonstrated in this year’s State of Remote Work report. Innovators are still working out issues with defining collaboration models across time-zones, finding sufficient face-to-face investment, surfacing learnings and formats for doing this. My own running of remote retros and collaboration sessions has demonstrated that for anything more creative than a fixed format retro, you still have to roll your own.

Tools and methods are going through the same evolution we went through in the physical world when discovering how to best use wall space and the available artifacts to reflect the context of teams. Remote pairing tools, conferencing capabilities, shared boards like google jamboard and planning tools have greatly matured.

The collaboration tools, however, are better than they have ever been. This has been demonstrated by a forced pivot into remote working by the recent COVID-19 pandemic. Chinese firms have been forced to use remote working for business continuity and have seen productivity achieved within a short time.

Learning on Demand Equaling Higher Education

I have had a front-row seat, working with a French disrupter, in the evolution of traditional higher education certifications. MOOCs and online learning platforms combine the teaching of traditional disciplines with current industry-relevant application patterns. Further, the delivery mechanism is increasingly changing to one which is pull-based to make learning more accessible to teams and individuals. I have personally found that hiring people with a combination of diverse experiences and Bootcamp training, gives them a good grounding to grow and have a broader set of drivers in delivering customer and business value. This ties in well with Ben’s point that we are improving how we provide remote teaching to remote workers

Situational Leadership

There is continued use of Wardley Maps as well as organizational mindfulness techniques to make sense around the unique context of a particular product’s development. There is also a deepening appreciation of the impact that organizational complexity and cognitive load has on teams and their products, as covered in the recent book on Team Topologies.


I have also been experimenting with accounting for situational context to recognize holistic tensions within teams and organizations in relation to individuals, culture, mental models, and belief systems. I’ve seen a growing desire at the innovator level to apply a broader contextual understanding to coaching, as demonstrated by Spotify’s use of full-stack coaches. Such situational leadership challenges the urge to keep repeating cookie-cutter practices without a contextual insight.

Engineering Patterns

DevSecOps and dev first security continue impacting how we practice a more holistic approach to security, within a world where cybercrime is increasingly prolific.

At QCon London last year, Snyk’s founder Guy Podjarny spoke about DevSecOps being defined by a breadth of shared ownership, collaboration on securing infrastructure, and incorporating security in software methods. In a blog post he published earlier this year, Podjarny defined dev first security as the “majority of security work…done by developers.”

I’ve also seen an increasingly early engagement of security teams to pull security considerations in through security by design, lite threat modeling, and ongoing security championship initiatives. In his 2019 QCon New York talk on breaking into InfoSec, Ray Bango described Security Champion programs as the “biggest thing that’s happening right now,” for making Security accessible to developers who want to grow into InfoSec.

The role of the architect continues to evolve, in order to help set boundaries and direction for evolution. Deloitte recently wrote about the Architecture Awakens movement which is breaking down ivory towers in favor of collaboration.

My personal exposure has demonstrated a slight resurgence in risk-aversion leading to organizations reaching back for BDUF architectural practices. There is continued importance for coaching and advocacy of those still on the journey of achieving business agility through evidence-based increments.

For the conservative, alongside the existing scaled agile approaches, traditional architectural frameworks are also evolving to provide more boundaries within an Agile and learning-centric context. An example of this comes from TOGAF maintainers, The OpenGroup, who are now defining an Open Group Agile Architecture Framework standard.

Ethics in Software

Rapidly creeping into Early Majority. I have never known a more ethically conscious time in our industry when people across the board feel free to ask about the rights and wrongs of their organizations, projects, and patterns they observe. Shane Hastie recently spoke with Cat Swetal and Kingsley Davies, on the InfoQ Culture & Methods podcast, about ethics and the moral imperatives which come with living in the realm of ethics.

This is increasingly seen in organizations responding to the current whistleblower culture by creating channels for dealing with ethical and legal quandaries.

Last year, The Guardian reported on ex-Googler Jack Poulson starting Tech Inquiry to support whistle-blowing by ethical developers. It is now also completely unnatural to converse about data representations without asking ethical questions relating to the implication of biases, use of personally identifiable issues, and the legal consequence. This has also been hugely driven by legislative initiatives such as GDPR, which are being replicated the world over. Most recently in California, where their equivalent California Privacy Act went into effect on January 1st.

The workplace is becoming increasingly socially conscious. I think this is a good thing.

Craig Smith — Agile Coach and Trainer and Editor for InfoQ

One of the things that makes being in the IT industry and the Agile community so interesting is that there are always new approaches to problems to learn and experiment with.
 
Innovators are starting to find benefits from the use of techniques such as liberating structures and Wardley Mapping and starting to get back to basics by challenging the use of large scaling frameworks with a focus on simplicity and descaling. This has been led by some of the early adopters who are challenging culture and focusing on creating joy in the workplace as well as the realization that agility thrives in organizations that tackle the wider organizational challenges of business agility and #noprojects.

33 Liberating Structures

Given the number of organizations that struggle with their agile and digital transformations, a spotlight has started to emerge on the roles that support these new ways of working, specifically the ethics around software development and Agile coaching as well as the maturity of true product management.

Interestingly, as the majority have well and truly embraced Agile techniques and the practice of cross-functional teams and full-stack development, we will hopefully start to see some real change as organizations realize that the principles at the heart of modern agile are more important than a strict adherence to Agile frameworks and practices.

Shaaron A Alvares — Agile and DevOps Coach and Editor for InfoQ

Inclusion and Belonging

While the focus has been on hiring diversity, it is moving towards developing cultures and management that prioritize inclusion and belonging. Inclusion is not just the responsibility of HR, it is owned by everyone at the workplace and teams are starting to understand the importance of inclusive collaboration. More organizations and product teams are leveraging agile practices and facilitation techniques to develop more inclusive teams. Managers and teams are also exploring the psychology and biology of fear to create more sustainable and authentic safe workplaces. Organizations recognize belonging as a basic safety need, and employees are invited to bring their whole self at work.

Situational Mapping

In an effort to understand and navigate complexity, only a few organizations are exploring sense-making and situational organizational practices such as value stream mapping, value chain mapping for strategic decision making (or Wardley Mapping), and customer journey mapping, in order to keep the focus on the value for the customers and end-users. These practices also aim at better analyzing internal organizations’ processes and at breaking down silos, inefficiencies, and decrease end-to-end lead time.

DevOps Upskilling Culture

DevOps, although always known as a culture, is focusing on team culture of collaboration and cross-knowledge sharing across the organization to drive better outcomes between development and operations. DevOps upskilling and cross-skilling are becoming integral part of organizations’ digital strategy.

The Five DevOps Ideals of Locality and Simplicity: Focus, Flow, and Joy; Improvement of Daily Work; Psychological Safety; and Customer Focus, proposed by Gene Kim in the Unicorn Projects, confirms the importance of the DevOps movement as a better way of working and delivering better value, sooner, safer, and happier.

DevOps Dojos, still very few, started to break into organizations, as a safe place to try, test, train and onboard onto DevOps technology and culture. Few include product management and agile coaching.

Bad Agile versus Human-Centered Ways of Working

Too many large corporations prefer to keep their organizational structure and hierarchy status quo with large teams and products. They leverage out of the shelves large scaling frameworks heavily focused on known practices and techniques based on doing, rather than exploring unknown ways of working that can help them unlock their organizational culture, people, and human-centered potential and innovation.

Doug Talbot — Engineering Leader, Organizational Dynamics Leader and Editor for InfoQ

Although it’s not the biggest tech trend at the moment, I feel the relationship between Climate Change and Technology is a topic that is hugely important for the planet. We are starting to see the pressure increasing for the Tech industry to take responsibility for their code, infrastructure, the climate cost of their end product and which 3rd parties they use. Sal Freudenberg and Chris Adams spoke about this at MapCamp. Paul Johnston at Qcon London pointed at the relative costs of using different cloud providers if you consider Climate cost as well. He demonstrated that Wardley Mapping (another emerging method) was useful for this cost analysis. Henrik Kniberg of Spotify Agile fame has been championing the climate cause and TED has recently started a huge campaign. Microsoft’s recent announcement that it aims to be carbon neutral by 2030 can also be seen as a signal in this context. The issue has also started to seep into the public consciousness as this series from the BBC demonstrates.  Paul Johnson and Anne Currie were both sources for the series.   

The key emerging trends in technology leadership are focused on two main scenarios. Trend one: that leaders must now understand how to support the human community that they lead. Mairead O’Connor spoke on why people are more complex than computers at QCon London 2019. To really be excellent as a leader now means understanding systems thinking, psychology, sociology, and anthropology and effectively starting a new career path that  involves many years of learning. Trend two: the leaders must be actively creating aligned organizational structures and services such as HR and Finance that do not break the new humanistic paradigms for knowledge workers. Both Spotify (Spotify Career path blog) and ING (HR change at ING) were early innovators of these trends in larger organizations but we are now seeing many more companies adopting these ideas.

Traditional HR practices such as CIPD (Chartered Institute of Personnel and Development) are needing to shift to allow for deliberate culture design,whether it is for Holacracy or more commonly for Agile practices such as cross-functional teams and empowerment. We have seen radically different HR models and the HR Agile manifesto moving into bigger companies such as ING, and this trend is accelerating.

I wanted to discuss one last trend — almost every organization has now declared they are doing a Digital transformation or “done” one. I would now put Digital transformation in Late majority. Digital transformation originally indicated restructuring the organization, its workforce, and processes to maximize the benefits of using new digital technologies and minimize costs by eliminating mechanical work and allowing customers to enjoy a modern digital interaction. However, as many organizations discovered their limited capacity to change old and entrenched structures, the Digital transformation became solely Digitization and we have seen a correlated trend in RPA implementations since 2017. The Digital transformation of an organization is tightly coupled with the trend of business agility as a concept which is the ultimate extension of Agile and DevOps / DevSecOps where all elements of a value stream are brought together and organized cross-functionally together.

Guest Contribution — Evan Leybourn

We spend more time at work than we do with anything else—including our family. This is not a Bad Thing™ in itself, but rather a fact of modern life. And yet, if work is the biggest part of our life, it should also be the best part of our life. Joy. Engagement. Purpose. These are the cultural ideals that employees and employers alike should be striving for.

In the Culture & Methods Graph above, there are some great ideas: liberating structures, self-selection, [Socio/Hola]cracy, coaching, even old-fashioned Agile. And there are hundreds more ideas out there. Every one of which has been created by people with the intention to make your workplace better. Put all together, this collective set of ideas, frameworks, methods, practices, and principles is referred to as business agility — and the possibilities are endless.

The Domains of Business Agility

Every department can change; from HR to sales to procurement to finance. You can change technology, healthcare, and even manufacturing companies. Product creation focuses more on the customer, team, and end-to-end flow of value. People can truly be engaged and incentivized to be their whole self at work.

So, let’s go back to the business agility cultural ideals I outlined above; Joy, Engagement, and Purpose. If you are doing Agile but it doesn’t bring you joy, if you have adopted self-selection but without a sense of purpose, if you are being coached but are not engaged—stop it and try something else.

Because that’s the secret to all of this; it doesn’t actually matter what you do, just do something. It’s not about following the rote ceremonies and practices. Take a look at all these ideas and, in service to your customers and colleagues, find the way that works for you. How will you make work one of the best parts of your life?

-Evan Leybourn
Founder and CEO, Business Agility Institute

At the Business Agility Institute, we believe the next generation of companies has arrived. They are agile, innovative, and dynamic — perfectly designed to thrive in today’s unpredictable markets. Our mission is to advocate for, connect, educate, and inspire people within these organizations, encouraging them to create an environment of shared knowledge and trust that will usher organizations around the world into the future of business.

About the Authors

Shane Hastie is the Director of Community Development for ICAgile, a global accreditation and certification body dedicated to improving the state of agile learning. Since first using XP in 2000 Shane’s been passionate about helping organizations and teams adopt sustainable, humanistic ways of working – irrespective of the brand or label they go by. Shane was a Director of the Agile Alliance from 2011 until 2016. Shane leads the Culture and Methods editorial team for InfoQ.com

Ben Linders is an Independent Consultant in Agile, Lean, Quality and Continuous Improvement, based in The Netherlands. Author of Getting Value out of Agile Retrospectives, Waardevolle Agile Retrospectives, What Drives Quality, The Agile Self-assessment Game and Continuous Improvement. As an adviser, coach and trainer he helps organizations by deploying effective software development and management practices. He focuses on continuous improvement, collaboration and communication, and professional development, to deliver business value to customers. Ben is an active member of networks on Agile, Lean and Quality, and a frequent speaker and writer. He shares his experience in a bilingual blog (Dutch and English) and as an editor for Agile at InfoQ. Follow him on Twitter: @BenLinders.

Rafiq Gemmail is a freelance technical coach, teacher and polyglot who has coached the principles of fast-feedback with leading firms in New Zealand. He is a passionate advocate for mob programming, having supported cross-functional teams through over a year of mobbing at New Zealand’s largest news site. Raf is also a champion for DevOps culture and one of the organisers of New Zealand’s DevOps days. He is also an ICAgile certified coach.

Craig Smith has been a software developer for over 15 years, specialising in a large number of technologies in that time. He has been an Agile practitioner for over 10 years, is a Certified Scrum Master and Certified ICAgile Professional and a member of both the Scrum Alliance and Agile Alliance and currently works as an Agile Coach, fulfilling technical lead, iteration manager and Agile coaching roles on technology and business projects. He has presented at many international conferences and is a reviewer of a number of Agile and software development books. In his spare time, Craig is an avid motorsport fan.

Shaaron A Alvares is a News Reporter and Editor for DevOps, Culture and Methods at InfoQ and works as an Agile Transformation Coach and Trainer at T-Mobile in Bellevue, Washington. She is Certified Agile Leadership, Certified Agile Coach from the International Consortium for Agile, and Agile Certified Practitioner, with a global work experience in technology and organizational transformation. She introduced lean agile product and software development practices within various global Fortune 500 companies in Europe, such as BNP-Paribas, NYSE-Euronext, ALCOA Inc. and has led significant lean agile and DevOps practice adoptions and transformations at Amazon.com, Expedia, Microsoft, T-Mobile. She focuses on introducing the Agile mindset and customized value-driven practices aligned with organizational performance goals. Blogger, writer and speaker at local organizations, she is board member, advisor and contributor to global agile organizations such as Scrum Total, Agnostic Agile. Shaaron published her MPhil and PhD theses with the French National Center for Scientific Research (CNRS).

Evan Leybourn pioneered the field of Agile Business Management; applying the successful concepts and practices from the Lean and Agile movements to corporate management. He keeps busy as a business leader, consultant, non-executive director, conference speaker, internationally published author and father. Evan has a passion for building effective and productive organisations, filled with actively engaged and committed people. Only through this, can organisations flourish. His experience while holding senior leadership and board positions in both private industry and government has driven his work in business agility and he regularly speaks on these topics at local and international industry conferences. As well as writing “Directing the Agile Organisation”, Evan currently works for IBM in Singapore to help them become a leading agile organisation. As always, all thoughts, ideas and comments are his own and do not represent his clients or employer.

Douglas Talbot is an experienced technology and product leader, specialising in creating and leading teams building complex, innovative products across tech, engineering and science boundaries. He has scaled agile and digital approaches to over one thousand people, distributed internationally, with 24/7 operations. He is known for a strength in building teams, great cultures, and attracting great talent.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: Software Teams and Teamwork Trends Report Q1 2020

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

The Culture & Methods editors team present their take on the topics that are at the front of the technology adoption curve. How to make teams and teamwork more effective, in person or remote, some new tools and techniques, some ideas that have been around for a while and are starting to gain traction, the push for professionalism, ethical behavior and being socially and environmentally aware.

By Ben Linders, Douglas Talbot, Evan Leybourn, Craig Smith, Shaaron A Alvares, Rafiq Gemmail, Shane Hastie

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Streaming a Million likes/second: Real-time Interactions on Live Video

MMS Founder
MMS Akhilesh Gupta

Article originally posted on InfoQ. Visit InfoQ

Transcript

Gupta: Anyone know what the largest live stream in the world was? Think of something that grinds an entire nation to halt.

Participant 1: Royal wedding.

Gupta: That was the second largest.

Participant 2: Cricket match.

Gupta: Yes, cricket. Believe it or not, it was the semifinal of the Cricket World Cup last year between India and New Zealand. More than 25 million viewers watched the match at the same time. Also, overall the match crossed 100 million viewers. As this gentleman said, the second largest was the British royal wedding, which was more than 80 million viewers concurrently. Remember this, we’ll come back to this.

This is me and my team. We call ourselves the Unreal Real-Time Team, and we love cricket and we love coding. We believe in something. We believe that problems in distributed systems can be solved by starting small. Solve the first problem, and add simple layers in your architecture to solve bigger problems.

Today I’m going to tell you a story of how we built a platform called the Realtime Platform using that principle to solve how we can stream or have many people interact simultaneously on live videos. I hope that a lot of you will be able to learn something from this and also apply it to your own systems.

Real-Time Interactions on Live Video?

What is a live video? A live telecast, or live conference broadcast, a sports match, all of them are examples of live videos. How many of you have interacted with a live stream on YouTube or Facebook? Perfect, so you know the general idea. The difference that these streams have is that they allow viewers to interact with each other. What is real-time interaction on live video? The easiest way is to just look at it or try it on your phones with the demo. People are able to like, comment, and just ask questions, and interact with each other. This is an example of a LinkedIn iOS app doing a LinkedIn live video. Similarly, on desktop the experience is very similar. There are many rich interactions that happen there. This particular live stream is from the LinkedIn Talent Connect conference that happened in September of last year.

I want to talk about the simplest interaction here, which is, how do these likes get distributed to all these viewers in real time all at the same time? Let’s say that a sender S likes the video and wants to send it to receiver A. The sender S sends the like to the server with a simple HTTP request. How do we send the like from the server back to receiver A? Publishing data to clients is not that straightforward. This brings us to the first challenge. What is the delivery pipe to send stuff to the clients? By the way, this is how I’ve structured the presentation today. I’m going to introduce problems and I’m going to talk about how we solve them in our platform. These are all simple layers of architecture that we’ve added to a platform to solve each of them.

Challenge 1: The Delivery Pipe

As discussed before, sender S sends the like to what we call the likes backend, which is the backend that stores all these likes with a simple HTTP request. This backend system now needs to publish the like over to the real-time delivery system. Again, that happens with a simple HTTP request. The thing we need is a persistent connection between the real-time delivery system and receiver A. Let’s talk a little bit more about the nature of this connection, because that’s the one that we care about here.

Log in at linkedin.com, and go to this URL, tiny.cc/realtime. This time you will actually see what’s happening behind the scenes. You should be logged in for this. For those of you who were successful, you should see something like this on the screen. This is literally your persistent connection with LinkedIn. This is the pipe that we have to send data over to your phones or to your laptops. This thing is extremely simple. It’s a simple HTTP long poll, which is a regular HTTP connection where the server holds onto the request. It just doesn’t disconnect it.

Over this connection, the user technology called server-sent events, and this allows us to stream chunks of data over what is called the EventSource interface. The client doesn’t need to make subsequent requests. We can just keep streaming data on the same open connection. The client makes a normal HTTP GET request. It’s as simple as a regular HTTP connection request. The only difference is that the Accept header says event-stream. That’s the only difference from a regular HTTP connection request.

The server responds with a normal HTTP 200 OK and sets the content type to event-stream, and the connection is not disconnected. Chunks of data are sent down without closing the connection. You might receive, for example, a “like” object, and later you might receive a “comment” object. Without closing the connection, the server is just streaming chunks of data over the same open HTTP connection request. Each chunk is processed independently on the client through what is called the EventSource interface, and as you can see, there is nothing terribly different from a regular HTTP connection, except that the Content-Type is different, and you can stream multiple chunks of bodies on the same open HTTP request.

How does this look like on the client side on web? The client first creates an EventSource object with a target URL on the server. Then, it defines these event handlers, which will process each chunk of data independently on the client. Most browsers support the EventSource interface natively. On Android and iOS there are lightweight libraries that are available to implement the EventSource interface on these clients.

Challenge 2: Connection Management

We now know how to stream data from the server to the client, and we did this by using HTTP long poll with server-sent events. What is the next challenge? Think of all the thousands of Indians trying to watch cricket. The next challenge is multiple connections, maybe thousands of them, and we need to figure out how to manage these connections.

Connection management – At LinkedIn, we manage these connections using Akka. Akka is a toolkit for building highly confident, message-driven applications. Anyone familiar with Akka Actors? A roomful. Yes, not Hollywood actors, they’re this very simple thing. It’s this little, small guy. This is the only concept you need to know to understand the rest of the presentation. Akka Actors are objects which have some state, and they have some behavior. The behavior defines how the state should be modified when they receive certain messages. Each actor has a mailbox, and they communicate exclusively by exchanging messages.

An actor is assigned a lightweight thread every time there is a message to be processed. That thread will look at the behavior that is defined for the message and modify the state of the Akka Actor based on that definition. Then, once that is done this thread is actually free to be assigned to the next actor. Since actors are so lightweight, there can be millions of them in the system, and each can have their own state and their own behavior. A relatively small number of threats, which is proportionate to the number of cores, can be serving these millions of actors all on the same time, because a thread is assigned to an actor only when there is something to process.

In our case, each actor is managing one persistent connection, that’s the state that it is managing. As it receives an event, the behavior here is defining how to publish that event to the EventSource connection. Those many connections can be managed by the same machine using this concept of Akka Actors. Let’s look at how Akka Actors are assigned to an EventSource connetion. Almost every major server frame will support the EventSource interface natively. At LinkedIn we use the Play Framework, and if you’re familiar with Play, we just use a regular Play controller to accept the incoming connection.

Then, we use the Play EventSource API to convert it into a persistent connection, and assign it a random connectionId. Now we need something to manage the lifecycle of these connections, and this is where Akka Actors fit in. This is where we create an Akka Actor to manage this connection, and we instantiate an Akka Actor with the connectionId, and the handle to the EventSource connection that it is supposed to manage. Let’s get [inaudible 00:11:35] and see how the concept of Akka Actors allows you to manage multiple connections at the same time.

Each client connection here is managed by its own Akka Actor, and each Akka actor in turn, all of them, are managed by an Akka supervisor actor. Let’s see how a like can be distributed to all these clients using this concept. The likes backend publishes the like object to the supervisor Akka Actor over a regular HTTP request. The supervisor Akka Actor simply broadcasts the like object to all of its child Akka Actors here. Then, these Akka Actors have a very simple thing to do. They just need to take the handle of the EventSource connection that they have and send the event down through that connection. For that, it looks something very simple. It’s eventSource.send, and the like object that they need to send. They will use that to send the like objects down to the clients. What does this look like on the client side? The client sees a new chunk of data, as you saw before, and will simply use that to render the like on the screen. It’s as simple as that.

In this section we saw how relevant source connection can be managed using Akka Actors, and therefore, you can manage many connections on a single machine. What’s the next challenge? Participant 3: Fan-out. Gupta: Fan-out is one. Before that.

Participant 4: [inaudible 00:13:19]

Gupta: Mailbox [inaudible 00:13:22]. We’re now already talking about big, big scale. Even before that, something simple. I’ll give you a hint. My wife and I always want to watch different shows on Netflix.

Participant 5: [inaudible 00:13:36]

Gupta: Yes. The thing that we did just now is just broadcast the like blindly to everybody without knowing which particular library they’re currently actually watching.

Challenge 3: Multiple Live Videos

We don’t know how to make sure that a like for, let’s say, the red live video goes to the red client, and the green live video goes to the green client. Let’s assume that this client here with connection id3 is watching the red live video, and this client here with connection id5 is watching the green live video. What we need is a concept of subscription, so the client can inform the server that this is the particular live video that they’re currently watching.

When client 3 starts watching the red live video, all it does is it sends a simple subscription request using a simple HTTP request to our server. The server will store the subscription in an in-memory subscriptions table. Now the server knows that the client with connection id3 is watching the red live video. W hy does in-memory work? There are two reasons. The subscription table is completely local. It is only for the clients that are connected to this machine.

Secondly, the connections are strongly tied to the lifecycle of this machine. If the machine dies, the connection is also lost, and therefore, you can actually store these subscriptions in-memory inside these frontend nodes. We’ll talk a little bit more about this later.

Similarly, client 5 also subscribes to live video 2, which is the green live video. Once all the subscriptions are done, this is the state of the front end of the real-time delivery system. The server knows which clients are watching which live videos.

When the backend publishes a like for the green live video this time, all that the supervisor actor has to do is figure out which are all the clients that are subscribed to the green live video, which in this case is clients 1, 2, and 5. The corresponding Akka Actors are able to send the likes to just those client devices. Similarly, when a like happens on the red live video these these actors are able to decide that it is designed only for connection ids 3 and 4, and is able to send them the likes for the videos that they’re currently watching.

In this section we introduce the concept of subscription, and now we know how to make sure that clients are only receiving likes for the videos that they’re currently watching. What’s the next challenge? Now we can go back to the gentleman here. Somebody already said here that there could be millions and millions of connections. There are just more number of connections than what a single machine can handle. That’s the next challenge.

Challenge 4: 10K Concurrent Viewers

We thought really hard about this. This is where we were a little stuck, and that’s us thinking really hard. We finally did what every backend engineer does to solve scaling challenges. You already know. We added a machine. We add a machine and we start calling these frontend servers. We introduce a real-time dispatcher whose job is to dispatch a published event between the newly introduced frontend machines, because now we have more than one.

Now, can the dispatcher node simply send a published event to all the frontend nodes? Yes, it can. It’s not that hard. It can, but it turns out that it’s not really efficient if you have a small live video with only a few viewers that are connected to just a few frontend machines. There’s a second reason which I’ll come back to a little later, but for now, let’s assume that the dispatcher can’t simply send a like to all the frontend machines blindly.

Given that the dispatcher now needs to know which frontend machine has connections that are subscribed to a particular live video. We need these frontend machines to tell the dispatcher whether it has connections that are subscribed to a particular live video. Let’s assume that frontend node1 here has connections that are subscribed to the red live video, and frontend node 2 here has connections that are subscribed to both the red and the green live video. Frontend node1 would then send a simple subscription request, just like the clients were sending to the frontend servers, and tell the real-time dispatcher that it has connections that are watching the red live video. The dispatcher will create an entry in its own subscriptions table to figure out which frontend nodes are subscribed to which live videos. Similarly, node2 here subscribes to both the red live video and the green live video.

Let’s look at what happens when an event is published. After a few subscriptions, let’s assume that this is the state of the subscriptions in the real-time dispatcher, and note that a single frontend node could be subscribed to more than one live videos. Now it can have connections that are watching multiple live videos at the same time. In this case, for example, node2 is subscribed to both the red live video and the green live video.

This time the likes backend publishes a like on the green live video to the real-time dispatcher, and the dispatcher is able to look up its local subscriptions table to know that nodes 2, 3, and 5 have connections that are subscribed to the green live video. It will dispatch them to those frontend nodes over a regular HTTP request. What happens next? That you’ve already seen. These frontend nodes will look up their own in-memory subscriptions table that is inside them to figure out which of their connections are watching the green live video and dispatch the likes to just those ones.

We now have this beautiful system where the system was able to dispatch between multiple frontend nodes, which are then able to dispatch to many, many clients that are connected to them. We can scale to almost any number of connections, but what is the bottleneck in the system? The dispatcher is the bottleneck in the system. It never ends. The next challenge is that we have this one node, which is what we’re calling the dispatcher, and if it gets a very high published rate of events then it may not be able to cope up.

Challenge 5: 100 Likes/Second

That takes us to challenge number 5, which is a very high rate of likes being published per second. Once again, how do we solve scaling challenges? You add a machine. Engineers just do the most lazy thing and it usually works out pretty well. We add another dispatcher node to handle the high rate of likes being published. Something about it to note here, the dispatcher nodes are completely independent of the frontend nodes. Any frontend node can subscribe to any dispatcher node, and any dispatcher node can publish to any frontend node. There is no persistent connections here. The persistent connections are only between frontend nodes and the clients, not here.

This results in another challenge, the subscriptions table can no longer be local to just one dispatcher load. Any dispatcher node should be able to access that subscriptions table to figure out which frontend node a particular published event is destined for. Secondly, I tricked you a little bit before. This subscriptions table can’t really live in-memory in the dispatcher node. It can live in-memory in the frontend node, but not in the dispatcher node. Why? Because even if a dispatcher node is lost, let’s say this one just dies, then we can’t afford to lose this entire subscriptions data. For both of these reasons we pull out their subscriptions table into its own key value store which is accessible by any dispatcher node at any time.

Now, when a like is published by the likes backend for the red live video on a random dispatcher node, and the green live video on some other random dispatcher node, each of them are able to independently query the subscriptions table that is residing in the key value store. They’re able to do that because the subscriptions table is completely independent of these dispatcher nodes, and the data is safe there. Our dispatcher nodes dispatch the likes based on what is in the subscriptions table, or with regular HTTP requests to the frontend nodes.

Challenge 6: 100 Likes/S, 10K Viewers Distribution of 1M Likes/S

I think we now have all the components to show you how we can do what I promised in the title of this talk. If 100 likes are published per second by the likes backend to the dispatcher, and there are 10k viewers that are watching the live video at the same time, then we’re effectively distributing a million likes per second. I’m going to start from the beginning and show you everything in one flow, because everyone tells me that I’ve got to repeat myself if I’m going to make sure that you remember something when you walk out of this talk.

This is how a viewer starts to watch a live video, and at this time the first thing that the viewer needs to do is subscribe to the frontend node, and subscribe to the library or topic that they’re currently watching. The client sends a subscription request to the frontend node, and the frontend node stores the subscription in the in-memory subscriptions table. The same happens for all said subscriptions from all the clients. Let’s go back [inaudible 00:24:35].

Now the subscription has reached the frontend nodes. The frontend node, as I said before, now has to subscribe to the dispatcher nodes, because the dispatcher will lead the node during the published step which frontend nodes have connections that are subscribed to a particular live video, so let’s look at that flow. The frontend node sends a subscription request to the dispatcher, which creates an entry in the key value store that is accessible by any dispatcher node. In this case, node1 has subscribed to live video 1, and node2 is subscribing to live video 2. This is the end of the subscriptions flow, so now we need to look at what happens during the published flow.

The published flow starts when a viewer starts to actually like a live video, so different viewers are watching different live videos, and they’re continuously liking them. All these requests are sent over regular HTTP requests to the likes backend, which stores them and then dispatches them to the dispatcher.

It does so with a regular HTTP request to any random dispatcher node, and they look up the subscriptions table to figure out which frontend nodes are subscribed to those likes and dispatch them to the subscribed frontend nodes. The likes have now reached the frontend nodes, and we have the last step which we began the presentation with. They need to send it to the right client devices. Each frontend node will look up its local subscriptions table, and this is done by the supervisor Akka Actor to figure out which Akka Actors to send these like objects to. They will dispatch the likes to the appropriate connections based on what they see in the subscriptions table.

Done. We just distributed a million likes per second with a fairly straightforward and iteratively designed, scalable distributed system. This is the system that we call the Real-Time Platform at LinkedIn. By the way, it doesn’t just distribute likes. It can also do comments, typing indicators, seen receipts, all of our instant messaging works on this platform, and even presence. Those green online indicators that you see on LinkedIn are all driven by this system in Real-Time. Everything is great. We’re really happy, and then, LinkedIn adds another data center.

Bonus Challenge: Another Data Center

This made us really stressed. We don’t know what to do, so we went back to our principle. We said, “Ok, how can we use our principle to make sure that we can use our existing architecture and make it work with multiple data centers?” Let’s look at that. Let’s take the scenario where a like is published to a red live video in the first data center, so this is DC-1. Let’s just assume that this is the first data center. Let’s also assume that there are no viewers of the red live video in the first data center. Remember I spoke about subscriptions in the dispatcher? It helps here, because now we might prevent a lot of work in DC-1 because we know whether we have any subscriptions for the red live video in DC-1.

We also know that in this case there are no viewers for the red live video in DC-2, but there are viewers of the red live video in DC-3. Somehow we need to take this like and send it to this guy over here, really far away. Let’s start. The likes backend gets the like for the red live video from the viewer in DC-1, and it does exactly what it was doing before. It’s not the likes backend’s responsibility, it’s the platform’s responsibility. We are building a platform here, and therefore, hiding all the complexity of the multiple data centers from the users that are trying to use this platform. It will just publish the like to the dispatcher in the first data center just like it was doing before. Nothing changes there.

Now that the dispatcher in the first data center has received the like, the dispatcher will check for any subscriptions, again, just like before, in its local data center. This time it saved a ton of work because there are no viewers of the red live video in DC-1. How do we get the like across to all the viewers in the other data centers. That’s the challenge. Any guesses?

Participant 6: Add another dispatcher.

Gupta: No, don’t add another dispatcher. We already have too many dispatchers.

Participant 7: [inaudible 00:29:47]

Gupta: Ok, so we can do cross-colo subscriptions, cross data center subscriptions. What’s another idea?

Participant 8: You can broadcast to any DC.

Gupta: Good, broadcast to any DC. We’ll talk a little bit about the tradeoff between subscribing in a cross data center fashion versus publishing in a cross data center fashion. It turns out that publishing in a cross data center fashion is better here, and we’ll talk a little bit about that a little later. Yes, this is where we do a cross colo, or a cross data center publish to dispatchers in all of the peer nodes. We’re doing that so that we can capture viewers that are subscribed to the red live video in all the other data centers.

The dispatcher in the first data center simply dispatches the likes to all of its peer dispatchers in all the other data centers, and in this case, a subscriber is found in DC-3 but not in DC-2. By the way, this dispatcher is doing exactly what it would’ve done if it received this like locally in this data center. There’s nothing special that it is doing. It’s just that this dispatcher distributed the like all over to all the dispatchers in the peer data centers. The viewer in DC-3 simply gets the like just like it would normally do, because the dispatcher was able to find the subscription information in DC-3. This viewer with the green live video does not get anything.

This is how the platform can support multiple data centers across the globe by keeping subscriptions local to the data center, while doing a cross colo fan-out during publish.

Performance & Scale

Finally, I want to talk a little bit about the performance of the system. It looks like everybody is here because, hey, scale. We did this experiment where we kept adding more and more connections to the same frontend machine. We just kept on going, and wanted to figure out how many persistent connections a single machine can hold. Any guesses?

Participant 9: A million.

Gupta: No, not that many. We also are doing a lot of work. It turns out that we were able to have 100,000 connections on the same machine. Yes, you can go to a million, but at the same time, because we’re also doing all this work, and because we use the system not just for distributing likes but also for all the other things that LinkedIn has, we were able to get to 100,000 connections per frontend machine. Anyone remember the second largest live stream?

Participant 10: Royal wedding.

Gupta: The royal wedding had 18 million viewers at peak, so we could do that with just 180 machines. A single machine can do 100,000 connections, and so with 180 machines you’re able to have persistent connections for all the 18 million viewers that are currently streaming the royal wedding. Of course, we just didn’t get to this number easily, so we hit a bunch of file descriptor limits, port exhaustion, even memory limits. Luckily we documented all of that at this link, tiny.cc/linkedinscaling. I hope that you will be able to get something out of reading something like this, because it’s very interesting. It’s just like regular scaling challenges, it’s just that we hit it in context of trying to expand the number of connections that we could hold on a single machine.

How about other parts of the system? How many events per second can be published to the dispatcher node? Before you answer this question, I want to talk about something really important about the design of the system which makes it massively scalable. The dispatcher node only has to publish an incoming event to a maximum of the number of frontend machines. It doesn’t have to worry about all the connections that these frontend machines are in turn holding. It only cares about this green fan-out here, which is the number of frontend machines that this dispatcher can possibly publish an event to, but it doesn’t have to worry about this red fan-out. That’s the part that the frontend machines are handling, and they’re doing that with in-memory subscriptions, with Akka Actors, which are highly, highly efficient in this. Now with that context, what do you think is the maximum events that you can publish to this dispatcher per second? Participant 11: Ten thousand.

Gupta: Very close. That’s a very good guess. It turns out for us that number turned out to be close to 5,000, so 5,000 events can be published per second to a single dispatcher node. Effectively, we can publish 50,000 likes per second to these frontend machines with just 10 dispatcher machines. By the way, this is just the first part of the fan-out. These 50,000 likes per second will then be fanned out even more by all the frontend machines that are able to do that very efficiently. That’s a multiplicative factor there, and that will result in millions of likes being distributed per second.

Lastly, let’s look at the time, because everybody really cares about latency. You’re building a real-time system so you got to make sure that things are super fast. Let’s talk about the end-to-end latency. If you recall the time, T1, at which the likes backend publishes the like to our real-time platform, which is the dispatcher machine, and we record the time, T2, at which point we have sent the like over the persistent connection to the clients. The reason we are measuring it there is because you can’t really control the latency outside your data center. I mean, you have some control over it, but that’s the one that the platform really cares about. Then, the data turns out to be just 75 milliseconds at p90. The system is very fast, as there is just one key value lookup here and one in-memory lookup here, and the rest is just network calls, and very few network calls.

These are some performance characteristics of the system. This end-to-end latency measurement is also a very interesting thing. How do you really do that? Most of you must be familiar with measuring latencies for a request response system. You send an incoming request and the same machine can measure when the response is sent out, and therefore, you can say that, “It took this much time.” In this case, there are multiple systems involved. You’re going from the dispatcher to the frontend node, and then to the client. How do you measure latencies for such one-way flows across many systems? That is also a very interesting problem, and we wrote about it. We wrote a system that we built using nearline processing, using Samza. Samza is another technology that we use at LinkedIn, and you can use that to measure latencies across end-to-end systems across many machines.

We wrote about it at tiny.cc/linkedinlatency. Don’t have the time to dive into it here, but I would love to, and I hope that you get something out of reading something like this. If you have a system where you want to measure latencies across many different parts of the stack, you can use something like this to measure latencies.

Why does the system scale? I think it scales because you can add more frontend machines or more dispatcher machines are your traffic increases. It’s just completely horizontally scalable.

The other thing that I mentioned at the beginning of this talk is that we also extended the system to build presence, which is this technology where you can understand when somebody goes online and offline. Now that we have these processing connections we know when they were made. We know when they were disconnected, so we also know when somebody came online and when somebody went offline, but it isn’t that easy, because mobile devices are notorious. They will sometimes just have a bad network. They might disconnect and reconnect without any reason. How do we average out or produce all that noise to figure out when somebody’s actually online and when they’re offline, and not just jitter all the way where you keep going offline and online, because you have connections and disconnections simply because of the network that you have?

We wrote about that at tiny.cc/linkedinpresence, where we used the concept of persistent connections to understand how somebody goes online and offline, and we built the presence technology on top of the exact same platform, so I hope that’s also useful to you.

Key Takeaways

That is probably a lot to consume in the last few minutes, so I’ll try to see if I can help you remember some of this. Real-time content delivery can enable dynamic interactions between users of your apps. You can do likes, you can do comments, you can do polls, discussions. Very powerful stuff, because it really engages your users.

The first piece you need is a persistent connection. For that, there is built-in support for EventSource in most browsers, and also on most server frameworks. There are also easily available client libraries that you can use for iOS and Android. Play and Akka Actors are powerful frameworks to manage connections in a very efficient way that can allow millions of connections to be managed on your server side. Therefore, they can allow millions of viewers to interact with each other. Everyone remember, Akka Actors, way cooler than Hollywood actors.

The principle I started this presentation with that challenges in distributed systems can be solved by starting small. Solve the first problem and then build on top of it. Add simple layers in your architecture to solve bigger challenges. This is all we did throughout this presentation. When you hit a limit, horizontally scaling the system is usually a good idea. Add a machine, distribute your work.

The Real-Time Platform that I described to you can be built on almost any server or storage technology. You can use Node.js. You can use Python. All of these server frameworks support some methodology of maintaining persistent connections. For the key value store, you can use Couchbase, Redis, MongoDB, anything that makes you the happiest, anything that you’re already using.

Most importantly, you can do the same for your app. Real-time interactions are very powerful, and I feel that if you use some of the principles that I shared with you, you can do some pretty interesting stuff, and pretty dynamic experiences in your own apps.

Thank you, everyone, for attending this session. I’m a proud Indian. I work at LinkedIn in the U.S., and I’m so glad that I got this opportunity to talk to you here at QCon London. This talk and all of its slides will be available at tiny.cc/qcon2020, I’m assuming very soon. There is also an AMA session at 1:40 p.m. where you can come and ask me anything, not just related to this, but anything else that you have in your mind. That’s happening at Guild.

Questions and Answers

Participant 12: Do you have any fallbacks for clients that don’t support server-sent events? Or do you just say modern browsers are our focus here?

Gupta: The beauty of server-sent events is that they’re literally a regular HTTP request. There’s absolutely no difference between what a regular HTTP connection would do. In fact, WebSockets are something that sometimes get blocked by firewalls in certain systems, and we have never experienced a case where server-sent events don’t work. Because it’s a regular distributed connection, most firewalls will not block it. Most clients will understand it, and we have never seen a case where server-sent events doesn’t work.

Participant 13: How do you synchronize your video stream with likes with time, basically?

Gupta: I think the question here is that once these likes have happened, how do you make sure that the next time somebody watches this video the likes show up at the same time? Is that what you’re asking?

Participant 13: Yes, and also, the video streams are delayed a little bit on different servers, and your likes are happening in different places.

Gupta: Yes, I think you must have noticed here that there is a delay. I think the question here is that, “I liked at moment X, but maybe the broadcaster sees it at moment Y.” Yes, there is a delay, and some of it is simply because of natural causes, you’re just speed of light. The other is that there is also sometimes something that we do to make sure that the broadcaster can cut something off if something is seriously wrong. The good thing here is that once somebody has pressed like, it will show up to all the viewers almost instantaneously. You can actually try it right now. If you press Like you should actually be able to see it almost immediately. The distribution is real-time, but yes, there may be a delay between when you think that the broadcaster said something versus when you actually liked it, and that is natural. I think there are also security reasons to do so.

Participant 14: My question is, do you have any consistency guarantees, especially in view of dispatcher failure even across data centers?

Gupta: Yes, great question – what about consistency? How do you show guarantees? How do you make sure that a like will actually get to its destination? The short answer is that we don’t. Because in this case, what we are going for is speed, and we’re not going for complete guarantees for whether something will make it to the end. Having said that, we measure everything. We measure the cross-colo dispatch. We measure the dispatchers sending requests to the frontends, and we also measure whether something that was sent by the frontend was actually received by the client. If we see our [inaudible 00:46:27] falling, we will figure out what the cause is and we will fix it.

Now, I do want to share something else now that you asked this question, which is Kafka. A natural question is, “Why not just do this with Kafka?” If you do it with Kafka, then yes, you do get that, because the way you would do it with Kafka is that the likes backend would publish a like over to a live video topic that is defined in Kafka, and then each of these frontend machines would be consumers for all the library or topics that are currently live. You already see a little bit of a problem here, which is that these frontend servers are now responsible for consuming every single live video topic, and each of them needs to consume all of them because you never know which connection is subscribed to which live video, and connected to this frontend server.

What this gives you is guarantees. You cannot drop an event anywhere here in the stack, but you can drop an event when you send it to the client from the frontend server, but you can detect that. In fact, EventSource interface provides a built-in support for it. It has this concept of where you are. It’s like a number that tells you where you are in the stream, and then if their things get dropped, the frontend server, the next time it connects it will tell you that it was at point X, and the frontend server can start consuming from the topic at point X. What you give away here is speed, and also the fact that the frontend servers will stop scaling after a while, because each of them need to consume from these streams, and as you add frontend machines that doesn’t help. Each frontend machine now needs to still consume all the events from all the Kafka topics.

Participant 15: You have 100 connections to the clients, and some of the clients are very slow. They might not be consuming your data properly on the pipe. How do you ensure that you don’t have memory exhaustion on the side of [crosstalk 00:48:41]?

Gupta: Notice that when the frontend server sends the data, or has the persistent connection to the client, it is actually a fire and forget. The frontend server itself is not blocking on sending the data to the client. It just shoves it into the pipe and forgets about it, so there is no process, or no thread that is waiting to figure out whether the data actually went to the client. Therefore, no matter what these clients are doing, they might be dropping events, they might not be accepting it because something is wrong on the client side. The frontend server is not impacted by that. The frontend server’s job is to just dispatch it over the connection and be done with it, again, because we are going for speed and not for [crosstalk 00:49:24].

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Secrets of a Strong Engineering Culture

MMS Founder
MMS Patrick Kua

Article originally posted on InfoQ. Visit InfoQ

Transcript

Kua: I’m very excited to share this talk. It’s a new one that I built specifically for QCon. Before I start, I want to talk about two significant experiences I had in my career very early on. I’ve been working in technology for almost 20 years, and so this experience happened just after the dot-com crash. Maybe some of you remember this. I had a job coming out of university, and actually, I lost that job before I even started because they shut their office down. That’s how it was where I was growing up in Australia. It was this 2001, 2002 period where things were a little bit uncertain, where the whole dot-com bubble had burst.

I had this experience working for a large company, I’ll call it Enterprise A. In Enterprise A I was working on a healthcare platform. This was exciting because health tech is quite important, and it felt like something that we could actually build and release and make changes to hospitals, to governments, to help people get access to a health-tech platform. As an engineer, I was spending three or four months on calls with San Francisco in India, going through 200-page specifications. You can imagine how fun that was as a developer. This healthcare platform had a two-year project plan, so this was constantly being updated. You could imagine this whole plan and these people building this plan of trying to get a healthcare platform early to test, get it regulated. It’s a lot of work, but we weren’t going to see first customer using this platform for two years.

What was interesting is that everyone had opinions on this. If you’ve worked in technology, you have lots of people with opinions. There’s a saying about hippos, “The highest-paid [inaudible]” perspective. Obviously, this was the age before Agile, and so we had a lot of different handouts, so 200-page specifications written by product managers. We still had a build-and-release team who would frantically put together builds candidates. If you can imagine what that was like, or if you work in that environment, you know what that’s like.

Then, of course, that very rigorous QA intense phase where we have relatively stable builds, and intense testing. That was a really interesting life-forming moment that I’m lucky that I experienced very early in my career. Now, a couple of years later after I left the enterprise, I ended up joining a startup. This was interesting because startups, at least in Australia, were quite rare. Here, we were actually practicing a lot of the extreme programming principles and practices.

We actually had a user researcher with us who was actually doing user testing back then trying to test out new types of prototypes to understand what would be a lot more appealing to people. We actually released every two weeks. This is before cloud, we released every two weeks. We prepared a candidate, we got it ready, and actually put this out live. Then we actually tested to see how changes actually performed with users.

We had monthly ideation settings. We didn’t have lots of people throwing ideas and backlogs of ideas. We all actually generated ideas or hypotheses about what we think would actually have the biggest movement being on customer signup, or customer conversion, and actually trying to test that out. We, as developers, were doing the full lifecycle thing of actually doing production support as well.

Now, these two contrasting experiences for me were really life-changing. As an engineer, “I’m thinking about what environment do I thrive in the best? Where do I get this full size [inaudible] sure that I’m actually working in the environment I want to have?”

This is the challenge that you will probably face when you’re actually working with recruiting people or keeping people employed. Engineers today have the choice of going wherever they like. They’re implicitly thinking about, “Is this the culture I want to work in?” Therefore, which of these two choices would you actually go with, or which of the many choices would they go with? You have an ability to influence this if you’re a technical leader in your organizations by trying to shape this engineering culture. Over my 20-year journey, I want to share some of those secrets with you.

As Cassy [Shum] mentioned, we have these quests for secrets, and I’ll be your guide today. A little bit about my background. I’ve been working for about 20 years. I did consulting with ThoughtWorks for about 14 years. Then over the last two and a half years, I was CTO and Chief Scientist for challenger bank in N26, where I could put a lot of the principles and practices we helped with a lot of clients into play, growing an organization from 50 people to about 370 over two and a half years. That wasn’t by chance because when you build engineering cultures, you have to be really deliberate about it. I have those two life-changing experiences early on in my career, so I understand what developers want, what engineers want to have. Your responsibility as leaders or managers is to think about this environment. You don’t need to manage engineers. Your value add is managing the environment so those engineers can do the best things that they can. These are some of the lessons that I want to share. If you’re interested, I have a couple of other books. I run a course for tech leadership and also a newsletter for leaders in tech.

Today, I want to talk about the journey that we’re going to be going on, and really I want to talk about why this is quite important and why you should actually care. Then we’re going to look at these three secrets. We’ll look at maybe some of the pitfalls with where companies go wrong, and also what you can do for each of these items, and then also describe some specific things that you can take away and actually put this into practice.

Why?

Let’s start with the why. Let’s have a talk about why this is actually important. If you’ve ever had to do your hiring, you know it’s really hard. Now, as a startup CTO, in Germany, starting with 50 people and we had to grow really rapidly, I knew that actually engineers have choices wherever they like. Every team that I’ve talked to in most companies are trying to hire somewhere. Ideas are not the problem in our industry, our problems are actually being able to build something, to sustain it, and also to experiment. Ideas are cheap, implementation is a lot hard. We’re all trying to hire, maybe we’re trying to grow talent, which is also a great way, but we just don’t have enough. There’s a big supply and demand problem in our industry for engineering talent.

As I said, ideas are cheap and easy. You can brainstorm hundreds of ideas, but actually, it’s only one or two of those things that you’ll actually be able to put into production and actually test. Your responsibility is thinking about, “Where does software belong, and how do you help it grow in your company in a sustainable way?” That’s where engineering culture matters. This is where all of you as tech leaders have an ability to influence it because your ability to support this is what will matter when you’re trying to actually hire people. We have the supply of developers over here, and we have all these vacancies. We’re all competing almost with each other, and it’s your responsibility to decide what sort of engineering culture you have so that you have the best chance of attracting the developers that you’re looking for.

This is one of the exciting things about technical leadership – you get to craft the culture that you want to have to attract similar people or people who want to work in that culture and experience that culture, and you have the ability to influence it. It’s a really exciting opportunity to have.

Remember that everyone has not just two choices, but often 10 or 20 different choices about where they’d like to work. We work in a very global connected, easily mobile industry, and it’s my experience that if you’re not keeping this engineering culture, a competitor will probably hire your engineers away from you. That’s why this is one of the biggest, most important things you can think about as you’re trying to cultivate the right culture in your area.

Impact

Let’s talk about three specific secrets because these are the things that you’re here for. We’ll look at each of these different elements, and we’ll look at each one, one by one. The first secret really is about the secret of making sure that you have impact. What do I mean by impact? If we look at this, I think there’s three different elements of impact that I think make a difference to a strong engineering culture. The first one is that as an engineer, I want to have impact with the customers. I want to know that what I do is useful for the people that we’re actually selling our products for. I also want to make sure that I see the outcome of what I’m actually achieving, and that I’ve actually got some satisfaction that I’ve solved a certain puzzle. We’ll talk about each of these elements.

One of the interesting things I’ve seen with some customer impact stuff is some organizations set up developers to be very far away from customers. Maybe you have an army of user researchers and product people in the middle. In those poor organizations where you still get handed a 200-page specification, an engineer doesn’t feel that connection with that customer. This is one big trap that happens in a lot of organizations, and it’s hard to unwind because you have to often work through organizational hierarchies and different departments to actually move towards that. It’s probably one of the reasons why we’ve moved towards a lot more of a cross-functional role and really try to put that customer perspective as close as possible to the developers. That’s something that you should avoid if you’re heading in this way, if you want to build a strong engineering culture.

Another idea is this thing about feature factories. I’ve seen a lot of organizations where you have backlogs for years. You don’t want to be in that world. Engineers definitely don’t want to be in that world [inaudible] asked you for. That’s the feature factory. Engineers join software because they want to solve problems. They don’t want to just implement code. They want to actually have a part in that process of taking a problem, trying to be part of that solution, and really think about what it is, the simplest way to solve that solution through maybe some coding application. If they just have features to implement, that’s no fun as an engineer. That’s something else to avoid.

Another perspective is sometimes people get too solution-focused. This happens when you have a lot of product managers building lots of ideas about what could be done. You know that there’s a customer need, but somebody has come up with that perfect solution about what needs to be done. There’s no negotiation around it. Once again, engineers want to be part of this problem-solving perspective.

What about seeing the outcome? What are some pitfalls that a lot of companies go down here? Well, one is there’s just too many handoffs. Products people are handing specifications off to developers, developers are handing things off to testers, testers are handing things off to poor SRE or DevOps people, and everyone’s just continuing to work. As a developer, I’m not getting to see whether I make good decisions because I don’t see the downward implication of where I’m coming from. Once again, this is a really hard thing to solve because you need to have an ability to influence the organizational structure. This is something that you really want to try to remove off too many handoffs.

Another idea is slow feedback loops. This is one of the reasons why I think things like cloud architectures, continuous delivery have really helped to improve that. When I worked in that enterprise with the two-year project plan to get to a release, I was really frustrated as a developer because I want to see the value of what I’m actually doing come to life. When I worked in a startup where we were then releasing every two weeks, that’s such a big step change from two years to two weeks because I actually can see this feedback loop, and I can see the things that I do have been adding value. Work on improving these feedback loops and try to remove the slow feedback loops.

Another challenge, which happens when you have an organization with a small set of developers and maybe lots of product people is that people are often allocated to different types of products or projects. Developers are often context switching, one project to another. This happens very much in early-stage startups because you’re often very lean in your engineering team. All this context switching means that a person who’s working on something probably isn’t getting feedback about what they built previously is good quality, actually solved the customer need, and they’re maybe not really focusing on what the biggest value is. Be careful of some of these pitfalls.

Then engineers join engineering so that you can solve this really interesting puzzle. Martin Fowler just published an article about outcomes over outputs. I get asked a lot by business people, “How do you measure the productivity of teams?” “Do you use velocity? Do you use points? ‘My team is getting more points than your team.'” That’s a really bad measure because it’s measuring output. A lot of people who haven’t been working in software don’t really understand that you can come up with a really simple solution. It may take a little bit more time, but it may be less work. The hard bit is in thinking about how you approach the problem and how you solve this problem elegantly.

If your organization is measuring people by the number of tickets they complete, by the number of story points your team delivers, by the number of features you pump out, who knows if those features are actually having an outcome to desired outputs for your company, or the customer growth, or customer engagement? Really, you want to invert this and stop focusing on the outputs. These are just byproducts of our process. They’re no real measure of what the outcome is that you have with that output.

I’m one of those people that I actually try to instigate some planning, so I actually don’t think all plans are bad. One of the difficult things is when people have put together a plan, they then want to stick to the plan that they’ve built. Good planning processes mean that you adapt. You know that you’re going to get your plan wrong. It gives you a basis of saying, “How far are we off our mark?” I believe in estimation, not because the estimates and the date matter, it’s because it gives us a good idea about, “Do we share the same assumptions?” and “What assumptions are changing as we validate our work?” If you have organizations that are just simply saying, “But you promise it by this certain date,” you’re not going to be building really good to the engineering culture.

One of the things that often happens if you work in a feature factory focused on lots of work is this idea of just lack of celebration of feedback. “Let’s move on to the next task. Let’s forget that we’ve actually achieved something really great, and let’s just move on to the next feature, the next sprint, the next iteration.” These are all things that you can avoid if you’re focused on this. The principle I use for this is really cultivating impact by reducing feedback loops. Now, what does that mean? What are some concrete examples of what you can do for this? When I first came into N26, it was interesting because we’re a beta seed bank, so all of you could sign up as a customer, and engineers could actually get closer to the customer. This is a little bit more difficult if you’re in a B2B platform where you’re selling to other companies, who then sell on to end customers. One of the practices I encourage our teams to do is actually go sit next to our customer services people. They weren’t actually in the same building, but they were in the same city. It wasn’t much effort to go over for an afternoon, listen to some calls, understand what problems that people were actually having.

I remember one time, we have this activity called “get stuffed on days.” Every six weeks, two days, people can choose what they do. A team went over to customer services, they sat with customer service people, listened to some calls, and an engineer went, “I didn’t realize this was a problem, I can fix this right now,” and literally, in an afternoon, fixed the customer issue that was plaguing customer services. The further away you are from customers, the harder it is [inaudible] This is a practice of shadowing CS that you can have to make sure that developers get closer to customers, get quick feedback about how their software is actually progressing.

Another thing, and this is one of the reasons why I think Agile has become really successful, it’s not because of big scaling methods like safe or whatever, it’s because you get to deliver in small increments. Any large transition or migration project that you’ve ever worked on and you’re not actually delivering value in small increments, it’s going to destine to fail. We’ve learned this through smaller sprints, smaller iterations. Make sure you deliver small outcomes because then we can use that to improve our plans. As developers, you get really good feedback about whether or not you’re actually building something that’s useful. I’ve worked on teams that have released something into production, after, say, a month, only to find out users don’t use it. They could actually can the whole project rather than continue on with the plan because they’ve got real feedback around user feedback.

OKRs are a very controversial topic, and this is a whole other talk by itself. One of my pet peeves with how OKRs are used is that they’re often used by management through objectives, so management through metrics. If you use OKRs well, what you should be doing is making sure that OKRs are used for aligning the organization and making sure that people understand, “Here’s how my work contributes…” Most companies should have some company mission, a goal that they’re trying to have. Every probably year or every six months, there’s probably large initiatives or themes that your company is focused on to maybe improve. It’s either maybe customer growth, stability, or customer conversional activity.

Then when a developer is working on a task, it’s hard sometimes to understand how is this linked to the bigger, broader picture. This is where I think an effective use of OKRs is really important, because it’s helping connect this feature to some initiative to the company mission. Once again, engineers want to have impact. They want to have purpose. They want to understand why they’re working on something. OKRs, when done well, can actually have a very important effect on this.

The other thing is really about accelerating release speed and frequency. How many of you have read the book, “Accelerate?” Very good. If you haven’t, I highly recommend it. It comes from the research that has come from the “State of DevOps” report that has been running probably for the last four or five years. They actually come up with some good metrics about how you do measure the productivity of an engineering organization. One of them is really about this speed to release. If you can release every two weeks, if you can release daily, hourly, or on-demand, it’s even better.

You can’t just simply release really rapidly. You also need some counterbalancing metrics like making sure when you release, you don’t break everything. There’s also some things like failure change on release so that you’re actually developing quality as well. This is one of the biggest lessons learned from both continuous delivery in that book, which is, the quicker that you accelerate this feedback loop, the better it is, and the better your engineering organization will be as a result. If you’re looking for big lever, this is probably one of the biggest ones to start with. It’s a hard one if you’ve got really long release cycles, but it’s definitely worth it as well.

Choice

We looked at impact as one of the secrets. Let’s talk about another one of these secrets. This is really about choice. Engineers love choice, but maybe a little bit too much choice sometimes. When developers disagree, they love arguing over these abstract ideas. Choice is quite an important part because it’s part of this problem-solving perspective. Part of it is having some say and being able to say what the solution is, having a say in how I work as an engineer. This is, “I like to have a say about how I conduct my daily business,” and then also degrees of freedom. It’s interesting because I don’t say complete freedom, and we’ll come back to this in a bit.

Around solution choice, things to be careful of is when you’re a large organization, it’s tempting to say, “You have to use x framework,” You get this, the larger your organization is, you want to have some consistency across the organization, so it makes sense that everyone has to use this framework. You have to be careful because that can backfire because it’s removing choice from developers. Often, people – and you get this if you have a separate to the architecting group – are making decisions and then developers get to just simply implement these decisions. That’s really no fun if you’re an engineer, and you really want to make sure that engineers are involved early so that they’re part of that process of choosing how they build particular solutions. Then if the output is particularly predetermined, so “Here are all the tasks that you have to implement for a particular feature”, that’s also not very great from an engineering perspective.

In terms of how to work, one of the interesting things after working in Agile environments for a very long time is I’m really non-prescriptive. I’m not going to go to a team and say, “You have to do stand-ups at 9:00.” Even though I wrote a book on retrospectives, I’m not going to say, “You have to do a retrospective every Friday.” I don’t believe in that. If you have a single way of working across your entire organization, that’s a big smell you want to watch out for. If you have prescriptive processes, and it’s ok to have them at the start, but if you’re not actually improving your processes and changing it, you’re probably not actually living the Agile values because improving, adapting.

If you’re optimizing for an individual, this is really hard because as an individual engineer, I might have my personal preferences. As a leader or manager, you want to optimize for the team and for the organization, not for a particular individual. You have to be careful about boundaries of freedom because one individual’s choice might start to impact the productivity of your team. Degrees of freedom is hard because we want to give lots of choice. You hear this term about developer autonomy, giving them lots of power to choose.

You have to be careful about heavyweight rules, and we’re seeing a trend away from architecture rules to more principles to help people understand what a good decision looks like in their environment. I think there is something about too much freedom, so to say you can make any choice, whatever you like. I’ve seen a lot of companies that say you can use whatever programming languages you like and then ended up with 20 programming languages when they have 10 developers. You can imagine how that’s going to be maintenance-wise long term.

There’s an interesting balance between autonomy and alignment. My autonomy can start to step onto somebody else’s autonomy because I’m making a choice that affects more than me. This is where you as a leader are trying to find out the right boundaries of both aligned decision making and autonomous decision making. If you don’t really have a clear decision-making process, this is another thing that fuzzies up the boundaries, is that as a person, you have a certain amount of decisions that you can make for yourself. As soon as it starts to impact other people or other departments, your decision-making part starts to become less impactful. You can give input, but you’re maybe not the decision-maker. To help improve this, you have to describe what those boundaries are for where you are.

The principle here is really cultivating choice by making the right thing easy to do. If you want people to use a preferred framework, make it easy for them to adopt it, make it valuable. That’s the interesting thing. If it’s useful, people will come to it. That’s one of the principles you can use here. The good thing here is you should automate the basics. This is one of the reasons we’re seeing a lot of stuff around developer productivity tools. Internal teams are really focused on building common libraries, not because everyone’s going to be forced to use these libraries, but just because they’ve solved this problem, and my team can benefit from that immediately scripts that automatically bootstrap new environments, new projects, make it easy. These are examples of making the right thing easy to do that then help with then building on the interesting choices.

Using architectural principles overall. The 12 factor architecture is really good example of 12 simple things that you can explain that talk about cloud-native or cloud friendly architectures. That’s something we’re seeing a lot with different types of companies that are starting to adopt architecture principles. Your company is at a different stage, you’re optimizing for a different problem. These are things that you can help articulate. Make sure that you set up the goal, and step out of the way. This is really hard, because we’re often, as developers, people who want to maybe come up with a solution for other people. Actually, if we really want to engage with people, what we should really be talking about is, “Here is the problem that needs to be solved. Here are the constraints of how we’re trying to solve that problem.” You want to give that to the team and give them the ability to actually problem solve that.

Too many places start to take away that choice by actually already predetermining the solution for those teams. The best way that you can engage with developer choice is making sure that you’re clear about what it is that needs solving and also by what constraints should that solution be. Then to solve the decision-making boundary, be clear about your decision-making process. For a decision that affects the entire organization, you might decide, rather than everyone has a democratic vote, everyone can give input. Actually, the choice will be made by a smaller group of people who are trying to balance out the needs of all these different teams and groups. This is ok. You don’t have to give everyone absolute input or choices for every decision that you make. You just need to be clear about where those boundaries are and what types of decisions you have.

I tend to think in terms of technical decisions about those things that a team affects, the things that affect subpart of a product, and then things that are affecting the whole platform as well. You want different levels of decision-making processes, depending on how irreversible the decision is and how wide that impact will be when you actually make that decision.

Improvement

We looked at impact, we’ve looked at choice, what is the third secret? Let’s have a look. This one is really about improvement. Developers want to continually see things improve. I had a hard time thinking about specifically what this meant to me, but one is definitely about personal improvement. As a developer, one of the common things that probably we all share is we want to learn, we want to grow. That’s a key thing. Also, if I’m working for a company, I want to make sure that our company is also improving. if we’re working for Tesla, we want to see movement towards automated [inaudible]

Then the other thing is really about the good work environment. Engineers will often complain, and I think one of the good things about a good engineering management group is making sure that people’s complaints get addressed. Engineers will often think about the things that they can’t control, things that are in their environment. To create a good engineering culture, you want to see improvements in that environment as well. I see that as the core responsibility of management.

Some traps around each of these different areas – if you don’t have any growth opportunities, engineers will leave your company. It’s an actual part of our process. You see this with startups, and you see this with companies who aren’t growing. People will want a bigger role, more responsibility. One of the challenges that you have as a leader is trying to find something that’s challenging for this person, and sometimes that will be across the boundaries of your company. They’ll have to join another company in order to have that opportunity to really grow.

Being careful about repetitive work – if something is repetitive, invest in automating it. Or, if it’s something that needs doing, rotate that role. Be careful about things that are just simply the same thing day in day out because people will get bored, and engineers need new challenges all the time so they feel that they’re actually learning. Another thing that happens quite a lot is really a lack of feedback and support. A common thing I hear from engineers is “How have I been doing? Have I been growing?” by the time that you get to an annual performance review of some sort there shouldn’t be any surprises. If there are, it’s probably an indicator that your environment is not providing the right feedback or support. Make sure that you address each of these things if you really want to help engineers grow.

The other thing that can stop people is really this fear of failure. If somebody makes a mistake, and they get fired for it, how safe do you think that environment is for learning? You need to think about what causes that and how to create safety when people do make mistakes. Very rare it’s a person that makes that mistake, it’s more about the system that allowed that mistake to happen. That’s actually a leadership responsibility. Remember that board people will quit. If you’re not actually encouraging them to grow and finding opportunities to grow, they will leave you, and they’ll find some other place where they can actually do that.

What about traps to steps to mission? One of them is unclear priorities. It’s something that I see a lot in different organizations, is that if you haven’t actually gone into an organization and you have clear priorities, this is really confusing because you don’t really understand how your work connects to the broader mission. Early on in my time at N26, I actually visited Spotify in New York and talked to one of their senior directors. It was really interesting because they’ve got six levels of prioritization. That’s crazy. Then if you think about it, Spotify is something like 3,000 or 5,000 people.

Once again, it’s interesting because you have different hierarchies of prioritization, but it’s very clear in the order of those priorities. Me as an individual, I know if I’m working on something, and it’s not the topmost priority, I can go point to the common source and say, “I think we should be working on this thing instead,” so I know I’m actually helping the company move in a way that it’s going. This is something that you can actually influence.

Make sure that you re-emphasize this mission that people are doing, but also make sure that they can understand initiatives. Sometimes a mission can be too broad and abstract. Too much of a big leap. It’s like, “How do we know that we’re moving to perhaps autonomous vehicles?” Then you can help people understand, “Here’s the initiative that’s actually helping towards that, and this is why your task is actually helping move towards that task.” Try to address this gap.

Another thing that happens a lot is really intransparent information. Maybe you have priorities, but maybe they’re stuck in JIRA somewhere. Nobody likes JIRA. One of your tasks is really to make sure that people understand where information is, how to pull that information so they get the best out of it. Traps around the work environment – things to avoid here are these sayings about “That won’t work here.” You see this in very large organizations that aren’t used to lots of change. “We tried that, and it was really painful. No, that won’t work here again,” You need a supportive environment to try things out. Those things will sometimes fail, they’ll be a bit painful, but, actually, you need that environment to say, “We don’t know if it’s going to work out, but let’s see if it actually improves the environment.”

If you wait for people to say, “Can I do this,” this is a big indicator that people don’t feel safe, or they’re not encouraged to improve their environment. Asking for permission, you really want people to ask for forgiveness. “I tried out this thing, I’m sorry, I think I stuffed up.” You want people to continually improve their environment.

Then, once again, this lack of feedback about what your company is actually doing to take in employee feedback. How many of you do something like Culture Amp, measuring what’s going on in your company? It’s about a third. Even when those core trend surveys come out, how many of those people are actually taking that feedback in, and people are saying, “We’re doing these initiatives based on your feedback.” Often, management will be so busy just getting those things done, but they’ll forget to connect their initiatives to the feedback that actually happened. Make sure that you take in that feedback, but make sure that you act on it and make that connection to where that feedback comes from because it’s going to be longer and people will forget.

The principle here is really cultivating improvements by making small changes easier. This is something that all of you can actually influence, be it in your team, be it with another team member, or across your organization. Some concrete ideas are really making sure that you take actions from retrospectives. “My pet peeve. We talk about this topic. It comes up again and again. Why does it keep coming up? Oh, we didn’t do that action from last time.” Making sure that people follow up in actions is one of your basic fundamentals, really. Making sure that you improve the system through post-mortems.

I see this with not just incident post mortems, but also from project post mortems. Some large scale projects that maybe didn’t go so well, you need to take the time out to actually reflect on what things did go well, so you can amplify those things and what things would you change for the next time your organization takes on a large initiative. This takes time out because rather than working on the next item, you’re really trying to focus on improving the system. That’s really worthwhile.

Creating forums is an important part. I don’t believe in pushing everyone to work in the same way. What I do believe in is creating ways that people can pull advice from other people who faced similar scenarios. Be it your guilds of people with same functional skills, or be it at the end of an initiative, or some brown bag, or lunch-and-learn learning session, you really need to be thinking about what environments you create to amplify these so that people are aware, “Somebody in my organization has already faced this problem, and I know who to reach out to in order to understand how they approached it,” These are some of the things that you can actually do.

Also, make sure that you define growth paths. You do this a lot in early-stage startups where people are just doing everything that they need to do to keep the business running. Then people want to ask, “How do I grow? What’s my next step? What’s my next challenge?” Even in larger organizations, you may have these growth paths, but what you’re trying to do is trying to help people find out, how does it apply to them in their environment? What’s the next opportunity based on what they want to grow in? Make sure that you talk about different growth paths, and not just a management track, but really a individual contributor track or a technical leadership track where people can stay technical and hands-on. That’s a really strong indicator of a good engineering culture.

How

We’ve looked at these three secrets, let’s quickly talk about how you can actually take some action from this. A very simple five-step recipe. First step is make sure that you gather input. You talk to your existing organization. You have engineers in your team, you should listen to why those engineers are still working for you. There’s probably some good reasons. Gather some input from those people.

Publish your tech culture. One of the interesting things about this is once you’ve talked to people, synthesize that information. What makes your environment different? Why do people come to choose to work for your company, rather than somebody like Google or an Amazon. There’s lots of great reasons. Maybe you get to work on a product that you’d never get a chance to work in a big giant, perhaps you get to work in a much smaller team so that you know everyone. That can be really important for people as well, but make sure you describe how you describe your tech culture to other people.

Then think about prioritizing key improvement areas. As you talk to people, there’ll be definitely pain parts, painful points, which aren’t so ideal. Here’s what you want to actually talk about, “Ok, these are really good parts of tech culture, but here are some areas where we should definitely improve as part of that.” Then decide on actions as a leadership or a management group, and make sure you follow them up.

If it’s that people aren’t getting feedback on a regular basis, find out a way to make sure that they get feedback from peers on a monthly basis or something like that. Maybe people need feedback training, and so you have to actually decide how you’re actually going to fulfill that to improve this engineering culture.

Then, it’s as simple as repeating. It’s a five-step simple recipe about how you can improve your engineering culture.

We all have metrics, we’ll have good questions, so here’s some three simple questions you can use to ask yourself, “Are you building a strong engineering culture?” How many handoffs do you have between software engineers and customers? Here’s a hint, less is better. Are software engineers involved in how they solve a business problem or are they just simply implementing a business problem? What opportunities do software engineers have to grow? These three simple questions we can link back to simply asking three simple secrets about impact, choice, and improvement.

Conclusion

You all have an ability to improve your engineering culture, regardless of where you are, and where you want to actually get to. I hope that you have a better understanding about where these different elements are that you can influence. Remember that engineers have these choices. You have the ability to steer how they choose this by really focusing on improving your engineering culture, so it becomes the appealing place for the types of people that you actually want to hire.

Questions and Answers

Moderator: I’ll start off with one question. One of the second things that you said was publishing your tech culture, and so a tactical question around that is, how do you publish it? Where does that get published? I’m assuming you’re taking inputs from the engineers, but what does that look like if I’m going to do that tomorrow?

Kua: What I would say with codifying your tech culture is starting to look at maybe other sources for inspiration. A lot of companies have published their company handbook. You can think about the “Valve Handbook,” or Amazon’s “12 Principles of Leadership.” You can use them as inspiration for thinking about how you want to codify our culture. I don’t think there’s a single correct place, so you might start by doing an internal blog post or something like that. There are so many different ways that you can implement it, but you have your own choice about how you’d like to do this.

As you start to write your culture, don’t write it and then publish it to the world. Publish it internally first. As you talk to people, as you try to work out what elements of your culture are unique, these are things to test out with your engineers. If it doesn’t resonate with them, then it won’t really fit. They’re not going to be supportive of it. This is one interesting thing about when I join a new company, or when I’m helping a new company, is I really try to listen to understand what makes that company unique, and every company is unique for its own reason. I’d start off with something simple like an internal blog post, maybe an internal presentation or info deck, but just a simple way to describe the company culture that is written down. That would be the most important part.

Participant 1: It would be interesting to hear your feedback on how to facilitate to get a culture of learning. Currently, where I’m stuck, there’s a lot of, “This is the way we do things because we’ve always done it that way,” and I want to try to attack that and to break it down. People are fearfull of change and of new technology, and I really don’t understand why people are fearful of new technology, why they are in tech, in general, because you should probably do something else. It will be interesting to hear of your thoughts on that.

Kua: I think that’s a couple of things. I don’t believe in making people do things. Think about what stops them from taking part, and also what encourages them. They’re the two levers that I tend to think of. In terms of what’s stopping them, sometimes it’s maybe too much work. That’s a common thing; engineers often feeling that they have to have that pressure to building a new feature, to completing the things that their team has been working on or what they’ve been allocated to. I think one thing is, as leaders, trying to make sure that there’s enough Slack time or opportunities that people can take part without feeling pressure from normal work.

I think if everyone’s working 100% capacity of functional work, they have no time for learning. As a leader, I’m thinking about, “How do I create some time so that they can actually take part?” That might mean that you start off with lunch-and-learns. You do a learning activity during lunchtime, incentivized by bringing in free food – everyone loves free food – and then maybe try to make sure that you amplify that. This is also the encouraging stuff because I think a lot of it comes down to what people get rewarded by. In most organizations that don’t have a strong engineering culture, everyone gets rewarded by the feature or the output that they produce.

The other thing that you should be doing is making sure that you make learning a rewarded part of your culture. When I go into a company or to a team where learning hasn’t been a big thing, and I see somebody either trying to do a lunch-and-learn session or to do a book club, I amplify that message. I say, “This is a really great thing. This is important.” You as a leader can explicitly focus on those elements and amplify them to say that this is an important thing that you want to cultivate and encourage, and then you might be able to find time to actually explicitly give people that time. Companies do things like hackathons. Google was famous for it’s like full day a week so that they had one day for personal projects, but they’re the concrete examples I would think of.

Participant 2: You mentioned about using a framework like OKRs for alignment, which imply that they should not perhaps be used for other things. I was wondering whether you could enumerate where you wouldn’t use OKRs, or things that you’ve seen before?

Kua: One of the things I’ve learned from systems thinking or Theory of Constraints, is the saying “Tell me how you’ll measure me, and I’ll tell you how I behave.” My pet peeve with OKRs is coming from particularly business people. Business people often want a metric, and then they want to manage people by that metric. If I’m a developer, and I’m measured by the amount of code I produce, I can write a lot of code. It may not be the most maintainable, it may not be the easiest to change in the future, but I can definitely play that metric. That’s the big pitfall that I would see with a lot of OKRs, is that a lot of people forget that when you talk about OKRs, the first thing to talk about is this objective, what’s the purpose? What’s the goal? A lot of people too much focus on the key result.

What number are we trying to touch, and how big are we going to get to that number? I think key results don’t always have to be a number as well. Key results can be like an initiative. You start a project and you deliver that project, for instance. That might be a key result is that that project is delivered or from a leadership perspective, some key results that I’ve used in the past, like “We’re going to publish a growth framework,” It’s not really a metric. To help engineers understand how they want to improve and grow, we want to be clear about what their potential growth parts are. Publishing that would be maybe quarterly key result. That’s one of the big things that I have pet peeves with OKRs is that they’re often used by management through objectives, which is management through numbers and targets, forgetting that the whole purpose of OKRs is really about the objective. Making sure that this objective is contributing to the greater goal, and people forget that.

Participant 3: Excellent talk. Thank you for that. I have a question around processes. How do you decide as a leader which ones we should mandate and which ones you want to give autonomy for? For example, you want to say that these are the principles we will follow for code reviews, and that’s just good practice, good engineering culture. Let’s say, should we say things like, “This is how we’re going to write acceptance criteria across all the 50,60 engineering teams or so on.” What is your thought on that?

Kua: This one’s a tough one. It’s one that I still struggle with. What I think about is, when I make a choice about a process, what I’m trying to think about is optimizing for the entire system. When I think about that, that’s my guiding principle about if I make process mandatory, does it improve everyone’s lives across the organization by either removing uncertainty, or by making something that is already a common default more explicit? Then I’m very careful about what processes we implement.

N26, for instance, I came in, and we had TypeScript back end, Java back end, Scala back end, I think one more, but the team had defaulted to a thing of, “We don’t want another programming language because we just can’t maintain another service given that we are all mostly Java programmers.” From that side, it was like easy enough to say, “Ok, we’re not adding another language into the mix until maybe we remove the mix or until we get a little bit larger,” because the operational complexity just became too much.

For that, it was easy because I was able to say, “Ok, this is a pain point for most of the team, and it seems to be the general consensus, so all I’m doing is really making the default sentiment more official.” I’m not coming in and saying, “Ok, you have to use .NET because that’s my preference.” I’m looking at the pain threshold and trying to optimize for the entire system. The way that I’d be doing that is really by just talking to your team. I’d be looking at, “Is this a pain point only for this team, or is it a pain point for this team, and that team, and for every other team?” Then I’d be looking at where that common pain point is across those teams to think about, “Ok, there’s maybe misclarity about how we do pull requests across teams.” That means that where there’s misclarity, there’s a chance for improving that process, and so that’s where I’d be listening for. That’s my general process.

Participant 4: I’d like to ask questions from those of us who work in B2B setups where some of the traps you mentioned are these trade out requirements of our customers. Have you got any hints, tricks, tips how to sell the technical culture to our non-technical partners?

Kua: When you’re in a B2B, you’re further away from the end customer. Even in a B2B place, I think you can still get access to customers. Even in your portfolio, you probably have a trusted partner who then maybe give you access to customers. The thing that I see with a lot of platform companies or B2B companies is that they either have an internal team that is producing the B2C side and then they have access to the customer. You can either do it through proxy, through another partner, and get access to their customers, or you can create it internally with the team your own. That would be my advice.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.