Java News Roundup: JDK 21 Updates, Spring Data 2023.0, JobRunr 6.2, Micronaut 4.0 Milestones

MMS Founder
MMS Michael Redlich

Article originally posted on InfoQ. Visit InfoQ

This week’s Java roundup for May 8th, 2023 features news from OpenJDK, JDK 21, GraalVM Native Build Tools 0.9.22, Spring Framework, Spring Data and Spring Shell releases, Micronaut 4.0-M3, Quarkus 3.0.3, Eclipse Vert.x releases, Micrometer Metrics and Tracing releases, Groovy 4.0.12, Tomcat releases, Maven 3.9.2, Piranha 23.5.0, Reactor 2022.0.7, JobRunr 6.2, JDKMon releases and Devoxx UK.

OpenJDK

JEP 448, Vector API (Sixth Incubator), has been promoted from Proposed to Target to Targeted for JDK 21. This JEP, under the auspices of Project Panama, incorporates enhancements in response to feedback from the previous five rounds of incubation: JEP 438, Vector API (Fifth Incubator), delivered in JDK 20; JEP 426, Vector API (Fourth Incubator), delivered in JDK 19; JEP 417, Vector API (Third Incubator), delivered in JDK 18; JEP 414, Vector API (Second Incubator), delivered in JDK 17; and JEP 338, Vector API (Incubator), delivered as an incubator module in JDK 16. This feature proposes to enhance the Vector API to load and store vectors to and from a MemorySegment as defined by JEP 424, Foreign Function & Memory API (Preview).

JEP 441, Pattern Matching for switch, has been promoted from Proposed to Target to Targeted for JDK 21. This JEP finalizes this feature and incorporates enhancements in response to feedback from the previous four rounds of preview: JEP 433, Pattern Matching for switch (Fourth Preview), delivered in JDK 20; JEP 427, Pattern Matching for switch (Third Preview), delivered in JDK 19; JEP 420, Pattern Matching for switch (Second Preview), delivered in JDK 18; and JEP 406, Pattern Matching for switch (Preview), delivered in JDK 17. This feature enhances the language with pattern matching for switch expressions and statements. InfoQ will follow up with a more detailed news story.

JEP 440, Record Patterns, has been promoted from Proposed to Target to Targeted for JDK 21. This JEP also finalizes this feature and incorporates enhancements in response to feedback from the previous two rounds of preview: JEP 432, Record Patterns (Second Preview), delivered in JDK 20; and JEP 405, Record Patterns (Preview), delivered in JDK 19. This feature enhances the language with record patterns to deconstruct record values. Record patterns may be used in conjunction with type patterns to “enable a powerful, declarative, and composable form of data navigation and processing.” Type patterns were recently extended for use in switch case labels via: JEP 420, Pattern Matching for switch (Second Preview), delivered in JDK 18, and JEP 406, Pattern Matching for switch (Preview), delivered in JDK 17. The most significant change from JEP 432 removed support for record patterns appearing in the header of an enhanced for statement. InfoQ will follow up with a more detailed news story.

JEP 439, Generational ZGC, has been promoted from Proposed to Target to Targeted for JDK 21. This JEP proposes to “improve application performance by extending the Z Garbage Collector (ZGC) to maintain separate generations for young and old objects. This will allow ZGC to collect young objects, which tend to die young, more frequently.” InfoQ will follow up with a more detailed news story.

JEP 449, Deprecate the Windows 32-bit x86 Port for Removal, has been promoted from Candidate to Proposed to Target for JDK 21. This feature JEP, introduced by George Adams, senior program manager at Microsoft, proposes to deprecate the Windows x86-32 port with the intent to remove it in a future release. With no intent to implement JEP 436, Virtual Threads (Second Preview), in 32-bit platforms, removing support for this port will enable OpenJDK developers to accelerate development of new features. The review is expected to conclude on May 18, 2023.

JEP 443, Unnamed Patterns and Variables (Preview), has been promoted from Candidate to Proposed to Target for JDK 21. This preview JEP proposes to “enhance the language with unnamed patterns, which match a record component without stating the component’s name or type, and unnamed variables, which can be initialized but not used.” Both of these are denoted by the underscore character as in r instanceof _(int x, int y) and r instanceof _. The review is expected to conclude on May 15, 2023.

JEP 453, Structured Concurrency (Preview), has been promoted from its JEP Draft 8306641 to Candidate status. Formerly a incubating API, this initial preview incorporates enhancements in response to feedback from the previous two rounds of incubation: JEP 428, Structured Concurrency (Incubator), delivered in JDK 19; and JEP 437, Structured Concurrency (Second Incubator), delivered in JDK 20. The only significant change features the fork() method defined in the StructuredTaskScope class returns an instance of TaskHandle rather than a Future since the get() method in the TaskHandle interface was restructured to behave the same as the resultNow() method in the Future interface.

JEP 452, Key Encapsulation Mechanism API, has been promoted from its JEP Draft 8301034 to Candidate status. This feature JEP type proposes to: satisfy implementations of standard Key Encapsulation Mechanism (KEM) algorithms; satisfy use cases of KEM by higher level security protocols; and allow service providers to plug-in Java or native implementations of KEM algorithms. This draft was recently updated to include a major change that eliminates the DerivedKeyParameterSpec class in favor of placing fields in the argument list of the encapsulate(int from, int to, String algorithm) method.

JEP 451, Prepare to Disallow the Dynamic Loading of Agents, has been promoted from its JEP Draft 8306275 to Candidate status. Originally known as Disallow the Dynamic Loading of Agents by Default, and following the approach of JEP Draft 8305968, Integrity and Strong Encapsulation, this JEP has evolved from its original intent to disallow the dynamic loading of agents into a running JVM by default to issue warnings when agents are dynamically loaded into a running JVM. Goals of this JEP include: reassess the balance between serviceability and integrity; and ensure that a majority of tools, which do not need to dynamically load agents, are unaffected.

The joint draft specification for JEP 440, Record Patterns, and JEP 441, Pattern matching for switch, has been updated by Gavin Bierman, consulting member of technical staff at Oracle, for review by the Java community. Significant changes include: an update of the specification of type inference for record patterns; and removal of the non-denotable “any”‘ patterns and the process of resolving patterns in favor of a compile-time notion of a type pattern being “null-matching” or not.

John Rose, JVM architect at Oracle, has published a whitepaper that outlines his concerns on how Project Lilliput, with a goal to reduce the object header to 64 bits, could affect development in Project Valhalla.

JDK 21

Build 22 of the JDK 21 early-access builds was also made available this past week featuring updates from Build 21 that include fixes to various issues. Further details on this build may be found in the release notes.

For JDK 21, developers are encouraged to report bugs via the Java Bug Database.

GraalVM Native Build Tools

On the road to version 1.0, Oracle Labs has released version 0.9.22 of Native Build Tools, a GraalVM project consisting of plugins for interoperability with GraalVM Native Image. This latest release provides notable changes such as: a fix for the URL lookup of the GraalVM Reachability Metadata Repository; add support for the default-for attribute; and a dependency upgrade to Metadata 0.3.0. More details on this release may be found in the changelog.

Fabio Niephaus, principal researcher on the GraalVM team at Oracle Labs, has announced improvements to GraalVM memory usage in native image builds. In particular: it only uses available memory, uses less overall memory, and faster builds of large applications after raising the memory limit to 32GB.

Spring Framework

The release of Spring Framework 6.0.9 delivers bug fixes, improvements in documentation, dependency upgrades and new features such as: consistent support for the MultiValueMap interface and common Map implementations in the CollectionFactory class; introduce internal constants for implicit bounds in the TypeUtils class; and a new matchesProfiles() method in the Environment interface for profile expressions. More details on this release may be found in the release notes.

Spring Data 2023.0.0, codenamed Ullman, has been released featuring: new keyset-based scrolling for Spring Data MongoDB, Spring Data Neo4j and Spring Data JPA; improved support for AOT processing with Querydsl and Kotlin; and upgrades to the Spring Data sub-projects. More details on this release may be found in the release notes.

Versions 2022.0.6 and 2021.2.12, both service releases of Spring Data, ship with bug fixes and dependency upgrades to sub-projects such as: Spring Data Commons 3.0.6 and 2.7.12; Spring Data MongoDB 4.0.6 and 3.4.12; Spring Data Elasticsearch 5.0.6 and 4.4.12; and Spring Data Neo4j 7.0.6 and 6.3.12.

Versions 3.1.0-RC1, 3.0.3 and 2.1.9 of Spring Shell have been released featuring: a migration of documentation to Asciidoctor Spring Backends; a dependency upgrade to JLine 3.23.0; and a backport of bug fixes. These versions build upon Spring Boot versions 3.1.0-RC2, 3.0.6 and 2.7.11, respectively. More details on these releases may be found in the release notes for version 3.1.0-RC1, version 3.0.3 and version 2.1.9.

Micronaut

The Micronaut Foundation has provided the second and third milestone releases of Micronaut Framework 4.0.0 featuring bug fixes, improvements and new features such as: new interfaces, MessageBodyWriter and MessageBodyReader, that can be used on both the client and the server as a single place to add custom writing and reading responses; support for annotation-based CORS configuration; additional configuration to endpoints for the service-http-client.enabled property with a default set to false; and support for compilation-time checked expressions in Micronaut annotations. More details on these releases may be found in the release notes for version 4.0.0-M2 and version 4.0.0-M3.

Quarkus

Quarkus 3.0.3.Final, the third maintenance release, delivers notable changes such as: a fix for an exception thrown due to null parameter in the SimpleResourceInfo interface in a response filter; improved container runtime detection; a workaround for unnecessary information logs in Hibernate ORM; and a resolution for unexpected behavior in reactive native mode with Quarkus 3.0.1. More details on this release may be found in the changelog.

WildFly

The WildFly team has published the 2023-2024 release plan that includes beta and final releases of WildFly 29 through WildFly 34. Releases were temporarily moved to a feature-boxed approach during the transition to Jakarta EE 10. This new release schedule is a return to the previously-used time-boxed approach.

Eclipse Vert.x

Eclipse Vert.x 4.4.2 has been released with dependency upgrades and notable changes such as: a new messageHandler() method in the GraphQLWSHandler class to intercept messages; a resolution for erratic behavior using concurrent access in the SchemaRepository interface; and an improved toObservable() method in the SQLRowStream class that eliminates a potential NullPointerException. More details on this release may be found in the release notes and deprecations and breaking changes.

Similarly, Eclipse Vert.x 3.9.16 has been released that delivers notable fixes: metrics blocking the event loop after the update to Vert.x 3.9.14; and STOMP server process client frames that initially would not send a connect frame. The 3.9 release train reached end-of-life in 2022, but service releases will be maintained until the end of 2023. More details on this release may be found in the release notes.

Micrometer

Versions 1.11.0, 1.10.7 and 1.9.11 of Micrometer Metrics have been released with new features such as: a reduction of overall memory allocation while exporting metrics using the DynatraceExporterV2 class; compiler parameter metadata in the CountedAspect class is no longer required; and the addition of metrics for the total number of running application threads in the JVM. More details on these releases may be found in the release notes for version 1.11.0, version 1.10.7 and version 1.9.11.

Similarly, versions 1.1.1, 1.1.0, 1.0.6 and 1.0.5 of Micrometer Tracing have been released that provides notable changes that include: a new constructor in the ObservationAwareSpanThreadLocalAccessor class that accepts an instance of the ObservationRegistry interface; align annotations that match changes in Micrometer Metrics; no-op implementations for the Propagator, Propagator.Getter and Propagator.Setter interfaces; and an improved getBaggage() method in the BaggageManager interface that consistently returns an instance of the Baggage interface if baggage doesn’t exist. More details on these releases may be found in the release notes for version 1.1.1, version 1.1.0, version 1.0.6 and version 1.0.5.

Apache Software Foundation

The release of Apache Groovy 4.0.12 features bug fixes, dependency upgrades and improvements: a more detailed error message when applying an instance of the ClassNode class using generics in Abstract Syntax Tree (AST) Transformations; support for virtual threads in the Groovy-Integrated Query (GINQ); and bytecode optimizations for generated methods using Java records. More details on this release may be found in the release notes.

Versions 11.0.0-M6 and 9.0.75 of Apache Tomcat ship with notable changes such as: improvements to the JsonAccessLogValve class to support more patterns for headers and attributes; improvements to the AccessLogValue class to change output of vertical tab from v to u000b and escape the timestamp output if an instance of the SimpleDateFormat class is used containing verbatim characters. Support for the HTTP connector settings, rejectIllegalHeader and allowHostHeaderMismatch, were deprecated in version 9.0.75 and removed in version 11.0.0-M6 and are now hard-coded to the previous default values. More details on these releases may be found in the release notes for version 11.0.0-M6 and version 9.0.75.

Apache Maven 3.9.2 has been released with improvements such as: issue a warning if a plugin depends on Maven Compat; print suppressed exceptions when a Maven Mojo fails; an improvement and extension of plugin validation; and display additional information when using -Dmaven.repo.local.recordReverseTree=true command line argument. More details on this release may be found in the release notes.

Piranha

The release of Piranha 23.5.0 provides notable changes such as: an update of external components; ensure JDK 18+ modules are released when executing the release with JDK 20; and change --ssl-keystore-file to --https-keystore-file. Also, the MimeTypeManager and LoggingManager interfaces, TEMPDIR extension and Piranha Naming modules were all deprecated. More details on this release may be found in their documentation and issue tracker.

Project Reactor

Project Reactor 2022.0.7, the seventh maintenance release, provides dependency upgrades to reactor-core 3.5.6, reactor-netty 1.1.7 and reactor-kafka 1.3.18. There was also a realignment to version 2022.0.7 with the reactor-pool 1.0.0, reactor-addons 3.5.1 and reactor-kotlin-extensions 1.2.2 artifacts that remain unchanged. More details on this release may be found in the changelog.

JobRunr

The release of JobRunr and JobRunr Pro 6.2.0 delivers: an important bugfix for JobRunr on the Windows platform; improved performance; and dependency upgrades to support Spring Boot 3.0 and Quarkus 3.0. More details on this release may be found in the release notes.

JDKMon

Versions 17.0.57, 17.0.55 and 17.0.53 of JDKMon, a tool that monitors and updates installed JDKs, has been made available this past week. Created by Gerrit Grunwald, principal engineer at Azul, these new versions provide changes such as: CVE detection now supports CVSS 2 and CVSS 3; download dialog for builds of OpenJDK now supports the standard C library (libc) selection, e.g., musl libc; a fix for the Linux script to build the application installer; and the addition of a Linux RPM build for the AArch64 architecture.

Devoxx United Kingdom

Devoxx United Kingdom was held at the Business Design Centre in London, England this past week featuring speakers from the Java community who delivered talks on topics such as: Java, Cloud, Data, AI, Machine Learning, Robotics, Programming Languages, Security, Architecture, Developer Practices and Culture.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


AWS Verified Access Now GA with Support for WAF and Signed Identity Context

MMS Founder
MMS Renato Losio

Article originally posted on InfoQ. Visit InfoQ

AWS recently announced the general availability of Verified Access, a managed service that provides secure access to corporate applications without relying on a VPN. With the GA, the cloud provider introduced support for AWS WAF and the ability to pass signed identity context to end applications.

Released as a preview during the re:Invent conference, the new service can be used to support a work-from-anywhere model, evaluating each access request in real time based on the user’s identity and device, using fine-grained policies.

Reducing the risks associated with remote connectivity, Verified Access can help secure distributed users, manage corporate application access, and centralize access logs: the new service evaluates access requests and logs request data, supporting the analysis of security and connectivity incidents.

Built on Zero Trust principles, Verified Access has centralized policy enforcement to grant access to the application behind the service, with support for Cedar policies to permit or forbid access to specific applications. According to the cloud provider, corporate applications with Site-to-Site VPN and internet-facing corporate applications are the two most common enterprise architectures that can benefit from moving to the new managed option.

Verified Access now supports integration with AWS WAF to protect web applications from application-layer threats and can pass a signed identity context to an application endpoint. Riggs Goodman III, senior global tech lead at AWS, and Shovan Das, principal product manager, explain the benefits:

Previously, users would request access to the application behind Verified Access with both identity and device claims, but the claims were not available to the end applications. Verified Access now passes signed identity context, including things like email, username, and other attributes from the identity provider to the applications. This enables you to personalize your application using this context, eliminating the need to re-authenticate the user for personalization.

Customers are charged for the amount of data processed and pay an hourly fee for each application on Verified Access, starting at $0.02 per GB and $0.27/hr. The pricing model has received criticisms from the community, with some users suggesting that Cloudflare VPN Replacement is often a cheaper solution.

Corey Quinn, chief cloud economist at The Duckbill Group, comments:

The last time a VPN-less service was put out by AWS was Amazon WorkLink, which has oh-so-very-quietly been deprecated in favor of ‘WorkSpaces Web” whatever that might be. Hopefully this one fares better.

Verified Access integrates with multiple third-party identity and device management services, including Beyond Identity, CrowdStike, CyberArk, Cisco Duo, Jamf, JumpCloud, Okta, and Ping Identity. The service is currently available in ten AWS regions, including Northern Virginia, Frankfurt, and Dublin.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


GitHub Overhauls Code Search Using New Search Engine

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

GitHub has introduced its new code search feature, including a redesigned search interface, a new code view, and a search engine rebuilt from scratch to be faster, more capable, and to better understand code, says GitHub software engineer Colin Merkel.

Our goal with the new code search and code view is to enable developers to quickly search, navigate and understand their code, put critical information into context, and ultimately make them more productive.

According to Merkel, the new search engine is twice as fast as the previous one. It also provides more flexibility supporting substring queries, regular expressions, and symbol search. For example, you could search for a string across all repos belonging to your organization without having to clone them beforehand:

org:my_org "string to look for"

You can also restrict your query to files written in a specific language or repo, exclude specific paths, or use many additional possibilities supported by GitHub search query syntax.

The new code view integrates search with a file browser and supports code navigation and browsing, allowing to jump to symbol definitions in over 10 languages.

GitHub engineer Timothy Clem provided a detailed overview of how the new search engine works behind the scenes to achieve its goals in terms of flexibility, performance, and scalability.

At the heart of GitHub search engine lies a powerful indexer, which is a prerequisite to being able to run queries fast. The search index is specialized for code, for example by being able to distinguish between programming languages, not ignoring punctuation, not stripping stop words, and so on. The index must also include ngrams, i.e. sequences of characters of a given length, to support substring queries.

GitHub build its search index by analyzing 45 million repositories, amounting to 115TB of content across 15.5 billion documents, which is a daunting task. Luckily, explains Clem, there are two factors that make it possible to reduce the amount of work to do: using Git blob object IDs to distribute unique documents evenly across shards, and the fact that GitHub hosts a lot of duplicate content.

When a new query is received, it is parsed into an abstract syntax tree and transformed into n concurrent requests sent to distinct shards in the search cluster. The shards carry through low-level processing such as translating regex into substring queries on the ngram indices. Finally, shards return their results to the query service, which aggregates them a selects the top 100.

Our p99 response times from individual shards are on the order of 100 ms, but total response times are a bit longer due to aggregating responses, checking permissions, and things like syntax highlighting. A query ties up a single CPU core on the index server for that 100 ms, so our 64 core hosts have an upper bound of something like 640 queries per second.

Thanks to this approach, GitHub can re-index the entire repository corpus in about 18 hours. The overall index size is 25TB, which is roughly a quarter the of original data.

The new code search is available for free to all GitHub users.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Open Source MongoDB Alternative FerretDB Now Generally Available

MMS Founder
MMS Renato Losio

Article originally posted on InfoQ. Visit InfoQ

FerretDB, an open-source MongoDB alternative database, recently announced its general availability. Released under the Apache 2.0 license, the project allows developers to use existing PostgreSQL infra to run MongoDB workloads.

FerretDB works as a proxy that translates MongoDB wire protocol queries to SQL, with PostgreSQL as the database backend. Started as an open-source alternative to MongoDB, FerretDB provides the same MongoDB APIs without developers needing to learn a new language or command. Peter Farkas, co-founder and CEO of FerretDB, explains:

We are creating a new standard for document databases with MongoDB compatibility. FerretDB is a drop-in replacement for MongoDB, but it also aims to set a new standard that not only brings easy-to-use document databases back to its open-source roots but also enables different database engines to run document database workloads using a standardized interface.

While FerretDB is built on PostgreSQL, the database is designed with a pluggable architecture to support other backends, with projects for Tigris, SAP HANA, and SQLite currently in the working. Written in Go, the project was originally started as the Server Side Public License (SSPL) that MongoDB adopted in 2018 does not meet all criteria for open-source software set by the Open Source Initiative. The team behind FerretDB writes:

Initially built as open-source software, MongoDB was a game-changer for many developers, enabling them to build fast and robust applications. Its ease of use and extensive documentation made it a top choice for many developers looking for an open-source database. However, all this changed when they switched to an SSPL license, moving away from their open-source roots.

According to Farkas, popular database management tools such as mongosh, MongoDB Compass, NoSQL Booster, and Mingo are already compatible with the current feature set of FerretDB. The Free Software Foundation comments in their monthly news digest:

For those who have been rightfully concerned over MongoDB’s change of license conditions in 2018, this is welcome news indeed.

In the 1.0 release, FerretDB added support to the createIndexes command, while version 1.1.0 includes the addition of renameCollection, support for projection field assignments, and the $project pipeline aggregation stage, as well as create and drop commands in the SAP HANA handler.

The general availability of a drop-in replacement for MongoDB received mixed feedback on Reddit, with some users supporting the abstraction layer and others not considering FerretDB a “real” database. Franck Pachot, developer advocate at YugabyteDB and AWS Data Hero, recently wrote an article on how to enable a MongoDB-compatible API on YugabyteDB:

Want a MongoDB-compatible API on top of your Distributed SQL database? Easy to connect FerretDB to YugabyteDB and all is open source.

FerretDB is not the only alternative to MongoDB: other schemaless document-based databases supporting MongoDB APIs are Amazon DocumentDB, Azure CosmosDB, and MariaDB MaxScale. In a recent webinar, David Murphy, principal database reliability engineer at Udemy, compares CosmosDB, DocumentDB, MongoDB, and FerretDB as document databases.

The FerretDB project and the roadmap are available on GitHub, with Docker images and RPM and DEB packages.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Previews Studio Bot, a Coding Bot for Android Development

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

At Google I/O 2023, Google has previewed Studio Bot, an AI-powered coding bot integrated in Android Studio latest version, codenamed Hedgehog. Studio Bot aims to help developers generate code, unit tests, and fix errors.

Google Studio Bot leverages Codey, Google’s text-to-code foundational model, also introduced at Google I/O. Codey is based on the latest iteration of PaLM 2, the language model that powers Google’s experimental chatbot, Bard, as well as other services, and that Google is positioning as a direct competitor to OpenAI’s GPT-4.

Studio Bot is designed for native Android development, says Google, having been trained on a curated set of data representing best practices for Android development. While Studio Bot is integrated in Android Studio, Codey has a more general use model where you can use it via API, or embed it in an SDK or application.

Based on what Google told developers, Studio Bot will not require them to share their source code with Google, but they are suggested to agree to send data about their conversation so the company can better understand how effective the tool is. Additionally, Studio Bot requires users to sign it to their Google accounts in order to be able to use it.

Using Studio Bot developers will be able to ask questions like how to get the current geolocation, how to add camera support to an existing app, what is dark theme, and so on. The tool supports both Java and Kotlin and is able to translate code back and forth. Additionally, it is aware of recent Android technologies like Compose.

Studio Bot is still in its early days, Google warns, and it should be considered an experimental technology. In particular, Google says, it might provide inaccurate or outright false answers while presenting them with total confidence. This also implies that any generated code might not produce the expected result and be low-quality or incomplete. Hence, Google, says, generated code should be validated and possibly adapted before being used in a production context.

Studio Bot enters an arena where developers can count on a number of different tools for AI-based code generation, including GitHub Copilot, OpenAI Codex, and AWS Code Whisperer.

As a final remark, it is worth noting that Studio Bot is only available in the US at the moment.

About the Author

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Innovating for the Future You’ve Never Seen: Distributed Systems Architecture & the Grid

MMS Founder
MMS Astrid Atkinson

Article originally posted on InfoQ. Visit InfoQ

Transcript

Atkinson: I’m Astrid Atkinson. I’m going to be talking about applying tech to what is, in my mind, the greatest problem of our generation and of the world as it stands today. Some of you may have heard of climate change. It’s been becoming more popular in the news. I think within the last 5 or 10 years, we’ve really seen a transition from a general conception of climate change from being an issue that might potentially affect our grandchildren maybe 50 or 100 years from now, to one that is immediately affecting our grandparents, happening today. Bringing about increasingly dramatic impacts to everyday life. This is obviously a really big problem. I talk a lot with folks who are looking to figure out how to direct their careers and their work towards trying to address it. I do also talk a lot with people who have pretty much already given up. I think it’s easy when you look at a problem of this magnitude, and one on which we’re clearly not quite yet on the right track to take a look, shake your head and say, maybe it’s already too late. Maybe the change is already locked in. In my mind, there’s basically two futures ahead of us. In one of them, we get it right. We do the work. We figure out what’s required. It’s technology. It’s policy. It’s politics. It’s people. We put in the technology, and the policy, and the political, and the economic investment that’s required to effectively bend the curve on climate change and keep our warming within about 2 degrees, which is a generally agreed on limit for a livable planet.

Our future is perhaps not the same as our past, but not substantially worse, it’s a recognizable world. I get to hang out with my grandkids, maybe we even get to go skiing. Most of the world gets to continue with something that looks like life today. In the other version of the future, we don’t do that. We cease to take action. We fail to do any better than we’re currently doing. We say that the policy change that’s in place will be good enough. We go on with life as usual, and we get the 3 to 4 degrees or more of change that’s locked in for plans as they stand today. That is potentially catastrophic. We don’t necessarily get to keep the civilization that we have in this model. Maybe some of us do, but definitely not all of us, and it doesn’t look great. When I think about this, personally, I would rather spend my entire life and my whole career working for the first version of the future. Because if we don’t choose to do that, the second one is inevitable. That’s where I stand. That is why I’m in front of you here talking about using technology applications and using technology talent to attempt to address this existential problem.

How to Decarbonize Our Energy System

When we think about decarbonization, about 60%, about two-thirds of the problem is our energy system. The other 30% is things like land use, industrial use, those sorts of things. Those are important too. Energy is a really good place to focus, because it’s a really large problem. The really oversimplified 2-step version of how we decarbonize our energy system is step 1, we electrify everything, so that we can use energy sources that are clean and efficient. Step 2, we decarbonize the grid, which is our delivery system for electricity. Now we have two problems. Talking about the change that’s going to be required just really quickly. This is an older view of energy flowing through just the U.S.’s energy systems. You can see that we have energy coming in from a number of clean sources. Generally speaking, everything that’s currently a fossil source needs to move into that orange box up there. That’s a pretty large change.

In any version of how this goes, the grid needs to do a lot more work. We need to be distributing a lot more energy around. There are plenty of versions of this too where there’s a lot more disconnected generation and so forth. Just to put a frame around this problem. I’m using two sets of numbers on this slide. One is from the International Energy Agency’s Net Zero by 2050 plan, which is effectively international plan of record. One thing to note in that 2050 plan is that buried in the details is the surprising note that industrialized nations, U.S., UK, Australia, Western Europe, supposed to be fully decarbonized by 2035. I’m not sure it’s quite on anybody’s radar to the extent that it needs to be today. The other set of numbers here is from an NREL study on what it would take to decarbonize our electrical system, and therefore energy system by 2035, just looking at the U.S. Of course, this is a national problem, but these numbers are nicely contained and related in that way.

They have multiple scenarios in this view of what the future could look like, ranging from high nuclear to brand new technologies. Any version of this solution set means that we need a lot more renewables. We potentially need a couple of new or expanded baseload technologies. We need a lot of load to be flexible. We need a lot of interconnection between the places where generation can happen and the places where energy is produced. It’s this last one that’s actually really tricky, because any version of this plan relies on somewhere between a 2x to 5x increase of our existing transmission network capacity. I’ll talk a little bit about the transmission network and the distribution network in a little bit. You could effectively think of the transmission network as like the internet backbone of the grid. It’s the large long-distance wires that connect across very long distances. In general, it’s primarily responsible for carrying energy from the places where it’s generated from large generators to local regions. I’ll talk a little bit about different parts of the network in a bit. Any version of this requires very large expansion in that capacity. That’s a really big problem, because it typically takes 5, 10, 15, 30 years to build one new line. That’s a pretty big hidden blocker in any such transition plan. It makes the role of the network as it stands today, increasingly important.

What Role Can Software Play?

Now, also, any version of that plan involves bringing together a lot of different types of technologies. There’s multiple types of generation technologies. There’s solar. There’s wind. There’s hydro, nuclear, geothermal, biomass. There’s also a lot of demand side technologies, and any brief encounter with grid technologies will show you a whole ton of these. In general, all of these demand side technologies, everything from like controllable in-house heat pumps, to electric vehicle chargers, to batteries, fall under the general heading of distributed energy resources. All that really means is just energy resources that are in your home or business. They’re at the edges of the grid. They’re located very close to a part of the demand fabric of the grid. There’s something that is growing in importance in the grid of today and will be a cornerstone of the grid of the future. Because the grid needs to do a slightly different job than it does today. The grid of today is basically responsible for delivering energy from the faraway places where it’s generated to the places where it’s used, must always be completely balanced in real-time for supply and demand. The grid of the future, those things are all true, but we also need to use it to balance and move energy around over time. We need to be able to get energy from the time and place where it’s produced to the time and place where it’s needed. That means moving supply and demand around. That means storage. It also means management. I’m going to talk about what the technology applications around that look like.

The big catch to all of this is that technology in the grid landscape is pretty outdated. There’s a few reasons for this. One is that when you are responsible for keeping things up and running, keeping the lights on, as many folks will have experience with, there’s an innate conservatism to that. You don’t want to mess with it. Another is just that this is a slow-moving legacy vertical with a lot of money and a lot of pressure on it. For various reasons, a lot of them being like conservatism around things like cybersecurity, almost all grid technology today is on-prem. That means that they don’t have access to the compute scale that you would need to handle real-time data from the grid. They think that data coming off the grid from something like a 15-minutely read on a smart meter is a lot. Anyone knows that that’s really not the case. It’s perhaps medium data, at best. It’s really not a great amount of data by distributed systems standards. If you’re running on-prem with just one computer, gets to be a lot.

The current state of the art in the grid space is basically fairly data sparse, not easily real-time for the majority of data sources. Really dependent on the idea of feeding in data from a small set of sources, to a physics model which emulates the grid, and then solves to tell you what’s happening at any particular point of it. This is nice, but it’s not at all the same thing as real-time monitoring. It is not the foundation for driving massive change, because the only thing you can know in that is the model. You cannot know the system. As anyone who’s worked with a distributed system knows, the data about the real-time system is the foundation of everything else that we do. That is how you know if things are working. It’s how you know how things are working. It’s how you plan for the future in an adaptive way.

Camus Energy – Zero-Carbon Grid Orchestration

That was really the genesis for me of founding my current company, which builds grid management software for people who operate grids, and that’s typically utilities. Suffice to say that, I think that there’s a really important and in fact urgent application of distributed systems technologies in this space, because we need these systems to be cloud native and we need that yesterday. We don’t need it because the cloud is someone else’s computer, we need it because the cloud is global scale or hyperscale computing. We need to be able to process large amounts of real-time data. We need to be able to solve complex real-time balancing optimization, machine learning AI type problems. We need to be able to do that in a very large scale and a very rapid way. That is the reason. That is the technology change that needs to happen in the grid space.

Distribution System vs. Distributed System

I’m going to talk a little bit about some terminology confusion here. I think of myself as a distributed systems engineer, or my work is distributed systems engineering. This is super confusing in the grid space, because people who work on the grid are used to thinking about the distribution system, which is the part of the grid that connects to your house, or your business, or whatever. Utilities like PG&E, ComEd, ConEd, SDG&E, are distribution utilities primarily. Whenever you say the word distribute or distributed, all they’re thinking about is a whole bunch of wires connecting to your house. Because of that, we’ve started to refer to what we do in the software space as hyperscale computing or cloud computing, just because it’s confusing. At its base, the technology change, and the transition that we need to make for the grid is really going from a centralized model of how the grid is operated and engineered, to a distributed model in which a lot of small resources play critical parts. Really, what we’re talking about here is making a distributed system for the distribution system. Again, confusing, but here we are. In fact, many of the terminology, many of the concepts and so forth, are surprisingly common. That’s really what I’m going to talk about, is the ways that we can apply lessons that we’ve learned from distributed systems design, architecture, engineering, and bring that into the grid space to accelerate the pace of change.

Background

I was at Google from about 2004. That was a really important and critical time for both Google but also for the industry. I was originally hired to work on Google’s Cloud Platform about two years before cloud platform was a thing. According to Wikipedia, the term the cloud was first used by Eric Schmidt in a public conference in 2006. In 2004, Google was rolling out its internal cloud and making a really significant transition from that centralized computing model with a few big resources to a highly distributed model of spreading work across millions of computers. This was something that was very new in the industry. We couldn’t hire anyone who had done it before. We were learning as we went. That was a really exciting time. This meme always makes me smile, because it’s from my time at Google. Also, because I put the flamingos on the dinosaur in about 2007 with other folks who are also working on this very large-scale distributed systems transition. It was actually intended to be a distributed systems joke. It was actually, like all the little systems coming to eat the dinosaur system. It was an April Fool’s Day joke. It was one that ended up having pretty significant legs. Years later, I would see someone come bringing flamingos to the dinosaur and just like popping on another flamingo. I love that because I think it also is like a nice metaphor for how systems become self-sustaining and evolve over time, once you see these really big paradigm shifts.

Reliability

The story of how you get from a few big servers to a bunch of little servers is a pretty well told one. Suffice to say that in any model, we’re going from a small number of reliable servers to a large number of distributed servers. This fabric of the system that distributes work across those machines, takes care of failures, lets you move work in a way that is aware of how that’s working within the system as a whole and lets you move work around under simple operator control. Is the fabric of this system that lets you get better reliability out of the system than you can get out of any individual piece of it. In the Google distributed systems context, and this is generally true for most cloud computing, there’s this set of backbone capabilities. I think of monitoring as being the foundation of all of them, because it is always the foundation of how to build a reliable system. There’s also ideas around orchestration, like getting work to machines, or getting work to places where work can be done in terms of load assignment, lifecycle management, fleet management. That’d be a container orchestration type system today. For Google it was Borg, but a Kubernetes or equivalent technology, I think, fits into that bucket. Load balancing, the ability to route work to be done to the capacity locations where work can be executed, is a fundamental glue technology of all distributed systems. As you get through larger systems, finding ways to introduce flexibility to that distribution of work really increases the reliability of the system overall.

As we were going through this process for Google, maintaining Google’s reliability at utility grade, five nines or better, was a core design requirement of all of the changes that we made. Thinking about how you carefully swap out one piece at a time in a system which has tens of thousands of microservice types, not just microservices or instances, and millions to billions of instances, is a high demand process, and one in which you need good foundations, and you need really good visibility. This isn’t just about the technology. It’s also about the tools which keep the system as a whole comprehensible and simple for operators, and allow a small group of operators to engage with it in a meaningful way.

Astrid’s Theory of Generalized Infra Development

Before we go on to talk about the grid, I just want to talk about patterns for how you build infrastructure in a generalizable way in this type of environment. This matters for the grid, because it’s actually very difficult to build systems without a utility. It’s also very difficult to build systems with a utility. In general, anytime you’re building a large-scale infrastructure system, you’re going to be working with real customers. You want to start with more than one, less than five. Very early in the lifecycle of an infrastructure development project, you want to go as big as you possibly intend to serve. Because if you don’t do it very early, you will never reach that scale. That was the successful repeatable pattern that I saw as we went through development of dozens to hundreds of infrastructure services at Google. It is also the one that we’re using in the grid landscape today.

What’s Hard About the Grid Today?

Let’s talk about the grid. Monitoring is a fairly fundamental part, I think, of any reliable and evolving distributed system. That’s actually one of the things that is really challenging in today’s grid landscape. I’ll just run through what that looks like in practice today. Here’s an example distribution grid. We’ve got basically a radial network that goes out to the edges. There’s a few different kinds of grid topologies, but this is a common one and the most cliched and complicated one. As we look at the parts of the grid, there’s the transmission network, which is the internet backbone of the grid today. A transmission network is actually fairly simple, it’s a mesh network, it does not have many nodes. It has thousands, not hundreds of thousands. It’s really well instrumented. It has real-time visibility. It is real-time operated today. Today’s independent system operators like CAISO, or MISO, or whoever, have pretty good visibility into what’s happening with very large-scale substations in a very large-scale network. Out here on the edges, there are some resources that participate in that ecosystem and those markets today, but they tend to be just a few. They’re really big, typically, commercial and industrial customers. They are also required to provide full scale real-time visibility to the transmission network in order to participate.

As we move forward, there’s a goal to have these distributed energy resources that are located at customer locations, become part of this infrastructure. No telemetry requirements yet, so it’s hard to see exactly what might be happening. It’s hard to say from anybody’s perspective what’s happening when they do this. This is the next step from the transmission network’s perspective. Recent Federal Energy Regulatory Commission Order 2222 mandates that this should be required, just doesn’t say how. Doesn’t say anything about telemetry, network management, integration. It’s a really good step, because forcing the outcome helps to force everybody to think about the mechanism.

On the distribution side, the data story is less good. For most distribution operators today, although they do have smart meters mostly, they typically can see what’s happening in any particular meter, 2 to 24 hours ago. This is due to slowness of data collection. They don’t typically have direct instrumentation on any part of the line below the substation. They also don’t necessarily have accurate models of connectivity for the meters to transformers, feeders, phases. Running that model I mentioned, is rather difficult if you don’t have either the data or the model. This is a really big blocker for adding any more stuff to the edges of the grid. It makes it very difficult to feel safe about any change you might possibly make, because you can’t see what is going on.

There’s lots of questions. Everything ok? What’s happening out there? Up until the last probably 5 or 10 years, planning and operations on the distribution network was literally forecasting load growth for the next 10 years. Then overbuilding the physical equipment of the grid by 10x. Then waiting for someone to call you, if anything caught on fire or the power was out. That was literally distribution operations. Obviously, if you’ve got a lot of stuff happening at the edges, that’s not so great. That is not necessarily sufficient. That’s not the model that we need if we want to be able to add a lot of solar, add a lot of battery, add a lot of EVs, whatever, but it is what we have had. The first step to being able to make any significant changes, is basically taking the data that we have, and starting to figure out like what can we do with it. This is the first place where that cloud scale, hyperscale distributed computing approach becomes really relevant. This also happens to be the first place where I’m going to talk about machine learning.

There is a lot of data out there, it’s just not very real-time. It’s not very comprehensive. For most utilities today, it’s also not correlated. When we look at the grid, being able to take the data that’s out there and get something like real-time is actually a really good application for machine learning technologies. There’s a bunch of things we can do. We can forecast and get a nowcast of what’s happening at any individual meter, both the demand and for solar generation. Once we have an accurate model of end loads, we can calculate midline loads. We can pull in third-party telemetry from the devices that are out there. Tesla have really good telemetry on their devices, and the ability to manage them in a large-scale way.

Then we can do a lot to patch together something usable out of the fabric that’s out there today. As we move forward, we need to do better. We need to get hardware instrumentation out there to the edges. We need real-time data from the meters. We need a lot of things. If we have to start there, we’re screwed because any project like that for a utility takes like 5 to 10 years. We need to start with the data that we have while we get the other stuff in place. As I talk about technology applications, yes, these will all be better with better data. We have to start with what we have. Figuring out how to start with what we have for the grid of today, and help take those steps that get us to the grid of tomorrow is nearly 100% of the work because, again, we want to be able to do this work on the grid as it stands by 2035. As far as grid technology goes, that is tomorrow.

The Promise of a 2-Way Grid

What are we building towards? Today’s distribution grid has a little bit of dynamism to it. The biggest factor that’s driving change in the last 5 to 10 years is the role of solar. Solar generation is not necessarily a large part of most grids today, but in some places like Australia and Hawaii, and so forth. Australia is actually the world leader on this, along with surprisingly Germany, where they have really a lot of rooftop solar. Energy supply provided by local rooftop solar is sometimes more than 50% of what’s used during the daytime, sometimes up to 80% or 90%. At that point, it causes really significant problems. The short thing is that as soon as you get above 10% or so, you start to see this curve and the pink line up here, this is called the duck curve. The more solar you get eating into the daytime portion of this curve, the deeper the back of the duck. In Australia they call it an Emu curve because Emus are birds with really long necks, also heard it called the dinosaur curve. This is the grid as it stands today. The steeper those ramps, in the morning and the evening, the larger the problem for ramping up and down traditional generation to fill in those gaps. Because, while those plants are considered to be reliable baseload and theoretically they’re manageable, and so forth, they also can take a day to ramp up and down, so they’re not very flexible. The other thing is that you start to move around things like voltage and frequency and so forth, as you get more of this solar present. The response from most utilities today has been like, slow down. We have an interconnection queue, we’ll do a study, or some version of no.

The grid of tomorrow needs to look quite different. We need to be supplying about three times the demand. We need a lot of that load to be flexible. We need a large role for battery. We need a lot of generation to be coming up locally, because we’re going to have trouble building all the transmission that we need. The more stuff happening locally, the better off we are. We need it to be manageable. We need it to be visible. We need it to be controllable. This is a really big step ahead of where we are today. If we’re talking about, how do we take steps to get from the grid as it stands to the grid as it needs to be? You can start with a few questions like, what’s out there? What’s it up to? Today, there’s not a lot. That’s really in large part, because this change is still starting to happen, because many utilities have pushed back on the addition of a lot of rooftop solar, or battery, or end user technology. It’s also because one of the big factors driving this change is going to be electric vehicles, and that’s really just starting to be a thing in a big way. For the grid as it stands today, there’s actually not that many network problems. Everyone’s been very successful in overbuilding the network, planning really effectively, keeping change slow enough that the pace of change within the utility can keep up with it.

As you start to get a little bit more, you do start to see places where individual parts of the network get stressed. In this picture, this little red triangle is a transformer connecting a couple of different locations where there are smart devices, let’s say EV services, and sometimes that transformer gets overloaded. We can replace it. It’s all right. The more devices we get out there, the more that starts to happen. This is particularly likely to be driven by EVs, but batteries are enough. Most houses run about 6-kilowatt peak load today. The kilowatt peak load of a Ford F-150 Lightning bidirectional charger is 19 kilowatts, which is about 3 to 4 houses worth of load all at once. This is enough to blow smoke out of transformers and catch them on fire.

Now we’ve got this question that utilities are starting to ask, which is, how do I stop that from happening? That is the first thing that they tend to come to. Then there’s, how do I know about this? How do I deal with it? Then you’ve got all these software companies, like mine showing up saying, “This is an easy problem for software, can totally manage this. We’ll just schedule in the vehicle’s charge, it’ll be fine.” A utility operator or engineer looks at this and they ask this question, “Software is great for software but these transformers can explode. If they explode, they catch stuff on fire.” This is true all the way up and down the grid. What you have here is not necessarily just a software problem, but now you have a systems problem, a culture problem, one that’s going to be familiar to anyone who’s worked on large scale systems transitions. This is also a trust problem.

Once you know what’s out there, you start to have some more questions, like, can I manage those resources? Could I schedule that EV charging? Could I get them to provide services to me? Can they help with the balancing or the frequency or the voltage problems that I might have? What’s the value of those services? Because I know from thinking about it for me, if I install some batteries, and PG&E comes knocking and they’re like, I want to manage your batteries. I’m like, get out. If they show up, and they’re like, “I have a program where we can automatically pay you for usage of those batteries, sometimes in a way that’s nondisruptive to your usage of that battery,” and I have a good feeling about PG&E, which is a separate question. I might say yes to that. That’s not an unreasonable thing. Batteries are expensive, I might want to defray the cost. Any model, when I talk about control, bear in mind that probably a very large part of this is also price signal driven. Or you’re ultimately going to be offering money to owners of resources, or aggregators of resources in order to get services from those resources. It’s a little bit of a secondary or an indirect operational signal. Nonetheless, a part of the fabric of the grid of the future.

Introducing Reliable Automation

We have a trust problem. I like to think of trust as a ladder. I’ve done a lot of automation of human-driven processes. This is something we had a lot of opportunity to do at Google as things were growing 10x, 100x, 1,000x, a million x, because the complexity of the system keeps scaling ahead of ability of humans to reason about it. There’s a lot of value to having a few simple controls that an operator can depend on, can use in a predictable way, understands the results of. Anytime you’re adding to that or automating something that they already do, that’s not something where you just go in and you’re like, “I solved it for you. Totally automated that problem.” If anyone here has done that, you know what response you get from the operations team, they’re like, get out. You have to come in a piece at a time. You need to keep it simple. You need to keep it predictable. Each step needs to be comprehensible, and it needs to be trustworthy.

Coming back to our little model of the grid, if we know a little bit about what’s happening, we’ve got something that we can start to work with. Like, do I know what’s happening at that transformer? Do I know what’s happening at the load points that are causing the problems? How much of it’s controllable? Do I have some levers? This is something that we can start to work with. Just keep in mind that as we go through this, the goal of automation is not necessarily just to automate things, or to write software, or whatever. The goal of it is to help humans understand the system and continue to be able to understand the system as it changes. Adding automation in ways that is layered and principled also tends to really map to the layers of abstraction, or it can map to layers of abstraction that help to build comprehensible systems as well. Automation is always best when it is doing that.

We can move up this ladder of automation or ladder of trust for an individual device. I can say like, ok, battery, I trust you to do your local job and back up the grid here. The easiest, most trustworthy way to have that happen is to say, I did an interconnection study for this battery, and I know that it is not capable of pushing more energy out at once than that transformer can manage. That’s what most utilities do today to keep a grip around it. That’s a default safety setting. However, as we start to get more of those, and those upgrades become very expensive, you need to start doing something a little bit smarter than that. You need to know what is your available capacity allocation at any particular point on the grid. This can be very simple. This can be like, I added up the size of all those batteries and I know that I have just this much left. It can also be something more sophisticated, and this is where being able to process our real-time data becomes important. Because if I can get real-time data about the transformer or synthesize it, I can get an idea of capacity allocation that is generally safe, outside of peak times, and then I can start to divide that up amongst the people who need to use it. Now I can safely give the operators the ability to call on these devices, in addition to the day-to-day work that the devices are doing. The simplest model for this is just to be able to get energy out or put energy in at a certain point in time. It’d be like, please charge during solar peak, please discharge in the evening during load peak. Once you have a default safe assumption that the device, when you call on it, won’t blow anything up, you begin to have the ability to get it to do other stuff for you. It is more trustworthy now.

That lets you start to look at it not just at the device level, but also at the system level. Now we know, what’s my capacity all the way upline, from the transformer to the conductor, which is the line, to the upline transformers, to the feeder head, to the substation. I can provide a dynamic allocation that basically lets me begin to virtualize my network capacity. This is not what we do in the grid today, but it is very much what we hope to do in the near future. This is the next step, being able to manage those devices and call upon them in a way that is respectful of limits all the way up and down the line. Then to be able to do that in a way that manages the collective capacity of the grid while also calling on services. This is when we can start to do things like orchestration. I really think of orchestration as being collective management to achieve a goal. It’s not just, can I turn a device on and off, or can I stop it from blowing up a transformer? It’s really, can I collectively look at this set of assets and call on it in a way that optimizes something about the system that I want to optimize, cost, carbon, in these particular cases, while maintaining reliability. This is the fundamental unit of automation that we need for the future grid.

Coming back to our question, yes, of course software can be trustworthy. We do this every day. This is the work of most of the people in this room. We know that it can be with caveats. It just means doing the work. It’s not just doing the technical work though, and that’s the really important part here. It also means doing the systems work. One important side note is that whenever you’re doing systems level work in systems that have a very low tolerance of failure, and this is true for things like self-driving cars, for aerospace applications, self-landing rockets, all of these sorts of things, the role of the simulation environment goes up in importance. If you can’t test safely in the wild, you have to test in sim. This is something that was a really big lesson in transitioning from an environment like Google, where in general, you can test in the wild, to something like rocketry or the grid. I actually spent quite a lot of time talking to a friend who worked on aerospace applications to understand the appropriate role for simulation and guaranteeing safety in a dangerous physical system.

Once you have that, the process of rolling out change in a large-scale physical system like the grid, it’s very similar to rolling out change in any distributed system software environment. You’re going to test it. You’re going to canary it. You’re going to deploy it slowly. You’re going to monitor it. Now you have a repeatable pattern that lets you begin to engage with change in a large scale, in a software driven grid management environment. If you’re curious where we’re at with this, we have several grids that are up and running with control that is in this model. This has proven to be really useful in the grid landscape, and something that I think is a really important learning that we brought in with us from the distributed system space.

The Self-Driving Grid: Reliable High-Scale Automation

What this opens up is the opportunity to really start to look at very large-scale automation. Ultimately, this will be big. It isn’t today, but it needs to be in order to solve the problem that is ahead of us. We need millions of devices. We need a lot of load growth. We need a lot of flexibility. We need it to be simple. As we start to look at that, thinking about the kinds of patterns that make scale predictable, reliable, manageable, and ultimately simple for distributed systems, you’re looking at the unit of a cluster deployment or a data center, or a region, or a service spread across multiple regions. You’re starting to think about, what are those units of self-management or resilience that let you start to think about this, not as a bunch of junk parts, but as a system that you can reason about. Where you could basically set some pieces of it, and expect them to run in a roughly self-sustaining way over a period of time requiring only occasional operator intervention. Starting to think of regions within the grid as being effectively like a cluster level deployment of a service. Where they’re able to operate under a consistent policy within the local set of resources, is a way of thinking about grid scale, that I think maps pretty well from the distributed systems environment up to the full grid environment. This isn’t just me, either. This is a pretty active line of inquiry amongst grid researchers, utilities. This is the model that we think can and will work, is this idea of a fractal grid or a hierarchical grid.

Critical to having the flexibility required to get useful outputs from all of these disparate services, though, is some easily operator controllable flexibility resource. This is where I want to come back to the idea of caching. As we look at what it took to scale up our distributed systems in the web services, or the internet services space, we did start with a very centralized model of capacity allocation, or serving capacity. You’ve got a big data center, you put all the load there. You might have a second one as a backup. When you start thinking about multiple data center locations, you’re also thinking about things like n plus 1 capacity, failover capacity. At first, you’re going to start with something relatively simple. This is basically where the grid is today. A few big generators, n plus 1. Planning for n plus 1 and n plus 2 is a really big thing in the grid as it is for distributed systems. You’re going to forecast load in the grid today and turn generation up and down as needed.

Coming back to our distributed systems example, as you get more demand, the first step is adding more capacity locations, more data centers, more computers. The next step, almost immediately after that, or alongside that, is the idea of caching, being able to have some lightweight compute cost optimal resources that can serve some fraction of your traffic. At the point where the constraint is not CPU or serving resource, but rather the network itself, the role of the cache changes. This is where at Google, in our early days, when we were looking at scaling problems, we were really dealing with capacity constrained scaling, because the constrained resource was compute for web search. The bandwidth piece of that is very small, a web search request or response is not terribly large. As soon as YouTube came along, suddenly, we had a different problem. Now it’s a network cost problem. For Google, this was 100% driven by reliably serving cat videos, which is very important to all of us. I also like dog videos, too.

This is the point at which we started to need to look at a distributed edge caching layer. What this gave Google was basically the ability to not just use caching to defray network costs, to stop that request from traversing all the way across the network. Also, start to use caching as a point of flexibility that increased the reliability of the system from the user’s perspective. If you think about it from the user’s perspective, like your connection to the network might sometimes be down but you can get a result from your browser cache. Your ISP’s connection to the network might be down, or Google may be down or flickering for a second or something, it could occasionally happen, but you can get a response from a local cache if there’s a response available there. It’s not something that replaces the reliability of the central resources, but it is something that substantially augments it. Starting to put these edge caching layers in place, let us get from a point at Google where we were maintaining basically a central serving system at five nines, including the network. The system was ultimately better than six nines reliable and effectively 100% from a user perspective. There’s no reason that it must stay that way, but that was an outcome of this system.

If we start to look at this in the grid perspective, what’s a cache? A battery is a cache. Thinking about the role of batteries in the future grid, it’s not just about energy storage, and being able to store that solar and use it again later. It’s also about providing the flexibility that allows us to use the network a lot more efficiently, and also provide a lot more control and a lot more margin for error for the control that we provide. This is something that I think will be really transformative for surprising resources. This is something that we start to see in the field today. In the limit, we start to have a model of the future grid that looks familiar. We’ve got a bunch of independent serving locations, that are potentially getting policies or updates or whatever, to provide an effective collective management at the system level. Perhaps they can even entirely isolate themselves sometimes or much of the time. Hopefully, they can also provide network services, if paid to do so.

I say, hopefully, because there is a model where this all falls apart as well, and the economics of the collective grid become crushed under people’s desire to go and just like defect and build their own power plants locally. That would be bad, from a tragedy of the commons’ perspective, because when you look at a model of the grid that is not providing any form of economic central connectivity, you’ve got a whole bunch of people who are no longer connected and can no longer effectively receive services. We like the model of the grid being available, because it lets us optimize a lot of things. It optimizes the use of the network. It optimizes the use of the resources. It lets us move energy from places where the wind is blowing to where it’s not, where the sun is shining to where it’s not. Let’s us move resources around and provide a substantial efficiency boost for the generation and the network resources. It isn’t a default outcome, we have to build this if we want it to happen.

Just bringing it back around to like, why are we talking about this in the first place? This is not an academic exercise. It’s not just because we want people to be able to get paid for use of their batteries. It’s not even just because we want transformers not to blow up. It’s because we really need the flexibility that customers and end users provide to be that last 20% of decarbonization of the grid and up our energy supply. If we can do that, we have a really good chance of making our deadlines. We might make it by 2035. We might make it by 2040. 2040 is not bad. 2030 would be the best. Every day that we make it is another day that we’re not emitting the carbon that we have been in the past, and it’s one step towards a better future.

Is This a Job for AI?

As a closing thought, I’m just going to touch on whether or not we will eventually have a giant AI to operate the grid. My short answer to that is no one giant AI, probably, for a couple reasons. One is that predictability and transparency is really important in a system like this, where there are many participants. It needs to be comprehensible, predictable, and it also needs to be something where particularly the financial implications of any action taken are very clear. Having said that, it is a very complex problem. There are a lot of opportunities for ML and AI to make it better. In the limit, I do expect a very large role for AI in optimization and scheduling, and continue to understand and be able to effectively manage the changing patterns of energy usage, to do all of the things that are on this slide. My happy message is that that is something we can evolve as we go. We can start from the tools that we have. There is an opportunity for everyone to find really interesting technical work in the fields that you’re interested in engaging with the grid, whether it’s just simply from a systems and an operation’s perspective, or whether you’re an AI and an ML researcher, and you see ways to really dramatically improve the operations of these systems.

Conclusion

There’s a couple of other resources that are really helpful for engineers who are looking to make a change into the climate space. We are not the only company working in this space, there are a number of very fine ones. I would really encourage you to consider moving in the climate and clean tech direction as you think about where your career goes in the future. Because there is no more important work that we can be doing today.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Reckoning with the Harm We Do: In Search of Restorative Just Culture in Software and Web Operations

MMS Founder
MMS Jessica DeVita

Article originally posted on InfoQ. Visit InfoQ

Transcript

DeVita: I’m Jessica DeVita. We’re talking a lot about Just Culture. Instead of sharing with you more about the theory of it, or what different people have said about it, I want to share with you the results of a study I did on the lived experiences of the people on the ground who are responding to incidents and outages. I’m inspired by the Just Culture manifesto. In particular, the first two commitments of the Just Culture manifesto are that people should feel free to work and speak up and report harmful situations or incidents that they experience or are involved in without fear of unjust or unreasonable blame or punishment. That they should support people. That organizations need to support people who are involved in these incidents and outages. That, in fact, supporting people is the first priority after an unwanted event. The other commitments are important as well, but I wanted to focus on the first two.

Research Methods

I went to go talk with folks to find out what’s happening for them. What are they experiencing? As a brief overview of the study, I asked participants about whether they’d felt harmed or traumatized by their experiences. The likelihood of being blamed, or of blaming themselves, and whether they were considering leaving their jobs as a result. I invited them to describe what blameless and accountability meant to them. Before we go into the results, I’ll just speak briefly about how I did the research. As a safety science researcher, I primarily focus on phenomenology, which emphasizes the importance of the lived experiences of people. For each interview that I do, I record them and I transcribe them. Then I print those out and I grab my highlighter and pen, and I’m looking for those significant statements that really speak to the research question that I have. This is often referred to as coding. We’re reviewing the transcripts and looking for these larger themes and grouping what the codes are into these larger meaning units and themes. Some people will then take their handwritten notes and go into using a qualitative data analysis tool. In that, they will group these findings into those larger meaning units. This is really what forms a codebook. The codebook is where we have a better understanding that we’ve really captured what it’s like for people. The goal of phenomenology is that we would leave with a feeling that we better understand what it’s like for someone to experience that. In this study, I used a mixed methods approach, where I surveyed 31 people. I had two focus groups, and had several individual conversations with folks.

Next, I’m going to give you a brief overview of the questions that we asked. Have you felt harmed or traumatized from your involvement in incidents and outages? What exactly is trauma? Trauma results from events where the individual experiences them as physically or emotionally harmful. Trauma has lasting effects on people’s well-being. I was surprised, I must say, that very few people said that they had not experienced trauma. Only 7 out of 31 responders indicated that they have not felt harmed or traumatized by their experiences. Only one person left a qualitative comment. They said outages are considered an opportunity to learn as well as share what you learned. If an outage is called, people will volunteer to be the incident commander or chime in hands to indicate that they’re available to help. Even with upper management, there’s very little negativity or blame. Again, I was surprised, I was really hoping that with all the work that we’ve done that more people would report that they’d had better experiences. The vast majority of participants described experiencing harm or trauma. They may have used other words, and I invited them to describe their experiences. These folks are what we might call a second victim. Second victims is maybe a new term you may not have heard before. Second victims suffer significant emotional harm regardless of whether their actions actually contributed to the incident. Second victims may experience harm related to stress from adversarial root cause investigations as well. The issues that they experience were not limited to them alone, they also affected their families.

Themes That People Shared

Next, let’s explore what people shared. Some of the larger themes and descriptions were that there was a punitive culture, that impacts their mental and physical health. They reported a loss of sleep, and disruptions to family time, and personal time, as well as no time to fix underlying issues. We’ll go into some of the significant statements for each of those larger themes that I described. As I mentioned, that people are still experiencing this punitive culture. One participant shared that a CTO told me to take the blame for an incident, or he’d fire me. I remember getting yelled at by the CEO that I wasn’t using the right tool. A lack of trust in the team is crippling, to be honest. People have memories like elephants. No one is ever satisfied. The process we have to deal with incidents, 5 whys based, inevitably leads to the determination of root cause that more often than not ends with human error. Postmortems were called recrimination meetings. One responder described how their COO flew into town after a particularly bad incident. They said, in the aftermath of that incident, the COO asked me, “Why did you release the software?” I told him we’d done all these tests, and we thought we were in good shape, but we missed it. He said again, “Why did you release the software?” I said, “I made a mistake, an error in judgment.” This is an hour-long meeting. The third time he asked again, “Why did you do this?” I said, “I don’t know what to tell you. I screwed up. I don’t know what more you want from me.” It wasn’t like I got fired or anything, but I definitely felt blamed.

The impact to their mental and physical health was the next finding, the next theme that emerged. One respondent said, “I have been traumatized for being involved in incidents for the past five years, to the point that every time I see an alert today, I get an anxiety attack.” “Sometimes I can’t help but not hide it. It will explode in emotional bursts or manifest in bright red hives all over my body, or I’ll completely shut down.” “I remember my lower back hurting because of the amount of adrenaline I’d been running on for the past 36 hours.” “It was very stressful and it gave me a lot of anxiety which led to a loss of sleep, and I would say a more profound sense of disengagement from the workplace.” Sleep loss was a very common theme among the responders. Asked people, what’s your relationship with sleep? One participant shared that sleep damage makes it very difficult for me to be on-call. When a page wakes me up, I generally do not sleep afterwards. I asked if sleep was ever discussed by management. They said formal sleep discussion doesn’t happen. It’s a very dangerous discussion. Not being able to go to sleep at a reasonable hour takes its toll on your mental abilities. Another participant shared that they are asleep with one eye open. “It’s like you’re asleep with one eye open. Don’t dolphins do that? It’s like half my brain is at work, ready and poised to respond to the pager while the other half of my brain is trying to relax.” “Sleep is very holy to me. I’m not on-call right now. I have a hard time separating work and non-work stuff, and so being on-call right now, I couldn’t cope with it in a healthy way.”

Some participants also described that working in Europe, there was a lot of questions and avoidance around something called the European Working Time Directive. That it was awkward to meet the requirements of this directive in a modern 24/7 on-call team. The European Working Time Directive, as I learned, doesn’t cover sleep very much, but it does cover rest. It’s quite complicated, but that you’re not supposed to have less rest than 11 hours in every 24 hours and not supposed to work more than 48 hours in any given week. Another participant shared that, “After the incident, when I tried to get back to sleep, my adrenaline was still going, so it took me a while to get back to sleep. It wiped me out for the whole day.” “There were some periods of time where I just would not respond. I wouldn’t even wake up to the pager. My partner would end up waking me up, ‘Your phone is going off, go get it.'”

This disruption to family life was a significant theme for many participants. One person described feeling exhausted, spent, drained, and guilty for working and being away from my children. “They were unable to count on me for being able to participate through the duration of some event whether that be a meal or anything else.” That interruption became a highly negative emotionally charged topic for this participant. “Scheduling of shifts is never done with any consideration for family events.” How are people coping? One participant shared that they are not coping well. “Having a good support network and therapy helps, but this industry can be absolute shit for them.” Some participants described disengagement as a coping mechanism, that they would disassociate or leave the job. Another participant described using it as fuel. Another person shared that they mostly get angry and try to change our industry as a way to cope. Using it as fuel so I feel better by at least knowing I have helped others to not deal with it alone. The resentment I felt from those experiences turned into fuel.

Restorative Just Culture Checklist

Next, I want to talk a little bit about a resource that Sidney Dekker has shared on Just Culture. He’s provided this Restorative Just Culture checklist. This is another set of questions that I asked some of the participants. First of all, it says that a Restorative Just Culture aims to repair trust and relationships damaged after an incident. It allows all parties to discuss how they’ve been affected and to collaboratively decide what should be done to repair the harm. Who has been hurt here? These second victims are these engineers who again were involved or just responding to the incident. I also want to mention that there is a third victim here. There’s incident analysts, those people who are tasked with supporting engineers after an incident and investigating what really happened. They are what’s known as a third victim. Have we acknowledged how we’ve harmed people? Do we acknowledge them? Second victims can experience this harm related to adversarial root cause analysis investigations. They can suffer significant emotional harm, again, whether or not their actions contributed to the incident or whether it was preventable at all. The impact on them can be severe, as we learned from some of the responders. They can take the form of signs, symptoms associated with acute stress syndrome, or post-traumatic stress disorder.

Some of this comes out of some research on patient safety professionals as the third victims of adverse events. These third victims, according to this research that Holden and Card did, are those that experience psychosocial harm as a result of indirect exposure to an incident, such as leading those incident investigations. Their study found that these third victims have an almost complete lack of emotional support, and a sense that the harm that they experience goes unacknowledged. The harm is clearly real. Respondents experienced anxiety, lost sleep, emotional exhaustion, and a sense of being blamed by everyone for events they weren’t involved in. These led to some of them considering leaving the profession.

We often talk about the organization as a victim. While organizations can certainly suffer reputational harm after these events, Holden and Card found that corporate victimhood is qualitatively different from the harm that the individuals experience. That organizations do not experience acute stress syndrome although their employees might. Organizations do not burn out and leave the profession although their employees might. Have we overfocused on the organizational harms and the reputational harms and ignored the needs of our responders and the incident analysts? The Just Culture checklist asks us, what do they need? What do these folks need? I’ll share with you next what some of our study participants shared in response to this question of, what do they need? How could we reduce the harm to them? Some of the larger themes that emerged are that we really need to educate management. They suggested we need to listen to engineers. We need to trust people. We need to take care of people. We need to slow down and alleviate the pressure to ship. They spoke about needing more training, more staffing, and more sustainable rotations. They asked us to focus on learning instead of blaming, and to use inclusive language. No more human error. No more root cause. No more fat fingering a change.

What People Needed

Let’s take a look at some of the significant statements that people shared about what they needed. They wanted us to invest in maintenance and plan for failure. Spend more time on maintenance. “I think what does harm at my company is the frequency of incidents and false alarms. Sometimes our SRE team gets paged for things that resolve themselves before the SRE even gets online to check things out.” “Make investing in resilience a priority for product teams. Make it clear to those teams as well. Usually, they want to improve things but they feel this external pressure to ship. Stop with the ‘Get it right the first time’ mentality.” One person shared that they wanted regular fire drills or empowerment training, so everyone is prepared and it doesn’t feel like a total shock and scare. Being unfamiliar when you’re a frontend engineer and suddenly having to parse through logs and APM like you’re a DevOps person at 1 a.m.

Educate management. “Just give us a break.” “Education for senior management around learning from incidents, accident models, helpful and unhelpful behavior including language.” “Better training for managers on how to manage their staff. Several managers simply shouldn’t be managing people.” Another respondent said that we would really need serious culture change from the top, but that there was no appetite for that. The next theme is really focus on learning instead of blaming. “Allowing for human nature to thrive instead of sanctioning individuals due to the complexity of systems.” “Adopting investigation approaches that are capable of uncovering systemic challenges far away from the frontline. If the tools you use only uncover causes close to the last people who are involved, then that’s all we’ll learn and focus on. Also focus on multiple perspectives.” One person said that if you make people mad and you chew them out, you’re not going to get good work out of them, so don’t do that. Another theme was the need to talk about what happened. “Personal outreach and giving people an opportunity to talk through things is very helpful, but also quite rare,” according to one participant. “Provide a safe space to talk about what happened.” “Build a space for them to share their concerns about the work they do at the sharp end.” The next theme was really sustainable rotations. One person shared that it really has to be six people at a minimum for a somewhat humane rotation. I think managers get very upset when I say that.

What Is Blameless?

I invited participants to describe, what is blameless? What does it mean to them? What does it mean to you? Way back in 2012, John Allspaw was talking about blameless postmortems and a Just Culture at Etsy. He said that having a Just Culture means that you are making an effort to balance safety and accountability. He also described that a blameless postmortem process means that engineers whose actions have contributed to an accident can give a detailed account and that they can give this detailed account without fear of punishment or retribution. What did our participants have to say? They characterized blameless as a behavior. It’s just something people say, as a doorway for inquiry, and a number of other ways of characterizing it, but mostly just not pointing fingers.

Let’s go into some of the statements that people shared. Blameless is a behavior. “When faced with a surprise, often negative, the primary goal is not to blame individuals, although blameless does not mean sanction-less, but to learn and understand what can be improved with the goal of keeping our systems sustainable and adaptable in the long run.” “Blameless means not pointing fingers at people for causing an incident.” “Blameless means that we don’t accept so and so messed up as the root cause of an incident. We look at the system’s tools and procedures that failed to prevent the mistake or mitigate the consequences.” Blameless was just something people say. “It’s just something people say nowadays, like we do Agile or psychological safety. It would be weird not to do it. A lot of folks don’t understand what it takes. I’ve seen some very blameful conclusions come out of blameless postmortems. You can still see the blame in the language and the actions.” “Blameless is a squishy marketing term used by part of the safety community to try to make blame attributed to frontline workers go down. It’s hard to tell how well it succeeded.” “Blameless has been Agile-ified to mean whatever the person in charge wants it to mean.”

Blameless was a doorway for inquiry. “Blameless means the emptiness of blame, not the dissolution of it. It means accepting that blame will happen as a natural result of a fleeting human emotional reaction, and that we should see it as a doorway for inquiry.” “We look beyond human error to discover the cause of an incident.” “Approaching incidents from a perspective that all involved were doing the best that they knew how, and the incidents occur because of systems factors and systemic pressures, not individual mistakes.” Blameless recognizes that software is hard. Another participant shared that we need to recognize how blame occurs and what it can tell us about the way our brain is recognizing patterns, and to transform it into something more useful. Finally, they shared that developing an empathy, an understanding of how a situation could unfold and why these decisions made were sound and reasonable from every operator’s perspective.

I asked participants how likely they are to be blamed for incidents at their workplace and how likely they were to blame themselves. I was a bit surprised to see that a majority of participants said that they were very unlikely to be blamed. Just as surprising was that a lot of participants said that they would blame themselves. It’s interesting how the notion of blamelessness manifests and how people describe it. Even then, these blameless organizations that claim to hold these values that actually self-blame is a significant emotion for people. I asked people, if they were likely to change their jobs as a result, to leave their jobs. Actually, another surprising result is that many people were not looking to leave their jobs. That a few of them obviously had left their jobs or were actively looking. Another person shared that they had lost two people out of their team due to the heavy on-call burden. It’s a significant risk to lose these people.

What Is Accountability?

I next invited people to talk about, what is accountability? What does it mean? What does it mean to them? Accountability was described in a few themes. It was characterized as culpability, being on the hook. It was shared as being a capability. A plurality, it takes a team. Also, described as prevention. Let’s look at some of those significant statements that people shared in these larger themes. Accountability as a capability. “The capability to accept your own role in an event, and your ability to go through restorative steps with other people involved.” “That people would have control over their work, but also their responsibility for it. Without both of those concepts that things fall apart.” “The power of agency supported by the reciprocity of others.” That you can be doing the best you can and still fall short of your goals, but that’s where we help each other out and don’t beat each other up when we miss. Accountability, it takes a team. “We all pitch in to get the service restored as quickly as possible, and to figure out how to improve the situation in the future.”

That there’s a group of individuals that are stewards of a given system and service and are committed to its improvement and long-term sustainability. Accountability as prevention. “Accepting that existing tools or procedures failed to prevent the incident, and spending time fixing those things as a team before we move on to more exciting work.” “Taking action so that the system cannot allow the mistake again.” Accountability as account giving. One person shared that it was about forthrightly being able to recount decisions and actions that were taken. Another participant described their unpopular opinion that accountability can only come from yourself, that you can hold someone accountable, but that’s typically punitive in nature. Accountability being self-imposed, meant that I’m going to take steps to educate myself and hopefully others around me to the best of my abilities such that future events are mitigated based on what I’ve learned.

Is There Conflict Between Blameless and Accountability?

I asked them if there was conflict in these concepts, conflict between blameless and accountability. For one participant the answer was yes. They said, “Some events aren’t blameless. When someone intentionally violates a policy or is malicious, then there should be accountability. When policies and cultures don’t support staff in being successful, I can see where blameless may be ok.” For the majority of the participants, there wasn’t conflict between the two concepts. This person said that it’s possible to say that we as a team need to spend more time making our system more stable, without finger pointing at any particular person or team. Despite the fact that blame serves a social function, blameless and accountable are not necessarily at odds, especially if organizations want to learn from incidents, and explore why locally rational decisions that may have been successful until there was an incident, had surprising effects. Accountability is the thing revealed to us when blame is but a passing phase instead of a concrete resting point. We discover that accountability is a plurality. It takes a team to be accountable in complex systems. How can we treat blame as anything but a film that we allow ourselves to recognize, politely remove, and move on?

Study Conclusions

I want to share now some of the conclusions from the study. Accountability means different things to different people. There’s no one definition. Even in cultures that practice blamelessness or that claim to have a Just Culture, people are still being harmed. Sleep is holy, we need to talk about sleep. Why are we still waking people up in the middle of the night?

Closing Thoughts on Accountability and Just Culture

Some closing thoughts on accountability and Just Culture. The word accountability, it’s the condition of being able to render accounting of something to someone, according to Dubnick, whereas the idea of accountability is less amenable to easy definitions. Dubnick explains crucially, “To perceive oneself as accountable is to accept the fact that there is an external reference point, a relevant other that must be taken into consideration. Being accountable is thus a social relationship.” Accountability, looking forward, instead of backwards. Sidney Dekker has written about Just Culture extensively. He described how that in a backward-looking accountability, that holding someone accountable is directed at events that have already happened. That accountability can also be forward looking. He references Virginia Sharpe.

Restorative justice, according to Dekker, achieves accountability by listening to multiple accounts, and looking ahead at what must be done to repair the trust and the relationships that were harmed. Perhaps operators involved in mishaps could be held accountable by inviting them to tell their story, their account, and then systematizing and distributing the lessons in it. By using this to sponsor vicarious learning for all. Dekker says that perhaps such notions of accountability would be better able to move us in the direction of an as-yet elusive blame-free culture.

Dr. Richard Cook says that, “There is no such thing as a Just Culture. There’s Just Culture. Where complex systems failure has occurred, culture plays out predictably. It’s more about the power dynamics, reserving the decision about what is acceptable, and calling that just is a species of nonsense. Just Culture is almost entirely a fig leaf for the usual management blame assignment. Justice is in the eye of the beholder.” Culture is what you do, it’s not what you say, as one participant shared. I hope this has offered you some insights into the lived experiences of people. We should still pursue the restorative approach to learning from incidents.

Questions and Answers

Brush: As a manager that supports folks both doing incident analysis but also responders. I’m curious if you can give me any tips and tricks on day-to-day things that managers can do to better support the emotional load.

DeVita: I think as managers we have that special opportunity, but every word we say carries extra meaning. I think probably modeling some of the blameless language, but also, I think day-to-day, what I’d love to see more managers doing is maybe just trying to check in with people after they’ve been on a long incident. It can take some time. Asking how folks are doing can be really impactful. It sounds maybe simple, but if you ask it and really try to check in with people afterwards, I just think that goes a long way. Also, monitor, how many times are people being woken up? What are you doing about it? Stop what you’re doing and go fix that. It’s not worth it. The cost to humans, it’s just incalculable.

Brush: If you got a company that is offering 24/7 services, and they do need tight turnaround times on responders, what do you suggest to avoid waking people up at night?

DeVita: I’m not saying it is going to be possible to not do it entirely. I think when we discover that the entire thing is resting on these two people, that’s a burden that’s really pretty great for them to bear. I would just say, first of all, again, noticing that that has happened, and show that you’re investing in it, because there’s fulfillment with doing your service really well. When you do get woken up for a problem, you definitely, I think, have a more connected experience with how it’s breaking. The breaks don’t respect the clock. They don’t care about our schedules. I would just really try to think about if this person was in Fiji enjoying the sunshine, would our other responders be able to do something to mitigate at least until the next business hours? Like, what’s possible here? That’s a question I would ask people afterwards. When you find that your experts have been woken up, definitely reach out and try to talk with them. Maybe ask if you have that good relationship with them, “If you haven’t been able to pick up for some reason, do you think your team could have handled it for you? Would you have been able to enjoy your vacation,” hypothetically? That question can come very differently from a manager than it can come from an independent incident analyst, if you have folks that are neutral, or strive to be neutral. Those are some questions I would ask, but knowing your power influences all of that. Dr. Cook reminds us at the end, like the power dynamic, may be not verbally spoken, but is always there.

Brush: I hadn’t thought of that. That’s an interesting thing to bring up about how the manager asking the question might get a less direct or not really honest, but a more like packaged answer about how this happens and what can happen instead.

The other thing that I was curious about is kind of like the opposite end of it, maybe this relates to it. What are some things that sometimes folks on the team will do inadvertently to make the whole experience somewhat worse. They’re well-meaning but accidentally they make things worse.

DeVita: First of all, that is an observation, and it depends on who’s observing. When we have that observation, it’s probably in hindsight. It’s, ok, it looks like this made things worse, but we only know that in hindsight, usually. We have sometimes well-meaning people who may join an incident call, and it’s not helping. I would just invite us to remember that the incident manager, commander’s role if you’re running that incident, and please explicitly identify yourself that you are running that incident, so that anybody joining always knows who’s running the show. That person has a duty, I think, to say, unless you’re taking over as incident commander, this needs to be a voice only bridge between the engineers, please take it to text only.

Brush: No, I think it’s great, like the roles and responsibilities, inserting them.

DeVita: Or if you have a relationship with that person, maybe you do that. It really depends on the situation. Incidents are never really scripted. They’re a bunch of people coordinating together who may not have experience working together. That’s why some of the incident management things can be helpful, because even if you’ve never worked together, if you identify your roles, maybe people know what to expect. Unless they’re going to relieve you, that might be just like, whoever is speaking right now, can I ask you to go to text only? Or you message that, or maybe your other helpers are messaging like, we’re all moderators, we all have to help each other in these incidents.

Brush: I think this question is around strategic incompetence, or I’ve heard it called malicious incompetence. How do you get folks to widen their knowledge? They can be more effective overall as a team, in terms of redundancy.

DeVita: Remember the viewpoint that you’re taking. You are seeing that as strategic incompetence. I’m not challenging that you have that experience. It’s a judgment, essentially, that they are avoiding learning. I would invite you to be really curious about what’s going on there. I’m not a gambling woman, but I would say to you that there’s something very fearful going on there. If they are afraid to touch that system, you should probably slow down and fix that. Because if the system is so fragile that they’re like, not touching that at all. I would not enjoy having that continue. If you just add the manager power dynamic, and push them, no, you have to go learn. It seems like there’s some safety issues there, either from, they’re afraid of the deployment system. They know it’s held together with bubblegum and toothpicks. Adding a manager push in there is absolutely not going to help. I would see if you can find a way to bring people together after an incident, and if you can start to build your trust with people that it’s ok to be curious about other parts of the system that are opaque to them. Until they see a blameless experience, they won’t believe you.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


ScyllaDB vs. MongoDB: When to Use Each – The New Stack

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

<meta name="x-tns-categories" content="Data / Operations / Software Development“><meta name="x-tns-authors" content="“>

ScyllaDB vs. MongoDB: When to Use Each  – The New Stack

Modal Title

2023-05-12 06:42:21

ScyllaDB vs. MongoDB: When to Use Each 

sponsor-scylladb,sponsored-post-contributed,

Numberly has been using both ScyllaDB and MongoDB in production for over five years. Learn which NoSQL database it relies on for different use cases and why.


May 12th, 2023 6:42am by


Featued image for: ScyllaDB vs. MongoDB: When to Use Each 

Within the NoSQL domain, ScyllaDB and MongoDB are two totally different animals. MongoDB needs no introduction. Its simple adoption and extensive community/ecosystem have made it the de facto standard for getting started with NoSQL and powering countless web applications. ScyllaDB’s close-to-the-metal architecture enables predictable low latency at high throughput. This is driving a surge of adoption across teams such as Discord, Tractian and many others that are scaling data-intensive applications and hitting the wall with their existing databases.

But database migrations are not the focus here. Instead, let’s look at how these two different databases might coexist within the same tech stack — how they’re fundamentally different, and the best use cases for each. Just like different shoes work better for running a marathon versus scaling Mount Everest versus attending a wedding, different databases work better for different use cases with different workloads and latency/throughput expectations.

So when should you use ScyllaDB vs. MongoDB, and why? Rather than provide the vendor perspective, we’re going to share insights from an open source enthusiast who has extensive experience using both ScyllaDB and MongoDB in production: Alexys Jacob, CTO at Numberly. Jacob shared his perspective at ScyllaDB Summit 2019, and the video has been trending ever since.

Here are three key takeaways from his detailed tech talk.

Scaling Writes Is More Complex on MongoDB

The base unit of a MongoDB topology is called a replica set, which is composed of one primary node and usually multiple secondary nodes (think of hot replicas). Only the primary node is allowed to write data. After you max out vertical write scaling on MongoDB, your only option to scale writes becomes what is called a sharded cluster. This requires adding new replica sets because you can’t have multiple primaries in a single replica set.

Sharding data across MongoDB’s replica sets requires using a special key to specify what data each replica set is responsible for, as well as creating a metadata replica set that tracks what slice of data lives on each replica (the blue triangle in the diagram below). Also, clients connecting to a MongoDB cluster need help determining what node to address. That’s why you also need to deploy and maintain MongoDB’s Smart Router instances (represented by the rectangles at the top of the diagram) connected to the replica sets.

Having all these nodes leads to higher operational and maintenance costs as well as wasted resources since you can’t tap the replica nodes’ IO for writes, which make sharded MongoDB clusters the worst enemy of your total cost of ownership, as Jacob noted.

For ScyllaDB, scaling writes is much simpler. He explained that “on the ScyllaDB side, if you want to add more throughput, you just add nodes. End of story.”

Jacob tied up this scaling thread, saying:

“Avoid creating MongoDB clusters, please! I could write a book with war stories on this very topic. The main reason why is the fact that MongoDB does not bind the workload to CPUs. And the sharding, the distribution of data between replica sets in a cluster is done by a background job (the balancer). This balancer is always running, always looking at how sharding should be done and always ensuring that data is spread and balanced over the cluster. It’s not natural because it isn’t based on consistent hashing. It’s something that must be calculated over and over again. It splits the data into chunks and then moves it around. This has a direct impact on the performance of your MongoDB cluster because there is no isolation of this workload versus your actual production workload.”

MongoDB Favors Flexibility over Performance, While ScyllaDB Favors Consistent Performance over Versatility

ScyllaDB and MongoDB have different priorities when it comes to flexibility and performance.

On the data modeling front, MongoDB natively supports geospatial queries, text search, aggregation pipelines, graph queries and change streams. Although ScyllaDB — a wide- column store (aka key-key-value) — supports user-defined types, counters and lightweight transactions, the data modeling options are more restricted than on MongoDB. Jacob noted that “from a development perspective, interacting with a JSON object just feels more natural than interacting with a row.” Moreover, while MongoDB offers the option of enforcing schema validation before data insertion, ScyllaDB requires that data adhere to the defined schema.

Querying is also simpler with MongoDB, since you’re just filtering and interacting with JSON. It’s also more flexible, for better or for worse. MongoDB lets you issue any type of query, including queries that cause suboptimal performance with your production workload. ScyllaDB won’t allow that. If you try, ScyllaDB will warn you. If you decide to proceed at your own risk, you can enter a qualifier indicating that you really do understand what you’re getting yourself into.

Jacob summed up the key differences from a development perspective:

“MongoDB favors flexibility over performance. It’s easy to interact with, and it will not get in your way. But it will have impacts on performance — impacts that are fine for some workloads but unacceptable for others. On the other hand, ScyllaDB favors consistent performance over versatility. It looks a bit more fixed and a bit more rigid on the outside. But once again, that’s for your own good so you can have consistent performance, operate well and interact well with the system. In my opinion, this makes a real difference when you have workloads that are latency- and performance-sensitive.”

It’s important to note that even queries that follow performance best practices will behave differently on MongoDB than on ScyllaDB. No matter how careful you are, you won’t overcome the performance penalty that stems from fundamental architectural differences.

Together, ScyllaDB and MongoDB Are a Great NoSQL Combo

“It’s not a death match. We are happy users of both MongoDB and ScyllaDB,” Jacob said.

Numberly selects the best database for each use case’s technical requirements.

At Numberly, MongoDB is used for two types of use cases:

  • Web backends with REST APIs and possibly flexible schemas.
  • Real-time queries over unpredictable behavioral data.

For example, some of Numberly’s applications get flooded with web tracking data that their clients collect and send (each client with their own internally developed applications). Numberly doesn’t have a way to impose a strict schema on that data, but it needs to be able to query and process it. In Jacob’s words, “MongoDB is fine here. Its flexibility is advantageous because it allows us to just store the data somewhere and query it easily.”

ScyllaDB is used for three types of use cases at Numberly:

  • Real-time latency-sensitive data pipelines. This involves a lot of data enrichment where there are multiple sources of data that need to be correlated, in real time, on the data pipelines. According to Jacob, “that’s tricky to do.” He said that “you need strong latency guarantees to not break the SLAs [service-level agreements] of the applications and data processes which your clients rely on down the pipe.”
  • Mixed batch and real-time workloads. Numberly also mixes a lot of batch and real-time workloads in ScyllaDB because it provides the best of both worlds (as Numberly shared previously). “We had Hive on one path and MongoDB on the other,” said Jacob. “We put everything on ScyllaDB and its sustaining Hadoop-like batch workloads and real-time pipeline workloads.”
  • Web backends using GraphQL, which imposes a strict schema. Some of Numberly’s web backends are implemented in GraphQL. When working with schema-based APIs, it makes perfect sense to have a schema-based database with low latency and high availability.

“A lot of our backend engineers, and frontend engineers as well, are adopting ScyllaDB,” said Jacob. “We see a trend of people adopting ScyllaDB, more and more tech people asking ‘I have this use case, would ScyllaDB be a good fit?’ Most of the time the answer is ‘yes.’ So ScyllaDB adoption is growing. MongoDB adoption is flat, but MongoDB is certainly here to stay because it has some really interesting features. Just don’t go as far as to create a MongoDB sharded cluster, please!”

Bonus: More Insights

Jacob is an extremely generous contributor to open source communities, with respect to both code and conference talks. See more of his contributions at https://ultrabug.fr/

Group
Created with Sketch.

TNS owner Insight Partners is an investor in: Real.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Hiring and Growing Great Site Reliability Engineers

MMS Founder
MMS Narayanan Raghavan

Article originally posted on InfoQ. Visit InfoQ

Subscribe on:






Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. Today I’m sitting down with Narayanan Raghavan. Narayanan is the Senior Director for Site Reliability Engineering for Managed Services at Red Hat. Narayanan, welcome. Thanks for taking the time to talk to us today.

Narayanan Raghavan: Thank you for having me.

Shane Hastie: The way I typically start with my guests is who’s Narayanan?

Introductions [00:40]

Narayanan Raghavan: Like you mentioned, my current role is Senior Director of Site Reliability Engineering for Managed Services here at Red Hat. I’ve been with Red Hat 15 plus years. It’s a long time. Had different roles at Red Hat. Ended up in this space and this role as Red Hat was starting the Managed Services effort. And this was one of my initial challenges to say, “Come on in and build an SRE organization at Red Hat.” And that’s what got me interested in this space to say it’s an opportunity to build a new space, get into a new space, take on a new challenge, and been in this space for seven plus years and still enjoying it every day. Been at Red Hat 15 plus years like I mentioned, and every day is a new day, new challenge, a lot of learning. So it’s been good.

Shane Hastie: SRE, relatively new, probably seven years plus. Red Hat would’ve been fairly early in that. Why does it matter? Why did this emerge? Why wasn’t it just something we did?

Defining Site Reliability Engineering [01:38]

Narayanan Raghavan: First off, let’s take a step back and go, what really is SRE, site reliability engineering? Now you look at the Google definition, Google came out with their SRE book about seven plus years ago. They really talk about it’s a team that balances risk of unavailability with rapid innovation, ensuring that you’re focused on building systems that are automatic or put differently, when you take a bunch of software engineers, put them in an infrastructure specific role, how do they think about it? How do they approach it? In the high level, in my mind, really boils down to how do you focus on reliability, scalability, security, performance of systems at scale? And for us, from an OpenShift perspective or managed services perspective, we have to think about scale, cutting across different cloud providers, different hyperscalers and fundamentally, how do we make platforms systems boring so your application developers can focus on what they do best, adding business value, focusing on their applications, et cetera. More so than thinking about the underlying system.

Shane Hastie: What does it take to be a good site reliability engineer?

What makes a good site reliability engineer [02:48]

Narayanan Raghavan: That’s a great question. I want to answer that two ways. First off, first and foremost, I’ll say people. You need good engineers, you need good people to make site reliability engineering come to life. The second big piece is culture, because you’re not building site reliability engineering organization with one person, you’re building it with a group of people. And when a group of people come together, the culture that you put in place, that’s what makes it tick. The individuals are the key, the culture is the fodder, so to speak, to get that team going. And the type of hiring you do, the type of engineer that you hire goes a long way. Then I generally go, I’m looking for people that are software engineers with their systems mindset or systems engineers who have a software development background as well, because that mix is critical for an SRE, as you think about scale, as you think about reliability, continuity, et cetera.

But I’m also not looking for that perfect fit, not looking for someone who checks every single box on my list of things because that person, they might be great to start off with, but that person is going to get bored. That person is going to, two months down the road, probably look for other opportunities. So I’m looking for people that are not the perfect fit in many ways. I’m looking for people that are eager to learn, that have that curiosity to pick up new things and jump into a challenge. You might have an escalation, an incident or what have you. I’m going to explore and try and understand and debug what’s happening. So being willing and able to dig into the weeds, being willing and able to have a conversation about it without blaming people, that’s important. And the cultural aspect plays a big role in that.

The importance of a trust and assuming positive intent [04:48]

So giving people the confidence, for example, to say it’s okay to fail. And one of the first things I did when I started building this group was set some team principles to say, “First and foremost, it is okay to fail. It’s important we learn from our failures and we don’t repeat the failures, but it’s okay to fail.” I can’t tell you how many times I’ve apologized to stakeholders to say, “Whoop, sorry, we’ve made a mistake. We learned from it. It’s not going to happen again.” But being able to show the human side is important. It’s important because SRE is a high stress role. You’re dealing with incidents and escalations, crisis management essentially. So being okay to fail was our first principle. Our second principle was assuming positive intent. People are trying to do what’s right for the organization, the company, et cetera. It’s okay to ask questions and sometimes you might have a question where the answer might be confidential.

I can’t always share, but assume that I am doing what’s best for the team, that my manager and his manager, et cetera, all the way up to our CEO, doing what’s right for the company. That assuming that positive intent absolutely matters, especially in the space that SRE plays. Because SREs are going to be interacting with our customers, partners, cloud providers, internal teams. So you get to play in a space where there are multiple stakeholders, multiple people that are impacted, et cetera. The third principle is starting with trust and extending that trust. Very similar to assuming positive intent, except this is all about encouraging curiosity, asking questions. Let’s not make assumptions because when you’re trying to manage systems at scale, last thing I want to do is make some assumptions that somebody did something. I’m going to ask that question, but also trust that somebody else is doing the job the right way. So we’re all swimming in the same direction.

Disagree and commit [06:50]

The fourth principle for me is disagree and commit. And this is true with any software development. I can build something a thousand different ways, always argue about one way or another. Let’s not get caught up in analysis paralysis. Give people the opportunity to pick an option, make the changes. If it works, great. If it doesn’t work, remember that it is okay to fail. And remember that we’re all doing this with positive intent. So we trust each other, we do it with positive intent. If we fail, these are just bits and bites, let’s rearrange it. Let’s acknowledge that we tried something, we learned from it, we’re going to take a different approach. So disagreeing and committing versus analysis paralysis is so important because that’s what allows us to keep the pace that’s important when development teams are pushing out changes after changes. How do you support that while also keeping systems reliable?

And then the last thing that bleeds out of that is communication. Communicate, communicate, communicate. I’d rather we over communicate, step on each other’s toes versus not. I’d rather we build a culture of feedback where people are open to giving and receiving that feedback for our own development too. I tell my team this, is we spend more times with each other either virtually or physically than we spend with our significant others. So why erect those barriers? Why not lower those barriers and learn from each other, give each other that feedback. Because team in my mind is a set of puzzle pieces coming together and if they don’t fit right, you’re not going to enjoy that space.

So getting that team together, getting those puzzle pieces, in this case, people to come together through culture is absolutely important. You’re not expecting perfection, you’re expecting things will happen, incidents will happen, things will break. That’s okay. Learning from it, making mistakes, learning from it. Again, that’s life. And it’s important from an SRE perspective. Like I said, crisis management is important, so is culture and communication. So the soft skill pieces start to engage and that starts to matter as well.

Shane Hastie: How do we help the people on these teams build the resilience that they need?

Resilience comes from failure [09:11]

Narayanan Raghavan: I’m going to answer that in, I don’t want to say contradictory way, but resilience comes from failure. You’re not going to build that resilience if you haven’t failed before. And this is why that culture matters. This is why communication skills matter. This is why being calm during a crisis, that’s a skill and that’s a hard skill to find. And building that resilience starts from the top. When the leader shows that it’s okay to fail, I am human after all, and I am going to acknowledge that and I’m going to show that when I fail, I’m going to celebrate it with the team to say, “I’m not going to ding anybody just because they failed.” We’re not going to learn otherwise. And resilience is part of that learning that happens as a team comes together.

Shane Hastie: And where do you find these people?

Good people come from anywhere [10:04]

Narayanan Raghavan: Good people from my perspective come from anywhere. I’ve had people that have had a background in music, I’ve had people that have PhDs, and so good people can come from anywhere. Like I said, I’m not looking for the perfect fit. I am looking for potential. I’m looking for behaviors that show a learning mindset. I am looking for skills too. But many of the skills are teachable when somebody has a learning mindset and shows curiosity versus not. And part of that is the conversations we have. Part of that is the engagement we have with that individual. And some of these people are internal. Many of these people are internal and others are external hires. There are people from different organizations, like I think I mentioned earlier, as an SRE function, we’re exposed to customers, partners, cloud providers, the infrastructure providers, communities, open source communities. So because we are exposed to so many different groups that communication becomes important.

But you also start to look at are there other groups that are exposed to different groups like the SRE organization is? And sometimes there are people in the support organizations that have the skills with customers and understand the technology. You can find good people there. Sometimes they’re good people in the engineering organization where they’re engaged, they understand the stack and they think about success of the entire stack versus success for just a layer of the stack. And when somebody’s invested in, I am in it for the success, the business outcome versus very easy to say it’s a networking problem. When somebody’s exhibiting those types of behaviors, usually for me that’s a good indicator that these are people that have the curiosity, a learning mindset and take accountability for business outcomes for the success of the product. And they’re usually the ones that I’m looking for and in trying to pull into the SRE space itself, because I can then invest in them. They may not be a perfect fit, but I invest in them and then they start to invest back into the company itself.

Shane Hastie: And that segues a bit into how do you keep these people?

Retaining people needs an environment where they can learn and grow [12:25]

Narayanan Raghavan: So hiring people is hard enough, keeping people is harder still. A lot of this again comes down to culture. It also comes down to giving people the opportunity to grow. Engineers are a curious bunch. Feeding that curiosity is important. Not just important, it’s vital. And making sure that we’re giving people opportunities to grow and learn, that becomes fodder for people to stay interested in wanting to learn and grow for, from my perspective, even within my own teams, for example, growth is not just vertical growth. It’s not just promotions. It’s also horizontal growth. So if I can wake somebody up at 2:00 AM in the morning and they can, one eye close, solve a problem, I start to question, “What are you learning? You know everything in the back of your hand, you’re not learning anything.” So my ask, my challenge to them is go pick a different space.

And sometimes that means letting go of somebody who’s really good because that’s what’s right for that individual. But when people see that you’re actually caring for their growth, for their development, they want to stay because they see that as a leader, as a manager, that you are invested in their success, they want to stay. And for me, that desire to learn translates to what sort of opportunities can I give them. What sort of projects can I put them on? Or, what sort of training can I have them take? And oftentimes engineers say, “Already busy. I can’t afford to take time off and my job is keeping me plenty busy.” My question back to the engineers is, “Does this mean you don’t take time off? Does this mean you don’t take paid time off and be with your family for a week and enjoy that time? Are you constantly working 24/7, 365?” The answer should be no.

And then I ask them the follow-up question is, “If you are taking time off, what happens when you are off? Your teammates step in, they cover for you, they help, they engage, et cetera. Why aren’t you doing the same thing with development? Why aren’t you treating development as a, I am going to take development time. I’m going to declare it upfront to my team, to my manager, et cetera, to say, the month of May, I’m going to be off for a week.” Great, take the time, focus on your development and then come back and put that knowledge that you’ve gained to use, because your team has supported you while you were out in training or what have you. Now you are coming back with additional knowledge to come and contribute back to the team.

So as a manager, that’s the investment. I am promising and committing to making sure that my team is growing and learning and has interesting work. And as a result, the team is committing back to saying, “I’m bringing this knowledge back into the team to think about automation, to think about scale, to think about reliability, think about building systems that can sustain failures, et cetera.”

Shane Hastie: Quite a lot of our audience are new to the leadership responsibilities. We take the best engineer and we promote them and we give them no management training and they become the worst leader. What advice would you have for people who are new in this leadership responsibility to avoid become the worst leader?

Advice for new leaders [15:47]

Narayanan Raghavan: That’s a great question. I think first and foremost is acknowledging your humanity. Just because I’m a manager, a newly promoted manager or what have you, or an experienced manager doesn’t make you right all the time. In fact, doesn’t make you right most of the time. Acknowledging that upfront to say, “This is my first time stepping into a management role. I’m going to be making mistakes. I need your help, the team’s help to learn because this is a new space for me. Help me learn. And I’m here to support you as well. So I am going to make sure that you are given the opportunities to learn and grow yourself, but without you helping me learn, I’m going to stumble. And as a new manager, I want to do what’s right for the team.” So I think calling that out upfront to say I am human after all, I’m going to make mistakes, that is important, if not vital in my mind, because you need to gain your team’s trust.

And trust also starts with showing vulnerability and being vulnerable to your team, with your team, for your team, is also important. The other second part I’ll share here is as a new manager, know that you’re not going to have all the answers. That is okay. I tell my teams this when we reorgs. I’m going to make a mistake in a reorg, we’ll make a reorg. Sometimes a reorg might make sense. Sometimes a reorg might not make sense. That is okay because one of the team principles is it is okay to fail. The analogy I give to engineers is, I wish I had a development environment to make a reorg, try it out before I roll it out to production. But I don’t. So when I make changes, I have to make it in production. I have to make it with real people, with feelings, with desires and aspirations, et cetera.

I don’t have a development environment, so give me some slack to say if I make a mistake or if something does not make sense, talk to me, call it out to me. But my commitment to you is when I make those changes, I’ll tell you why I’m making those changes. Because the why matters to people to say, “This has been my thinking, this is why I’m making the change.” The what then follows on, the how then follows on. I’m less concerned about that. My focus is a lot more on why am I making this change? What was my thinking? What were the drivers? And being able to do that as a manager I think is fundamental.

Shane Hastie: Some great advice and some really deep points for people to ponder. So Narayanan, if our audience want to continue the conversation, where do they find you?

Narayanan Raghavan: I’m not as active in social media, but I’d say LinkedIn is probably the best way to have a conversation. Twitter handle is “bign_”. Like I said, not very active in social media. So LinkedIn is probably the best option.

Shane Hastie: Narayanan, thank you so much for taking the time to talk to us today.

Narayanan Raghavan: Thank you for having me again. This has been great.

Mentioned

About the Author

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


NoSQL Market Expected to Reach USD 22087 Billion by 2026 | Top Players such as

MMS Founder
MMS RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

Nosql Market

Nosql Market

The NoSQL database industry is driven by an increase in demand for e-commerce and online applications, which adds to total market demand.

PORTLAND, PORTLAND, OR, UNITED STATE, May 12, 2023 /EINPresswire.com/ — Allied Market Research published a new report, titled, ” The NoSQL Market Expected to Reach USD 22,087 Billion by 2026 | Top Players such as – MarkLogic, MongoDB and Objectivity.” The report offers an extensive analysis of key growth strategies, drivers, opportunities, key segment, Porter’s Five Forces analysis, and competitive landscape. This study is a helpful source of information for market players, investors, VPs, stakeholders, and new entrants to gain thorough understanding of the industry and determine steps to be taken to gain competitive advantage.

The NoSQL market size was valued at USD 2,410.5 million in 2018, and is projected to reach USD 22,087 million by 2026, growing at a CAGR of 31.4% from 2019 to 2026.

Request Sample Report (Get Full Insights in PDF – 266 Pages) at: https://www.alliedmarketresearch.com/request-sample/640

Increase in unstructured data, demand for data analytics, and surge in application development activities across the globe propel the growth of the global NoSQL market. North America accounted for the highest market share in 2018, and will maintain its leadership status during the forecast period. Demand for online gaming and content consumption from OTT platforms increased significantly. So, the demand for NoSQL increased for handling huge amount of data.

The NoSQL market is segmented on the basis of type, application, industry vertical, and region. By type, it is categorized into key-value store, document database, column-based store, and graph database. On the basis of application, it is divided into data storage, mobile apps, data analytics, web apps, and others. Further the data storage segment is sub-segmented into distributed data depository, cache memory, and metadata store. Depending on industry vertical, it is categorized into retail, gaming, IT, and others. By region, the NoSQL market is analyzed across North America, Europe, Asia-Pacific, and LAMEA.

Access full report summary at: https://www.alliedmarketresearch.com/NoSQL-market

Based on type, the key value store segment accounted for more than two-fifths of the total market share in 2018, and is estimated to maintain its lead position by 2026. Contrarily, the graph based segment is expected to grow at the fastest CAGR of 34.2% during the forecast period.

Based on vertical, the IT sector contributed to the highest market share in 2018, accounting for more than two-fifths of the total market share, and is estimated to maintain its highest contribution during the forecast period. However, the gaming segment is expected to grow at the highest CAGR of 34.8% from 2019 to 2026.

If you have any questions, Please feel free to contact our analyst at: https://www.alliedmarketresearch.com/connect-to-analyst/640

Based on region, North America accounted for the highest market share in 2018, contributing to more than two-fifths of the global NoSQL market share, and will maintain its leadership status during the forecast period. On the other hand, Asia-Pacific is expected to witness the highest CAGR of 35.5% from 2019 to 2026.

Leading players of the global NoSQL market analyzed in the research include Aerospike, Inc., DataStax, Inc., Amazon Web Services, Inc., Couchbase, Inc., Microsoft Corporation, MarkLogic Corporation, Google LLC, Neo Technology, Inc., MongoDB, Inc., and Objectivity, Inc.

Enquiry Before Buying: https://www.alliedmarketresearch.com/purchase-enquiry/640

Covid-19 Scenario:

● With lockdown imposed by governments of many countries, demand for online gaming, content consumption from OTT platforms, and activity on social media increased significantly. So, the demand for NoSQL increased for handling huge amount of data.

● With organizations adopting “work from home” strategy to ensure continuity of business processes, NoSQL databases would be needed to store and retrieve data.

Procure Complete Report (266 Pages PDF with Insights, Charts, Tables, and Figures) at:
https://www.alliedmarketresearch.com/checkout-final/668c39800f8d254d9387c591facde465

Thanks for reading this article; you can also get an individual chapter-wise section or region-wise report versions like North America, Europe, or Asia.

If you have any special requirements, please let us know and we will offer you the report as per your requirements.

Lastly, this report provides market intelligence most comprehensively. The report structure has been kept such that it offers maximum business value. It provides critical insights into the market dynamics and will enable strategic decision-making for the existing market players as well as those willing to enter the market.

About Us:

Allied Market Research (AMR) is a market research and business-consulting firm of Allied Analytics LLP, based in Portland, Oregon. AMR offers market research reports, business solutions, consulting services, and insights on markets across 11 industry verticals. Adopting extensive research methodologies, AMR is instrumental in helping its clients to make strategic business decisions and achieve sustainable growth in their market domains. We are equipped with skilled analysts and experts and have a wide experience of working with many Fortune 500 companies and small & medium enterprises.

Pawan Kumar, the CEO of Allied Market Research, is leading the organization toward providing high-quality data and insights. We are in professional corporate relations with various companies. This helps us dig out market data that helps us generate accurate research data tables and confirm utmost accuracy in our market forecasting. Every data company in the domain is concerned. Our secondary data procurement methodology includes deep presented in the reports published by us is extracted through primary interviews with top officials from leading online and offline research and discussion with knowledgeable professionals and analysts in the industry.

David Correa
Allied Analytics LLP
+ +1-800-792-5285
email us here

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.