March 2023 - Page 10 of 11 - Mobile Monitoring Solutions

Uncategorized

Article: Step One to Successfully Building Your Platform: Building It Together

MMS • Lee Ditiangkin

Key Takeaways

When building your case for an IDP to the business, you need to frame your argument in terms the business will understand and care about
Time to Market is one of the most important metrics to use when framing up your business case
When talking about operational efficiency, an IDP can bring, tailor the message to the outcomes the business needs now
Organizations that don’t have an IDP need one full-time equivalent in charge of operations per every 5-7 person development team.
Measuring the impact of your platform is critical to proving the investment was worthwhile. Measuring market share, net promotor score, and onboarding time are all great places to start.

For the last two decades, I have built Internal Developer Platforms (IDPs) at companies like Apple, IBM, and Atlassian. One recurring question I see from practitioners in the platform engineering community is how to make a case for building an IDP for management. How can you create the urgency to build an IDP now and not in the next two quarters or two years? Here are some approaches I wish I had known about when starting my career.

First, you need to understand what your top management cares about. What are their current priorities? What is top of mind? Here are a few options to ask and think about:

They want to drive digital transformation across the engineering organization, modernize the stack and roll out the shiny new tech they read about on “cio.com”.
They want to build new, faster revenue streams to create business value.
They are concerned about mitigating security risks that damage the company’s reputation.
They are optimizing for operational efficiency, managing spending, or even cutting costs, e.g., cloud costs.
They are working to retain and attract the best talent in the market.

Whatever the main focus is, make sure you understand it and align your platform initiative to management priorities. Build a business case that reflects the cost of waiting to build an IDP, but do it the smart way. As Galo Navarro explained, “We need this platform engineering initiative now; otherwise, we are doomed,” this is not a great approach when reaching out to management. They hear doomsday stories at least 20 times a day. Here are some arguments that will be effective with management:

Accelerate digital transformation

Build new and faster revenue streams

Many engineers don’t think about business metrics, but they’re important to incorporate into your argument for your IDP. One of the most important metrics is Time to Market (TTM), which defines the time between developing a new product or feature, entering the market, and generating revenue. The longer your TTM, the greater your investment in each new product and feature requires.

If shortening TTM is one of your top management’s goals, you could create a case that defines this: “Building an IDP could shorten our TTM by up to 40% by increasing developer productivity. With an IDP, developers can self-serve all the tech and tools they need, reducing time wasted waiting on Ops. IDPs also make it easier to make architectural changes on applications, which you need for rapid iterations. The result is a boost in productivity for developers and operations, enabling us to react faster to a competitive landscape.”

Mitigate security risks

Security and data breaches like those that recently impacted Uber and Slack have increased management’s awareness of security issues. While engineers focus on technical problems that cause risk, management will be more concerned with the damage to brand reputation and loss of market share that results from breaches. If security risks are a top priority for management, you could build a case like this: “An IDP will help us mitigate security risks and their consequences. A well-designed IDP enforces security best practices by driving standardization and detects security vulnerabilities automatically if security tools are integrated smartly.”

Optimize for operational efficiency

Almost every developer tool aims to optimize for operational efficiency, so if this is the case you want to make; you need to add some meat to the bone. If your management is worried about unnecessary outages or downtime due to manual, half-scripted workflows, your IDP story should highlight increased reliability as a result of building a platform. Developers want fewer outages because it means fewer late-night calls and interruptions. The business wants fewer outages because it translates to better customer retention, and high availability is a strong selling point.

If high cloud costs are an issue, you could argue that an IDP makes it easier to manage your setup more efficiently. Small changes, like the ability to detect and pause unused environments, can bring down costs. Large enterprises might also benefit from cloud vendor-neutral platforms, enabling them to easily switch between different cloud vendors. In general, identify the most significant inefficiencies in your organization – developer workflows, key person dependencies, ops bottlenecks, etc. – and tailor your argument accordingly.

Attract and retain top talent

IDPs also help engineering organizations attract and retain top talent by improving the developer experience. High cognitive load causes DevOps burnout and high turnover. According to the recent Gartner report “A Software Engineering Leader’s Guide to Improving Developer Experience” (behind paywall), investing heavily in better developer experience is the best way to safeguard developers’ creative work and the key to their productivity. This results in happier engineers and helps your organization innovate faster.

Calculate an ROI for your Internal Developer Platform

If you ask for bigger budgets for building your IDP, you’ll need to come up with an ROI calculation. In my experience, organizations that don’t have an IDP need one FTE (full-time equivalent) in charge of operations per every 5-7 person development team. Think about how quickly this compounds across organizations with dozens (or hundreds) of teams.

Platform metrics

In addition to coming up with an ROI calculation, you should also consider the metrics you expect to improve once the platform is built. I recommend onboarding one lighthouse team first. Then benchmark some of their metrics against other teams that haven’t been onboarded to the platform. This will give you a good sense of how the new IDP fits the developers’ needs and to be able to measure productivity gains.

The first metrics that will come to your mind will be the classic DORA metrics like lead time, deployment frequency, change failure rate, and MTTR. You will also want a dashboard on top of your platform where you can observe these metrics constantly. Though it can be more difficult, you should track time freed up for ops or reduced developer waiting times.

The platform should also meet specific metrics like SLOs (service level objectives) based on SLAs (service level agreements). As Chris Ford and Cristóbal García García pointed out, your platform needs to be a reliable product. Otherwise, developers won’t trust it, push back, and find ways not to use it or work around it. For this reason, the platform team should agree on ambitious SLOs before building the IDP and always keep reliability as a top priority.

Once the platform is rolled out across the entire engineering organization, consider platform adoption metrics. These will tell you if the product (the platform) fits the needs and if you are on the right track regarding onboarding, training, and evangelizing the platform to developers the right way. Key metrics would be the following:

Market share or the percentage of developers using your platform. If your IDP is optional, there will always be alternatives. For example, developers can access the underlying technologies directly and ignore the IDP.
Net Promoter Score (NPS) or the extent to which users recommend the platform to their peers.
Onboarding time for new developers. An IDP should accelerate the onboarding of new developers because they no longer need to learn the underlying technology. If your onboarding time does not go down; there is something wrong. Either your IDP is crap, or your new developers aren’t using it.

In the end, you will need to tie these metrics back to the storyline you initially drafted for your top management (faster TTM, more innovation, operational efficiency, and retaining talent). In addition to defining and tracking quantitative metrics, you should always continue doing qualitative research. Constantly run interviews with your customers and the developers. Michael Galloway shared some great examples here.

Building an IDP means investing in creativity and innovation

Investing in creativity and innovation has never been more urgent than now. Your management is aware of the potential financial crisis coming next year. Companies that invest in shorter innovation cycles and faster TTM will always meet customer demands and have an edge over the competition. It’s up to you to demonstrate how an IDP will get your organization there: by reducing cognitive load on developers, enabling them to self-serve what they need to run their apps and services, driving standardization by design, and improving the developer experience.

This piece will help you emphasize the urgency of building an IDP to your management. If your management says “yes” and approves your budget (I’m keeping my fingers crossed for you on both counts), the journey doesn’t end there. Be careful to mitigate the most common fallacies in platform engineering, and start by building the house, not the front door. To put it in plain words: think about the architectural design of your Internal Developer Platform before you start building, and always remember to prioritize your Developer Experience for your Developer Community.

About the Author

Lee Ditiangkin

Show moreShow less

Uncategorized

OpenAI Unleashes ChatGPT and Whisper APIs for Next-Gen Language Capabilities

MMS • Daniel Dominguez

OpenAI has announced that it’s now letting third-party developers integrate ChatGPT and Whisper into their apps and services via API, offering access to AI-powered language and speech-to-text capabilities. As compared to using the company’s current language approach, these APIs will make it easier for businesses to integrate ChatGPT and Whisper into their platforms.

The new ChatGPT model, also known as the gpt-3.5-turbo costs $0.002 per 1,000 tokens, which is 10 times less than the GPT-3.5 models that are currently in use. For many situations outside of conversation, it’s also the best model. Unstructured text is typically consumed by GPT models and is supplied to the model as a series of tokens. Instead, ChatGPT models consume a series of messages together with their associated metadata.

Modern language processing techniques can produce answers to inputs in natural language that are human-like. The model is an effective tool for creating conversational interfaces because it is capable of comprehending linguistic nuance, including idioms, slang, and colloquialisms. With ChatGPT, developers can build chatbots, virtual assistants, and other conversational interfaces that respond to users in a tailored and human-like manner. However, the most recent ChatGPT model will now be substantially more affordable and open to third parties thanks to a dedicated open-source platform.

OpenAI has also unveiled a new API for Whisper, its speech-to-text technology. According to the company, you may use it to translate or transcribe audio for $0.006 per minute. Whisper model is open source, so you can run it on your own hardware without paying anything.

Additionally, OpenAI is introducing certain policy adjustments that it claims are a result of developer input. One significant feature is that it will no longer train its models on data provided over the API unless users specifically consent to that use.

Moreover, OpenAI is now providing dedicated instances for customers who desire greater control over the particular model version and system performance. Requests are typically processed using computing resources that are shared with other users and are charged separately. The API is hosted on Azure, and with dedicated instances, developers can purchase a time-limited allocation of compute resources specifically designated for handling their queries.

AI can provide incredible opportunities and economic empowerment to everyone, and the best way to achieve that is to allow everyone to build with it, says OpenAI.

The launch of these API’s is expected to have a significant impact on the developer community, as it provides new tools and capabilities for building more advanced and sophisticated language applications.

About the Author

Daniel Dominguez

Show moreShow less

Uncategorized

New JavaScript Incremental Computing Library Delivers Better UX for Single-Page Apps

MMS • Bruno Couriol

The team behind the collaborative whiteboard tldraw recently published a library that brings incremental computing to JavaScript. Signia seeks to overcome fundamental performance limitations of tldraw’s chosen UI and reactive framework and ultimately provide better interactive apps with better user experience. Signia can however be used standalone or in conjunction with any UI framework.

The tldraw team explained their motivation as follows:

tldraw is a collaborative digital whiteboard built with React. It has an unusually active client state, with lots of in-memory data that changes often and much of which is derived from other data.
We spent several months building tldraw’s new version using a popular reactive state framework. However, we quickly ran into two big problems that made it impossible for us to scale to the number of shapes and multiplayer users that we knew browsers could handle.
[…] Both issues were fundamental limitations of the framework’s reactivity model.

Derived data would be recomputed every time their data dependencies would change. Those recomputations could be expensive for large derived collections. In some contexts, derived data may also be recomputed when they could be retrieved instead from a cache.

Highly interactive single-page web applications like tldraw generally consider the user experience to be a key component of their value proposition. The responsiveness to user inputs, itself a key component of the user experience, relates to the amount of computation that they trigger on the main thread.

A part of the computation occurs due to the framework at play. At some level of scale, the performance inefficiencies due to the framework may noticeably impact the user experience and must be addressed. Raph Levien, in his article Towards a unified theory of reactive UI, said:

React is most famous for introducing the “virtual DOM”. It works well for small systems but causes serious performance issues as the UI scales up. React has a number of techniques for bypassing full tree diffs, which is pragmatic but again, makes it harder to form a clean mental model.

Some developers may resort to reactive data libraries (e.g., jotai, recoil, zustand) to reduce the amount of unnecessary computation performed by the framework. They may alternatively resort to UI frameworks that already embed similar reactive data capabilities and perform less unnecessary computations by design.

Another part of the computation to perform relates to keeping dependent data synchronized with their dependencies. Efficiency can be gained by computing dependent data lazily (when needed) instead of eagerly (as soon as their dependencies change); once instead of every time one dependency changes (e.g., through topological sorting of the reactivity graph); or more efficiently. Incremental computing deals with computing dependent data faster.

As Denis Firsov and Wolfgang Jeltsch explained in their paper (Purely Functional Incremental Computing):

Many applications have to maintain evolving data sources as well as views on these sources. If sources change, the corresponding views have to be adapted. Complete recomputation of views is typically too expensive. An alternative is to convert source changes into view changes and apply these to the views. This is the key idea of incremental computing.

Signia’s documentation gives an example of derived data that may benefit from incremental computation. Let’s assume an array arr of 10,000 values, and a derived variable y obtained as arr.map(f). When arr is pushed a new value val, a naïve way to recompute y is to rerun the map operation over the 10,001 values. With an incremental computing approach, the mapped array of 10,000 values that was cached is simply appended f(val). This leads to one single run of f vs. 10,001 in the case of the naïve approach. The incremental approach generalizes to
filter, reduce, sort, and many other operations (see Self-adjusting computation, Umut A. Acar, 2005).

Signia provides a reactive API that allows developers, among other things, to define atoms (independent data) and computed data (derived from atoms), together with the respective setters and getters:

import  { computed, atom }  from  'signia'
  
const firstName =  atom('firstName',  'David')
firstName.set('John')  
console.log(firstName.value)  
firstName.update((value)  => value.toUpperCase())  
console.log(firstName.value)  

firstName.set('John')  
const lastName =  atom('lastName',  'Bowie')  
const fullName =  computed('fullName',  ()  =>  {  
  return  `${firstName.value} ${lastName.value}`  
})  
  
console.log(fullName.value)

To these reactive APIs, Signia adds incremental computing APIs that store a history of input changes. The API user can then incrementally compute the updated derived value from the cached derived value and the change history. Signia’s documentation provides an example that leverages the patch format of the immutable state library Immer. Changes are stored as operations (e.g., add, replace, remove) that are pattern-matched to the corresponding incremental computation (e.g., splice).

Incremental computing is not a new approach. JSON Patch can be used to avoid sending a whole document when only a part has changed. When used in combination with the HTTP PATCH method, it allows partial updates for HTTP APIs in a standards-compliant way. The D3 visualization library lets developers specify incrementally how to update a visualization on enter, update, and exit of input data.

Signia is however new in that it provides a generic JavaScript API for incremental computing. The trading firm Janestreet maintains a similar-minded OCaml library called incremental. Yaron Minsky noted already 7 years ago:

Given that incrementality seems to show up in one form or another in all of these web frameworks, it’s rather striking how rarely it’s talked about. Certainly, when discussing virtual DOM, people tend to focus on the simplicity of just blindly generating your virtual DOM and letting the diff algorithm sort out the problems. The subtleties of incrementalization are left as a footnote.

That’s understandable, since for many applications you can get away without worrying about incrementalizing the computation of the virtual DOM. But it’s worth paying attention to nonetheless, since more complex UIs need incrementalization, and the incrementalization strategy affects the design of a UI framework quite deeply […] Thinking about Incremental in the context of GUI development has led us to some new ideas about how to build efficient JavaScript GUIs.

Signia is open-source software released under the MIT license. Feedback and contributions are welcome and should follow the contribution guidelines.

About the Author

Bruno Couriol

Show moreShow less

Uncategorized

Cross-Industry Report Identifies Top 10 Open-Source Software Risks

MMS • Sergio De Simone

Promoted by Endor Labs and featuring contributions from over 20 industry experts, the new Endor Labs Station 9 report identifies the top operational and security risks in open-source software.

As Endor Labs lead security researcher Henrik Plate puts it, new applications make large use of open-source components and should take seriously any risks coming from their integration.

It’s well-known that open source is oftentimes more performant and secure than proprietary software, but it’s also clear that open source software comes as-is, without warranties of any kind, and any risk of using it being solely on downstream users.

Inspired by the OWASP Top Ten, a standard document for developers and web application security, the Endor Labs Station 9 report includes contributions from industry experts working at companies such as HashiCorp, Adobe, Palo Alto Networks, Discord, and others.

This report covers both operational and security issues to highlight the top 10 risks associated with the consumption of open source components, all leading to problems that can compromise systems, enable data breaches, undermine compliance or hamper availability.

The top risk in the report is understandably represented by known vulnerabilities present in packages, followed by the possibility that legitimate packages are compromised through an attack on the distribution infrastructure in order to inject malicious code or resources. At the third spot come name confusion attacks, which rely on strategies like typo-squatting, brand-jacking, or combo-squatting to create malicious packages that may lead developers to inadvertently trust them.

While the possibility that open-source software contains known vulnerabilities is usually well-recognized in the industry, which also has many tools at its disposal to counter them, the other two risk factors are relatively new. They fall into the category of supply chain attacks and the industry is just starting to have its first tools to fight them, including Semgrep, Guarddog and others.

The next two risks in the list are unmaintained and outdated software. In both cases, there can be functional as well as security-related implications. For outdated software the fix is as straightforward as updating to the latest package version, although that might be troublesome in the face of accrued incompatibilities.

Other risks included in the list are license and regulatory risks; immature software, such as software that does no apply best practices or lack unit tests; wrong versioning, leading to changes being introduced without the developers being able to review or approve them; and under- or over-sized dependency, i.e. packages providing very little or a lot of functionality.

Those kinds of risks all fall into categories seemingly more manageable with a traditional quality-oriented outlook, albeit they should not be overlooked for a comprehensive approach.

The Station 9 team aims to update the report regularly to reflect technological advances as well as any evolutions in the threat landscape.

About the Author

Sergio De Simone

Show moreShow less

Uncategorized

Java News Roundup: JEP Updates, JReleaser 1.5, Spring Updates, Vert.x, Project Reactor, Ktor

MMS • Michael Redlich

This week’s Java roundup for February 27th, 2023 features news from OpenJDK, JDK 20, JDK 21, Spring Framework 6.0.6, Spring Boot 3.0.4, Spring Data 2022.0.3 and 2021.2.9, Spring Shell 3.1.0-M1, 3.0.1 and 2.1.7, Quarkus 2.16.4, Micronaut 3.8.6, Eclipse Vert.x 4.4.0, Project Reactor 2022.0.4, Apache Tomcat 9.0.73, Hibernate 6.2 CR3, JReleaser 1.5.0, Ktor 2.2.4 and Gradle 8.0.2.

OpenJDK

JEP 438, Vector API (Fifth Incubator), was quickly promoted from Draft to Candidate to Proposed to Target status for JDK 20 this past week. This JEP, under the auspices of Project Panama, incorporates enhancements in response to feedback from the previous four rounds of incubation: JEP 426, Vector API (Fourth Incubator), delivered in JDK 19; JEP 417, Vector API (Third Incubator), delivered in JDK 18; JEP 414, Vector API (Second Incubator), delivered in JDK 17; and JEP 338, Vector API (Incubator), delivered as an incubator module in JDK 16. JEP 438 proposes to enhance the Vector API to load and store vectors to and from a MemorySegment as defined by JEP 424, Foreign Function & Memory API (Preview). The review is expected to conclude on March 8, 2023.

JEP Draft 8303358, Scoped Values (Preview), was submitted by Andrew Haley and Andrew Dinn, both distinguished engineers at Red Hat. This JEP, formerly known as Extent-Local Variables (Incubator) and under the auspices of Project Loom, proposes to enable sharing of immutable data within and across threads. This is preferred to thread-local variables, especially when using large numbers of virtual threads. This draft evolves JEP 429, Scoped Values (Incubator), that will be delivered in the upcoming release of JDK 20.

Wei-Jun Wang, principal member of the technical staff at Oracle, has updated JEP Draft 8301034, Key Encapsulation Mechanism API, to include a major change that eliminates the DerivedKeyParameterSpec class in favor of placing fields in the argument list of the encapsulate(int from, int to, String algorithm) method. This draft proposes to: satisfy implementations of standard Key Encapsulation Mechanism (KEM) algorithms; satisfy use cases of KEM by higher level security protocols; and allow service providers to plug-in Java or native implementations of KEM algorithms.

JDK 20

JDK 20 remains in its release candidate phase with the anticipated GA release on March 21, 2023. Build 36 remains the current build in the JDK 20 early-access builds. More details on this build may be found in the release notes.

JDK 21

Build 12 of the JDK 21 early-access builds was also made available this past week featuring updates from Build 11 that include fixes to various issues. Further details on this build may be found in the release notes.

For JDK 20 and JDK 21, developers are encouraged to report bugs via the Java Bug Database.

Spring Framework

The release of Spring Framework 6.0.6 delivers new features such as: refine the invokeSuspendingFunction() method in the CoroutinesUtils class; deprecate the get(Context) method in favor of getExchange(ContextView) method in the ServerWebExchangeContextFilter class to better align with the deferContextual() and transformDeferredContextual() methods in the Mono class; and add missing @Nullable annotations to the overloaded format() methods in the LogMessage class. More details on this release may be found in the release notes.

The release of Spring Boot 3.0.4 ships with bug fixes, improvements in documentation and dependency upgrades such as: Spring Framework 6.0.6, Spring for Apache Kafka 3.0.4, Spring Data 2022.0.3, Project Reactor 2022.0.4 and Dropwizard Metrics 4.2.17. Further details on this release may be found in the release notes.

Versions 2022.0.3 and 2021.2.9 of Spring Data, both service releases, include bug fixes and upgrades to respective sub-project versions such as: Spring Data Commons 3.0.3 and 2.7.9; Spring Data Elasticsearch 5.0.3 and 4.4.9; Spring Data for Apache Cassandra 4.0.3 and 3.4.9; and Spring Data MongoDB 4.0.3 and 3.4.9. These two versions may be consumed with Spring Boot 3.0.4 and 2.7.x, respectively.

Versions 3.1.0-M1, 3.0.1 and 2.1.7 and of Spring Shell were released this past week that address common issues such as: an error in which a negative number is one of the elements within an array that is passed into the @ShellOption annotation; a situation in which an implementing class of the Converter interface isn’t being called possibly due to a regression in fixes made for options handling; and a situation in which the getOptions() method declared in the CommandRegistration interface always rebuilds options making it difficult to discern the correct instance. Each version is built on Spring Boot 3.1.0-M1, 3.0.3 and 2.7.9, respectively. More details on these releases may be found in the release notes for version 3.1.0-M1, version 3.0.1 and version 2.1.7.

Quarkus

Red Hat has released Quarkus 2.16.4.Final featuring: add logging to the CompiledJavaVersionBuildStep class; propagate Quarkus-related failsafe system properties; provide more visibility for the OIDC connection error log messages; and return a null InputStream from REST Client when the HTTP server response returns status code 204. Further details on this release may be found in the changelog.

Micronaut

The Micronaut Foundation has released Micronaut 3.8.6 featuring bug fixes, improvements in documentation and updates to modules: Micronaut Security 3.9.3, and Micronaut AWS 3.10.9. More details on this release may be found in the release notes.

Eclipse Vert.x

Eclipse Vert.x 4.4.0 has been released with new features such as: a new implementation of OpenAPI using the latest JsonSchema API as a preview feature; support for using the io_uring interface of the Linux kernel; and enable TLS 1.3 by default and disable TLS 1.0/1.1. Further details on this release may be found in the release notes, deprecations and breaking changes, and complete list of new features.

Project Reactor

Project Reactor 2022.0.4, the fourth maintenance release, provides a dependency upgrade to reactor-netty 1.1.4.

Apache Software Foundation

The release of Apache Tomcat 9.0.73 features: a correction to a regression introduced in the fix for bug 66196 in which the HTTP headers and/or request line could get corrupted (one part overwriting another part) within a single request; provide a more appropriate HTTP server response (status codes 501, Not Implemented, rather than 400, Bad Request) when rejecting an HTTP request using the CONNECT method; and add support for txt: and rnd: rewrite map types from the mod_rewrite module. More details on this release may be found in the release notes.

Hibernate

The third release candidate of Hibernate 6.2 ships with bug fixes and resolutions to various issues. Developers can expect new features such as support for: Java records; STRUCT data types; table partitioning via the new @PartitionKey annotation; and improved generated values. Further details on this release may be found in the release notes.

JReleaser

Version 1.5.0 of JReleaser, a Java utility that streamlines creating project releases, has been released delivering updates such as: streamlined support for LinkedIn; add Azure as a deployer; display deprecation messages for the command line interface flags; and aet command hooks to be filtered by platform. More details on this release may be found in the release notes.

Ktor

JetBrains has released version 2.2.4 of Ktor, the asynchronous framework for creating microservices and web applications, that include improvements and fixes such as: URLs containing an underscore will fail to parse correctly in an HTTP client request; the value defined in the connectTimeoutMillis property is not respected upon using the HttpTimeout plugin in parallel with the HttpRequestRetry plugin; and a situation in which the wrong content type is declared when defining two routes that results in an HTTP status code 405, Method Not Allowed, instead of the more accurate HTTP status code 415, Unsupported Media Type. Further details on this release may be found in the release notes.

Gradle

Gradle 8.0.2, a patch release, ships with fixes such as: Java and Scala builds with no explicit toolchain will fail using Gradle 8.0.1 and Scala 2.13; dependencies from the already-resolved super configuration are not included in the sub-configuration; and The InstrumentingTransformer generates different class files in Gradle 8 versus 7.6. More details about Gradle 8.0 may be found in this InfoQ news story.

About the Author

Michael Redlich

Show moreShow less

Uncategorized

Microsoft Introduces Azure Operator Nexus to Simplify Deployment of Network Functions

MMS • Steef-Jan Wiggers

Microsoft recently announced the public preview of Azure Operator Nexus, a hybrid, AI-enabled, carrier-grade cloud platform designed to meet the specific needs of network operators.

Azure Operator Nexus is a service designed to bring the performance and resiliency of carrier-grade network functions to traditional cloud infrastructures. It provides a secure way to deploy mobile core and virtualized radio access networks (vRAN) network functions on-premises (far-edge, near-edge, core datacenters) or in Azure regions – delivering visibility into logging, monitoring, and alerting for infrastructure components and workloads.

Azure Operator Nexus builds on the functionality of its predecessor, the Azure Operator Distributed Services private preview, by integrating key Microsoft technologies such as Mariner Linux, Hybrid AKS, and Arc. Additionally, it leverages Microsoft services for security, lifecycle management, observability, DevOps, and automation.

Source: https://techcommunity.microsoft.com/t5/azure-for-operators-blog/introducing-azure-operator-nexus/ba-p/3753393

Azure Operator Nexus supports two deployment models: On-premises and Public Region. In the On-premises scenario, operators can purchase the hardware from Microsoft partners. In contrast, in the Public Region scenario, Microsoft provides the infrastructure and allows the Operator to consume the resources on demand.

The company also introduced the Azure Operator Nexus Ready program, which provides a comprehensive ecosystem of certified network functions (CNFs) for operators. The program employs the Azure Operator Service Manager, which offers a consistent and scalable deployment experience for CNFs and VNFs from multiple vendors. In addition, there is an Operator Nexus System Integrator (SI) Program, which includes SIs trained and certified to deliver horizontal- and vertical-Integration services to the operators.

Yousef Khalidi, corporate vice president, Microsoft Azure Networking, outlined in a Microsoft Tech Community blog post the benefits of deploying Azure Operator Nexus:

Lower overall TCO 
Greater operations efficiency and resiliency through AI and automation 
Improved security for highly-distributed, software-based networks

Igal Elbaz, senior vice president, Network CTO, AT&T, said in an Azure blog post:

As a pioneer in network virtualization and SDN, AT&T is confident in our decision to run our multi-vendor 5G Standalone Mobile Core on Operator Nexus platform while we continue to deploy and operate the platform in AT&T data centers. AT&T made the decision to adopt Operator Nexus platform over time with expectation to lower total cost of ownership (TCO), leverage the power of AI to simplify operations, improve time to market, and focus on our core competency of building the world’s best 5G service.

Lastly, Azure Operator Nexus’s pricing is unavailable as Microsoft will not publish it during the public preview. More information on the platform is available on the documentation landing page.

About the Author

Steef-Jan Wiggers

Show moreShow less

Uncategorized

Podcast: Establishing SRE Foundations with Vladyslav Ukis

MMS • Vladyslav Ukis

Subscribe on:

Transcript

Shane Hastie: Hey folks, Qcon London is just around the corner. We’ll be back in person in London from March 27 to 29. Join senior software leaders at early adopter companies as they share how they’ve implemented emerging trends and best practices.

You’ll learn from their experiences, practical techniques and pitfalls to avoid, so you get assurance you’re adopting the right patterns and practices. Learn more at qconlondon.com. We hope to see you there.

Good day, folks. This is Shane Hastie for the InfoQ Engineering Culture Podcast. It seems like a year ago, the last time Vlad and I sat down and we were just checking the records it was. We are talking today about Vladyslav Ukis’s new book, Establishing SRE Foundations. Vlad, welcome back. Nice to see you again.

Vladyslav Ukis: Thanks Shane. It’s great to be here again on the show and it’s incredible that exactly year ago we were recording the first podcast, so now it’s the second one.

Shane Hastie: Indeed. For the benefit of our listeners who haven’t come across you in your work, give us the five-minute background. Who’s Vlad?

Introductions [01:13]

Vladyslav Ukis: I’ve been working in the healthcare industry for a long time and in the recent years I’ve been working on cloud computing. We’ve got a big platform for digital services in the healthcare domain, and this is where I’m running the R&D which includes development and operations.

That also leads us to the content of the book, which is about how do you introduce operations in an organization that have never done that before, that only did development before. Overall, as part of that work of the last decade, we have introduced lots of new processes in an organization that again, has never done those things before.

Like continuous delivery, releasing faster, operating the product, measuring the value of the services that are running in production and so on and so forth. I’s been a real transformation journey over the last decade.

Shane Hastie: Interesting stuff. Why do we need another SRE book?

Why another SRE book? [02:12]

Vladyslav Ukis: That was actually the question that I asked myself when I first got in touch with SRE because all the existing SRE books back in the day, they were written either by Googlers or ex-Googlers. Basically, the whole idea of SRE comes from Google and therefore also the publications are coming from the Google employees.

Therefore, also those publications, books and so on, they focus on, I’d say, advanced SRE. If you are a new organization, new to the topic of operating software, new to the topic of software as a service which requires you to operate the software, then you first need to build some foundations until you are ready for the advanced stuff.

So that’s what the book is about. It’s about helping the organizations to make the first couple of steps on the operations ladder before they are ready for the advanced stuff that is described in the original SRE books by Googlers.

Shane Hastie: What are some of those foundations? What are the things that organizations don’t have in place typically that they struggle with at the beginning?

Areas where organisations struggle at the beginning of an SRE journey [03:20]

Vladyslav Ukis: I think at the very beginning, there will be a lot of questions why development needs to do operations at all. In the traditional organizations, the typical setup is that there is the development department and then there is the operations department.

The development department does the development and the operations department does the operations, and that’s the way it’s always been. Therefore, it’s very difficult to get to a point where it becomes clear that you need a totally different approach if you are serious about running digital services in production.

As your traffic grows, as your frequency of delivery to production grows, you will see the need to operate the services by the people who are actually developing them because otherwise the handovers to cumbersome the handover from development to operations cannot happen frequently enough for the frequency of production updates.

Also, the troubleshooting the time to recovery from failures in production can only be done fast if you are able to get those alerts, that report on something wrong in production to the people who can actually fix the problems fast enough.

Shane Hastie: This is a culture shift. How do we encourage that culture shift?

Making the challenges visible in order to address them [04:49]

Vladyslav Ukis: How it typically would happen is that the organization operating in a traditional way would come to a point where the limitations of the current approach will become evident. You’ll have, for example, lots of production outages or you’ll have long time to recover from outages and things like that.

With the current approaches, the organization cannot improve the situation so that will be a natural push to look for something new. Here the SRE as a methodology comes in handy because it’s about bringing alignment on operational concerns across the entire organization.

For the product owners, for the operations engineers, for the developers, there are things within the SRE framework that make them all think about the reliability that the services in production need to have and how much they’re willing to pay for it. How much they’re willing to pay for the level of reliability that they want to get.

Over time, I think it’ll need to lead to an organization-wide initiative to introduce Site Reliability Engineering because it literally affects all aspects of the product delivery organization. It affects product management, it affects development, it affects operations.

Therefore, it’ll need to be something that the organization puts on top of the list of organizational initiatives that they undertake. And then from there, once there is an agreement that you put SRE onto the list of initiatives, then at some point there is endorsement for that to happen.

And then the activities in different teams can start and slowly over time, team by team, that transition can happen until the organization reaches an optimization, a point where the foundations are established and the optimization happens in all teams.

Shane Hastie: If I’m a middle level team leader, I’m being asked to do more and more of that, getting to the culture of you build it, you run it, bringing these things in, how do I encourage my leadership to invest in this?

How to influence change from the middle [07:02]

Vladyslav Ukis: I think there are a couple of things that a middle manager can do. On the one hand, they first of all need to be convinced that this is the right approach themselves. They need to educate themselves and know that this is the way to go from their point of view.

So then with that conviction, they can start talking to the leaders at the organization to basically gauge their temperature to see where they stand in their understanding. And then you can find probably a leader that would support this.

And then with that leader, you can then broaden the alignment because they would then talk to the others and so on. That way you create the understanding at the leadership level. Then once you go further, you need to sell the benefits that that would bring to the individual function, right?

What are the benefits, for example, to the development function? What are the benefits to the operations function? What are the benefits to the product management function? Basically, over time you build up the momentum until there is enough understanding in the organization to try that thing.

I think you definitely need to frame this as this is another experiment that we will run and see how that goes because this requires a profound socio-technical change in a socio-technical system.

Therefore, it all very much depends on people that are there and it might work or it might not work depending on the circumstances and the attitudes of people and the willingness to change and the necessity and so on.

Shane Hastie: Stepping back a bit, what are some of the key concepts in the book? I see the acronym SLO popping up over and over again. What is an SLO and why do I care?

Introducing Service Level Objectives [08:40]

Vladyslav Ukis: Typically, if you take an organization that has never done operations before, then they are not aware of the level of reliability that they provide to their users, to their customers. Typically, this is not a quantified thing this reliability.

Obviously, everybody talks about our system needs to be stable and reliable, but this is typically not quantified and this is what those SLOs do. They let you quantify your reliability. That forces the organization to first of all come together and think about the level of reliability that you want to provide.

Therefore, the SLOs, that’s an acronym of Service Level Objective. What is the objective of the service level that we provide for this service and that service? If you aggregate the services to bigger digital services that you sell, then what is the level of reliability that you provide at that level?

If you break that down, then you start thinking about or you are forced to think about, okay, so this Service Level Objective is then for which dimension? Is it for availability? Would be then availability SLO. Is it then latency? Would be then latency SLO. Is it then say durability? Then durability SLO and so on.

Basically, it forces you to think about the dimensions that are important in terms of reliability and then it sets a numeric number for the reliability that you want to achieve in that dimension and this is the SLO then.

Typically, a service would have a set of SLOs across several dimensions, say availability, latency, and then in the SRE jargon, these dimensions, they are called Service Level Indicators. A typical SLI is then availability, another typical one is latency, and then for each of those SLIs that apply to a particular service, you set a goal and this is the objective or the SLO.

That’s why you would care about SLOs and that’s why SLOs can be seen as the alignment unit, so to speak, for the software delivery organization to align on the level of reliability that you want to achieve and then also what it would take to get there in terms of investment in reliability.

Shane Hastie: How do you avoid the, “but of course we want 100% reliability”, conversation?

Error budgets make improvement possible [11:05]

Vladyslav Ukis: That’s right. Yeah. That can come up very easily because of course we want to be reliable all the time. There is a nice concept for this within SRE and that concept is called error budget. Error budget is a calculation based on the SLO.

For example, if you’ve got your availability SLO set for 99%, then your error budget is calculated automatically by taking 100 and subtracting 99. That means your error budget is then 1%. So then let’s take this for a particular time period and then we would say, okay, so your availability SLO is to be 99% available within a month.

That means that your error budget is 1% non-availability, so unavailability within that month. That means you are allowed to be unavailable for 1% of the calls, for example, within that month. And then you can use that error budget but only what you’ve got inside the error budget.

So you’re not allowed to exceed the error budget. You can use the error budget done to do anything you want in terms of, for example, feature experimentation, some technical work that requires a little bit of downtime. Basically, anything you want you can do within the window of that error budget.

Now, imagine the SLO is hundred percent, that means your error budget is what? Hundred minus hundred is zero. You don’t have any error budget. That means then if you take this serious, if you are strict about it, you actually don’t want to do anything to your service.

Because every time you try to deploy, every time you try to do something with the service, there is a force some likelihood that you will cost some downtime and that will chip away from your error budget. If you’ve got none, then theoretically speaking you don’t want to do anything to your service.

You are basically back to the old days of never touch your running system because if the system is running and it’s running fine, then you don’t consume any error budget. The moment you try to do something to the service, try to update it, there is a possibility that you chip away from the error budget causing some downtime and you’ve got none.

Basically, you end up then in a paralyzed state so to speak. On the one hand you’ve got such a high target, a hundred percent availability. On the other hand, you don’t want to do anything to the service because that might then cause you to slip from your goal of hundred percent.

This is of course not realistic because if you talk to product management then they will want to push features all the time and every time you push a feature, you deploy, you again are under risk of causing some downtime so basically it’s a nonsensical situation.

Therefore, to bring things to a point where you can on the one hand have some reasonable level of reliability, say 99.5 or so, and also have some reasonable level of error budget, that means being allowed to make some mistakes while you are working on the service, there is this concept of error budget where the organization is forced to agree on the SLO.

With that then automatically the error budget is granted and you then operate within the error budget and you then track your error budget consumption. Whenever you hit the zero error budget, there needs to be also some consequences which are also then titled in SRE as Error Budget Policy where those consequences are defined then team by team.

That way you’ve got a self-regulating system where on the one hand you’ve agreed on reliability, on the other hand you’ve got still some error budget and therefore you can deploy new features frequently and so on and you’ve got controls. You don’t want to exceed the error budget. If you do, then there is a policy that also was agreed before that you execute then.

Shane Hastie: When something goes wrong, what do you do? The incident response process.

Having a clear incident response process [15:04]

Vladyslav Ukis: When something goes wrong, then you need to have the ability to mobilize the organization in an efficient and effective manner. Efficient means you don’t want to page all the developers, you don’t want to go too broad.

But on the other hand, you also want to page a developer that can actually fix the problem instead of their neighbor because then the neighbor will get in the end to the developer who can actually fix the problem. You want to be fast because you want to have a small-time to recovery from incidents.

For that you set up a process. You need to be able to classify your incidents, one way or another. That’s another thing that you need to put in place. You need to be able to say, okay, that incident, okay fine, that’s a priority one and therefore we are on our incident response process say in the full-blown way.

Or that one, okay, that’s say priority two, therefore we are on the incident response process still, but it’s not DP one level of mobilization of the people and so on. Of course, once you fix the incidents, then you also need to have some effective postmortem process where people will come together timely after the incident was fixed and then discuss what happened.

How you could improve in future on tech stuff, on culture stuff, on process stuff and so on so that you also have got then a tracking system for those action items that come out of the postmortems and then ensure that they are dealt with in a timely manner so that the whole process is then actually working.

Shane Hastie: How do you prevent the postmortem from degrading to blame?

Making postmortems safe and avoiding blamestorming [16:41]

Vladyslav Ukis: This is something that needs to be set up by the people who are running the postmortems. That means that the people who are designated to do say incident coordinators or incident commanders, they need to establish this as the table stakes for taking part in the postpartum.

They need to tell the people that we are here not to blame each other, but there is a retrospective prime directive that you can look at and say this is what we want actually to do. We assume that everybody acted to the best of the abilities.

We are not here to blame each other, but we are here to find out the root causes and therefore let’s be open about sharing information and nobody will be punished in the end for this. That will not affect your performance review and so on.

That said, there are of course thorny interpersonal issues that might have come in the way during the incident resolution and these things they need to be dealt with before the postmortem meeting.

Before the postmortem meeting, the incident coordinator needs to ensure that they talk to the individual people who are affected by some conflict and then clarify things before the meeting takes place with more people. Therefore, in the book, the process is divided in three steps.

What do you do before postmortem meeting? What do you do during postmortem meeting and what do you do after the postmortem meeting? Astonishingly, in each of these three phases, there is a whole lot of things to do in order to make it work.

I think a typical misconception in the industry is that you just have a meeting and that’s the postmortem meeting and that’s it. This is just a tiny bit of the process. There is a whole lot to be done before, during and after to make it work.

Shane Hastie: Organization structures often get in the way of these types of improvements. What are the patterns? What are the common things that we need to look at to allow that change and what are those changes going to come up with?

There are a variety of team topologies that enable SRE [18:34]

Vladyslav Ukis: Thanks for that question because that points to a particular chapter in the book, which I think is becoming the most popular chapter in the book, which is organizational structure, how to organize for SRE. I just gave a talk at the DevOps Enterprise Summit US on this.

SRE team topologies and there were a lot of questions around this. I think there are lots of organizations that are grappling with this, how to organize for SRE, what is the right organizational structure.

I think this is also coming from the fact that the bigger internet companies like Google, Facebook, Amazon, they are organized differently and they all do either SRE directly like Google or something similar to SRE like Facebook and Amazon, but they are organized differently.

That means that there are different ways to organize well for SRE. In the book, I’ve got an entire chapter on this where I detail through all the different SRE team topologies that seem to be working in the industry.

You can have an organization that does SRE but doesn’t even have the SRE role because everything is done by the developers on rotation and everything is within the team so that’s one point on the spectrum.

Then at the opposite end of the spectrum, you can have an entire SRE organization, which is a function with online management and everything. That organization then runs services in production on condition. If the services have got a certain level of reliability, that means they fulfill the SLOs and so on so then the organization runs the services.

If the services then fall below those reliability levels, then the SRE organization hands back the services for operations to the development teams until they improve the reliability and then the services fulfill higher SLOs at which point then the services can be operated by the dedicated SRE organization again.

And then there are different setups in between. I think to approach this, you need to have several step approach. First of all, you need to clarify the question of who runs the services. On the scale of developers run the services themselves to there is a dedicated SRE organization that runs the services.

There are also in between shared ways of running the services. For example, the SRE organization really lends people into the development teams and they are there for a long time effectively becoming the team members or also a couple of other setups.

Once that is clarified, then the next step is to say, okay, with the setup that I’ve chosen or that I’m considering right now, what kind of incentives are still there with the developers to implement reliability as they implement features?

Because you want to maximize those incentives. Of course with you build it, you run it, you’ve got then the maximization of the incentives and then with other setups you’ve got them slightly less, but the incentives for the developers are a bit less.

But still depending on the setup, you can find a balance between having a dedicated SRE organization if you want to do this, if that’s important enough for the business and still provide enough reliability incentives to the developers so that they implement it during the future implementation and not as an after thought.

And then there are also a couple of other considerations. What kind of knowledge sharing is required between different parties, between different teams? Again, if you are doing everything within the development team, well the knowledge sharing is just natural so they’re just working together as a team.

If it’s still another organization, a dedicated SRE organization for example, you still require some more synchronization. What is the cost of the synchronization that that causes and so on?

Basically, there is a set of considerations that you need to undertake until you make a decision and the decision is basically two-dimensional. One dimension is the organizational structure itself, which is difficult to change, therefore you should put a lot of thought before you declare a structure because it’s difficult to change.

And then another dimension is the dimension of basically who runs the services because who runs the services, then it can change more easily. As I mentioned before in the setup where the dedicated SRE organization runs the services, they’ve got an option to actually say, no, I don’t run those services.

Therefore, you go back to you build it, you run it automatically because the services fell below a certain reliability discretion. Yeah, this is a question that many organizations are grappling with at the moment and I hope that the detailed account of the options in the book will help different organizations with their decision making.

Shane Hastie: What is the role of the Site Reliability Engineer and what’s the career path?

The role and career path for a Site Reliability Engineer [23:29]

Vladyslav Ukis: That’s a very good question. Yeah, I think the typical starting point for the Site Reliability Engineer is a very technical role at the center of which is being on call. You are on call for services and you know how to fix the services when there is something wrong.

You know how to set up SLOs in the right way, you know how to agree error budget policies, you know how to discipline the team to follow the SRE practices and things like that. I think in order to be more effective, that role needs to extend more.

That role needs to also take part in the product management activities a little bit, bring the reliability perspective in there. That role also needs to be in things like user story mappings for example, where people are discussing the stories individually and the workflows individually, again, bringing their reliability perspective in there.

So then during the development of course, that role also needs to be there to coach the team in all things reliability, because as they implement the features, this is where you actually decide on the actual reliability by either implementing certain reliability features in as you go.

Or they’re not there, and therefore you then discover later once the services are running that something is missing and so on so therefore being involved in the development, although not necessarily implementing the reliability features themselves.

But being involved there and having an influence on how the team thinks about reliability and does the implementation and so on, implementing stability patterns, circuit breakers where appropriate and so on. This is important.

Then of course during operations, depending on the setup, then if the team does for example, you build it, you run it and you are an SRE from an SRE organization, then you can of course bring a wealth of experience to teach the team how to do this, especially for teams that have never done something like this before.

I think it’s a very wide range of activities that you need to spawn in order to be really effective at implementing a reliability in the organization. I think also there is a lot of misconception now, especially also if you look at the job adverts.

There is a high demand for SREs and you’ll find that those job adverts, they very mostly are focused on the very technical bits of the job and as often in the industry, there is also a bit of confusion with terminology.

You’ll find job their adverts, they are not talking about SRE job adverts, they are talking about DevOps positions and just Operations Engineers and so on. But by large, I think in order to be effective at installing reliability, you need to be active in all those areas of product life cycle.

Shane Hastie: An interesting and wide-ranging conversation. We’ll include the link to the book in the show notes, but if people want to continue the conversation, where do they find you?

Vladyslav Ukis: They’ll find me on LinkedIn. There are actually several conversation threats about the book going on right now. I’m looking forward to further folks joining those conversations and connecting and talking about this because I think this is really important because the industry is going towards software as a service more and more.

When the software is a service, comes the obligation for the organizations to operate the software and the question is how you do it. It’s how to establish operations effectively as an organization and this is what the book is about. I think this is an important topic for the industry at large.

Shane Hastie: Vlad, thank you so much.

Vladyslav Ukis: Thank you very much, Shane.

Mentioned:

About the Author

Vladyslav Ukis

Show moreShow less

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Uncategorized

Infrastructure as SQL on AWS: IaSQL Enters Beta Adding Multi-Region and Transactions

MMS • Renato Losio

The open-source service IaSQL recently announced its beta release. Designed to manage cloud infrastructure using SQL, IaSQL introduced support for AWS multi-region, new AWS services, and infrastructure changes as transactions.

Released as open-source software last year, IaSQL manages cloud infrastructure in a PostgreSQL database. The service treats infrastructure as data and maintains a two-way connection between the AWS account and the local database.

As part of the beta release, IaSQL added transactions, supporting batch and stage changes: the iasql_begin and iasql_commit methods allow developers to perform changes to an AWS deployment synchronously, by temporarily disabling the normal propagation between the cloud and the infrastructure database.

It is now possible to run IaSQL on multiple AWS regions, with the default region defined when connecting the database to the account, and locally, running the docker container available on Dockerhub.

docker run -p 9876:9876 -p 5432:5432 --name iasql iasql/iasql

Furthermore, IaSQL increased coverage for EC2, CodeDeploy, CodeBuild, CodePipeline, SNS, ACM, Route53 and other AWS services. Mapping cloud APIs to SQL tables, the service now provides modules with the goal to simplify the deployment of common use cases. LuisFer De Pombo, Co-Founder and CEO of IaSQL, writes:

For instance, deploying a docker container to ECS and exposing it to the internet which is not just ECS but involves ECR, ECS, ACM, and Route53. These simplified modules are written in pure SQL on top of the existing IaSQL modules and are meant to abstract the complexity of coordinating multiple AWS services.

In the article “Deploying to ECS, Simplified!”, Mohammad Teimori Pabandi highlights how modules still allow low-level access to resources:

What if you still need the level of control you had when you were doing all the steps manually? The traditional PaaS trade-off is that you can’t grow your app beyond the built-in limitations as you don’t have access to all the small details. The IaSQL approach is not limited in that way (…) you still have full control over all resources in the deepest details.

While some developers question the benefits of SQL over YAML-based traditional tools, user thythr writes on Reddit:

I feel a pleasant warmth in my soul when I get to use Postgres to do job X, no matter what X is. The more, the better!

IaSQL is not the only open-source project promoting SQL as a language for cloud infrastructure, with Steampipe providing open-source CLI to query cloud APIs.

IaSQL is available on GitHub under an AGPL-3.0 license and as a hosted SaaS solution with a free tier available. AWS is currently the only cloud provider supported, with partial coverage for approximately 25 services.

About the Author

Renato Losio

Show moreShow less

Uncategorized

API Magic: Building Data Services with Apache Cassandra – The New Stack

MMS • RSS

API Magic: Building Data Services with Apache Cassandra – The New Stack

2023-03-03 06:41:53

API Magic: Building Data Services with Apache Cassandra

sponsor-datastax,sponsored-post-contributed,

Using data APIs and advanced data modeling will make it far easier for JSON-oriented developers to connect to Cassandra.

Mar 3rd, 2023 6:41am by

Mark Stone

Featued image for: API Magic: Building Data Services with Apache Cassandra

All applications depend on data, yet application developers don’t like to think about databases. Learning the internals and query language of a particular database adds cognitive load, and requires context switching that detracts from productivity. Still, successful applications must be responsive, resilient and scalable — all characteristics that will be determined by choice of database. How, then, should an application developer balance these considerations?

What if we could shift the balance, providing a data service in developer-friendly idioms, rather than expecting developers to learn database-specific idioms?

At the Stargate project, the open source API data gateway designed to work with Apache Cassandra, we’re excited to start talking publicly about our upcoming JSON API that meets JSON-oriented developers on their terms. Not only is this great news for JSON-oriented developers, but the technique we’ve followed constitutes a new design pattern for leveraging data APIs and advanced data modeling to produce data services.

In this article I’ll discuss how to provide developer-friendly idioms using Cassandra together with Stargate, and how we’re working to do just that for JSON.

Data Models: Interoperability vs. Idiom

In the early days, Cassandra was sometimes described as “a machine for making indexes.” This was a testament to Cassandra’s inherent resilience and flexibility, a clay out of which more robust structures could be molded. Cassandra today is a richer clay with greater possibilities. Not only is it a great database, it’s a great machine for making databases. Here at the Stargate project, we’re using the JSON API to prove this as the first example of a new paradigm in database development.

It’s not unusual for one database to be built out of another. Even MongoDB is built on top of WiredTiger, if you dig deep enough. AWS is known for its extensive use of MySQL behind the scenes, including using the MySQL storage engine for DynamoDB. So the idea of using Cassandra, with its inherent scalability and performance, as the building block for other data systems makes sense.

Yet application developers don’t really interact with databases. Even if your organization manages its own database infrastructure and builds applications against that infrastructure, the first step is generally to define and implement the data models your applications require.

Those data models mediate between application and database. In some sense, data modeling limits a database; it takes unformed, and thus general-purpose, clay and molds it into something purpose-built for a particular application idiom. We sacrifice interoperability for something idiomatic.

Is it a good idea to trade for something idiomatic and give up something interoperable? If you want to beat the averages, then the answer is an emphatic “yes.” We don’t think this way much when choosing databases, but we’ve long thought this way when choosing programming languages.

This idea was well expressed by Paul Graham decades ago when he explained how Viaweb won the early dot-com race to create the first broadly successful, web-based e-commerce platform.

Viaweb wasn’t necessarily the fastest or most scalable e-commerce platform. In Graham’s words it was “reasonably efficient.” Instead, Graham argues that, for programming languages, on a scale of machine-readable to human-readable, the more human-readable (and thus higher-level) languages are more powerful because they improve developer productivity. And at the time of Viaweb, Graham thought the most powerful language was Lisp. The crux of Graham’s argument is this:

“Our hypothesis was that if we wrote our software in Lisp, we’d be able to get features done faster than our competitors, and also to do things in our software that they couldn’t do. And because Lisp was so high level, we wouldn’t need a big development team, so our costs would be lower. If this were so, we could offer a better product for less money and still make a profit. We would end up getting all the users, and our competitors would get none and eventually go out of business.”

Unlocking Developer Productivity

Graham wrote those words 20 years ago, and developer productivity remains the North Star that guides much of innovation in technology. Where Graham talks about the power of higher-level languages, we express that same concept as providing developers with tools that are more idiomatic to their software development experience.

Graham praises Lisp (rightly), and since the dot-com time we have seen a proliferation of new higher-level languages: Ruby and Rust, to name a couple. We’ve also seen the birth and proliferation of mobile device developer languages and frameworks, such as Swift, Flutter and Dart.

So why are languages like C and C++ still important? The old joke about C holds an important truth: “Combining the power of assembly language with the ease of use of assembly language.” If you want to write a compiler, then you need to move closer to the machine language idiom and away from the natural language idiom.

In other words, among other virtues, C and C++ are machines for building new languages. What’s easy to overlook in Graham’s praise of Lisp is that Lisp also has some of this “machine for building languages” characteristic.

Lisp was the first broadly used language to introduce the concept of macros, and it is often the concept of macros that trips up those new to Lisp. Once you understand macros, you understand that Lisp is more of a meta language than a language, and that macros can be used to build a purpose-built language for a specific problem domain.

Designing and creating an initial set of macros is hard, intellectually challenging work. But once Graham and the Viaweb team did that, in effect they had an e-commerce programming language to work with, and that unlocked the developer productivity that enabled them to outpace their competition.

Twenty years later, all of this seems clear enough in the context of programming languages. So, what has happened in the database world? The short answer is that databases have evolved more slowly.

The Data API Revolution

If tabular data is the assembly language of the database world, then SQL is the C/C++ of query languages. We developed tabular data structures and the concept of data normalization in an era when compute and storage were expensive, and for use cases that were well defined with relatively infrequent schema changes. In that context, to operate efficiently at any kind of scale, databases needed to closely mimic the way computers stored and accessed information.

Today’s world is the opposite, making that earlier time seem archaic by comparison: Compute and storage costs are highly commoditized, but in a world of real-time data combined with machine learning and artificial intelligence, use cases are open ended and schema changes are frequent.

The most recent revolution in database technology was the NoSQL revolution, a direct response to the canon of tabular, normalized data laid down by the high priests of the relational database world. When we say “NoSQL revolution,” we refer to the period from 2004, when Google released its MapReduce white paper, until 2007, when Amazon published its Dynamo white paper.

What emerged from this period was a family of databases that achieved unprecedented speed, scalability and resilience by dropping two cherished relational tenets: NoSQL databases favored denormalized data over data normalization, and favored eventual consistency over transactional consistency. Cassandra, first released in 2008, emerged out of this revolution.

Data APIs will be the next major revolution in database technology, a revolution that’s only just getting started. Changes in the database world tend to lag behind changes in programming languages and application development. So while RESTful APIs have been around for a while, and have helped usher in the architecture for distributed, service-oriented applications, we’re only just beginning to see data APIs manifest as a key part of application infrastructure.

To understand the significance of this revolution, and how, 20 years after Paul Graham’s proclamation, the database world is finally delivering on developer productivity, let’s look at Stargate’s own story. It starts by returning to the theme of interoperable versus idiomatic.

Stargate: A High-Fidelity, Idiomatic Developer Experience

When we decided that the Cassandra ecosystem needed a data gateway, we built the original set of Stargate APIs with a sense of urgency. That meant a monolithic architecture; monoliths are faster to build, yet not always better. We launched with a Cassandra Query Language (CQL) API, a REST API and a RESTful Document API. We quickly added GraphQL as an additional API. To date, Stargate has been interoperable; everything from Stargate is stored using a native CQL data model, so in principle, you could query any table from any API.

We’ve learned that in practice, no one really does this. Developers stick to their particular idiom. By favoring interoperability, we bled Cassandra-isms into the developer experience, thus impeding developer productivity. Because Stargate’s original version required developers to understand Cassandra’s wide-column tabular data structures, to understand keyspaces and partitions, we have anchored too close to the machine idiom and too far from the human idiom.

The interoperability trap is to favor general purpose over purpose-built in design thinking. We’ve pivoted to thinking in terms of purpose-built, which trades some general capability for a more specific mode of expression, moving us closer to the human idiom and further away from the machine idiom. And so we started to think: Could we deliver a high-fidelity idiomatic developer experience while retaining the virtues of Cassandra’s NoSQL foundations (scale, availability and performance)?

The key lies in data modeling. In order to turn Cassandra into the “Lisp of databases,” we needed a data model that could serve a purpose analogous to Lisp macros, together with a Stargate API that would enable developers to interact idiomatically with that data model. We started with JSON, the greatest common denominator of data structures among application developers, and thus started building the JSON API for Stargate. Then we had to figure out how best to model JSON in Cassandra.

Stargate already has a Document API, but in Stargate’s original Document API, we used a data model we call “shredding” to render a JSON document as a Cassandra table. This model maps one document to multiple rows in a single Cassandra table and preserves interoperability. If you use CQL to query the resulting table, you’ll get back meaningful results.

This original shredding data model has downsides. It does not preserve metadata about a document. For example, for any document containing arrays, once the document is written, we don’t know anything about array size without fully inspecting the document. More significantly, we’ve departed from Cassandra’s expectations about indexing. Cassandra indexes on rows, but we’ve now spread our document out across multiple rows, making a native Cassandra index of documents impossible.

To make Cassandra a suitable storage engine for JSON, we were going to need a new data model, something superior to shredding. We called it “super shredding.” You can learn more about super shredding at Aaron Morton’s Cassandra Summit talk in December, but here’s a bit of a teaser: We take advantage of Cassandra’s wide-column nature to store one document per row, knowing that a Cassandra row can handle even very large documents.

We also have a set of columns in that row that are explicitly for storing standard metadata characteristics of a JSON document. Now we have something more easily indexable, as well as a means of preserving and retrieving metadata.

Contributing Back to Cassandra

Yes, to get all this to work at scale we will need some underlying changes to Cassandra. Accord, which Apple is contributing to Cassandra 5, will help us handle data changes in a more transactional manner. Storage-attached indexing (SAI) and Global Sort, which DataStax is contributing to Cassandra 5, will help us handle ranged queries against JSON documents in a more performant manner.

Cassandra is not a static piece of software; it’s a vibrant and evolving open source project. So we’re also continuing a longstanding Cassandra tradition of using requirements that emerge client-side to foster changes on the database side. User needs have prompted the proposals for Accord, SAI and Global Sort. These will not only make Stargate’s JSON API better, but will make Cassandra better. This is a great reminder that data engineers and application developers are not two different communities, but complimentary cohorts of the extended Cassandra community.

And JSON is just the first step. Essentially, we will have built a document database that you interact with through a JSON API out of Cassandra, Stargate and a reasonably efficient Cassandra data model. Super shredding is our macro. This approach turns Cassandra into a machine for making databases.

Could this approach be followed by another database besides Cassandra? Not easily, and here’s why. There’s a sort of database analog of the second law of thermodynamics that works in Cassandra’s favor. We start with something that is fast, scalable and resilient, but not very idiomatic to developers. Within the constraints of reasonable efficiency, we trade some of that speed, scale and resilience for a more idiomatic interface to present to developers. What one can’t easily do is go in the reverse direction. Starting with something that is highly idiomatic and then trying to figure out how to make it fast, scalable and resilient is a daunting task that might not even be possible.

That thermodynamic principle is why data APIs are the new database revolution, and why Cassandra is the database that will power this revolution.

Learn more at Cassandra Forward, a virtual event on March 14. Register now!

Learn more about DataStax.

Group
Created with Sketch.

Mark Stone is product manager at DataStax. He’s a technology veteran with many years of experience in product management, program management and people management. Always working as part of the connective tissue between business stakeholders and technical stakeholders, Mark loves…

Uncategorized

Traffic Protocol Analyzer Wireshark Gets its Own Foundation

MMS • Sergio De Simone

The popular open-source protocol analyzer Wireshark has a new permanent home in the form of the Wireshark Foundation, which should provide the means for its further, long-term evolution, says Sysdig, which took over as the project main sponsor in 2022.

The importance of a tool like Wireshark cannot be underestimated, implies its creator and currently lead developer Gerald Combs, since “modern society runs on computer networks and those networks need to be reliable, fast, and secure”.

The creation of the non-profit Wireshark Foundation responds to the goal of facilitating its further development, providing a “permanent home” to the Wireshark community, and hosting a developer and user conference, dubbed SharkFest. Additionally, as the Wireshark Foundation executive director Shari Najafi pointed out, the foundation also aims at educating its community of users about network analysis and troubleshooting.

Wireshark could have been donated to an existing foundation, such as the Cloud Native Computing Foundation, but it was decided to create a new one to ensure the processes the Wireshark community established along the years can be preserved. At this respect, Sysdig CTO explained that:

“Moving Wireshark to a foundation guarantees that Gerald and the rest of the core developers own and operate Wireshark. The open source users can count on the fact that Wireshark will remain an important industry standard for a long time, and that its development will continue to be driven by the community.”

Combs also stressed the role played by Sysdig in its effort to give support to security-oriented open-source tools, including Falco and a large set of eBPF libraries, both of which were donated to the CNCF.

“The Wireshark community and I look forward to investigating ways to extend Wireshark to address new challenges, including securing the Cloud.”

With over 2,000 contributors and 60 million downloads in the last five years, Wireshark is one of the most popular protocol analyzers. Originally created in 1998 under the name of Ethereal by Gerald Combs and rebranded as Wireshark in 2006, it has had a number of different sponsors over the years, including CACE Technologies, Riverbed Technology, and lately Sysdig.

About the Author

Sergio De Simone

Show moreShow less

Article: Step One to Successfully Building Your Platform: Building It Together

MMS • Lee Ditiangkin

Key Takeaways

Accelerate digital transformation

Related Sponsored Content

Build new and faster revenue streams

Mitigate security risks

Optimize for operational efficiency

Attract and retain top talent

Calculate an ROI for your Internal Developer Platform

Platform metrics

Building an IDP means investing in creativity and innovation

About the Author

Lee Ditiangkin

Subscribe for MMS Newsletter

Did you know...

OpenAI Unleashes ChatGPT and Whisper APIs for Next-Gen Language Capabilities

MMS • Daniel Dominguez

About the Author

Daniel Dominguez

Subscribe for MMS Newsletter

Did you know...

New JavaScript Incremental Computing Library Delivers Better UX for Single-Page Apps

MMS • Bruno Couriol

About the Author

Bruno Couriol

Subscribe for MMS Newsletter

Did you know...

Cross-Industry Report Identifies Top 10 Open-Source Software Risks

MMS • Sergio De Simone

About the Author

Sergio De Simone

Subscribe for MMS Newsletter

Did you know...

Java News Roundup: JEP Updates, JReleaser 1.5, Spring Updates, Vert.x, Project Reactor, Ktor

MMS • Michael Redlich

OpenJDK

JDK 20

JDK 21

Spring Framework

Quarkus

Micronaut

Eclipse Vert.x

Project Reactor

Apache Software Foundation

Hibernate

JReleaser

Ktor

Gradle

About the Author

Michael Redlich

Subscribe for MMS Newsletter

Did you know...

Microsoft Introduces Azure Operator Nexus to Simplify Deployment of Network Functions

MMS • Steef-Jan Wiggers

About the Author

Steef-Jan Wiggers

Subscribe for MMS Newsletter

Did you know...

Podcast: Establishing SRE Foundations with Vladyslav Ukis

MMS • Vladyslav Ukis

Subscribe on:

Transcript

Introductions [01:13]

Why another SRE book? [02:12]

Areas where organisations struggle at the beginning of an SRE journey [03:20]

Making the challenges visible in order to address them [04:49]

How to influence change from the middle [07:02]

Introducing Service Level Objectives [08:40]

Error budgets make improvement possible [11:05]

Having a clear incident response process [15:04]

Making postmortems safe and avoiding blamestorming [16:41]

There are a variety of team topologies that enable SRE [18:34]

The role and career path for a Site Reliability Engineer [23:29]

About the Author

Vladyslav Ukis

Subscribe for MMS Newsletter

Did you know...

Infrastructure as SQL on AWS: IaSQL Enters Beta Adding Multi-Region and Transactions

MMS • Renato Losio