Category: Uncategorized
Microsoft’s Customer Managed Planned Failover Type for Azure Storage Available in Public Preview

MMS • Steef-Jan Wiggers
Article originally posted on InfoQ. Visit InfoQ
Microsoft recently announced the public preview of a new failover type for Azure Storage with customer managed planned failover. This new failover type allows a storage account to failover while maintaining geo-redundancy, with no data loss or additional cost.
Earlier, the company provided unplanned customer managed failover as a disaster recovery solution for geo-redundant storage accounts. This has allowed their customers to fulfill their business needs for disaster recovery testing and compliance. Furthermore, another available failover type planned (Microsoft managed) is available. With the new customer managed planned failover type, users are no longer required to reconfigure geo-redundant storage (GRS) after their planned failover operation.
In a question on Stack Overflow, if Azure Storage failover and failback need two replications, the company replied hinting at the new failover type:
Yes, two replications would be required, and the process would work as documented. We currently have Planned Failover in private preview, allowing customers to test the Failover workflow while keeping geo-redundancy. As a result, there is no need to re-enable geo-redundancy after each failover and failback operation. This table highlights the key differences between Customer Managed Failover (currently GA) and Planned Failover (currently in the planned state).
If the storage service endpoints for the primary region become unavailable, users can fail over their entire geo-redundant storage account to the secondary region. During failover, the original secondary region becomes the new primary region, and all storage service endpoints are then redirected to the new primary region. After the storage service endpoint outage is resolved, users can perform another failover operation to fail back to the original primary region.
(Source: Microsoft Learn)
There are various scenarios where Planned Failover can be used according to the company:
- Planned disaster recovery testing drills to validate business continuity and disaster recovery.
- Recovering from a partial outage in the primary region where storage is unaffected. For example, suppose storage service endpoints are healthy in both regions, but another Microsoft or 3rd party service faces an outage in the primary region. In that case, users can fail over their storage services. In this scenario, once users fail over the storage account and all other services, their workload can continue to work.
- A proactive solution in preparation for large-scale disasters that may impact a region. To prepare for a disaster such as a hurricane, users can leverage Planned Failover to fail over to their secondary region and then fail back once things are resolved.
AWS and Google Cloud’s competitive storage offerings offer similar disaster recovery options, such as multiple active regions (AWS) and dual/multiple buckets (GCP).
In a recent tweet emphasizing that backups are essential, Fletus Poston tweeted:
You likely know having a backup is essential—but it’s not enough. A backup that hasn’t been validated could leave your business vulnerable when disaster strikes. Whether it’s a ransomware attack, hardware failure, or accidental deletion, relying on untested backups can lead to incomplete or corrupted data recovery.
The new failover type is available in Southeast Asia, East Asia, France Central, France South, India Central, and India West Azure regions.
.NET 9 Release Candidate 1: Approaching Final Release with Updates Across the Framework

MMS • Almir Vuk
Article originally posted on InfoQ. Visit InfoQ

Last week, Microsoft released the first release candidate for the upcoming .NET 9 framework, which includes a range of updates across its core components, such as the .NET Runtime, SDK, libraries, C#, and frameworks like ASP.NET Core and .NET MAUI.
Regarding the .NET libraries, new APIs were added to ClientWebSocketOptions and WebSocketCreationOptions, enabling developers to configure WebSocket pings and automatically terminate connections if no response is received within a specified timeframe.
Additionally, new types—ZLibCompressionOptions and BrotliCompressionOptions—have been introduced, providing more detailed control over compression levels and strategies. These additions offer greater flexibility compared to the previous CompressionLevel option.
For users working with TAR archives, the public property DataOffset has been introduced in System.Formats.Tar.TarEntry allows access to the position of data in the enclosing stream. This provides the location of the entry’s first data byte in the archive stream, making it easier to manage large TAR files, including concurrent access features.
From this version, LogLevel.Trace events generated by HttpClientFactory will now exclude header values by default. However, developers can log specific header values using the RedactLoggedHeaders helper method, enhancing privacy and security.
Furthermore, the new command dotnet workload history has been introduced. As explained, this command tracks the history of workload installations or modifications within a .NET SDK installation, offering insights into workload version changes over time. It is intended to assist users in managing workload versions more efficiently, as stated similar to Git’s reflog functionality.
The release candidate also introduces updates to ASP.NET Core, such as a keep-alive timeout for WebSockets, support for Keyed DI services in middleware, and improvements to SignalR distributed tracing, aiming to improve performance and simplify development workflows.
Interested readers can read more about ASP.NET Core RC1 updates in the latest and detailed InfoQ news article.
Regarding .NET MAUI the release focuses on addressing issues and stabilizing the platform in preparation for the general availability (GA) release. Among the new features is HorizontalTextAlignment.Justify, which provides additional text alignment options for Labels. Also, updates to HybridWebView are included, with a guide for developers upgrading from .NET 9 Preview 7 to RC1, especially about Invoking JavaScript methods from C#.
With a note that .NET for Android and iOS in this release, are primarily focused on quality improvements, as reported this release requires Xcode 15.4 for building applications.
Lastly, looking into the community discussion for this release, an interesting conversation clarified that the Out-of-proc Meter wildcard listening feature is a new feature not previously available in either in-process or out-of-process scenarios.
Tarek Mahmoud Sayed, collaborator on .NET project, wrote the following:
The wildcard support is the new feature that was not supported in proc or out of proc. The sample showing the in-proc just for simplicity to show using the wildcard. You can still listen out-of-proc too and leverage the wildcard feature. We are working to make diagnostics tools like dotnet monitor to leverage the feature too. Let me know if there still anything unclear and I’ll be happy to clarify more.
Interested readers can find more information about this version in the official release notes. Lastly, the .NET 9 Release Candidate 1 download is available for Linux, macOS, and Windows.

MMS • Danilo Sato
Article originally posted on InfoQ. Visit InfoQ

Transcript
Sato: The track is about the architectures you’ve always wondered about, and I work in consulting, and sometimes, not always, I have lots of stories about architectures you probably don’t want to know about. I’ll try to pick the things that are interesting. One of the benefits of working in consulting is I see lots of things, and what I try to do with these topics is less about zeroing in on a specific solution or a specific way of doing things, but we like to look at things more like, what are the patterns, what are the different ways that people can do it maybe using different types of technology stacks?
What is this Organization’s Business?
To start with, because we’re talking about data architectures, I want to start with a question. I’ll show you a data architecture, and I’ll ask you like, help me find what is this organization’s business? This organization has a bunch of data sources, so they’ve got ERP systems, CRM system. They do their finance. They get external data. Then they ingest their data through a data lake or a storage. They do that through streaming. They do that through batch. They do a lot of processing of the data to cleanse it, to harmonize the data, to make it usable for people. Then they do serving of the data. They might put it in a data warehouse, on a data mart where lots of consumers can use that data. They put it on a dashboard to write reports, to train machine learning models, or to do exploratory data analysis.
Which company is that? Has anyone seen anything like this before? I call this the left to right data architecture. Probably most people will have something like this, or a variation of this. What I like about using data products, and some of the ideas in data mesh, is that we can talk about data architectures slightly like this. I’ll describe a different one now, and I’ll use these hexagons to represent data products, and then some square or rectangles for systems. This company has data about content. They call it supply chain, because they have to produce the content. They add metadata to the content.
Then they make that content available for viewers to watch that content. When they watch that content, they capture viewing activities. What are people actually watching? What are they not watching? When do they stop watching? When do they start watching? Then they do marketing about that to get people to go watch more of their content. They have this, they call them audiences. When we’re looking around different types of audiences, one of the things that marketing people like to do is to segment the audience so that we can target the message to reach people that might actually be more interested in a specific type of content than another one.
Which company is this? There’s still a few of them, but at least the industry should be much easier to understand now, because we’re talking about the domain. In this case, this was a project that we’ve done with ITV, which is here in the UK. I don’t think I need to introduce but it’s a broadcaster. They produce content. They distribute that content. They actually have a studios’ business as well. There’s way more of that, but this was a thin slice of the first version of when we started working with them to build the data mesh. The first use case was in marketing. What we were trying to do is bring some of that data they already had into this square here.
I didn’t put the label, but it was the CDP, the customer data platform, where they would do the audience segmentation, and then they could run their marketing campaigns. The interesting thing is, once you do segments, then the segments themselves is useful data for all the parts of the business as well to know about. We can describe architecture in a more domain specific way. These are the things that hopefully will be more meaningful than the things underneath. Because if we zoom in into one of these data products, you’ll see that it actually has a lot of the familiar things.
We’ll have ingestion of data coming in. It might come in batches. We might do it in streaming. We might have to store that data somewhere. We still do pipelines. We still process the data. We have to harmonize and clean it up. We will need to serve that data for different uses. That can be tied in to dashboards or reports, or more likely, it could also be useful for all the downstream data products to do what they need. When we look inside of the data product, it will look familiar with the technology stacks and maybe architectures that we’ve seen before, but data products gives us this new vocabulary or this new tool, how we describe the architecture that’s less about the technology and more about the domain.
Background
In this talk, I was looking at, where should I talk about? Because those two, to me, is very related. It’s like we could look inside and how we architect for this data product. The other thing that I’m quite excited about is that you can actually architect with data products, which is more about that bigger picture, looking at the bigger space. This is what I’ll try to do. I am the Global Head of Technology at Thoughtworks. I’ve been here for over 16 years, but you see, my background is actually in engineering. I joined as a developer, and I fell into this data and AI space while here at Thoughtworks.
I was here when Zhamak wrote the seminal article to define data mesh principles and early reviewer of the book, and we’ve been trying to apply some of these ideas. The reason why I picked the topic to be data products and not data mesh, is because what we found is, it’s a much easier concept for people to grasp. You have hints if you’ve read the data mesh book or the article. There’s lots of things that you see there that comes from that. Data product is one way that’s easy for people to engage with that content.
Shift in Data Landscape
To tell you a little bit about the history, like why we got to where we are, why is it hard to architect or to design data architectures these days. There’s been a shift. Way back, we would be able to categorize things about like, we’ve got things that happen in the operational world. We’ve got data that we need to run the business, that have operational characteristics, and then we have this big ETL to move data into the analytical space where we’ll do our reporting or our business intelligence. That’s where a lot of those left to right architectures came about. We can use different ways to do those things, but in terms of how we model things, like databases, when I came into the industry, there weren’t many options.
Maybe the option is like, which relational database am I going to use? It was always like, tables. Maybe we design tables in a specific way if we’re doing transactional workloads versus if we’re doing analytical workloads. That was the question. Like, is it analytical workloads? Is it transactional workload? Which relational database do we use? Early in the 2000s there were maybe a few new things coming up there, probably not as popular these days, but some people try and build object database or XML databases. Things have evolved, and these days, this is what the landscape looks like. This got published. Matt Turck does this every year. This started off as the big data landscape, I think 2006 or something like that.
Then it’s been evolving. Now it includes machine learning, AI, and data landscape, which he conveniently calls it the MAD Landscape. It is too much for anyone to use to understand all of these things.
The other thing that happened is that we see those worlds getting closer together. Whereas before we could think about what we do operationally and what we do for analytics purposes as separate concerns, they are getting closer together, especially if you’re trying to train more like predictive, prescriptive analytics. You train machine learning models, ideally you want the model to be used, backed by an operational system to do things like product recommendation or to optimize some part of your business. That distinction is getting more blurry, and there are needs for getting outputs of analytics back into the operational world, and the other way around.
A lot of these analytics needs good quality data to be able to train those models or get good results. These things are coming together. These were some of the trends where the original idea from data mesh came from. We were seeing lessons we learned for how to scale usage of software. A lot of the principles of data mesh came from how we learn to scale engineering organizations. Two in particular that I call out, they are principles there in the book as well, but one is the principle of data as a product. The principle means, let’s treat data as a product, meaning, who are the people that are going to use this data? How can we make this data useful for people?
How can we put the data in the right shape that makes it easy for people to use it? Not just something that got passed around from one team to another, and then you lose context of what the data means, which leads to a lot of the usual data quality problems that we see. In data mesh, at least, data product side is the building block for how we draw the architecture, like the ITV example I showed earlier on. The other thing around that is that if we’re treating as a product, we bring the product thinking as well.
In software these days we talk about product teams that own a piece of software for the long run, there’s less about those build and throw over the wall for other people to maintain. The data products will have similar like long-term ownership, and if we can help drive alignment between how we treat data, how we treat technology, and how we align that through the business.
The other big concepts that influenced a lot of data mesh was from domain-driven design. The principle talks about decentralizing ownership of data around domains. What Zhamak means with domains in that context is from Eric Evans’ book on domain-driven design, which originally was written about tackling complexity in software. A lot of those core ideas for how we think about the core domain, how do we model things? How do we drive ubiquitous language? Do we use the right terminology that reflects all the way from how people talk in the corridors all the way to the code that we write? How do we build explicit bounded contexts to deal with problems where people use the same name, but they mean different things in different parts of the organization? A lot of that, it’s coming from that, applied to data.
Modeling is Hard
The thing is, still today, modeling is hard. I like this quote from George Box, he says, “Every model is wrong, but some of them are useful.” To decide which one is useful, there’s no blanket criteria that we can use. It will depend on the type of problem that you’re trying to solve. We’ve seen on this track, I think most people said, with architecture, it’s always a tradeoff. With data architecture, it’s the same. The way that we model data as well, there’s multiple ways of doing things. This is a dumb example, just to make an illustration.
I remember when I started talking about these things, there were people saying, when we model software or when we model data, we should just try to model the real world. Like, this thing exists, let’s create a canonical representation of the thing. The problem is, let’s say, in this domain, we’ve got shapes with different colors. I could create a model that organizes them by shape. I’ve got triangles, squares, and circles, but I could just as easily organize things by color. You can’t tell me one is right, the other one is wrong, if you don’t know how I’m planning to use this. One might be more useful than the other depending on the context. These are both two valid models of this “real world”. Modeling is always this tradeoff.
How Do We Navigate?
We’ve got trends from software coming into data. We’ve got the landscape getting more complex. We’ve got some new, maybe thinking principles from data mesh. How can we navigate that world? The analogy that I like to use here is actually from Gregor Hohpe. He talks about the architect elevator, architect lift, if you’re here in the UK, would be the right terminology. In his description, he wrote a book about it, but he talks about how an architect needs to be able to navigate up and down. If you imagine this tall building as representing the organization, you might have the C-suite at the top, you might have the upper management, you might have the middle management, the lower management, and then you might have people doing the work at the basement somewhere.
As an architect, you need to be able to operate at these multiple levels, and sometimes your role is to bring a message from one level to the other, to either raise concerns when there’s something that’s not feasible to be done, or to bring a vision down from the top to the workers. I like to apply a similar metaphor to think about architecture concerns. I’m going to go from the bottom up. We can think about data architecture within a system. If we are thinking about how to store data or access data within an application, a service, or in this talk, more about within the data product, and then we’ve got data that flows between these systems. Now it’s about exposing data, consuming that data by other parts of the organization.
Then you’ve got enterprise level concerns about the data architecture, which might be, how do we align the organizational structure? How do we think about governance? In some cases, your data gets used outside of the organization, some big clients of ours, they actually have companies within companies, so even sharing data within their own companies might be different challenges.
Data Architecture Within a System
I’ll go from the bottom to the up to talk about a few things to think about. Start, like I said, from the bottom, and the first thing, it’s within a system, but which system are we talking about? Because if we talk about in the operational world, if we’re serving transactional workloads where there’s lots of read/writes and lots of low latency requirements, that choice there is not as much about data product, but it’s, which operational database am I going to use to solve that problem? When we talk about data product in data mesh, you will see, like in the book, the definition is about a socio-technical approach to manage and share analytical data at scale.
We’ve always been focusing data products on those analytical needs. It’s less about the read/write choices, but it’s more about using data for analytical needs. The interesting thing is that some folks that work in the analytics, we’re quite used about breaking encapsulation. I don’t know if you’ve got those folks in your organization, but they’re like, just give me access to your database, or just give me a feed to your data, and then we’ll sort out things. We don’t do that when we design our operational systems. We like to encapsulate and put APIs or services around how people use our things.
In analytics, at least historically, it’s been an accepted way of operation where we just let the data team come and suck the data out of our systems, and they will deal with it. You end up with that disparate thing, and when we break encapsulation, the side effect is that the knowledge gets lost. This is where we see a lot of data quality problems creep in, because the analytics team is going to have to try to rebuild that knowledge somehow with data that they don’t own, and the operational teams will keep evolving so they might change the schema. We create a lot of conflict. The data team doesn’t like the engineering teams, and vice versa.
In the operational database side, like any other architecture choice, it’s understanding what are the data characteristics that you need, or the way to think about the cross-functional requirements. From data, it might be like, how much volume are you expected to handle? How quickly is that data going to change? Aspects around consistency and availability, which we know it’s a tradeoff on its own, different access patterns. There will be lots of choices. If you think about all the different types of databases that you could choose, I would just quickly go through them. You’ve got the traditional relational database, which is still very widely used, where we organize data in tables.
We use SQL to query that data. You’ve got document databases where maybe the data is stored as a whole document that we can index and then retrieve the whole document all at once. We’ve got those key-value stores where, maybe, like some of the caching systems that we built, we just put some data in with a key value that we can retrieve fast. We’ve got some database there, those wide column style databases like Cassandra or HBase, where we model column families, and then you index the data in a different way depending on the types of the queries that you’re going to make. We’ve got graph databases where we can represent the data as graphs or relationships between nodes and entities, and infer or maybe navigate the graph with a different type of query.
Some people say like search database, like Elasticsearch, or things like that. We’re trying to index text to make it easier to search. We’ve got time-series databases, so maybe you’ve got high throughput temporal data that we might need to aggregate by different periods of time because we’re trying to do statistics or understand trends. We’ve got columnar databases, which is popular with analytical workloads, where maybe the way that we structure the data and the way that it’s stored, it’s more organized around column so it’s easier to run analytical queries against that. This thing keeps evolving. I was just revisiting these slides, and actually now, with all the new things around AI and new types of analytics, there’s even like new types of databases that are coming up.
Vector database is quite popular now, where if you’re trying to train an LLM or do reg on the LLM, you might want to pull context from somewhere, or whether we store our embeddings will be in a vector database. Machine learning data scientists, they use feature stores where they can try to encapsulate how they calculate a feature that’s used for training their models, but they want that computation to be reused by other data scientists, and maybe even be reused when they’re training the model, and also at inference time, where they will have much more strict latency requirements on it.
There are some metrics databases coming up, like, how do we compute KPIs for our organization? People are talking about modeling that. Even relations, like, how can we store relations between concepts? How can we build like semantic layer around our data so it’s easier to navigate? This landscape keeps evolving, even if I take all the logos from who are the vendors that are actually doing this.
There’s more things. We could argue after, it’s like, actually, Elasticsearch claims they’re actually a document database as well, they’re not just a search database. I can query the SQL in a graph database as well, because there’s a dialect that’s compatible with that. There’s lots of nuances around that. There are some databases that claim to be multimodal. Like, it doesn’t matter how it’s going to get stored, you can use a product and it will give you the right shape that you need.
You have to choose whether you’re going to use the cloud services that are offered, whether you’re going to go open source, or whether you’re going to get a vendor. The interesting one like, sometimes you don’t even need a full database system running. When we talk about data, sometimes it’s like, I just want a snapshot of that data to be stored somewhere in the lake. It might be like a file. It might be a parquet file. It’s just sitting on a storage bucket somewhere.
Data Product
Now I’ll shift more to the data product, which is more on the analytical side. One of the questions I get a lot from clients is to understand like, but what is a data product? Or, is this thing that I’m doing a data product? The way that I like to think about it, I’ll ask you some questions here. If you’ve got a table, you’ve got a spreadsheet, do you think that’s a data product? The answer is probably not, but there’s other ways that we can have data. Maybe it’s an event stream or an API, or I’m plugging a dashboard into this, does this become a data product now? Who thinks that? It doesn’t stop there, because if we’re going to make it a data product, we need the data to be discoverable and the data to be described.
There will be schema associated with the data, how that data gets exposed, especially on the output sign, and we want to capture metadata as well. That makes it more robust. Now, is that a data product? It doesn’t end there. We want the metadata and the data product to be discoverable somewhere. We’re probably going to have a data catalog somewhere that aggregates these things so people can find it. We also, like I said before, in the zoomed in version, there will be code that we write for how to transform, how to ingest data, how to serve data for this product. We need to manage that code. We need to choose what kind of storage for this data product we’re going to use, if we’re doing those intermediary steps. As we do transformation, we probably want to capture lineage of data, so that’s going to have to fit somewhere as well.
More than that, we also want to put controls around quality checks. Like, is this data good? Is it not good? What is the access policy? Who is going to be able to access this data? Even we’re talking about data observability these days, so we can monitor. If we define service level objectives or even agreements for our data, are we actually meeting them? Do we have data incidents that needs to be managed? This makes a little bit more of the complete picture. We want the data product to encapsulate the code, the data, the metadata that’s used to describe them. We use this language, we talk about input ports, output ports, and control ports. Then the input ports and outputs is how the data comes in and how the data gets exposed. It could be multiple formats. It might be a database-like table, maybe, as an output port. It might be a file in a storage somewhere. We might use event streams as an input or output. It could be an API as well.
Some people talk about data virtualization. We don’t want to move the physical data, but we want to allow people to connect to data from other places. There are different modes of how the data could be exposed. Then the control ports are the ones that connect to the platform, to either gather information about what’s happening inside the data product, to monitor it, or maybe to configure it.
The thing about that is, because it gets complex, what we see is, when we’re trying to build data products, a key piece of information that comes into play is the data product specification. We want to try to create a versionable way to describe what that product is, what is its inputs, what is its outputs? What’s the metadata? What’s the quality checks? What are the access controls that we can describe to the platform, how to manage, how to provision this data product. The other term that we use a lot when we talk about data products is the DATSIS principles.
Like, if we want to do a check mark, is this a data product or not? The D means discoverable. Can I find that data product in a catalog somewhere? Because if you build it, no one can find it, and it’s not discoverable. If I find it, is it addressable? If I need to actually use it, can I access it and consume it through some of these interfaces that it publishes? The T is for trustworthy. If it advertises this quality check, so the SLO or the lineage, it makes it easier for me, if I don’t know where the data is coming from, to understand, should I trust this data or not? Self-describing, so all that metadata about, where does it fit in the landscape? Who’s the owner of that data product? The specification, it’s part of the data product. It needs to be interoperable.
This gets a little bit more towards the next part of the presentation. When we talk about sharing data in other parts of the organization, do we use a formal language or knowledge representation for how we can allow people to connect to the data. Do we define standard schemas or protocols for how it’s going to happen? Then S is for secure, so it needs to have the security and access controls as part of that.
Data Modeling for (Descriptive/Diagnostics) Analytical Purposes
Quickly dive into when we talk about analytics, one of the modeling approaches. I put here in smaller caps, like this is maybe for more descriptive, diagnostics types of analytics. It’s not anything machine learning related. We usually see about this like, there’s different ways to do modeling for analytics. We can keep that raw data. We can use dimensional modeling. We can do data vault. We can do wide tables. Which one do we choose? Dimensional model is probably one of the most popular ones. We structure the data either in a star schema or a snowflake schema, where we describe facts with different dimensions. Data vault is interesting. It is more flexible. You’ve got this concept of hubs that define business keys for key entities that can be linked to each other.
Then both hubs and links can have these satellite tables that is actually where you store the descriptive data. It’s easier to extend, but it might be harder to work with, for the learning curve is a little bit higher. Then we’ve got the wide table, like one big table, just denormalize everything, because then you’ve got all the data that we need in one place. It’s easy to query. That’s another valid model for analytics. This is my personal view. You might disagree with this. If we think about some of the criteria around performance of querying, how easy it is to use, how flexible the model is, and how easy to understand? I put raw data as kind of like the low bar there, where, if you just got the raw data, it’s harder to work with because it’s not in the right shape.
Performance for querying is not good. It’s not super flexible, and it might be hard to understand. I think data vault improves on that, a lot of them especially on model flexibility. Because it’s easy to extend their model, but maybe not as easy to use as a dimensional model. The dimensional model might be less flexible, but maybe it’s easy to use and to understand. Many people are used to this. Performance might be better because you’re not doing too many joins when you’re trying to get information out of it. Then, wide tables, it’s probably like the best performance. Because everything is in a column somewhere, so it’s very quick to use and very easy, but maybe not super flexible. Because if we need to extend that, then we have to add more things and rethink about how that data is being used.
Data Architecture Between Systems
I’ll move up one level now to talk about data architecture between systems. There are two sides of this. I think the left one is probably the more important one. Once we talk about using data across systems, now, decisions that we make have a wider impact. If it was within my system, it’s well encapsulated. I could even choose to replace the database or come up with a way to migrate to a different database, but once the data gets exposed and used by other parts of the organization, now I’ve made a contract that I need to maintain with people.
Talking about, “What’s the API for your data product?” The same way that we manage APIs more seriously about, let’s not try to make breaking changes. Let’s treat versioning more as a first-class concern. That becomes important here. Then the other thing is how the data flows between the systems. Because there’s basically two different paradigms for handling the data in motion. Thinking about the data product API, because it needs more rigor, one of the key terms that are coming up now, people are talking about this, is data contracts.
We’re trying to formalize a little bit more, what are the aspects of this data that is in a contract now? We are making a commitment that I’m going to maintain this contract now. It’s not going to break as easily. Some of the elements of the contract, the format of the data, so, is my data product going to support one type or maybe multiple formats? Is there a standard across the organization? Examples may be Avro, Parquet, maybe JSON or XML, Protobuf, like that. There are different ways that you could choose. The other key element is the schema for the data, so that helps people understand the data that you’re exposing.
Then, how do we evolve that schema? Because if we don’t want to make breaking changes, we need to manage how the schema evolves in a more robust way as well. Depending which format you chose, then there’s different schema description languages or different tools that you could use to manage that evolution of the schema. The metadata is important, because it’s how it hooks into the platform for discoverability, to interop with the other products. Then, discoverability, how can other people find it? I mentioned before having something like a data catalog that makes it easy for people to find. They will learn about what the data is, how they can be used, and then decide to connect to the API.
When we talk about data in motion, like I said, there’s basically two main paradigms, and they’re both equally valid. The very common one is batch, which usually is when we’re processing a bounded dataset over a time period. We’ve got the data for that time period, and we just run a batch job to process that data. When we’re doing batch, usually going to have some workflow orchestrator for your jobs or for your pipelines, because oftentimes it’s not just one job, you might have to do preprocessing, processing, joining, and things like that. You might have to design multiple pipelines to do your batch job. Then the other one that’s popular now is streaming.
We’re trying to process an infinite dataset that keeps arriving. There are streaming engines for writing code to do that kind of processing. You have to think about more things when you’re writing these things. The data might arrive late, so you have to put controls around that. If the streaming job fails, how do you snapshot so that you can recover when things come back? It’s almost like you’ve got a streaming processing job that’s always on, and it’s receiving data continuously. There’s even, for instance, Apache Beam, they try to unify the model so you can write the code that could either be used for batch processing or stream processing as well.
Flavors of Data Products
The other thing, thinking about between systems, I was trying to catalog, because there are different flavors of the data products that will end up showing up in the architecture. When we read the book and when we read the first article from Zhamak, she talks about three types of data products. For example, we’ve got source-aligned data products. Those are data products that are very close to, let’s say, the operational system that generates that data. It makes that data usable for other parts of the organization. Then we may have aggregate data products. In this example here, I’m trying to use some of the modeling that I’ve talked about before.
Maybe we use a data vault type of modeling to aggregate some of these things into a data vault model as an aggregate. The dimensional model, in a way, is aggregating things as well. This is where it gets blurry, the dimensional model might also act as a consumer-aligned data product, because I might want to plug my BI visualization tool into my dimensional model for reporting or analytics. I might transform, maybe from data vaults into a dimensional model to make it easier for people to use, or maybe as a wide table format. Then the other one I added here, which is more, let’s say we’re trying to do machine learning now, I need to consume probably more raw data, but now the output of that trained model might be something that I need to consume back in the operational system.
We’ve got these three broad categories of flavors of data products. What I’ve done here, there are more examples of those types of products, so I was trying to catalog a few based on what we’ve seen. In the source-aligned, we’ve got the obvious ones, the ones that are linked to the operational system or applications. Sometimes what we see is like a connector to a commercial off-the-shelf system. The data is actually in your ERP that’s provided by vendor X. What we need is connectors to hook into that data to make it accessible for other things. The other one I put here, CDC anti-corruption layer. CDC stands for Change Data Capture, which is one way to do that, pull the rug of the data from the database from somewhere.
This is a way to try to avoid the leakage, where we can try to at least keep the encapsulation within that domain. We could say, we’ll use CDC because the data is in a vendor system that we don’t have access, but we’ll build a data product that acts as the anti-corruption layer for the data, for everyone else in the organization. Not everyone needs to solve that problem of understanding the CDC level event. Or, external data, if I bought data from outside or I’m consuming external feeds of data, maybe they might show up as a source-aligned in my mesh. The next one aggregate.
Aggregate, I keep arguing about with my teams as well, because lots of things are actually aggregates when we’re trying to do analytics. In a way, it’s like, is it really valid or not? We see just as an example like this, sometimes aggregates are closer to the producer. The example I have is, we’re working with the insurance company, and you can make claims on the website, so I can call on the phone. What happens behind the scenes is there’s different systems that manage the website and the phone. You’ve got claims data from two different sources, but for the rest of the organization, maybe it doesn’t matter where the claim comes from, so I might have a producer owned aggregate that combines and makes it easy for people downstream to use that. We might have this especially analytical models that are more like aggregates across domains.
When we’re trying to do KPIs, usually, business people ask about everything. You need to have somehow access to data that comes from everywhere. The common one is consumer owned. When we’re trying to do, maybe, like a data warehouse model, or something like that, you aggregate for that purpose. Then the other one, like a registry style. This one comes up when we’re trying to do, for instance, master data management in this world, where, let’s say we’ve got our entity resolution piece in the architecture, where we’ve got customer data everywhere. I aggregate into this registry to try to do the entity resolution, and then the output is like, how did I cluster those entities, that can be used either by other operational systems or for downstream analytics?
Then in consumer-aligned, there’s many of them. Data warehouse is a typical one. I put operational systems here as well because people talk about reverse ETL, where we take the output from analytics, put it back into the operational system. To me, that’s just an example. I’m using output from another product to feed back into the operational system. This was the example from the ITV one in the beginning, where we’re trying to feed that audience data into the CDP. It’s actually aggregating things and putting in the consumer for marketing.
I think metric store, feature store, building machine learning models, examples of things that are more on the consumer side. Or if we’re trying to build like an optimization type of product, maybe, that might need data from other places, and then the outputs might be the optimization. The interesting thing, if you think about this now, is that data products were born with the purpose of being analytical data products, but now we’re starting to feed things back into the operational world, which means the output ports can actually be subject to operational SLAs as well. This is a question that I get a lot.
It’s like, if I’m only using data products for analytics, how does it feed back into the operational world? In my view, I think that’s a valid pattern. Like the output port can be something that you need to maintain just as another operational endpoint that gives you, let’s say, model inference at the speed that the operational system needs.
Data Architecture at Enterprise Level
That leads me to the last level, at the enterprise level. Here we start getting more into organizational design. How is the business organized? How do we find those domains? A lot of that strategic design from DDD comes into play here, because we need to find, where are the seams between the different domains? Who’s going to own these different domains? One question that comes up a lot when we talk about data mesh specifically, is, data mesh is very opinionated about if you need to scale, if you got a lot of data, a lot of use cases, the centralized model will become a bottleneck at some point.
If you haven’t got to that point, maybe that’s a good way to start. If you’re trying to solve the problem of, my data teams are a bottleneck, then the data mesh answer is, look at a decentralized ownership model for data. A lot of that domain thinking data as a product is what enables you to decentralize ownership. This is a decision you can make. When we start mapping that to different domains or different teams, you’ve got, data products usually sit in a domain. Domain will have multiple data products.
The data products can be used both within that domain or across different domains. We will draw different team boundaries around that. The team could own one or multiple data products, which is another question I get a lot, like, if I’m going to do all this, do I need one team for each data product? Probably not. The team might be able to own multiple of them. Sometimes they can own all of the data products for a domain, if it’s something that fits in the team’s head. The key thing is they have that longer term ownership for those data products.
Then, this is another principle from data mesh to enable this. If you’re trying to decentralize, if you’re trying to treat data as a product, how do I avoid different teams building data in a different way, or building things that don’t connect to each other? The answer is, we need a self-serve platform that helps you build those in a more consistent way, and also to drive speed, because then you don’t need every team solving the same problem over again. The platform needs to think about, how does it support these different workloads, whether it’s batch or streaming? Is it analytical, or AI, operational things? The platform also enables effective data governance, so when we start tracking those things I was talking before about quality, lineage, access control. If the platform enables that, then it gives you a better means to implement your data governance policies.
It delivers more of this seamless experience, how data gets shared and used across the organization for both producers, consumers of data. It might be engineering teams, and eventually even business users. If you try to drive those democratizing of analytics, and we’ve got business people trained to use things, they could also be users in this world. How does the platform support these multiple workloads? Going back to the example I had before, I had the self-service BI here as an example of a platform capability, maybe. There’s actually more. In this case, I’ve picked a few examples.
I might have to pick different types of databases on the operational side. Maybe event stream platform to support how the events flow between the systems. Some kind of data warehouse technology here. I need a lot of the MLOps stack for how do I deploy the machine learning models? How do I manage all of that?
Another way to talk about, if you read the Data Mesh book, Zhamak talks about, data platform is actually multiplane. Oftentimes, when I go to clients, especially if they’re not doing data products or data mesh, they will already have something in place. They will have the analytics consumption plane and the infrastructure plane. Analytics is where they’re doing those visualizations or business reporting, or even, like I put data science, machine learning as a use case of data here that’s built on some infrastructure that they’ve been building, could be cloud, could be on-prem. Infrastructure will be the building blocks to actually store the data, process the data, put some controls around that data.
The thing that we also need, I think, in the book, she talks about the data mesh experience plane. If we’re abstracting away from data mesh, what that is, is really like a supervision plane. I’m trying to get a view of what’s across the entire landscape of data. The missing bit there is the data product experience plane. This is something that usually we don’t find because that is where we put capabilities around, how do I define the specification for the data product, the types of inputs and output ports that I will support? How do I support the onboarding and the lifecycle of managing these data products, metadata management? There’s a lot of things, even like, I think we’ve got data product blueprints.
When I was talking about the different flavors, if the organization has data products that are similarly flavored, one thing you can do is extract those as blueprints. When you need another one that’s very similar to that, then it’s even easier for the teams to create a new one. In the supervision plane, especially in data mesh, like we talk about computational governance, that’s the term that we use in the book. The idea is, actually we’re informing that, connecting to the infrastructure plane, and building that into the data product onboarding and the lifecycle management so that we can actually extract those things around data lineage, data quality, auditing.
If you need access audits for who’s doing what. This is a capability view, so I didn’t put any product or vendor here, but it’s a good way to think about that platform. You can see how it can quickly become very big. There’s lots of things that fall under this space. Usually, we will have one or multiple teams that are actually managing that platform as well, and they build and operate the infrastructure. The key thing is self-serve platform. If it’s not self-serve, then it means the teams that need things will have to ask the teams to build it. Here, self-service enables you to remove that bottleneck, so when they need it, they can get it.
One way that I really like to think about how to structure the teams and how to split the ownership is using team topologies. If you haven’t read this highly recommended book, it’s not just for data, it actually talks about software teams and infrastructure teams. The way that I really like is they use the idea of cognitive load as a way to identify like, is it too much for the team to understand? Maybe we can split things, split teams. How does value flow between these different teams? What’s the interaction modes between them? Just a quick, is a hypothetical example here, but using the team topology terminology, if you’re used to it.
Maybe the self-serve data platform team offers the platform as a service, so that’s the interaction mode, and the data product teams are the stream-aligned teams from team topology terminology. Then we might have, let’s say, governance as an enabling team that can facilitate things within the product, and they might collaborate with the platform team to implement those computational aspects of governance into the platform itself.
The last bit I’ll talk about is governance, because when we talk about enterprise data usage, governance becomes important. How does the company deal with ownership, accessibility, security, quality of data? Governance goes beyond just the technology. This gets into the people and the processes around it. An example here, tying back to the data product space. Computational governance could be, we build a supervision dashboard like that, where we can see across the different domains, these are the data products that exist. I could see, do they meet the DATSIS principles or not? I even have some fitness function or tests to run against to see if the data product actually meets them or not.
Summary
Thinking about data, there’s a lot of things. I talked about, data within the system, between systems at the enterprise level.
Questions and Answers
Participant: You mentioned the data platform, this self-service one, but that’s not something that you just get completely out of the blue one day. It needs to be built. It needs to be maintained. How do you get to that point? Especially considering like your data source is mostly operational data, do you use your operational data teams to be building that, or do you bring in data product teams, or a bit of mix and match? Because there are pros and cons, both of them.
Sato: I think if you’ve got ambition to have lots of data products then the platform becomes more important. When we’re starting projects, usually what we say is, even if you don’t start with all the teams that you envision to have, you should start thinking about the platform parts of it almost separate. We might build that together with the same team to start with, but very quickly, the platform team will have its own roadmap of things that they will need to evolve, and you will want to scale more data product teams that use it. Again, this is assuming you’ve got the scale and you’ve got the desire to do more with data. That’s the caveat on that. I think it’s fine to start with one team, but always keep in mind that those will diverge over time.
See more presentations with transcripts

MMS • Daniel Dominguez
Article originally posted on InfoQ. Visit InfoQ

Stability AI has introduced three new text-to-image models to Amazon Bedrock: Stable Image Ultra, Stable Diffusion 3 Large, and Stable Image Core. These models focus on improving performance in multi-subject prompts, image quality, and typography. They are designed to generate high-quality visuals for various use cases in marketing, advertising, media, entertainment, retail, and more.
The new models offer a range of features: Stable Image Ultra delivers high-quality, photorealistic outputs, making it ideal for professional print media and large-format applications; Stable Diffusion 3 Large balances generation speed and output quality, suited for producing high-volume, high-quality digital assets such as websites, newsletters, and marketing materials; and Stable Image Core is optimized for fast, affordable image generation, perfect for rapidly iterating on concepts during the ideation phase.
These models address common challenges like rendering realistic hands and faces and offer advanced prompt understanding for spatial reasoning, composition, and style.
Stability AI Diffusion models, can be used for inference calls using the InvokeModel
and InvokeModelWithResponseStream
operations. These models support various input and output modalities, which can be found in the Stability AI Diffusion prompt engineering guide.
To use a Stability AI Diffusion model, the model ID is required, which can be found in the Amazon Bedrock model IDs. Some models also work with the Converse API, which can be checked in the Supported models and model features.
It is important to note that the Stability AI Diffusion models support different features and are available in specific AWS Regions, which can be found in Model support by feature and Model support by AWS Region respectively.
Stuart Clark, senior developer advocate at AWS, shared on his X profile:
Imagine creating stunning visuals with just text prompts, now you can!
User AI Master Tools commented:
AI in Creative Fields: Stability AI’s launch of a new model to Amazon BedRock and the debut of Luma AI’s Dream Machine 1.6 highlight the ongoing integration of AI into creative processes, from art to content generation.
The community’s response to Stability AI’s integration of its top three text-to-image models into Amazon Bedrock has been diverse, reflecting a mix of excitement, strategic insight, and critical perspectives. Enthusiasts are eager to see how this development will transform content creation across industries, while others are focused on the technical advantages and increased accessibility this integration provides. However, concerns about centralization, data privacy, and the impact on open-source AI remain part of the conversation.

MMS • Aditya Kulkarni
Article originally posted on InfoQ. Visit InfoQ

The Kubernetes project has recently announced the release of version 1.31, codenamed “Elli”. This version incorporates 45 enhancements, with 11 features reaching Stable status, 22 moving to Beta, and 12 new Alpha features introduced. Key features in this release include enhanced container security with AppArmor, improved reliability for load balancers, insights into PersistentVolume phase transitions, and support for OCI image volumes.
Matteo Bianchi, Edith (Edi) Puclla, Rashan Smith and Yiğit Demirbaş from the Kubernetes Release Communications team covered this announcement in a blog post. Kubernetes v1.31 marks the first release following the project’s 10th anniversary.
Among the stable features in this release, Kubernetes now fully supports AppArmor for enhanced container security. Engineers can use the appArmorProfile.type
field in the container’s securityContext
for configuration. It is recommended to migrate from annotations (used before v1.30) to this new field.
Kubernetes v1.31 now also offers stable improved ingress connectivity reliability for load balancers, minimizing traffic drops during node terminations. This feature requires kube-proxy as the default service proxy and a load balancer supporting connection draining. No additional configuration is needed as it’s been enabled by default since v1.30.
The latest release introduces a new feature to track the timing of PersistentVolume phase transitions. This is achieved through the addition of a lastTransitionTime
field in the PersistentVolumeStatus
, which records the timestamp whenever a PersistentVolume changes its phase (e.g., from Pending to Bound).
This information is valuable for measuring the duration it takes for a PersistentVolume to become available for use, thus aiding in monitoring and improving provisioning speed.
Furthermore, this feature provides valuable data that can be utilized to set up metrics and service level objectives (SLOs) related to storage provisioning in Kubernetes.
One of the features in the release, which is now in Alpha, is support for Open Container Initiative (OCI) compatible image volumes. Kubernetes v1.31 introduces an experimental feature that allows the direct use of OCI images as volumes within pods. This helps AI/ML workflows by enabling easier access to containerized data and models.
The cloud native technology community showed particular excitement about this feature. Users of the Kubernetes subreddit took notice of the announcement post. One of the Reddit users expressed that this is a “very cool” feature, and in the same thread explained the benefits of having the model as an image.
AI & Kubernetes experts at Defense Unicors (a Medium Publication) also welcomed the use of OCI images to manage and share the AI models, making the process smoother and more integrated with other tools.
The features that graduated to Beta include nftables API, the successor to the iptables API, that delivers improved performance and scalability. Notably, the nftables proxy mode processes service endpoint changes and packets more efficiently than iptables, particularly benefiting clusters with extensive service counts.
For further engagement, users can join the Kubernetes community on Slack or Discord, or post questions on Stack Overflow. Kubernetes v1.31 is available for download from the official website or GitHub.

MMS • Sergio De Simone
Article originally posted on InfoQ. Visit InfoQ

To efficiently and effectively investigate multi-tenant system performance, Netflix has been experimenting with eBPF to instrument the Linux kernel to gather continuous, deeper insights into how processes are scheduled and detect “noisy neighbors”.
Using eBPF, the Compute and Performance Engineering teams at Netflix aimed at circumventing a few issues that usually make noisy neighbors detection hard. Those include the overhead introduced by analysis tools like perf, which also implies they are usually deployed only after the problem has already occurred, and the level of engineer expertise required. What eBPF makes possible, according to Netflix engineers, is observing compute infrastructure with low impact on performance to achieve continuous instrumentation of the Linux scheduler.
The key metric identified by Netflix engineers as an indicator of possible performance issues caused by noisy neighbors is process latency:
To ensure the reliability of our workloads that depend on low latency responses, we instrumented the run queue latency for each container, which measures the time processes spend in the scheduling queue before being dispatched to the CPU.
To this aim, they used three eBPF hooks: sched_wakeup
, sched_wakeup_new
, and sched_switch
. The first two of them are invoked when a process goes from ‘sleeping’ to ‘runnable’, i.e., when it is ready to run and waiting for some CPU time. The sched_switch
hook is triggered when the CPU is assigned to a different process. Process latency is thus calculated by subtracting the timestamp at the moment the CPU is assigned to the process and the moment when it first becomes ready to run.
Finally, the events collected by instrumenting the kernel are processed in a Go program to emit metrics to Atlas, Netflix metrics backend. To pass collected data to the userspace Go program, Netflix engineers decided to use eBPF ring buffers, which provide an efficient, high-performing, and user-friendly mechanism that does not require extra memory copying or syscalls.
Along with the timing information, eBPF also makes it possible to collect additional information about the process, including the process’s cgroup
ID which associates it to a container, which is key to correctly interpret preemption. Indeed, detecting noisy neighbors is not just a matter of measuring latency, since it also requires tracking how often a process is preempted and which process caused the preemption, whether it runs in the same container or not.
For example, if a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it’s actually because the container is hitting its CPU quota.
To make sure their approach did not hamper the performance of the monitored system, Netflix engineers also created a tool to measure eBPF code overhead, bpftop. Using this tool, they could identify several optimizations to reduce even more the overhead they initially had, keeping it below the 600 nanoseconds threshold for each sched_*
hook. This makes it reasonable to constantly run the hooks without fearing they will have an impact on system performance.
If you are interested in this approach to system performance monitoring or understanding better how eBPF works internally, the original article provides much more detail than can be covered here, including useful sample code.
Presentation: Are You Done Yet? Mastering Long-running Processes in Modern Architectures

MMS • Bernd Ruecker
Article originally posted on InfoQ. Visit InfoQ

Transcript
Ruecker: I talk about long running, not so much about exercise, actually. We want to start talking about food first, probably more enjoyable. If you want to order pizza, there are a couple of ways of ordering pizza. You probably have ordered a pizza in the past. If you live in a very small city like I do, if you order pizza, what you do is actually you call the pizza place. That’s a phone call. If I do a phone call, that’s synchronous blocking communication. Because I’m blocked, I pick up the phone, I have to wait for the other person to answer it. I’m blocked until I got my message and whatever, whatnot.
If then the person answers me, I get a direct feedback loop. Normally, that person either tells me they make my pizza or they don’t. They can reject it. I get a direct feedback. I’m also temporarily coupled to the availability of the other side. If the person is currently not available to pick up the phone, if they’re already talking on another line, they might not be able to take my call. Then it’s on me to fix that. I have to probably call them again in 5 minutes, or really stay on the line to do that. Synchronous blocking communication. What would be an alternative? I know you could probably use the app. Again, I can’t do that where I live. You could send an email. An email puts basically a queue in between. It’s asynchronous non-blocking communication, and there’s no temporal coupling.
I can send the email, even if the peer is not available, even if they take other orders. How does it make you feel if you send an email to your pizza place? Exactly that, because there is no feedback loop at all. Do they read my email? I pick up the phone to call them. It could be. It’s not a technical restriction that there is no feedback loop. They could simply answer the email saying, we got your order, and you get your pizza within whatever, 30 minutes. You can do a feedback loop again, asynchronously. It’s not really the focus of today. I have another talk also talking about that this is not the same. You can have those interaction patterns decoupled basically from the technology you’re using for doing that. Synchronous blocking communication, asynchronous non-blocking.
The most important thing is on the next slide. Even if I do that independent of email, or phone, it’s important to distinguish that the feedback loop is not the result. I’m still hungry. They told me they send a pizza, I’m probably even more hungry than before the result of the pizza. The task of pizza making is long running, so it probably goes into a queue for being baked. It goes into the oven. They hopefully take the right time to do that.
They can’t do that in a rush. Then the pizza is ready, and it needs to be delivered to me. It’s always long running, it takes time. It’s inherently there. That’s actually a pattern we see in a lot of interactions, not only for pizza, but for a lot of other things. We have a first step, that synchronous blocking, but we have an asynchronous result later on.
Could you do synchronous blocking behavior for the result, in that case? Probably not such a good idea. If you take the example not of pizza but of coffee. If you go to a small bakery and order coffee, what happens is that the person behind the counter takes your offer, takes your money, basically turns around, going for the coffee machine, presses a button, waits for the coffee to come out of that. Going back to you, give you the cup. It’s synchronous blocking. They can’t do anything else. I can’t do anything else.
We’re waiting for the coffee to get ready. If you have a queue behind you, and if you’re in a good mood to make friends, you probably order 10 coffees. It takes a while. It’s not a good idea. It’s not a good user experience here and it doesn’t scale very well. The coffee making is relatively quick compared to the pizza making and other things. It doesn’t have to be that way. There’s a great article from Gregor Hohpe. He called it, “Starbucks Doesn’t Use Two-Phase Commit.” He talked about scalable coffee making at Starbucks, where you also separate the two things. The first step is the synchronous blocking thing. I go to the counter, order, pay. Then they basically ask for my names or correlation identifier. Then they put me in a queue, saying, to the baristas, make that coffee for Bernd. Then these baristas are scaled independently.
There might be more than one, for example, doing the coffee, and then I get the coffee later on. That scales much better. That’s another thing you can recognize here, it also makes it easier to change the experience of the whole process. A lot of the fast-food chains have started to replace, not fully replaced, but replace some of the counters or the humans behind the counter with simply ordering by the app. Because that’s very easy for the first step, but not so easy for the coffee making. There’s robotics also for that. There are videos on the internet, how you can do that, but it’s not on a big scale. Normally, the baristas are still there, the coffee making itself. We want to distribute those two steps.
With that in mind, if I come back to long running, when I say long running, I don’t refer to any AI algorithm that runs for ages until I get a result. No, I’m basically simply referring to waiting. Long running for me is waiting because I have to wait for certain things, that could be human work, the human in the loop, like we just heard, because somebody has to prove something. Somebody has to decide something that are typically things, or somebody has to do something. Waiting for a response, I sent whatever inquiry to the customer, and they have to give me certain data.
They have to make their decision. They have to sign the document, whatever it is, so I have to wait for that. Both of those things are not within seconds, they can be within hours, days, or even weeks, sometimes even longer. Or I simply want to let some time pass. The pizza baking is one example, but I had a lot of other examples in the past. One of my favorites was a startup. They did a manufactured service, which was completely automated, but they wanted to make the impression to the customer that it’s like a human does it. They waited for a random time, between 10 and 50 minutes, for example, to process a response. There are also more serious examples.
Why Is Waiting a Pain?
Why is waiting a pain? It basically boils down to because we have to remember that we are waiting. It’s important to not forget about waiting. That involves persistent state. Because if I have to wait not only for seconds, but minutes, hours, days, or weeks, or a month, I have to persist it somewhere to still remember it when somebody comes back. Persistent state. Is that a problem? We have databases? We do. There are a lot of subsequent requirements, if you look at that.
For example, you have to have an understanding what you’re waiting for. You probably have to escalate if you’re waiting for too long. You have versioning problems, like if I have a process that runs for a month, and I start at like every day a couple of times, I always have processes in flux. If I want to change the process, I have to think about already running ones, and probably do something different for them than for newer ones, for example. I have to run that at scale. I want to see where I’m at, and a lot of those things.
The big question is, how do I do that? How do I solve those technical challenges without adding accidental complexity? That’s what I’m seeing, actually, quite often. I wrote a blog post, I think, more than 10 years ago, where I said, I don’t want to see any more homegrown workflow engines. Because people stumble into that, like we simply have to write a status flag in the database. Then we wait, that’s fine. Then they start, “We have to remember that we have to have a scheduler. We have to have an escalation. People want to see that.” They start adding stuff. That’s not a good idea to do.
Background
I’m working on workflow engines, process engines, orchestration engines, however you want to call them, for almost all my life, at least my professional life. I co-founded Camunda, a process orchestration company, and wrote a lot of things in the past about it. I’ve worked on a couple of different open source workflow engines as well in the past.
Workflow Engine (Customer Onboarding)
One of the components that can solve these long running issues is a workflow engine. We’re currently more going towards naming it an orchestration engine, some call it a process engine. It’s all fine. The idea is that you define workflows, which you can run instances off, and then you have all these requirements being settled. I wanted to give you a 2-minutes demo, not because I want to show the tool, that’s a nice side effect. There are other tools doing the same thing. I want to get everybody to the same page of, what is that? What’s a workflow engine? If you want to play around with that yourself, there’s a link.
It’s all on GitHub, so you can just run it yourself. What I use as an example is an onboarding process. We see that in basically every company to some extent. You want to open up a new bank account, you go through an onboarding process, as a bank. You want to have a new mobile phone contract, you go through onboarding. If you want to have new insurance contract, onboarding. It’s always the same. This is how it could look like. What I’m using here, it’s called BPMN, it’s an ISO standard, how to define those processes.
You do that graphically. In the background, it’s simply an XML file basically describing that. It’s standardized, ISO standard. That’s not a proprietary thing here. Then you can do things like, I score the customer, then I approve the order. That’s a manual thing. I always like adding things live with the risk of breaking down. We could say, that takes too long. We want to basically escalate that. Let’s just say, escalate. Yes, we keep it like that. We have to say what too long is. That’s a duration with a period, time, 10 seconds should be enough for a person to review it. I just save that.
What I have in the background is a Java application, in this case. It doesn’t have to be Java, but I’m a Java person. It’s a Java Spring Boot application basically that connects to the workflow engine, in this case also deploys the process. Then also provides a small web UI. I can open a new bank account. I don’t even have to type in data because it does know everything. I submit the application. It triggers a REST call basically. The REST call goes into the Spring Boot application. That kicks off a process instance within the workflow engine. I’m using our SaaS service, so you have tools like Operate, where it can look into what’s going on.
There it can see that I have processes running. You see the versioning. I have a new version. I have that instance running. If I kick off another one, I get a second one in a second. I’m currently waiting for approval. I also already have escalated it, at the same time. Then you have tasks list, because I’m now waiting for a human, for example. I have UI stuff. I could also do that via chatbot or teams’ integration, for example. Yes, to automatic processing, please. Complete the task. Then this moves on. I’m seeing that here as well. I’m seeing that this moves on, and also sends an email. I have that one.
Process instance finish, for example. It runs through a couple of steps. Those couple of steps then basically connect to either the last two things I want to show, for example, create customer in CRM system, is, in this case, tied to a bit of Java code where it can do whatever you want to. That’s custom glue code, you simply can program it. Or if you want to send a welcome email, you already see that. That’s a pre-built connector. For example, for SendGrid, I can simply configure. That means in the background, also, my email was sent, which I can also show you hopefully here. Proof done, “Hello, QCon,” in email. We’re good.
That’s a workflow engine running in the background. We are having a workflow model. We have instances running through. We have code attached, or UIs attached to either connect to systems or to the human. Technically, I was using Camunda as a SaaS service here, and I had a Spring Boot application. Sometimes I’m being asked, ok, workflow, isn’t that for these like, I do 10 approvals a day things? No. We’re having customers running that at a huge scale. There’s a link for a blog post where we go into the thousands of process instances per second.
We run that in geographically distributed data centers in the U.S. and UK, for example, and this adds latency, but it doesn’t bring throughput down, for example. We are also working to reduce the latency of certain steps. What I’m trying to say is that that’s not only for I run five workflows a day, you can run that at a huge scale for core things.
When Do Services Need to Wait? (Technical Reasons)
So far, I looked at some business reasons why we want to wait. There are also a lot of technical reasons why you want to wait for things, why things get long running. That could be, first of all, asynchronous communication. If you send a message, you might not know when you get a message back. It might be within seconds in the happy case or milliseconds. What if not, then you have to do something. If you have a failure scenario, you don’t get a message back, you want to probably just stop where you are, and then wait for it to happen.
Then probably you can also notify an operator to resolve that. Or the peer service is not available, so especially if you go into microservices, or generally distributed systems, the peer might not be available, so you probably have to do something about it. You have to wait for that peer to become available. That’s a problem you should solve. Because otherwise, yes, you get to that. You get chain reactions, basically.
The example I always like to use is this one. If you use an airplane, you get an email invitation to check in a day before, 24 hours before that normally. Then you click a link, and you should check in. I did that for a flight actually to London. I think that was pre-pandemic, 2019, or something like that. I flew to London with Eurowings. I wanted to check in, and what it said to me was, “There was an error while sending you your boarding pass.” I couldn’t check in. That’s it. What would you do? Try it again. Yes, of course. I try it again. That’s what I did. Didn’t work. I tried it again 5 minutes later, didn’t work.
What was the next thing I did? I made a calendar entry in my Outlook, to remind me of trying it again in a couple of hours. Because there was still time. It wasn’t the next day. I just wanted to make sure not to forget to check in. That’s what I call a stateful retry. I want to retry but in a long running form, like 4 hours from now because it actually doesn’t work. It doesn’t matter because I don’t need it yet now.
The situation I envision is that, in the background, they had their web interface, they probably had a check-in microservice. They probably had some components downstream required for that to work, for example, the barcode generation, or document output management, or whatever. One of those components did fail. The barcode generation, for example, didn’t work, so they couldn’t check me in. The thing is that the more we distribute our systems into a lot of smaller services, the more we have to accept that certain parts are always broken, or that network to certain parts are always broken.
That’s the whole resiliency thing we’re discussing about. The only thing that we have to make sure, which is really important, that it doesn’t bring down our whole system. In other words, just that the 3D barcode generation, which is probably needed for my PDF boarding pass, I need to print out later, is not working, shouldn’t prevent my check-in. That’s a bad design. That’s not resilient. Because then you get a chain reaction here. The barcode generation is not working, probably not a big deal. It gets to a big deal because nobody can check in anymore. They make it my problem.
They transport the error all the way up to me, for me to resolve because I’m the last one in the chain. Everybody throws the hot potato ones further, I’m the last part in the chain as a user. That makes me responsible for the Outlook entry. The funny part about that story was really, the onwards flight, same trip from London, easyJet, “We are sorry.” Same problem, I couldn’t check in, but they give you the work instruction. They are better with that. “We’re having some technical difficulties, log on again, retry. If that doesn’t work, please try again in 5 minutes.” I like that, increase the interval. That makes a lot of sense. You could automate that probably.
The next thing, and I love that, “We do actively monitor our site. We’ll be working to resolve the issue. There’s no need to call.” It’s your problem, leave us alone. In this case, it’s very obvious because it’s facing the user. It’s an attitude I’m seeing in a lot of organizations, even internally to other services, their problem, which is, throw an error, we’re good.
The much better situation would be the check-in should probably handle that. They should check me in. They could say, you’re checked in, but we can’t issue the boarding pass right now, we’re sorry, but we send it on time. Or, you get it in the app anyway. I don’t want to print it out, don’t need a PDF. They could handle it in a much more local scope. That’s a better design. It gives you a better designed system. The responsibilities are much cleaner defined, but the thing is now you need long running capabilities within the check-in service. If you don’t have them, that’s why a lot of teams are rethrowing the error.
Otherwise, we have to keep state, we want to be stateless. That’s the other part, which I was discussing with a lot of customers over the last 5 years. The customer wants a synchronous response. They want to see a response in the website where it says you’re checked in, here’s your boarding pass, here’s the PDF, and whatever. We need that. People are used to that experience. I wouldn’t say so. If my decision as the customer is either I get a synchronous error message and have to retry myself or I get some result later on. I know what I’d pick. It’s still a better customer experience. It needs a little bit of rethink, but I find it important.
Let’s extend the example a little bit and add a bit more flavor on a couple of those things. Let’s say you’re still doing flight bookings, but maybe you also want to collect payments for it. That would make sense as a company. The payments might need credit card handling, so they want to take money from the credit card. Let’s look at that. The same thing could happen. You want to charge the credit card. The credit card service at least internally but maybe also on that level will use some SaaS whatever service in the internet. You will probably not do credit card handling yourself unless you’re very big, but normally, you use some Stripe-like mechanism to do that.
You will call an external API, REST, typically, to make the credit card charge. Then you have that availability thing. That service might not be available when you want to charge a credit card. You probably also then have the same thing, you want to charge it and want to probably wait for availability of the credit card service, because you don’t want to tell your customers, we can’t book your flight because our credit card service is currently not available. You probably want to find other ways. That’s not where it stops. It normally then goes beyond that, which is very interesting if you look into all the corner cases.
Let’s say you give up after some time, which makes sense. You don’t want to try to book the flight for tomorrow, for the next 48 hours. It does make sense. You give up at some point in time. You probably say the payment failed, and we probably can’t book your flight, or whatever it is that you do. There’s one interesting thing about distributed systems, if you do a remote call, and you get an exception out of that, you can’t differentiate those three situations. Probably the network was broken, you have not reached the service provider.
Maybe the network was great but the service provider, the thread exploded while you were doing it. It didn’t process it. Did it commit its transaction or not? You have no idea. Or everything worked fine and the response got lost in the network. You can’t know what just happened. That makes it hard in that scenario, because even if you get an exception, you might have charged the credit card, actually.
It might be a corner case, but it’s possible. Depending on what you do, you might not want to ignore it. Maybe you can. If that’s a conscious decision, that’s fine. Maybe you can’t, then you have to do something about that. You also can do that in a workflow way. You could also run monthly conciliation jobs, probably also a good solution. It always depends. If you want to do it in a workflow way, you might even have to check if it was charged and refunded, so it gets more complicated. That’s what I’m trying to say.
In order to do these kinds of things, again, embrace asynchronous thinking. Make an API that’s ready to probably not deliver a synchronous result. That’s saying, we try our best, maybe you get something in a good case, but maybe you don’t. Then, that’s HTTP codes. I like to think in HTTP codes, like 202 means we got your request, that’s the feedback loop, we got it, but the result will be later. Now you can make it long running, and that extends your options, what it can do. Speaking of that, one of the core thoughts there is also, if you make APIs like that, make it asynchronous, make it be able to handle long running.
Within your services, you’re more free to implement requirements the way you want. Let’s say you extend the whole payment thing, not only to credit cards, but probably to also have customer credits on their account. Some companies allow that. If you return goods, for example, you get credits on your account, which you can use for other things, or PayPal has that. If you get money sent via PayPal, it’s on your PayPal account, you can use that first before they deduct it from your bank account, for example. Then you could add that where you say, I first deduct credit and then I charge the credit card, and you get more options of doing that also long running. That poses interesting new problems around really consistency. For example, now we have a situation where we talk to different services, probably for credit handling, or for credit card charging.
All of them have their transactions internally, probably, but you don’t have a technical transaction spawning all of those steps. Where you say, if the credit card charging fails, I also didn’t deduct the customer credit, I just say payment failed. I need to think about these scenarios where a deducted customer credit card charge doesn’t work. I want to fail the payment. Then I have to basically rebook the customer credit. That’s, for example, also what you can do with these kinds of workflows. That’s called compensation. Where you say, I have compensating, like undo activities for activities if something failed. The only thing I’m trying to say here is, it gets more complex very quickly if you think about all the implications of distributed systems here.
Long Running Capabilities (Example)
Going back to the long running capabilities. My view on that is, you need long running capabilities to design good services, good service boundaries. That’s a technical capability you should have in your architecture. I made another example to probably also make it easier to grasp. Let’s say the booking service basically tells the payment service via method via REST call, saying, retrieve payment. I won’t discuss orchestration versus choreography, because that could be something you’re also interested in. Why doesn’t it just emit an event? Booking says, payment, retrieve payment for that flight, for example. Payment chose the credit card. Now let’s say the credit card is rejected. Service is available, but the credit card is rejected. That very often happens in scenarios where I store the credit card in my profile, it’s expired, and then it gets rejected.
Now the next question is what to do with that. Typically, a requirement could be, if the credit card is rejected, the customer can provide new details. They hopefully still book their flight. We want them to do that. They need to provide new credit card details. You can also think about other scenarios. Somewhere I have the example of GitHub subscriptions, because there, it’s a fully automated process that renews my subscription, uses my credit card. It doesn’t work, they send you an email, “Update your credit card.”
The question is where to implement that requirement. One of the typical reactions I’m seeing in a lot of scenarios is that, as a payment, we’re stateless again. We want to be simple. We can’t do that, because then we have to send the customer an email. We have to wait for the customer to update the credit card details. We have to control that whole process. It gets long running.
They understand it adds complexity, they don’t want to do that. Just hot potato forward to the booking, because the booking is long running anyway, for a couple of reasons. They also have that. They can handle that requirement better, so let’s just throw it over the fence over there. I’m seeing that very often, actually. If you make the same example with order fulfillment, or other things where it’s very clear that that component, like booking, order fulfillment has a workflow mechanism, then this happens. The problem is now you’re leaking a lot of domain concepts, out of payment into booking, because booking shouldn’t know about credit card at all. They want to get money. They want to have the payment. They shouldn’t care about the way of payment. Because that probably also changes over time, and you don’t always want to adjust the booking, just because there’s a new payment method.
It’s a better design to separate that. That’s questionable. If you go into DDD, for example, it also leaks domain language, like, credit card rejected. I don’t care, I wanted to retrieve payment. Either you got my payment or you didn’t. That’s the two results I care about as booking. You want to really put it into the payment service. That makes more sense. Then, get a proper response, like the final thing. In order to do that, you have to deal with long running requirements within payment. That’s the thing. You should make that easy for the teams to do that.
I added potentially on the slide. In such a situation, payment in 99% of the cases might be really super-fast, and could be synchronous. Then there are all these edge cases where it might not be and it’s good to be able to handle that. Then you can still design, for example, an API versus say, in the happy case I get a synchronous result. It’s not an exceptional case. It’s just one case. The other case could be, I don’t get that. I get an HTTP 202, and an asynchronous response. Make your architecture ready for that. Then you could use probably also workflows for implementing that.
Just because there’s a workflow orchestration doesn’t mean it’s a monolithic thing. I would even say, the other way round, if you have long running capabilities available in the different services you might want to do, it gets easier to put the right parts of the process in the right microservices, for example, and it’s not monolithic at all. It gets monolithic if, for example, payment doesn’t have long running capabilities, and you move that logic into the booking service, just because that booking service has the possibility to do long running. I find that important. It’s not that having orchestration, or long running capabilities adds the monolithic thing. It’s the other way round, because not all the services have them at their disposal. Normally, what they do is they push all the long running stuff towards that one service that does, and then this gets monolithic. From my perspective, having long running at the disposal for everybody avoids these, what Sam Newman once called, god services.
Successful Process Orchestration (Centers of Excellence)
Long running capabilities are essential. It makes it easier to distribute all the responsibilities correctly. Also, it makes it easier to embrace asynchronous, non-blocking stuff. You need a process orchestration capability. That’s what I’m convinced of. Otherwise, probably, I wouldn’t do it for all my life. That’s also easy to get as a team. Nowadays, that means as a service, either internally or probably also externally, to create a good architecture. I’m really convinced by that. Looking into that, how can I do that? How can I get that into the organization better? What we’re seeing very successful, all organizations I talk with that use process orchestration to a bigger extent, very successfully, they have some Center of Excellence, organizationally. They not always call it Center of Excellence. Sometimes it’s a digital enabler, or even process ninjas. It might be named very differently. That depends a little bit on company culture and things.
It’s a dedicated team within the organization that cares about long running, if you phrase it more technically, or process orchestration, process automation, these kinds of things. This is the link, https://camunda.com/process-orchestration/automation-center-of-excellence/, for a 40-page article where we collected a lot of the information about Center of Excellence: how to build them, what are best practices to design them, and so on. One of the core ideas there is, a Center of Excellence should focus on enablement, and probably providing a platform.
They should not create solutions. Because sometimes people ask me, but we did that BPM, where we had these central teams doing an ESB and very complicated technology and didn’t work. It didn’t work, because at that time, a lot of those central teams had to be involved in the solution creation. They had to build workflows. It was not possible without them. That’s a very different model nowadays. You normally have a central team that focuses on enabling others that then build the things. Enabling means probably consulting, helping them, building a community, but also providing technology where they can do that.
What I’m discussing very often within the last two or three years is, but we stopped doing central things. We want to be more autonomous. We have the teams, they should be free in their decisions. We don’t want to put too much guardrails on them. Isn’t a central CoE the path? Why do you do that? I discuss that with a lot of organizations actually. I was so happy about the Team Topologies book. That’s definitely a recommendation to look into. The core ideas are very crisp, actually. In order to be very efficient in your development, you have different types of teams. That’s the stream-aligned team that does business logic, that implements business logic, basically. They provide value. That’s very often also value streams and whatever. You want to make them as productive as possible to remove as much friction as possible so they can really provide value, provide features. In order to do that you have other types of teams.
The two important ones are the enabling team, a consulting function, like hopping through the different projects, and the platform team, providing all the technology they need, so they don’t have to figure out everything themselves. The complicated subsystem team is something we don’t focus on too much. It can be some fraud check AI thing somebody does, and then provides an internal as a service thing. You can map that very well. Our customers do that actually very well to having a Center of Excellence around process orchestration, automation, for example.
Where you say they provide the technology. In our case, that’s very often Camunda, but it could be something else. Very often, they also own adjacent tools like RPA tools, robotic process automation, and others. They provide the technology and also the enablement: project templates, and whatnot. That’s very efficient, actually. It frees the teams of figuring out that themselves, because that’s so hard. As a team, if you don’t have an idea how you build your stack, you can go into evaluation mode for two or three months, and you don’t deliver any business value there. That’s actually not new. There are a couple of recommendable blog posts out there also talking about that. One is the thing from Spotify. Spotify published about Golden Path, 2020, where they basically said, we want to have certain defined ways of building a certain solution type. If we build a customer facing web app, this is normally how we do it.
If we build a long running workflow, this is how we do it. They have these kinds of solution templates. The name is good, actually, they name it Golden Path, because it’s golden. They make it so easy to be used. They don’t force teams to use it. That’s the autonomy thing. They don’t force it upon people. They make it desirable to be used. They make it easy. It’s not your fault if it’s not working. Then it’s golden. I like the blog post, actually, I love that quote, because they found that rumor-driven development simply wasn’t scalable. “I heard they do it like that, probably you should do that as well.” Then you end up with quite a slew of technology that doesn’t work. I find this really important that you want to consolidate on certain technologies. You want to make it easy to use them across the whole organization. That makes you efficient. Don’t force it upon the people.
They also have a tool. That’s a big company, they do open source on the side. They made backstage.io. I have no idea if the tool is good. I have not used it at all. I love the starting page of their website, The Speed Paradox, where they said, “At Spotify, we’ve always believed in the speed and ingenuity that comes from having autonomous development teams, but as we learn firsthand, the faster you grow, the more fragmented and complex your software ecosystems become, and then everything slows down again.” The Standards Paradox, “By centralizing services and standardizing your tooling, Backstage streamlines your development environment.
Instead of restricting autonomy, standardization frees your engineers from infrastructure complexity.” I think that’s an important thought. They’re not alone. If you search the internet, you find a couple of other places, for example, Twilio, but also others. Same thing. We’re offering paved path, mature services, pull off the shelf, get up and running super quickly. What you do is create the incentive structure for teams to take the paved path, because it’s a lot easier. If they really have to go a different route, you make it possible. It’s not restricting autonomy, simply helping them. That’s important. I think it’s also important to discuss that internally.
Graphical Models
Last thing, graphical models. That’s the other thing I discuss regularly. Center of Excellence, yes, probably makes sense. Process orchestration, yes, I understand why we have to do that. Graphical models? We’re developers. We write code. Thing is, BPMN, that’s what I showed. It’s an ISO standard. It’s worldwide adopted. It can do pretty complex things. I just scratched the surface. It can express a lot of complex things in relatively simple model, so it’s powerful. It’s living documentation. It’s not a picture that’s requirement, but it’s running code. That’s the model you put into production. It’s running code. That’s so powerful.
This is an example where it’s used for test cases. That’s what the test case tests, for example. You can leverage that as a visual. Or it can use it in operations like, where is it stuck, or what is the typical way it’s going through, or where are typical bottlenecks, and so on? You can use that to discuss that also with different kinds of stakeholders, not only developers, but all of them.
If you discuss a complex algorithm, like a longer process or workflow, you normally go to the whiteboard and sketch it because we’re visual as a human. Just because I’m a programmer doesn’t make me less visual. I want to see it. Very powerful. It’s even more important, because I think a lot of the decisions about long running behavior needs to be elevated to the business level.
They need to understand, why we want to get asynchronous. Why this might take longer. Why we need to change, also customer experience to leverage the architecture. The only way of doing that is to really make it transparent, to make it visual. I think it was a former marketing colleague that worked with me, phrased it like that. What you’re trying to say is that in order to leverage your hipster architecture, you need to redesign the customer journey. That’s exactly that. That’s important to keep in mind.
Example (Customer Experience)
I want to quickly close that with another flight story. The first thing it’s happening, so you get everything asynchronous. They did change the customer experience a lot. Now I’m working on train companies. That’s the same thing. Mobile. You get automatically checked in for flights. You don’t even have to do that. Why should I do that? My flight to London was delayed by an hour. Ok, that’s delayed. That was canceled. That’s not so nice. Then I got a relatively quickly and automated email, that’s the only one in German, which I don’t get why. Did I get that one in German? It wasn’t German.
I got the link to book my hotel at Frankfurt airport. Why? I don’t want to get a hotel in Frankfurt, I want to get to London. Everything automated, everything pushed. Nice. Then I got, via the app not via email, a link to a chatbot where I should chat about my flight. It says, we rebooked you for tomorrow morning. It didn’t do that completely because it’s not Lufthansa, so you have to see a human colleague. I don’t want to get to London tomorrow, I want to get there today. I basically visit a counter.
The end of the story is they could rebook me to a very late flight to London, Heathrow, which was very late. I hated that. What I still like, everything was asynchronously. I got notification of everything in the app via email. I think there’s some good things on the horizon there. The customer experience for airlines at least changed quite a bit over the last 5 years. Funny enough, last anecdote, I read an article about the bad NPS score of Lufthansa, and I probably understand why.
Recap
You need long running capabilities for a lot of reasons. Process orchestration platforms, workflow engines, great technology. You should definitely use that for those, because it allows you to design better service boundaries, implement quicker, less accidental complexity. You can embrace asynchronicity better. Provide a better customer experience. We haven’t even talked about the other stuff like increased operational efficiency, automation, reduce risk, be more compliant, document the process, and so on. In order to do that successfully across the organization, you should organize some central enablement. I’m a big advocate for that, to really adopt that at scale.
See more presentations with transcripts

MMS • Anthony Alford
Article originally posted on InfoQ. Visit InfoQ

Researchers at Apple and the Swiss Federal Institute of Technology Lausanne (EPFL) have open-sourced 4M-21, a single any-to-any AI model that can handle 21 input and output modalities. 4M-21 performs well “out of the box” on several vision benchmarks and is available under the Apache 2.0 license.
4M-21 is a 3B-parameter Transformer-based encoder-decoder model. All 21 input modalities are mapped to discrete tokens using modality-specific tokenizers, and the model can generate any output modality given any input modality. The model was trained on around 500 million samples of multimodal data, including COYO and C4. Out of the box, 4M-21 can perform a wide range of tasks, including steerable image generation and image retrieval. On vision benchmarks including semantic segmentation and depth estimation, it outperformed comparable baseline models. According to Apple:
The resulting model demonstrates the possibility of training a single model on a large number of diverse modalities/tasks without any degradation in performance and significantly expands the out-of- the-box capabilities compared to existing models. Adding all these modalities enables new potential for multimodal interaction, such as retrieval from and across multiple modalities, or highly steerable generation of any of the training modalities, all by a single model.
4M-21 builds on Apple’s earlier model, Massively Multimodal Masked Modeling (4M), which handled only seven modalities. The new model triples the modalities, which include text and pixel data, as well as “multiple types of image, semantic and geometric metadata.” Each modality has a dedicated tokenizer; text modalities use a WordPiece tokenizer, while image modalities use variational auto-encoders (VAE). The model is trained using a single objective: “a per-token classification problem using the cross-entropy loss.”
By allowing inputs with multiple modalities and chaining operations, 4M-21 supports fine-grained image editing and generation. For example, providing a text caption input will prompt the model to generate the described image. Users can control details about the generated image by including geometric input such as bounding boxes, segmentation maps, or human poses along with the caption. The model can also perform image retrieval based on different inputs; for example, by finding images given a caption or a semantic segmentation map.
Research team member Amir Zamir posted about the work in a thread on X. One user asked Zamir why the model does not support audio modalities. Zamir replied that “It’s a matter of data,” and suggested their method should work with audio. He also wrote:
IMO, the multitask learning aspect of multimodal models has really taken a step forward. We can train a single model on many diverse tasks with ~SOTA accuracy. But a long way to go in terms of transfer/emergence.
Andrew Ng’s AI newsletter The Batch also covered 4M-21, saying:
The limits of this capability aren’t clear, but it opens the door to fine control over the model’s output. The authors explain how they extracted the various modalities; presumably users can do the same to prompt the model for the output they desire. For instance, a user could request an image by entering not only a prompt but also a color palette, edges, depth map extracted from another image, and receive output that integrates those elements.
The code and model weights for 4M-21 are available on GitHub.

MMS • RSS
Posted on mongodb google news. Visit mongodb google news
We recently compiled a list titled Jim Cramer’s Top 10 Stocks to Track for Potential Growth. In this article, we will look at where MongoDB, Inc. (NASDAQ:MDB) ranks among Jim Cramer’s top stocks to track for potential growth.
In a recent episode of Mad Money, Jim Cramer points out the surprising strength in the market, noting that many companies are performing better than Wall Street recognizes. He argues that people should stop doubting these companies every time there’s a negative data point. Cramer highlights the impressive management and execution by CEOs, which often goes unnoticed.
“Suddenly, all is forgiven, or if not all, then at least most. I’m talking about the incredible resilience in this market, buoyed by a recognition that many companies are simply better than Wall Street gives them credit for. We need to stop turning against them every time there’s a seemingly bad data point. Every day I come to work, I’m dazzled by the resourcefulness of executives who do their best to create value for you, the shareholder. Lots of stocks went up on days like today when the Dow advanced 335 points, the S&P gained 75%, and the NASDAQ jumped 1.0%, all thanks to good management and excellent execution that often goes unnoticed.”
While Cramer acknowledges that some CEOs deserve skepticism, he emphasizes that many are outstanding and deserve recognition for their hard work. He criticizes the focus on short-term economic indicators and emphasizes that great companies aren’t distracted by minor fluctuations.
“Listen, I’m not a pushover. I can hit CEOs with tough questions when needed, some of them deserve skepticism and scorn. But there are also plenty of brilliant, hardworking CEOs with incredible teams, and you ignore their hustle at your own peril. This often gets lost in the shuffle when we’re focused on the parlor game of guessing the Fed’s next move—a quarter point, half a point, quarter, half. You know what I say? Let’s get serious. Terrific companies don’t get caught up in that quarter-half shuffle.”
Cramer explains how Kroger CEO Rodney McMullen has led the supermarket chain to success despite challenges, including resistance to its acquisition of Albertsons and a tough economic environment. McMullen has managed to keep food costs down and deliver strong results through effective strategies like a superior loyalty program and regional store improvements. Despite high food prices, the company’s stock rose more than 7% following a positive earnings report, showcasing the company’s successful turnaround.
“CEO Rodney McMullen has managed to keep food costs down and deliver fantastic numbers, all while maintaining an expensive, unionized labor force in a very uncertain commodity environment. How? The company confounded critics by developing a superior loyalty program, regionalizing their stores, and creating some of the best private-label products out there, second only to Costco. Food is still expensive, but cooking at home is far cheaper than dining out. McMullen tells us that consumers are no longer flush with cash, especially his most budget-conscious clientele. He notes, “Budget-conscious customers are buying more at the beginning of the month to stock up on essentials, and as the month progresses, they become more cautious with their spending.”
Wow, that’s a tough environment. When I heard this, I thought back to the old company, the one that used to miss its numbers whenever the environment got a little tough. Everybody else remembers the old company too, which is why the stock was just sitting there waiting to be picked up, until this quarter’s report, after which it soared more than 7% in response to the fabulous results. Everyone thought the company would drop the ball, as they used to, but McMullen has finally whipped his supermarket into shape.”
Cramer contrasts this with the tech industry, where complex details often lead Wall Street to misunderstand a company’s true potential. He believes that in tech, analysts frequently overlook the expertise and capabilities of CEOs who have a deep understanding of their businesses.
“We all need to eat, so it’s not hard to understand the grocery business. But it’s quite different when it comes to tech, where analysts constantly doubt the resolve and expertise of CEOs who simply know more about their businesses than the critics. In tech, the complexity often leads Wall Street to conclusions that have little to do with reality.”
Our Methodology
This article reviews a recent episode of Jim Cramer’s Mad Money, where he discussed several stocks. We selected and analyzed ten companies from that episode and ranked them by the level of hedge fund ownership, from the least to the most owned.
At Insider Monkey we are obsessed with the stocks that hedge funds pile into. The reason is simple: our research has shown that we can outperform the market by imitating the top stock picks of the best hedge funds. Our quarterly newsletter’s strategy selects 14 small-cap and large-cap stocks every quarter and has returned 275% since May 2014, beating its benchmark by 150 percentage points (see more details here).
A software engineer hosting a remote video training session on a multi-cloud database-as-a-service solution.
MongoDB Inc.(NASDAQ:MDB)
Number of Hedge Fund Investors: 54
Jim Cramer believes MongoDB, Inc. (NASDAQ:MDB) is an enterprise software company delivering excellent results, but it isn’t receiving the same level of recognition as competitors like Salesforce.com (NYSE:CRM). He notes that investors generally seem to shy away from enterprise software companies, with the exception of Salesforce.com (NYSE:CRM). However, Cramer feels that MongoDB, Inc. (NASDAQ:MDB) is currently at a good price, suggesting it may be undervalued despite its strong performance. Cramer sees potential in MongoDB, Inc. (NASDAQ:MDB) and implies it deserves more attention in the enterprise software space.
“You know, MongoDB, Inc.(NASDAQ:MDB) is an enterprise software company that put up terrific numbers and isn’t getting credit in the same way Salesforce.com, inc. (NYSE:CRM) and others are. People tend to dislike enterprise software, except for ServiceNow. I think MongoDB, Inc.(NASDAQ:MDB) is at the right price.”
MongoDB, Inc. (NASDAQ:MDB) offers a strong case for long-term growth, driven by its outstanding financial performance and strategic advancements. In Q2 2024, MongoDB, Inc. (NASDAQ:MDB) reported a 40% jump in revenue, reaching $423.8 million, with its cloud-based Atlas platform accounting for 65% of total revenue. This growth exceeded market expectations and demonstrates the growing demand for its flexible database solutions. MongoDB, Inc. (NASDAQ:MDB) also turned its operating loss from the previous year into a profit of $53.6 million, reflecting its ability to grow while controlling costs.
Analysts are optimistic about MongoDB, Inc. (NASDAQ:MDB), with KeyBanc raising its price target to $543, citing MongoDB’s dominant position in the NoSQL database market and its potential to capitalize on rising demand from cloud and AI-driven applications. MongoDB, Inc. (NASDAQ:MDB)’s educational initiatives, such as partnering with India’s Ministry of Education to train 500,000 students, further strengthen its developer community and support future growth.
ClearBridge All Cap Growth Strategy stated the following regarding MongoDB, Inc. (NASDAQ:MDB) in its first quarter 2024 investor letter:
“During the first quarter, we initiated a new position in MongoDB, Inc. (NASDAQ:MDB), in the IT sector. The company offers a leading modern database platform that handles all data types and is geared toward modern Internet applications, which constitute the bulk of new workloads. Database is one of the largest and fastest-growing software segments, and we believe it is early innings in the company’s ability to penetrate this market. MongoDB is actively expanding its potential market by adding ancillary capabilities like vector search for AI applications, streaming and real-time data analytics. The company reached non-GAAP profitability in 2022, and we see significant room for improved margins as revenue scales.”
Overall MDB ranks 6th on the list of Jim Cramer’s top stocks to track for potential growth. While we acknowledge the potential of MDB as an investment, our conviction lies in the belief that under the radar AI stocks hold greater promise for delivering higher returns, and doing so within a shorter timeframe. If you are looking for an AI stock that is more promising than MDB but that trades at less than 5 times its earnings, check out our report about the cheapest AI stock.
READ NEXT: $30 Trillion Opportunity: 15 Best Humanoid Robot Stocks to Buy According to Morgan Stanley and Jim Cramer Says NVIDIA ‘Has Become A Wasteland’.
Disclosure: None. This article was originally published on Insider Monkey.
Article originally posted on mongodb google news. Visit mongodb google news

MMS • RSS
Posted on nosqlgooglealerts. Visit nosqlgooglealerts
By Ben Paul, Solutions Engineer – SingleStore
By Aman Tiwari, Solutions Architect – AWS
By Saurabh Shanbhag, Sr. Partner Solutions Architect – AWS
By Srikar Kasireddy, Database Specialist Solutions Architect – AWS
![]() |
SingleStore |
![]() |
The fast pace of business today often demands the ability to reliably handle an immense number of requests daily with millisecond response times. Such a requirement calls for a high-performance, non-schematic database like Amazon DynamoDB. DynamoDB is a serverless NoSQL database that supports key-value and
document data models that offers consistent single-digit millisecond performance at any scale. DynamoDB enables customers to offload the administrative burdens of operating and scaling distributed databases to AWS cloud so that they don’t have to worry about hardware provisioning, setup and configuration, throughput capacity planning, replication, software patching, or cluster scaling.
Perform near real-time analytics on DynamoDB data allows customers to quickly respond to changing market conditions, customer behavior, or operational trends. With near real-time data processing, you can identify patterns, detect anomalies, and make timely adjustments to your strategies or operations. This can help you stay ahead of the competition, improve customer satisfaction, and optimize your business processes.
SingleStore is an AWS Data and Analytics Competency Partner and AWS Marketplace Seller. It’s SingleStore Helios is a fully managed, cloud-native database that powers real-time workloads needing transactional and analytical capabilities.
By combining DynamoDB with SingleStore, organizations can efficiently capture, process, and analyze DynamoDB data at scale. SingleStore has high-throughput data ingestion and near-real time analytical query capability for both relational and JSON data. This integration empowers businesses to derive actionable insights from their data in near real time, enabling faster decision-making and improved operational efficiency.
SingleStore has the ability to stream Change Data Capture (CDC) data from DynamoDB and serve up fast analytics on top of its Patented Universal Storage. SingleStore has native support for JSON so DynamoDB items can be stored directly in their own JSON column. Alternatively, key-value pairs from DynamoDB can be stored as own column in SingleStore.
In this blog post we will cover two architectural patterns to integrate DynamoDB with Singlestore in order to perform near real-time analytics on your DynamoDB data –
- Using DynamoDB Stream and AWS Lambda
- Using Amazon Kinesis Data Streams connector with Amazon MSK
Using DynamoDB Stream and AWS Lambda
Figure 1 – Architecture pattern for Amazon DynamoDB CDC to SingleStore using DynamoDB Stream and AWS Lambda
The design pattern described leverages the power of DynamoDB Stream and AWS Lambda to enable near real-time data processing and integration. The following is the detail workflow for the architecture:
1. Client applications interact with DynamoDB using the DynamoDB API, performing operations such as inserting, updating, deleting, and reading items from DynamoDB tables at scale.
2. DynamoDB Stream is a feature that captures a time-ordered sequence of item-level modifications in a DynamoDB table and durably stores this information for up to 24 hours. This allows applications to access a series of stream records containing item changes in near real-time.
3. The AWS Lambda service polls the stream for new records four times per second. When new stream records are available, your Lambda function is synchronously invoked. You can subscribe up to two Lambda functions to the same DynamoDB stream.
4. Within the Lambda function, you can implement custom logic to handle the changes from the DynamoDB table, such as pushing the updates to a SingleStore table. If your Lambda function requires any additional libraries or dependencies, you can create a Lambda Layer to manage them.
5. The Lambda function needs IAM execution role with appropriate permissions to manage resources related to your DynamoDB stream.
This design pattern allows for an event-driven architecture, where changes in the DynamoDB table can trigger immediate actions and updates in other systems. It’s a common approach for building real-time data pipelines and integrating DynamoDB with other AWS services or external data stores.
This pattern is suitable for most DynamoDB customers, but is subject to throughput quotas for DynamoDB table and AWS Region. For higher throughput limit you can consider provisioned throughput or the following design pattern with Amazon Kinesis Data Streams connector.
Using Amazon Kinesis Data Streams connector with Amazon MSK
Figure 2 – Architecture pattern for Amazon DynamoDB CDC to SingleStore using Amazon Kinesis Data Streams
This design pattern leverages Amazon Kinesis Data Streams and Amazon MSK to enable more flexible data processing and integration. The following is the detail workflow for the architecture:
1. Client applications interact with DynamoDB using the DynamoDB API, performing operations such as inserting, updating, deleting, and reading items from DynamoDB tables at scale.
2. Amazon Kinesis Data Streams captures changes from DynamoDB table asynchronously. Kinesis has no performance impact on a table that it’s streaming from. You can take advantage of longer data retention time—and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
The Kinesis data stream records might appear in a different order than when the item changes occurred. The same item notifications might also appear more than once in the stream. You can check the ApproximateCreationDateTime attribute to identify the order that the item modifications occurred in, and to identify duplicate records.
3. Using an open-source Kafka connector from the GitHub repository deployed to Amazon MSK Connect to replicate the events from Kinesis data stream to Amazon MSK. With Amazon MSK Connect, a feature of Amazon MSK, you can run fully managed Apache Kafka Connect workloads on AWS. This feature makes it easy to deploy, monitor, and automatically scale connectors that move data between Apache Kafka clusters and external systems.
4. Amazon MSK Connect needs IAM execution role with appropriate permissions to manage connectivity with Amazon Kinesis Data Streams.
5. Amazon MSK makes it easy to ingest and process streaming data in real time with fully managed Apache Kafka. Once the events are on Amazon MSK, you get the flexibility to retain or process the messages based on your business need. It gives you the flexibility to bring in various downstream Kafka consumers to process the events. SingleStore has a managed Pipelines feature, which can continuously load data using parallel ingestion as it arrives in Amazon MSK without you having to manage code.
6. SingleStore pipeline supports connection LINK feature, which provides credential management for AWS Security credentials.
This pattern gives you the flexibility to use the data change events in MSK to incorporate other workloads or have the events for longer than 24 hours.
Customer Story
ConveYour, a leading Recruitment Experience Platform, faced a challenge when Rockset, their analytical tool, was acquired by OpenAI and set for deprecation. Demonstrating remarkable agility, ConveYour swiftly transitioned to SingleStore for their analytical needs.
ConveYour CEO, Stephen Rhyne said “Faced with the impending deprecation of Rockset’s service by the end of 2024, ConveYour recognized the urgent need to transition our complex analytical workloads to a new platform. Our decision to migrate to SingleStore proved to be transformative. The performance improvements were remarkable, particularly for our most intricate queries involving extensive data sets. SingleStore’s plan cache feature significantly enhanced the speed of subsequent query executions. Furthermore, the exceptional support provided by SingleStore’s solutions team and leadership was instrumental in facilitating a swift and efficient migration process. This seamless transition not only addressed our immediate needs but also positioned us for enhanced analytical capabilities moving forward.”
Conclusion and Call to Action
In this blog, we have walked through two patterns to set up a CDC stream from DynamoDB to SingleStore. By leveraging either of these patterns, you can utilize SingleStore to serve up sub-second analytics on your DynamoDB data.
Start playing around with SingleStore today with their free trial, then chat with a SingleStore Field Engineer to get technical advice on SingleStore and/or code examples to implement the CDC pipeline from DynamoDB.
To get started with Amazon DynamoDB, please refer to the following documentation – https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GettingStartedDynamoDB.html.
.
SingleStore – AWS Partner Spotlight
SingleStore is an AWS Advanced Technology Partner and AWS Competency Partner that provides fully managed, cloud-native database that powers real-time workloads needing transactional and analytical capabilities.
Contact SingleStore | Partner Overview | AWS Marketplace
About the Authors
Ben Paul is a Solutions Engineer at SingleStore with over 6 years of experience in the data & AI field.
Aman Tiwari is a General Solutions Architect working with Worldwide Commercial Sales at AWS. He works with customers in the Digital Native Business segment and helps them design innovative, resilient, and cost-effective solutions using AWS services. He holds a master’s degree in Telecommunications Networks from Northeastern University. Outside of work, he enjoys playing lawn tennis and reading books.
Saurabh Shanbhag has over 17 years of experience in solution integration for highly complex enterprise-wide systems. With his deep expertise in AWS services, he has helped AWS Partners seamlessly integrate and optimize their product offerings, enhancing performance on AWS. His industry experience spans telecommunications, finance, and insurance, delivering innovative solutions that drive business value and operational efficiency.
Srikar Kasireddy is a Database Specialist Solutions Architect at Amazon Web Services. He works with our customers to provide architecture guidance and database solutions, helping them innovate using AWS services to improve business value.