March 2024 - Page 2 of 8 - Mobile Monitoring Solutions

Uncategorized

Presentation: Portfolio Analysis at Scale: Running Risk and Analytics on 15+ Million Portfolios Every Day

MMS • William Chen

Article originally posted on InfoQ. Visit InfoQ

Transcript

Chen: My name is Will. I’m a director and tech fellow at BlackRock. I work on the Aladdin Wealth product. We do portfolio analysis at massive scale. I’m going to talk about some of those key lessons we’ve learned from productizing and from scaling up portfolio analysis as part of Aladdin. First off, just to establish a little bit of context, I want to talk about what we do at BlackRock. BlackRock is an asset manager. Our core business is managing money, and we do it on behalf of clients across the globe. Those assets include retirement funds for teachers, for firefighters, nurses, and also for technology professionals too, like you and me. We also advise global institutions like sovereign wealth funds and central banks too. We also built Aladdin. Aladdin is a comprehensive, proprietary operating system for investment management. It calculates and models sophisticated analytics on all securities, and all portfolios across all asset classes. It’s also a technology product. We provide it as a service to clients and we get paid for it. We have hundreds of client firms that are using it today. It’s a key differentiator for BlackRock within the asset management and finance industry.

Then, finally, we have Aladdin Wealth. Aladdin Wealth is a distinct commercial offering within Aladdin, and it’s targeted towards wealth managers and financial advisors. We operate at massive scale. Every night, we compute a full portfolio analysis on more than 15 million portfolios that are modeled on our system. These portfolios are real portfolios that are held by people like you and me, or your parents and your siblings. They’re held at one of our client firms. These portfolio analyses that we produce, they’re an important part of managing these portfolios, and also helping us achieve our financial goals. On top of those 15 million portfolios, we also process more than 3 million portfolio analyses during the day through API calls. We also get sustained surges of more than 8000 portfolios in a single minute. These API calls are also real portfolios, but we’re getting them through our system integrations with client firms. We’re providing them an instant portfolio analysis on-demand as an API. During the course of this talk, another 300,000 to 500,000 portfolios are going to have gone through our system for analysis.

Example: Portfolio Factor Risk

To give you a sense of what is a portfolio analysis, there’s a lot of stuff that goes into it. It’s a broad range of analytics. It’s a lot of complex calculations. They’re all an important part of understanding what’s in your portfolio, how it’s positioned, and how it’s going to help you achieve your goals. I do want to dive in on one particular number, though, a very important number that we calculate, and that one is portfolio factor risk. I’m going to spend a little bit of time explaining what it is, why it’s important, and also how we actually go about and calculate this thing. Risk is a measure of the uncertainty in your portfolio’s value, and how big the potential losses can be. If your portfolio’s got a risk of 10%, what that means is that we expect it to lose less than 10% of its value over a 1-year period. We have a one standard deviation confidence of that. If you’re going to take your portfolio’s returns on a forward-looking basis, and chart them out over a year, you’ll get that histogram on the left over here. That purple x I’ve got over there, that would be what your portfolio risk is. If you have a lot of risk, that x is going to be big, distribution will be very wide. Whereas if you have very low risk, your portfolio’s distribution is going to be very narrow. Knowing what that x is, is very important. Because if you don’t know what that x is, and the distribution is wider than you think, you could be set up to take on a lot more losses than you’d expect, or that you can tolerate financially given your circumstances.

To calculate that x, we need the stuff on the right, which is, we need factors and factor modeling. Factors are your long-term drivers of returns on securities and portfolios. They’re going to be things like interest rates, spreads, FX rates, oil, gold, all sorts of things like that. When we calculate risk, we use more than 3000 factors to do that. That lets us tell you not just what your portfolio’s total risk is, what that x is, but we also tell you how that x is broken up into all these different factors and can tell you where it’s coming from. To calculate this on the portfolio level, like I mentioned, there’s a lot of data required, there’s a lot of modeling required. At the portfolio level, it fundamentally boils down to that equation I put down there, which is a matrix vector equation. It’s E, C, and E, which is exposure, which is a vector, C is a covariance matrix, and E is another exposure vector. That covariance matrix in the middle, when you deal with factor risk, is going to have 3000 rows and 3000 columns in it, one for every factor, and it’s going to be a dense matrix in here. There are about 9 million entries in that matrix. We have to compute this value hundreds, or maybe even thousands of times to calculate a portfolio analysis for one portfolio.

1. Trim Your Computational Graph (Key Insights from Scaling Aladdin Wealth Tech)

How have we achieved this level of scale? First of all, languages matter, frameworks matter. Writing good and efficient code is, of course, very important. We’ve also found that other higher-level factors matter more. I’m going to talk about six of them. The first insight we’ve learned is to trim down your computational graph. What I mean here is that if you think about any financial analytic that you may deal with in your day-to-day work, it usually depends on evaluating a complex computational graph, that graph is going to have a lot of intermediate nodes and values that have to be calculated. It’s also going to depend on a ton of inputs at the very tail end. For the purposes of this example, I’m going to be talking about the calculations we do, so portfolio risk and portfolio scenarios, which are marked in purple. I’ve marked them in purple over here. This is really a generic approach for a generic problem, because a lot of the business calculations that you’re going to deal with are basically just graphs like this. If it’s not a graph, if it’s just a page or a pile of formulas that have been written out by a research or Quant team at some point, you can actually try to write it out as a graph like this so you can apply this technique. Over here we have this simplified graph of how to compute portfolio risk and portfolio scenarios. To actually find out how to get that number, you have to walk backwards from the purple output node, all the way back to your inputs. If you look at portfolio scenarios, for example, portfolio scenarios at the bottom over there, it’s got two arrows coming out of it: it’s got position weights and security scenarios. Then your security scenarios itself has two more arrows. One arrow comes from factor modeling at the top in orange, and then another arrow comes from security modeling on the left. Then both of these nodes are going to come from a complex network of calculations. These things are going to have a whole host of inputs and dependencies like prices, reference data, yield curve modeling risk assumptions and factors. If you were to ask me with pen and paper or with a computer, actually, to calculate one of these portfolio scenarios, it would take you minutes, or it could even take you hours to actually do that even once.

To apply this principle, this insight here of trimming down the graph, we’re going to look at all the different nodes and see what you can actually lock down or see what you can reuse across your different calculations. In the case of risk, jump into that other purple node, it’s got three nodes leading into it. The first node is the position weight. That’s something that you can’t lock down, because every portfolio is going to look different. You might not even know the position weights of your portfolio until right before you need to actually calculate it. That’s not something you could do anything about, you have to leave that one as a trim variable here. You can move over to the orange side at the top. If you can lock down your risk parameters, which are your assumptions, your approaches, your parameters that you use to lock down your factor modeling, you can lock down your factor levels to say, I know what the interest rates are, I know the price of gold and all these other things are. That lets you lock down the covariance between factors, because if you lock down the two input nodes, everything underneath can be locked down. Then you can treat factor covariance as if it’s a constant, and you can just instantly fetch it and reuse it across all of your calculations.

Then the same approach also applies on the left over here with the security exposure. You can basically saw off most of that graph here, as long as you can lock down equity ratios, that’s a stand-in for all reference data, prices, and your yield curve, because when those are constants, everything to the right of it just becomes a constant as well. What you’ve done here is that you’re left with a much simpler graph here, in that there’s only three inputs. The maximum calculation depth of this graph, the maximum steps you have to walk backwards from a purple node is just one. When you have a graph like this, that lets you memoize, and so you can memoize these three intermediate nodes that have these locks on them, meaning you can take the results of these calculations, which are expensive to calculate, you can store them in a cache somewhere, and you can key them based on the inputs that came in at the very left or the very tail nodes. Then that lets you reuse them across these calculations. It can reduce the work required to calculate one of these things from minutes and hours down to less than a second. One of the issues with doing this, if you just memoize these nodes on-demand, is that the first time you calculate one of these things, is going to be prohibitively slow and lead to a bad experience for everyone.

The other part to think about here is, when can you lock these down, not just whether or not you can lock them down? If you have a reactive system for running your financial models, something that can trigger downstream processes when the price of the security gets updated, or risk assumptions are modeled, they can map all that out, that can be very easy to do. In many cases, and what’s true in our case as well, is that our financial models are run on batch systems that tend to be a little bit older and hard to operate. In these scenarios, you have to be practical about what you can do and try to capture as much of the benefit as you can, with the least amount of operational complexity. For us, it’s easy for us to just lock things down after the market closes, after the economy is modeled, and then we can run things overnight. Then they can be processed at the security or factor level well before the business day starts.

The last point on the slide is to not be afraid of challenging the status quo when you’re doing this. Because when you’re looking at one of these computational graphs, chances are you didn’t create that one. Tons of smart people have worked on making this graph the way it looks today, so that I can calculate everything it needs to do, so that it can be as powerful, as fast and efficient, and so on. If your job is to make this scale up by orders of magnitude, if your job is to make this faster, if your job is to make the tail latency a lot better, you fundamentally have to give something up. There’s no free lunch here. For us the thing to give up is really that variability and flexibility. In our situation, that’s actually feature, because when we lock down our risk assumptions it actually lets us compare things apples to apples across portfolios. It’s not something that our business users actually needed or wanted. That’s an easy tradeoff for us to make.

2. Store Data in Multiple Formats

The second insight is to store your data in multiple formats. What I mean by this is that you can take copies of the data that you generate from your analytical factory, from your models, and all that stuff, and you can store multiple copies of it in a bunch of different databases. Then when you actually need to query the data out, you can pick the best database for what you’re trying to do with it, and best meaning cheapest, fastest, or otherwise, most appropriate. For example, if you need to fetch out a single row, based on a known key, you could go to your large-scale join engine, like Hive, or Snowflake, or something like that, and you can get that row back out, and you’ll get the right answer. It’ll be very slow and very inefficient to actually do that, because it’s got to run a table scan in most cases. If you have a second copy of the data inside of a key-value store like Cassandra, we can go straight to Cassandra, get it out at very low cost. It will happen very fast, and it’ll be reliable as well, and as cheap.

On the flip side, if you need to do a large-scale join across two large datasets, you technically could run that off of your key-value store, as long as you have a way to iterate out the keys that you need. That would be a very foolish way to do that, your application join would be brittle. You might as well take advantage of the copy of the data that you have in Hive instead. In Aladdin Wealth, we basically progressively added more copies of the data as our use cases have grown. We started off by using Hive as our one database. Hive was great. It was incredibly scalable, could handle a lot of data, but it had a couple of tradeoffs associated with it. One was that it was very slow and actually unstable when you had a lot of concurrent usage. Second of all, it was just not as fast as we needed to be when we were looking at a small slice of the data, or when we wanted to just take a single rollout. Most of our use cases were one of those two things. To accommodate that, we actually introduced Solr as a replacement. Solr can do both of those things very well using Lucene indices. Things were great. It was handling concurrency very well. Then we also had to introduce Cassandra after that, because we had to deliver an external API with very strict SLAs and financial implications for breaching them. We needed something that was even faster, and more importantly, that was more reliable, that had a much tighter tail. After that, we actually got to have it tie back into the mix, when we have to do some large-scale joins. That has really been our journey is really adding more copies of this.

One of the tradeoffs with dealing with this approach is that when you have a lot of different databases, and your data exists in a lot of different places, it can potentially become inconsistent. That sounds bad. It’s not always the end of the world, because there’s a couple of things that you can do to deal with it. First of all, one of the things you can do is to introduce data governance and controls. What I mean by that is you can introduce processes, to make sure that if you at least attempted to write data to all of the databases that you have copies of here. You can also introduce processes that are reconciled to make sure that what actually ended up there converged upon some set at an eventual point of time. You can also have monitoring of like dead letters, if stuff fails to write, you should make sure somebody is looking at it. There’s all sorts of stuff you could do there.

A second approach is to actually isolate your sensitive use cases, meaning that you should look at all of your workloads and see what really, truly cannot tolerate any degree of inconsistency, even for a moment. In our experience, within our domain, there have been very few of those use cases. Whenever you ask your business users, can data be inconsistent? Of course, it sounds bad to say that data is inconsistent. The real question that you want to ask your business users is, what are the real-world tangible commercial implications if I have two APIs, and they come up with a different answer over a 1-second period, over a 1-minute period, or over a 1-hour period? When you do that, you might find that it’s actually not as bad as you think. Then you can also start to have the more interesting conversations about, how am I going to find out if one of these things is inconsistent? If I find out one of these things is inconsistent, do I have a way to fix it? We like to call that detection and repair, and making sure that you have a detection method and a repair method for these things. Then the final way to deal with inconsistency is to not care at all. I think there can be some cases where that works, and is appropriate. In most situations, I think data governance and controls can cost very little. They’re very cheap to implement. They can go a long way towards making things feel more consistent.

3. Evaluate New Tech on a Total Cost Basis

The third insight is to evaluate new tech on a total cost basis. What I mean by this is that when you’re looking at the languages, the frameworks, and the infrastructure that are going to make up your software system, make sure to think about it from a total cost of ownership perspective, within the context of your enterprise. If you think about existing tech at your firm and your enterprise, it’s going to be enterprise ready by definition, because it exists in your enterprise. There’s prior investment in it. It’s integrated with all the other stuff that’s worth running at your company. You’ve got operational teams that are staffed and trained to operate and maintain it. Then you have developers that know how to code against it, too. On the flip side, enterprise ready tech tends to be more on the older side, and not very cool. By cool, I mean have a lot of like developer interests outside of your company or even within your company. Coolness does matter, because it affects your ability to hire developers, to inspire developers, to retain and train them, and all that stuff. There’s a real consideration here. With new tech, when you’re thinking about onboarding new tech into your enterprise, you need to weigh that enterprise readiness versus that coolness factor and whatever other tangible benefits that you believe it’ll bring. It could be faster, or more scalable, or more reliable. If you have to weigh that against, I have to put in the work to make this integrate better with my systems, I have to get my operational teams bonded and trained on maintaining this, and so on. Just make sure to weigh those both sides carefully.

On the slide here, I’ve laid out the technology landscape from the perspective of my enterprise, or at least within Aladdin Wealth. We have enterprise ready stuff on the right, and then we have less enterprise ready stuff on the left. This picture is going to look very different for every enterprise. For us, we’ve stuck mostly with the enterprise ready side, so the right side of the graph, but we did make a choice to onboard something that wasn’t quite enterprise ready, and that was Spark in the middle here. That was the right choice for us, because at the time, the alternative to Spark to do some distributed computing work was just very hard to code against. It was also that we had a Hadoop enterprise installation at the firm that was geared towards research workloads. We had something existing that we could just invest a little bit to fully operationalize and make it fully production ready to run our stuff. The last benefit of Spark here was also that it’s independent of resource managers. In theory, we could port this workload across a lot of different clusters. This landscape is what it looks like for us. It is going to look different for other enterprises. I do want to highlight, Cassandra is something that I personally think is very cool. It’s also very enterprise ready, because we have DBA teams operating it, and we have years of experience with it. It may not be enterprise ready tech at your enterprise. On the flip side, I put Kubernetes and gRPC there. This isn’t brand new tech either. It could very well be production ready at your enterprise, but for us, it just is not.

4. Use Open Source, and Consider Contributing

The next insight is to use open source and to consider contributing. Many hard problems have already been solved in open source. If you find something in the open source that does almost everything you need, but not quite everything, it’s missing a key piece of functionality, you’ve got two choices as an enterprise. Choice one is to get a bunch of developers together, and try to build an internally developed equivalent of that open source product. Your development teams will probably have a lot of fun doing this, because developers love to write code, and they love to build complex things. At the flip side, it’s going to take a lot of time, it’s going to take a lot of money. Then, what you end up with at the end is not going to do everything that the open source product does, it’s going to be missing all these triangles and squares of functionality in there, and it probably just isn’t going to be as good either. On top of that, your enterprise then has to go and maintain this forever.

The other approach, which is path 2 here, is to follow your firm’s open source engagement policy. If you have an OSPO, that would be a good place to start, and work with the community and engage with the maintainer of those projects, to see if you can contribute that missing functionality back in as that purple box over there. This doesn’t always work out. It’s not a guarantee. It could fail for many reasons. It could be that it’s trade secret. It could be that the purple box doesn’t align with the direction of the open source project. When it does work out, it’s a win-win for everybody involved. The community benefit from your contribution, and then your company benefits in so many ways, because your company can do this with less time and less money. The product you end up with is even better than what you could have gotten before. The open source community tends to enforce higher standards on code than internal code. Then, finally, your firm doesn’t have to invest in maintaining this by themselves forever. They can invest as part of the community and be part of the community in keeping this thing alive. Then, finally, for the developer themselves, there’s a key benefit here, which is that you get some fame and recognition because you get a public record of your contribution. It’s good for everybody when you can follow path 2.

In Aladdin Wealth, we of course use tons of open source software. We use like Linux, Java, Spring, Scala, Spark, all sorts of stuff like that. We also tend to follow path 1 more than path 2, but we want to be doing more of path 2, and be contributing more in open source. I do want to share one story which is about matrix vector multiplication. That is something that’s very much core to our business, like I described earlier. We’ve actually taken both path 1 and path 2 for matrix operations. We started off our business and our product by using netlib BLAS. That is a open source state of the art library from the ’70s. It’s written in Fortran. Everybody is using that for matrix operations. We use that to bootstrap our product and bootstrap our business. We achieved a significant level of scale with that, and things were great. What we found out was that as we expanded to other regions, as we expanded our capabilities and took on more complex portfolios, and added more functionality, we found that BLAS was actually not fast enough for us. It was running out of memory in some cases. That seems like a surprise, given how old and battle tested this library is. That’s fundamentally because we did not have a general-purpose matrix problem, and BLAS is highly optimized for dense vector, dense matrix operations. Our matrix is dense, and it’s also positive definite and symmetric, because it’s covariance matrix. Then our vectors, though, are almost always a little sparse, or extremely sparse. We weren’t able to take advantage of that problem formulation with use of BLAS. What we did is we gradually unwound more of our implications with BLAS methods and replaced them with internal equivalents that were optimized for our use cases with these sparse vectors and dense matrices. It got to the point where we were only using the open source library for the interface of a class that defines, what’s a matrix? How do I iterate the entries, those kinds of things? It’s not a great situation to be in, because you’re pulling in a bunch of code in an interface.

What we did is we looked out at the open source matrix library landscape in Java, and we found that there was one library, EJML, which did most of what we needed. It even had one of the optimizations that we had added. It was optimized for a slightly different setup, in that you could have a sparse matrix and a dense vector, but we actually needed the reverse. We also needed one that was exploiting the symmetry as well. What we did is we worked with the maintainer through GitHub, by raising an issue, by raising a PR, by exchanging comments. We were able to add that optimization in as a very small piece of code, an additional method on top of a very comprehensive, very fully featured library within EJML. Then we were able to pull that code back, and we’re going to be able to retire a bunch of internal code as a result of that. That’s the value of open source. We’re getting to retire our internal code, and we’re also contributing back to the community. The story gets even better, because when we raised our PR with our optimization, the maintainer actually took a look at it, when he was reviewing it, he made another optimization on top of that, and that optimization actually yielded a 10% throughput increase in certain cases. That optimization, that extra 10% is something that we would have never been able to benefit from or found out from if we hadn’t first taken the step to open source our own, with not that bespoke use case. We’ve got a post on the BlackRock Engineering blog at engineering.blackrock.com, which talks about this in detail.

5. Consider the Dimensions of Modularization

The next insight I want to talk about is about the dimensions of modularization. There are three of them that I want to talk about, there’s change, capabilities, and scale. I’m going to go through these in detail over the next couple of slides. There are a couple of different approaches to drawing module boundaries when you’re designing a software system. One of the very classic approaches dates back to a 1971 paper by David Parnas. His key insight, his conclusion at the end of that paper was that you should consider difficult design decisions or design decisions which are likely to change. Then your design goal becomes to hide those decisions, ideally inside of a single module. Then that lets you change those design decisions later on, just by modifying that one module, and you don’t have to touch everything else. The very common example of this is going to be your database. If your software system lives for a long time, if it’s successful, you may want to change out the database. It could be because of commercial reasons, you want to swap out your vendor. Or it could just be that you want to upgrade to a non-compatible version, or you want to take your code and run it somewhere else with a totally different database. To apply this principle, you would draw boundaries around your database access. That way, you can just swap out that one module later on when you actually need to do this, and you don’t have to rewrite your calculation logic for your pipelines or anything else like that. That was the dimension of change.

The next dimension is capabilities. What I mean here is that you should think about breaking up your modules by their functional capabilities. These defines what they do. Make sure that those capabilities are high level enough, they’re real intangible things that product managers and business users can describe. They can tell you in plain language what they do, how they interact, and why they’re important in their daily lives. When you do that, your modules can become products themselves with their own product lifecycle, and their own product managers. Even better, they become building blocks that product managers can then combine in a lot of different ways to build many different products, depending on whatever use case they’re trying to target. In the wealth space, our capabilities, our building blocks are going to be things like portfolios, like securities and portfolio analysis. A product manager could reasonably take those modules, those services and APIs, and say, “I want to build a portfolio supervision app. I want to be able to monitor portfolios, check when they’re within tolerance, make sure that they’re going to achieve my end client’s goals.” They can just combine those building blocks with a compliance alert building block, and they’ve got a product right there. Then another product manager can go and take those same building blocks, and build a portfolio construction app. Something that’s geared towards making changes to portfolios, and analyzing those changes and making sure they’re appropriate, and using optimization to do that. You can do that with those same building blocks, and you can layer on optimization capabilities, householding data about your customers, all sorts of stuff like that. The best part about this is that that layering, that construction of these little building blocks, if you’ve done your job right, your product managers don’t need you to think about that. Your product managers can do that without you. That frees you up to think about the harder engineering problems that you need to be dealing with.

When you do that, if you take change and scale up change and the capabilities as your dimensions, that lets you treat batch as long-running and interactive. What I mean here is that if you’ve taken your capability, and you’ve taken the business logic of that capability and created that as a pure library that has no dependencies on where it’s rubbing, so there’s no idea what the database is, or anything like that, you can tailor the deployment of that capability based on what you want. If you take a capability like portfolio analysis, you may want to run it tens of millions of times on batch infrastructure, but you also need to be able to provide that as an API. You need to be able to handle one portfolio very quickly and very reliably. You can’t fail on that one, or as a batch when you can fail and retry it in 30 minutes or something like that. To do that, you would take your library, and you would bind it to reliable databases, stuff that’s geared towards that reliability, that RPC or API gateways, and you can deliver it as an API. Then you can take that same library, you can bind it to Spark, you can use files in HDFS for I/O, stuff that’s really geared towards that throughput, and you can run the same capability across tens of millions of these portfolios. The nice thing about this approach is because your library is totally stateless, you can use the same code in both places. You can be sure that the numbers you’re calculating here, your analytical consistency is there between the two different approaches. You also get the benefit of, if I’m going to add a new feature, if I’m going to add a new analytic, I get it for free in both sides because I’ve created this as a library. If I make things faster, if I swap out this EJML library, it’s going to be faster on both sides.

One of the interesting things, though, is that when you’re actually in your IDE, trying to do this, when you’re actually going through your code and modules, you’ll find that you’re adding interfaces that feel like they shouldn’t exist. It feels like a lot of extra work. It’s going to feel like a lot of extra complexity just for the sake of this, just to unchain your logic from your database, from your RPC, and all that stuff. At the end of the day, we’ve found that it’s really worth it to put that extra complexity in there, because it lets you unchain your capability from how it’s running. Then you can stamp out all the deployments you need. We’ve talked about here being an API deployment, a batch deployment. You can also have ones that are more geared, further skewing towards the reliability end, or whatever. There could be three or four other deployments you’ve got here. We actually didn’t follow this guidance to begin with. We actually did the obvious thing, which was to code everything inside of our Spark job in Scala. We had all of our calculation logic in there tightly bound to our Spark jobs at first. We had to rewrite the whole thing when we had to build an API. Now we know and we’re going to apply that to everything else we do.

The final dimension on the modularization is scale requirements. The way to think about this, or the way I think about this is that if you have two workloads, and if those workloads are fundamentally related, they deal with the same set of data, they deal with the same domain, but one of them has a very different scale profile from the other. You might be tempted to put them in as one module, but you should consider drawing the line between them to make them two separate modules. For example, portfolio analysis, the one I keep talking about, is going to be that purple box here. It’s a very calculation heavy capability, and it needs to run at very large scale, but it fundamentally only ever reads from a database. It’s very lightweight on your database. You can accelerate that with caches, and all that stuff. Then on the other hand, we also have low scale workloads that depend on, if someone is making a trade on a portfolio, or someone’s updating a portfolio. It’s still dealing with portfolios but it’s at a much lower scale. On the other hand, it also writes to a database, the writes tend to be more expensive than your reads. If you were to couple those two things in one module, then you’d have to scale them up together too. To scale up your purple boxes, you’d have to scale up your orange boxes in the same way. If you draw a line in between them, and you have two separate capabilities and two separate modules here, you can run them as separate services, and you can scale them up exactly the way you need. You end up with one orange and three purples instead of three oranges and three purples. This isn’t a hard and fast rule. I use the example of read versus writes from databases, but there are plenty of cases where you actually don’t want to draw that boundary, even if there is that big scale difference. For example, if you have a CRUD service, chances are like a lot of other people are calling that read, call on it much more often than the rest, but you’re not going to create a CRUD service to achieve your CRUD. We will cut in our service to achieve that. We like to call this splitting between stateless services, stateful services, even though technically they’re all stateless, because they’re not maintaining their state. As much as possible, we try to break things up into stateless things and try to carve out the stateless part of a workload so that we can scale it up. To summarize this fifth point here, we found that it’s valuable to consider multiple dimensions all at once, when you’re drawing your module boundaries. There’s a dimension of change from Parnas’s paper, and there’s a dimension of capability, so basically trying to think about what are your product building blocks. Then there’s also the dimension of scale. Thinking about, what are your scale requirements? There’s no core playbook, or there’s no universal right answer here. This is just something to think about, considering all these things all at once when you’re designing your system.

6. No Backsies Allowed

The final insight I want to talk about now is no backsies allowed. The point here is that you need to be careful about what you expose externally, what you expose outside of your walls. Because once you expose something, your callers, your customers, they’re going to expect it to always work. They’re going to depend on it. They’re going to create these key critical workflows that require that. You’re never going to be able to take that back. A very common way this happens is that if you add some features that are hard to maintain, they’re complex, and you don’t expect them to be that useful, but then one person, one client firm happens to use it, you’re stuck with that. You have to carry it on forever. The other way this takes shape is also with SLAs. If you’re overly permissive with your SLAs, if you set them too high, or if you choose accidentally not to enforce it, you’re going to be in a lot of trouble. It’s going to reduce your agility, being too permissive will reduce your agility. It’s going to increase the fragility of your system. An example of that would be if you get a calculation request, you get an analysis request that’s super large, or super complex in some way, and you happen to be able to process that successfully within a certain amount of time. The only reason that happened was because there was no other traffic going on, and so all of the typical bottlenecks were just totally free for that to happen. Your customers don’t care that that was the case. Their callers are going to say, it worked once, it’s going to work a second time. I’m going to keep going on and on. You’re going to be stuck with that. This can also happen with retention, so data retention requirements. If you said that, I’m going to purge data or not make data available after a certain amount of time, and then you don’t do that, someone is able to fetch out their old data, they’re going to quietly start depending on that too. In both of these cases with SLAs, these are real things that have happened to us, and your only choice is to find out how you’re going to make that work. There’s at least two ways you can do that. One way is you can spend money to make it work. You can spend money on hardware, whether it’s the memory, or the disk, or CPU cores, and you can make sure that you can continue that quality of service. The other way that this can happen is that your developers can be so inspired by these challenges, and they can try to find more optimizations and performance out of their code. That’s actually a good thing. You want your engineers to be challenged and challenged to innovate and delivering for your clients. It’s still a cost that you didn’t have to pay at that time. The key is that you should be conservative about your throttles, and then make sure you have throttles on everything you do. That way you don’t get into the situation. If you have service readiness checklists as part of deployments, you should think about adding a throttle check, and reasonable validation of that throttle to that service readiness check. Also, yes, don’t expose any features that you don’t plan to support forever, because, otherwise, you’re stuck with these things, they’re going to be like those anchors on the page here. They’re going to prevent you from getting things done, and otherwise slowing down.

Summary

We’ve achieved a lot of scale with Aladdin Wealth. We’ve got more than 50 million portfolios analyzed on our system every night. We’ve got more than 3 million API calls that we do throughout the day. We’ve got surges of more than 8000 portfolios per minute. To achieve this level of scale, these have been some of the key insights that we’ve learned along the way. I also want to leave you with one other point here, which is that, running great code, profiling your page with all sorts of tools, and just generally being passionate about the performance of your applications and your customer experience here is all going to be very important to being successful. On top of that, people: hiring, retaining, and developing the people on your team is going to be a key part of whether or not you can achieve this.

Questions and Answers

Participant 1: I was curious about like, when you want to break things out into modules and dissolve them into their own products. You’ve done this a few times, it sounds like. Have you been taking more of like an incubator approach where you know ahead of time that some service is going to be used by multiple teams, and so you build it upfront with that way in mind, like not committing to any particular use case? Or, are you still starting off with one pure use case, it seems like it has a lot of potential broad applicability, and then you bring it out into your platform once it’s proven itself in one case.

Chen: I think we try to do the former. We try to build capabilities first, and so make sure that we map out what we think a capability needs to be. That it’s not very specific to one like, I think put as a hero use case, and we try to shape it and all that stuff. We are also an enterprise B2B business, and so sometimes we do have to really over-index on one use case versus another. It’s a balance, in that you need your product managers to get a little bit more technical to actually do this. For them to also think from a platform-first approach and think about the platform as being the product itself, not the actual experience of the end users being your product.

Participant 2: You mentioned about the business problem, like saying business may be ok if data is not matching between the formats. There is a lot of data replication here because you’re showing us four different data streams. As an enterprise, is your policy ok to make it that kind of data replication?

Chen: I did mention data inconsistency as a tradeoff, but the other one is cost, because there’s going to be a lot of copies of these things there. That is something that you’ll have to weigh in terms of what is the benefit you’re getting out of that. For example, if most of your Solr use cases can be satisfied with Snowflake or Hive, and the reasonable expectations of performance from the user do not demand what you get from Solr, then there’s a cost dimension here that you can say, I choose to remove one copy of this data, save on storage, but then you suffer a little bit somewhere else. That’s the story of that. This is a tradeoff between cost and user experience here.

Participant 2: It’s not more from the cost perspective, which one is the certified data? If you’re feeding the data to a downstream system or to an external party, what is the criteria you’re using?

Chen: You do have to pick one of those things as being your source of truth. That source of truth when you’re dealing with analytical data can be something like files, like something that you’re going to archive. If you have a factory that’s generating the stuff overnight, you can say, I’m going to snapshot at this point, this is what I’m going to put into files. In our line of business, that’s actually pretty helpful, because we also need to deliver large-scale extracts to analyze to say like, this is what we actually saw. This is what we computed. This is what the universe looks like. This is the stuff that you can go and warehouse in your data warehouse.

Participant 3: I want to hear, since you talked about a couple of migrations there, just about the role of testing, and how you kind of thought of that, especially dealing with these fairly advanced linear algebra problems and making sure that you’re getting those same numbers before the migration, after the migration, and going into the future. Maybe just hear a little bit about that side of things.

Chen: Like I mentioned, analytical consistency across the deployments is very important, and so is like after releases. I think that’s definitely an evolving story for us on an ongoing basis. I will talk about one of those migrations, which was when we moved from our Scala-based risk engine to our Java one. The testing for that took months. We built specialized tooling to do that, to make sure every single number we calculated across all of our client firms was the same afterwards. It actually wasn’t the same all the time. It turns out we had a couple weird cases or like modeling treatments that had to get updated. You do have to build tooling and invest in that to make sure that’s the case. Because when you’re dealing with these models, your customer client firms trust that you’re right. It’s complex to calculate these things, but if also they shift, and there’s some bias that’s introduced, like that’s a huge deal. We also have checks and controls in place at all levels. When you think about that factory and those nodes, your yield curve model is going to have its own set of backtesting results after every change and all that. Every aspect of that graph will get validated.

Participant 4: I work in AgTech, and so we have, at a smaller scale, but a similar scenario like what you outline where you’ve got an analysis engine. It’s stateless, so you take some inputs, you generate a whole bunch of analyses outputs. Even though the engine is technically stateless, it does have an extra dependency on some public data that’s published by the government. Some of the parameters that feed into the engine are pulled dynamically by the engine out of that database. That’s not part of the inputs that are going in. If your system has this kind of extra database dependency, is it still viable to have that multi-deployment that you described where you’ve got the engine here in an API, you’ve got the engine over here in batch? Is it practical and easy to move around that way when you’ve got an external dependency like this? Is that maybe a problem that you have encountered.

Chen: We do have external dependencies. I do think it depends very much on the circumstances and the nature of what that external dependency is. We do have some of those things that we depend on, for example, maybe a security model that is slow and unpredictable, and that we don’t control within the rest of Aladdin. What we do there is we evaluate to see if we should either rebuild that thing or have a specialized cache for that, and then take ownership of that, basically creating a very clean boundary. Think about the modularization, like drawing a module around there. You might have to add some code to do that. That’s something that we have done. We’ve chosen to like, we’re going to make a copy of these things and make it available on the outside.

Participant 4: It sounds like you take the dependencies, and you just go ahead and copy them over, as far as you’re concerned for your use case, it’s worth. That’s just a choice we have to make.

Participant 5: I imagine that you work with hundreds of thousands of customers that are all different, no two customers are alike. When you’re designing an engineering system, so how do you account for that? To what extent, do you try to have a system that works for at least most of the customers? Do you treat them as different channels, give them different instances, or the pipeline, or the architecture, different data storage? I’m really interested in your perspective.

Chen: I think for the most part, we like to be building one platform and one set of technology to back it, and all sorts of stuff like that. I’m going to show the computational graph here. We do like to build one product and one thing, but when you have clients across the globe, and you have client firms that are using this for tons of different people with a lot of different circumstances, you’re right, they’re going to have a lot of different things. Doing business in Europe is very different. It’s not just GDPR, it’s also how they manage money. They don’t even use the term financial advisor over there. They have very different business models. When you’re global, you have to be able to adapt your system to all of these different markets if you want to succeed. I talk about locking these things down. You have to also open up certain points of configuration, but you want to be very intentional about which parts you’re unlocking because you’re trading off your scale as a result. For example, like every client firm is going to have a different set of risk assumptions. They may choose to weight the different half-lives. They may choose to do different overlaps, when you think about some of these data analysis techniques. They may have different wants, and so you have to be able to allow them to vary them, before that one set for that one enterprise, and that’s one of the beauties of B2B enterprise business. You can lock it down at that point. In the same way, like client firms’ view on what is an opportunity, what is a violation, all these kinds of things, like compliance alerts need to be specific to your customers as well. Also, one key thing about Aladdin is that we don’t have any investment views ourselves, we have to be able to take investment views from our customers. That’s another point of configuration. Each one of those things you do, you have to weigh those tradeoffs again. What is the value? What is the commercial impact of opening that up versus the cost of how are you going to be able to scale there? Really requires very close partnership.

See more presentations with transcripts

Uncategorized

MongoDB has Over 3,000 Customers in India and Growing – Analytics India Magazine

MMS • RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

Listen to this story

MongoDB loves India. Boris Bialek, the field CTO of MongoDB, was in Bengaluru earlier this month and in an exclusive interaction with AIM, he said: “India’s market momentum is tremendous. The growth is huge, and we have over 3,000 customers here. We experience millions of downloads every month and have an exceptionally active developer community here.”

In India, MongoDB, the king of the NoSQL database, serves unicorns, smaller startups, and digital native companies, including those specialising in generative AI and digital transformation initiatives. Some of its notable customers include Zomato, Tata Digital, Canara HSBC Life Insurance, Tata AIG, and Devnagri.

India cares about data security like no other. With data and AI sovereignty on the rise, there’s a demand for keeping data within a controlled environment, either on-premises or in a cloud, but not over public APIs. This is where MongoDB comes in—it acts as a bridge, integrating various components into a cohesive system.

“Our approach emphasises simplicity, transparency, and trust. We make things clear around how vectors are used and provide transparency in data usage. This level of clarity addresses concerns about system trustworthiness and what is being built,” he added.

Other NoSQL databases like Redis and Apache Cassandra are also widely used by Indian developers. The former has over 12 million daily downloads and derives 60% of its revenue from national database projects. Apache Cassandra has a strong presence of 14.95%, with companies like Infosys, Fujitsu, and Panasonic using it. Amazon DynamoDB holds a market share of approximately 9.5%.

Focuses on Real-time

“We no longer see ourselves just as a NoSQL database. We’re part of a larger picture, providing services for banking transactions, multi-document handling, search capabilities, and integrating edge components for global manufacturing lines,” said Bialek.

This is where the concept of a ‘developer data platform’ emerges, and it’s a significant change the team has observed over recent years. It’s about accelerating integration without the need to maintain multiple disconnected systems.

“We’re discussing real-time data. Everything today demands immediacy, whether it’s a UPI transaction that needs decisions in milliseconds or providing individualised, real-time data like online stock trades,” he added. Behind this is a vast amount of JSON data, which is what the JSON document model in MongoDB is about.

Several companies achieved remarkable improvements in efficiency and customer satisfaction through MongoDB solutions. Bialek spoke about an Italian energy company that was able to slash its service desk response time from a day to two minutes!

Meanwhile, a German company pioneered chatbot systems utilising self-trained LLMs for tailored client interactions, such as rapid insurance claim processing. In the retail sector, data analysis led to a 50% reduction in return shipments for an online shoe retailer by suggesting optimal shoe sizes, demonstrating the mutual benefits of predictive services for customers and businesses.

Generative AI Struggles

During recent customer interactions, Bialek noted several challenges regarding speeding up and implementing new technology like generative AI.

“No matter if the company is small or large, the main challenge is accelerating business processes. Developers are in high demand, so the focus is on how fast we can create new platforms like payment systems and integrate new technologies like UPI LITE with limited help. We’re collaborating with major payment providers to address these challenges,” Bialek added.

Secondly, customers are experiencing a stark difference between companies that are aggressively implementing generative AI and have clear directions, vis à vis others that are still figuring it out. So, MongoDB assists them in understanding and applying use cases.

MongoDB to the Rescue

In November last year, the company added new features to MongoDB Atlas Vector Search that offers several benefits for generative AI application development.

“The feedback has been overwhelmingly positive, citing its ease of use. Some clients have even built solutions in half a day, a task that would have previously taken six months and a consultant,” Bialek commented.

Furthermore, LLMs require storage for vectors, and MongoDB simplifies access to these. The emphasis is on delivering a holistic solution by integrating data layers, applications, and underlying technologies.

In January of this year, MongoDB partnered with California-based Patronus AI to provide automated LLM evaluation and testing for business clients. This partnership merges Patronus AI’s functions with MongoDB’s Atlas Vector Search tool, creating a retrieval system solution for dependable document-based LLM workflows.

Customers can build these systems through MongoDB Atlas and utilise Patronus AI for evaluation, testing, and monitoring, enhancing accuracy and reliability.

“I would say that MongoDB is a bit like the glue in the system, bringing everything together in a straightforward manner,” said Bialek.

What’s Next?

Going forward, Bialek noted that the primary goal is to focus on improving user-friendliness through simplicity, real-time data processing, and application development.

“We are releasing a new version of our product this year, as we do annually. We’re currently showcasing our Atlas Streams in public preview, which is crucial for real-time streams, processing, and vector search integration. There’s a lot more in development, too,” concluded Bialek.

Uncategorized

Redis Switches to SSPLv1: Restrictive License Sparks Fork by Former Maintainers

MMS • Renato Losio

Article originally posted on InfoQ. Visit InfoQ

Redis has recently announced a change in their license by transitioning from the open-source BSD to the more restrictive Server Side Public License (SSPLv1). The move has promptly led to a fork initiated by former maintainers and reignited discussions surrounding the sustainability of open-source initiatives.

Starting with Redis 7.4, all subsequent versions will be dual-licensed under the Redis Source Available License (RSALv2) and SSPLv1, foregoing distribution under an approved open-source license. Rowan Trollope, CEO of Redis, explains:

The new source-available licenses allow us to sustainably provide permissive use of our source code (…) Under the new license, cloud service providers hosting Redis offerings will no longer be permitted to use the source code of Redis free of charge.

Started in 2009 by Salvatore Sanfilippo, Redis has emerged as the most popular in-memory storage solution, used as a distributed, in-memory key–value database and cache. With the adoption of the more restrictive SSPL license created by MongoDB, Redis will not be an open-source project anymore. In a follow-up article to address some of the concerns, Trollope and Yiftach Shoolman, co-founder and president at Redis, address some of the concerns and announce new features:

We’re going back to our roots as the world’s fastest real-time data platform and are proud to announce the acquisition of Speedb, the world’s fastest data storage engine. Over the past two years, we’ve been working directly with Speedb, integrating it as the default storage engine in the Redis Enterprise auto-tiering functionality launched recently in version 7.2.

Reflecting on Redis’s journey, Khawaja Shams, co-founder and CEO at Momento, and Tony Valderrama, head of product at Momento, argue that “Redis did not create Redis” and “how Garantia Data pulled off the biggest heist in open source history.”

The licensing transition has raised concerns in the community and Madelyn Olson, principal engineer at Amazon ElastiCache and formerly a member of the Redis open source governance, writes:

I’ve gotten together with various former Redis contributors and we’ve started working on a fork. We are all unhappy with the license change and are looking to build a new truly open community to fill the void left by Redis. Come join us!

Similar changes in licenses by ElasticSearch triggered official forks from AWS, with the cloud provider recently building GLIDE, a new Redis client. To provide clarity on the new project, Olson clarifies:

Is this an AWS fork? AWS employs me, but this is just me trying to keep the continuity with the community. AWS is aware of what I’m doing and is preparing their own response.

Werner Vogels, CTO at Amazon, adds:

I am excited about the actions Madelyn Olson and other core Redis maintainers are taking. BTW, this is Madelyn taking action, not an official AWS announcement. She should get serious credit for her bias for action. Expect more news soon.

The new license applies to the upcoming 7.4 release, with all previous versions continuing to be open-source under the BSD license.

About the Author

Renato Losio

Show moreShow less

Uncategorized

MongoDB has Over 3,000 Customers in India and Growing – Analytics India Magazine

MMS • RSS

Posted on mongodb google news. Visit mongodb google news

Listen to this story

Focuses on Real-time

Generative AI Struggles

During recent customer interactions, Bialek noted several challenges regarding speeding up and implementing new technology like generative AI.

MongoDB to the Rescue

In November last year, the company added new features to MongoDB Atlas Vector Search that offers several benefits for generative AI application development.

Customers can build these systems through MongoDB Atlas and utilise Patronus AI for evaluation, testing, and monitoring, enhancing accuracy and reliability.

“I would say that MongoDB is a bit like the glue in the system, bringing everything together in a straightforward manner,” said Bialek.

What’s Next?

Going forward, Bialek noted that the primary goal is to focus on improving user-friendliness through simplicity, real-time data processing, and application development.

Article originally posted on mongodb google news. Visit mongodb google news

Uncategorized

Public preview: Change partition key of a container in Azure Cosmos DB (NoSQL API)

MMS • RSS

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

We are pleased to announce the introduction of the capability to change the partition key of a container in Azure Cosmos DB for the NoSQL API. You can now efficiently create a new partition key for a container and transfer data from the old to the new container seamlessly through the Azure Portal, utilizing the offline container copy process with only a few clicks.

Learn more.

Uncategorized

Leaked data from Shopify plugins developed by Saara

MMS • RSS

Posted on mongodb google news. Visit mongodb google news

Security

Wednesday, March 27, 2024

<!–

–>

The Cybernews research team exposed a data breach from a publicly accessible MongoDB database, tied to the Shopify plugin developers from Saara, the data included over 7M orders and sensitive customer information, plus the data was available for eight months and held for a ransom of .01 Bitcoin.

The Cybernews research team discovered that a vast amount of sensitive data of shoppers was exposed to threat actors by the e-commerce giant’s Shopify plugin developer Saara, with millions of orders being leaked.

Key findings from the Cybernews report, covering the data breach on the Shopify plugins developed by Saara

Researchers discovered a publicly accessible MongoDB database belonging to a US-based company, Saara, that is developing Shopify plugins.
The leaked database stored 25GB of data.
Leaked data was collected by plugins from over 1,800 Shopify stores using the company’s plugins.
It held data from more than 7.6 million individual orders, including sensitive customer data.
The data stayed up for grabs for eight months and was likely accessed by threat actors.
The database contained a ransom note demanding 0.01 in bitcoin (around $640), or the data would be released publicly.

Plugins confirmed as affected by the leak:

EcoReturns: for AI-powered returns
WyseMe: to acquire top shoppers

Leaked data included:

Customer names
Email addresses
Phone numbers
Addresses
Information about ordered items
Order tracking numbers and links
IP addresses
User agents
Partial payment information

Some of the online stores mostly affected by the leak:

Snitch
Bliss Club
Steve Madden
The Tribe Concepts
Scoboo.in
OneOne Swimwear

Become a subscriber of App Developer Magazine for just $5.99 a month and take advantage of all these perks.

MEMBERS GET ACCESS TO

– Exclusive content from leaders in the industry
– Q&A articles from industry leaders
– Tips and tricks from the most successful developers weekly
– Monthly issues, including all 90+ back-issues since 2012
– Event discounts and early-bird signups
– Gain insight from top achievers in the app store
– Learn what tools to use, what SDK’s to use, and more

Subscribe here

Article originally posted on mongodb google news. Visit mongodb google news

Uncategorized

Ridge Holland update – Gerweck.net

MMS • RSS

Posted on mongodb google news. Visit mongodb google news

On last night’s episode of NXT, Ridge Holland addressed the WWE Universe. He acknowledged that his performances over the past few weeks fell short of his usual standards. Holland emphasized the significance of his family in his life. Recognizing that his focus and passion were wavering, he made the difficult decision to step away from in-ring competition indefinitely.

Ridge Holland was then moved to the Alumni section on WWE’s official website. Fightful Select’s Corey Brennan has since reported that this is indeed storyline. He said “NXT sources have confirmed to Corey Brennan of Fightful Select that this is part of Holland’s current storyline and is not an official retirement. The story has been one that Holland has been motivated to do, with the former Brawling Brute being receptive to suggestions.”

Ridge holland announces he is stepping away from in ring and the WWE.#WWENXT pic.twitter.com/yH0honp5UA

— PW content  (@pwsonX) March 27, 2024

Article originally posted on mongodb google news. Visit mongodb google news

Uncategorized

DigitalOcean Introduces CPU-based Autoscaling for its App Plaform

MMS • Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

DigitalOcean has launched automatic horizontal scaling for its App Platform PaaS, aiming to free developers from the burden of scaling services up or down based on CPU load all by themselves.

This capability helps ensure that your applications can seamlessly handle fluctuating demand while optimizing resource usage and minimizing costs. You can configure autoscaling using either the user interface or via appspec.

Besides simplifying the creation of scalable services on their managed platform, the new capability should also help optimize performance and cost, says DigitalOcean.

The autoscaling capability will continuously collect CPU metrics and compare the average CPU utilization across all containers against a pre-defined threshold. When CPU utilization over a given period exceeds the threshold, the current deployment is cloned to create new container instances. When CPU utilization falls under the threshold, instead, the system will automatically remove container instances. Users have the option to set the threshold and limit the maximum as well as the minimum number of container instances that are allowed to run at any given time.

Autoscaling is always available for any component with dedicated instances. To enable it, you only need to define the maximum and minimum instance count and a CPU utilization threshold. The following snippet shows how you can define those values in an appspec file:

...
services:
- autoscaling:
    max_instance_count: 10
    min_instance_count: 2
    metrics:
      cpu:
        percent: 80
...

Once you have defined your appspec file, you can deploy that configuration using the DigitalOcean CLI with doctl apps update. Alternatively, you can use DigitalOcean API and provide all required configuration parameters in a JSON payload.

If you do not want to use the autoscaling capability, you must provide an instance_count value instead of max_instance_count and min_instance_count.

Appspec is a YAML-based format developers can use to configure apps running on DigitalOcean App Platform, including external resources as well as environment and configuration variables. It is also possible to configure an app using the Web-based interface and then download its appspec as a backup at a specific point in time.

App Platform is DigitalOcean platform-as-a-service (PAAS) solution that allows developers to create their deployment from a git repository or using pre-built container images, with the platform taking care of the entire application lifecycle.

About the Author

Sergio De Simone

Show moreShow less

Uncategorized

Ridge Holland announces he is “stepping away from in-ring competition” – Gerweck.net

MMS • RSS

Posted on mongodb google news. Visit mongodb google news

Ridge Holland delivered an emotional speech in the ring, expressing the difficulty of what he’s about to say. He acknowledges the perception surrounding him and the impact of his job on his personal life. Holland reveals he’s had tough discussions with loved ones and himself, citing a lack of mental and physical resilience required for rugby and wrestling. In the best interest of his family, he announces his indefinite departure from in-ring competition.

Ridge holland announces he is stepping away from in ring and the WWE.#WWENXT pic.twitter.com/yH0honp5UA

— PW content  (@pwsonX) March 27, 2024

Holland has been moved to the alumni section of WWE’s website.

Article originally posted on mongodb google news. Visit mongodb google news

Uncategorized

How to Build Your Own RAG System With LlamaIndex and MongoDB – Built In

MMS • RSS

Posted on mongodb google news. Visit mongodb google news

Large language models (LLMs) have provided new capabilities for modern applications, specifically, the ability to generate text in response to a prompt or user query. Common examples of LLMs include: GPT3.5, GPT-4, Bard and Llama, among others. To improve the capabilities of LLMs to respond to user queries using contextually relevant and accurate information, AI engineers and developers have developed two main approaches.

The first is to fine-tune the baseline LLM with propriety and context-relevant data. The second, and most cost-effective, approach is to connect the LLM to a data store with a retrieval model that extracts semantically relevant information from the database to add context to the LLM user input. This approach is referred to as retrieval augmented generation (RAG).

Retrieval Augmented Generation (RAG) Explained

Retrieval augmented generation (RAG) systems connects an LLM to a data store with a retrieval model that returns semantically relevant results. It improves responses adding relevant information from information sources to user queries. RAG systems are a cost-effective approach to developing LLM applications that provide up-to-date and relevant information compared to fine-tuning.

RAG systems are an improvement on fine-tuning models for the following reasons:

Fine-tuning requires a substantial amount of proprietary, domain-specific data, which can be resource-intensive to collect and prepare.
Fine-tuning an LLM for every new context or application demands considerable computational resources and time, making it a less scalable solution for applications needing to adapt to various domains or data sets swiftly.

What Is Retrieval Augmented Generation (RAG)?

RAG is a system design pattern that leverages information retrieval techniques and generative AI models to provide accurate and relevant responses to user queries. It does so by retrieving semantically relevant data to supplement user queries by additional context, combined as input to LLMs.

Illustrated process of a RAG system. — Retrieval augmented generation process. | Image: Richmond Alake

The RAG architectural design pattern for LLM applications enhances the relevance and accuracy of LLM responses by incorporating context from external data sources. However, integrating and maintaining the retrieval component can increase complexity, which can add to the system’s overall latency. Also, the quality of the generated responses heavily depends on the relevance and quality of the data in the external source.

RAG pipelines and systems rely on an AI stack, also known as either a modern AI stack or Gen AI stack.

This refers to the composition of models, databases, libraries and frameworks used to build modern applications with generative AI capabilities. The infrastructure leverages parametric knowledge from the LLM and non-parametric knowledge from data to augment user queries.

Components of the AI Stack include the following: Models, orchestrators or Integrators and operational and vector databases. In this tutorial, MongoDB will act as the operational and vector database.

AI stack components — AI and POLM stack components. | Image: Richmond Alake

An error occurred.

Unable to execute JavaScript. Try watching this video on www.youtube.com, or enable JavaScript if it is disabled in your browser.

A tutorial on the basics of a retrieval augmented generation (RAG) system. | Video: IBM Technology

More on LLMsWhat Enterprises Need to Know Before Adopting a LLM

How to Create a RAG System

The tutorial outlines the steps for implementing the standard stages of a RAG pipeline. Specifically, it focuses on creating a chatbot-like system to respond to user inquiries about Airbnb listings by offering recommendations or general information.

1. Install Libraries

The following code snippet installs various libraries that provide functionalities to access large language models, reranking models, and database connection methods. These libraries abstract away some complexities associated with writing extensive code to achieve results that the imported libraries condense to just a few lines and method calls.

LlamaIndex: This is an LLM/data framework that provides functionalities to connect data sources — files, PDFs and websites, etc. — to both closed (OpenAI or Cohere) and open-source (Llama) large language models. The LlamaIndex framework abstracts complexities associated with data ingestion, RAG pipeline implementation and development of LLM applications.
PyMongo: A Python driver for MongoDB that enables functionalities to connect to a MongoDB database and query data stored as documented in the database using various methods provided by the library
Data sets: This is a Hugging Face library that provides access to a suite of data collections via specification of their path on the Hugging Face platform.
Pandas: A library that enables the creation of data structures that facilitate efficient data processing and modification in Python environments.

!pip install llama-index
!pip install llama-index-vector-stores-mongodb
!pip install llama-index-embeddings-openai
!pip install pymongo
!pip install datasets
!pip install pandas

2. Data Loading and OpenAI Key SetUp

The command below assigns an OpenAI API key to the environment variable OPENAI_API_KEY. This is required to ensure LlamaIndex creates an OpenAI client with the provided OpenAI API key to access features such as LLM models (GPT-3, GPT-3.5-turbo and GPT-4) and embedding models (text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large).

import os
os.environ["OPENAI_API_KEY"] = ""

The next step is to load the data within the development environment. The data for this tutorial is sourced from the Hugging Face platform, more specifically, the Airbnb data set made available via MongoDB. This data set comprises Airbnb listings, complete with property descriptions, reviews, and various metadata.

Additionally, it features text embeddings for the property descriptions and image embeddings for the listing photos. The text embeddings have been generated using OpenAI’s text-embedding-3-small model, while the image embeddings are produced with OpenAI’s CLIP-ViT-B/32 model, both accessible on Hugging Face.

from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/MongoDB/embedded_movies
# Make sure you have an Hugging Face token(HF_TOKEN) in your development environemnt
dataset = load_dataset("MongoDB/airbnb_embeddings")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])

dataset_df.head(5)

To fully demonstrate LlamaIndex’s capabilities and utilization, the ‘text-embeddings’ field must be removed from the original data set. This step enables the creation of a new embedding field tailored to the attributes specified by the LlamaIndex document configuration.

The following code snippet effectively removes the ‘text-embeddings’ attribute from every data point in the data set.

dataset_df = dataset_df.drop(columns=['text_embeddings'])

3. LlamaIndex LLM Configuration

The code snippet below is designed to configure the foundational and embedding models necessary for the RAG pipeline.

Within this setup, the base model selected for text generation is the ‘gpt-3.5-turbo’ model from OpenAI, which is the default choice within the LlamaIndex library. While the LlamaIndex OpenAIEmbedding class typically defaults to the ‘text-embedding-ada-002’ model for retrieval and embedding, this tutorial will switch to using the ‘text-embedding-3-small’ model, which features an embedding dimension of 256.

The embedding and base models are scoped globally using the Settings module from LlamaIndex. This means that downstream processes of the RAG pipeline do not need to specify the models utilized and, by default, will use the globally specified models.

from llama_index.core.settings import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small", dimensions=256)
llm = OpenAI()

Settings.llm = llm
Settings.embed_model = embed_model

4. Creating LlamaIndex custom Documents and Nodes

Next, we’ll create custom documents and nodes, which are considered first-class citizens within the LlamaIndex ecosystem. Documents are data structures that reference an object created from a data source, allowing for the specification of metadata and the behavior of data when provided to LLMs for text generation and embedding.

The code snippet provided below creates a list of documents, with specific attributes extracted from each data point in the data set. Additionally, the code snippet demonstrates how the original types of some attributes in the data set are converted into appropriate types recognized by LlamaIndex.

import json
from llama_index.core import Document
from llama_index.core.schema import MetadataMode

# Convert the DataFrame to a JSON string representation
documents_json = dataset_df.to_json(orient='records')

# Load the JSON string into a Python list of dictionaries
documents_list = json.loads(documents_json)

llama_documents = []

for document in documents_list:

  # Value for metadata must be one of (str, int, float, None)
  document["amenities"] = json.dumps(document["amenities"])
  document["images"] = json.dumps(document["images"])
  document["host"] = json.dumps(document["host"])
  document["address"] = json.dumps(document["address"])
  document["availability"] = json.dumps(document["availability"])
  document["review_scores"] = json.dumps(document["review_scores"])
  document["reviews"] = json.dumps(document["reviews"])
  document["image_embeddings"] = json.dumps(document["image_embeddings"])


  # Create a Document object with the text and excluded metadata for llm and embedding models
  llama_document = Document(
      text=document["description"],
      metadata=document,
      excluded_llm_metadata_keys=["_id", "transit", "minimum_nights", "maximum_nights", "cancellation_policy", "last_scraped", "calendar_last_scraped", "first_review", "last_review", "security_deposit", "cleaning_fee", "guests_included", "host", "availability", "reviews", "image_embeddings"],
      excluded_embed_metadata_keys=["_id", "transit", "minimum_nights", "maximum_nights", "cancellation_policy", "last_scraped", "calendar_last_scraped", "first_review", "last_review", "security_deposit", "cleaning_fee", "guests_included", "host", "availability", "reviews", "image_embeddings"],
      metadata_template="{key}=>{value}",
      text_template="Metadata: {metadata_str}n-----nContent: {content}",
      )

  llama_documents.append(llama_document)

# Observing an example of what the LLM and Embedding model receive as input
print(
    "nThe LLM sees this: n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.LLM),
)
print(
    "nThe Embedding model sees this: n",
    llama_documents[0].get_content(metadata_mode=MetadataMode.EMBED),
)

The next step is to create nodes from the documents after creating a list of documents and specifying the metadata and LLM behavior. Nodes are the objects that are ingested into the MongoDB vector database and will enable the utilization of both vector data and operational data to conduct searches

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

parser = SentenceSplitter(chunk_size=5000)
nodes = parser.get_nodes_from_documents(llama_documents)

for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode=MetadataMode.EMBED)
    )
    node.embedding = node_embedding

5. MongoDB Vector Database Connection and Setup

MongoDB acts as both an operational and a vector database for the RAG system. MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

First, register for a MongoDB Atlas account. For existing users, sign into MongoDB Atlas.
Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
Create the database: `airbnb`.
Within the database` airbnb`, create the collection ‘listings_reviews’.
Create a vector search index named vector_index for the ‘listings_reviews’ collection. This index enables the RAG application to retrieve records as additional context to supplement user queries via vector search. Below is the JSON definition of the data collection vector search index.

{
      "fields": [
        {
          "numDimensions": 256,
          "path": "embedding",
          "similarity": "cosine",
          "type": "vector"
        }
      ]
    }

Below is an explanation of each vector search index JSON definition field.

fields: This is a list that specifies the fields to be indexed in the MongoDB collection and defines the characteristics of the index itself.
numDimensions: Within each field item, numDimensions specifies the number of dimensions of the vector data. In this case, it’s set to 256. This number should match the dimensionality of the vector data stored in the field, and it is also one of the dimensions that OpenAI’s text-embedding-3-small creates vector embeddings.
path: The path field indicates the path to the data within the database documents to be indexed. Here, it’s set to embedding.
similarity: The similarity field defines the type of similarity distance metric that will be used to compare vectors during the search process. Here, it’s set to cosine, which measures the cosine of the angle between two vectors, effectively determining how similar or different these vectors are in their orientation in the vector space. Other similarity distance metric measures are Euclidean and dot products.
type: This field specifies the data type the index will handle. In this case, it is set to vector, indicating that this index is specifically designed for handling and optimizing searches over vector data.

By the end of this step, you should have a database with one collection and a defined vector search index. The final step of this section is to obtain the connection to the uniform resource identifier (URI) string to the created Atlas cluster to establish a connection between the databases and the current development environment.

Don’t forget to whitelist the IP for the Python host or 0.0.0.0/0 for any IP when creating proof of concepts.

Follow MongoDB’s steps to get the connection string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.

This guide uses Google Colab, which offers a feature for securely storing environment secrets. These secrets can then be accessed within the development environment. Specifically, the line mongo_uri = userdata.get('MONGO_URI') retrieves the URI from the secure storage.

The following steps utilize PyMongo to create a connection to the cluster and obtain reference objects to both the database and collection.

import pymongo
from google.colab import userdata

def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri, appname="devrel.content.python")
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = userdata.get('MONGO_URI')
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)

DB_NAME="airbnb"
COLLECTION_NAME="listings_reviews"

db = mongo_client[DB_NAME]
collection = db[COLLECTION_NAME]

The next step deletes any existing records in the collection to ensure the data ingestion destination is empty.

# Delete any existing records in the collection
collection.delete_many({})

6. Data Ingestion

Using LlamaIndex, vector store initialization and data ingestion is a trivial process that can be established in just two lines of code. The snippet below initializes a MongoDB atlas vector store object via the LlamaIndex constructor MongoDBAtlasVectorSearch. Using the ‘add()’ method of the vector store instance, the nodes are directly ingested into the MongoDB database.

It’s important to note that in this step, we reference the name of the vector search index that’s going to be created via the MongoDB Cloud Atlas interface. For this specific use case, the index name: “vector_index”.

Vector Store Initialization

From llama_index.vector_stores.mongodb, import MongoDBAtlasVectorSearch.

from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

vector_store = MongoDBAtlasVectorSearch(mongo_client, db_name=DB_NAME, collection_name=COLLECTION_NAME, index_name="vector_index")
vector_store.add(nodes)

Create an instance of a MongoDB Atlas Vector Store: vector_store using the MongoDBAtlasVectorSearch constructor.
db_name="airbnb": This specifies the name of the database within MongoDB Atlas where the documents, along with vector embeddings, are stored.
collection_name="listings_reviews": Specifies the name of the collection within the database where the documents are stored.
index_name="vector_index: This is a reference to the name of the vector search index created for the MongoDB collection.

Data Ingestion

vector_store.add(nodes): Ingests each node into the MongoDB vector database where each node represents a document entry.

7. Querying the Index With User Queries

To utilize the vector store capabilities with LlamaIndex, an index is initialized from the MongoDB vector store, as demonstrated in the code snippet below.

from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_vector_store(vector_store)

The next step involves creating a LlamaIndex query engine. The query engine enables the functionality to utilize natural language to retrieve relevant, contextually appropriate information from a vast index of data. LlamaIndex’s **as_query_engine** method abstracts the complexities of AI engineers and developers writing the implementation code to process queries appropriately for extracting information from a data source.

For our use case, the query engine satisfies the requirement of building a question-and-answer application, although LlamaIndex does provide the ability to construct a chat-like application with the Chat Engine functionality.

import pprint
from llama_index.core.response.notebook_utils import display_response

query_engine = index.as_query_engine(similarity_top_k=3)

query = "I want to stay in a place that's warm and friendly, and not too far from resturants, can you recommend a place? Include a reason as to why you've chosen your selection"

response = query_engine.query(query)
display_response(response)
pprint.pprint(response.source_nodes)

Initialization of Query Engine

The process starts with initializing a query engine from an existing index by calling index.as_query_engine(similarity_top_k=3).
This prepares the engine to sift through the indexed data to identify the top k (3 in this case) entries that are most similar to the content of the query.
For this scenario, the query posed is: “I want to stay in a place that’s warm and friendly and not too far from restaurants, can you recommend a place? Include a reason as to why you’ve chosen your selection.”
The query engine processes the input query to find and retrieve relevant responses.

Observing the Source Nodes

The LlamaIndex response object from the query engine provides additional insight into the data points contributing to the generated answer.

More on LLMsHow to Develop Large Language Model Applications

Understanding How to Build a RAG System

In this tutorial, we’ve developed a straightforward query system that leverages the RAG architectural design pattern for LLM applications, utilizing LlamaIndex as the LLM framework and MongoDB as the vector database. We also demonstrated how to customize metadata for consumption by databases and LLMs using LlamaIndex’s primary data structures, documents and nodes. This enables more metadata controllability when building LLM applications and data ingestion processes. Additionally, it shows how to use MongoDB as a Vector Database and carry out data ingestion.

By focusing on a practical use case — a chatbot-like system for Airbnb listings — this guide walks you through the step-by-step process of implementing a RAG pipeline and highlights the distinct advantages of RAG over traditional fine-tuning methods. Through the POLM AI Stack — combining Python, OpenAI models, LlamaIndex for orchestration and MongoDB Atlas as the dual-purpose database, developers are provided with a comprehensive tool kit to innovate and deploy generative AI applications efficiently.

Article originally posted on mongodb google news. Visit mongodb google news

Presentation: Portfolio Analysis at Scale: Running Risk and Analytics on 15+ Million Portfolios Every Day

MMS • William Chen

Transcript

Example: Portfolio Factor Risk

1. Trim Your Computational Graph (Key Insights from Scaling Aladdin Wealth Tech)

2. Store Data in Multiple Formats

3. Evaluate New Tech on a Total Cost Basis

4. Use Open Source, and Consider Contributing

5. Consider the Dimensions of Modularization

6. No Backsies Allowed

Summary

Questions and Answers

Subscribe for MMS Newsletter

Did you know...

MongoDB has Over 3,000 Customers in India and Growing – Analytics India Magazine

MMS • RSS

Focuses on Real-time

Generative AI Struggles

MongoDB to the Rescue

What’s Next?

Subscribe for MMS Newsletter

Did you know...

Redis Switches to SSPLv1: Restrictive License Sparks Fork by Former Maintainers

MMS • Renato Losio

About the Author

Renato Losio

Subscribe for MMS Newsletter

Did you know...

MongoDB has Over 3,000 Customers in India and Growing – Analytics India Magazine

MMS • RSS

Focuses on Real-time

Generative AI Struggles

MongoDB to the Rescue

What’s Next?

Subscribe for MMS Newsletter

Did you know...

Public preview: Change partition key of a container in Azure Cosmos DB (NoSQL API)

MMS • RSS

Subscribe for MMS Newsletter

Did you know...

Leaked data from Shopify plugins developed by Saara

MMS • RSS

Key findings from the Cybernews report, covering the data breach on the Shopify plugins developed by Saara

Plugins confirmed as affected by the leak:

Leaked data included:

Some of the online stores mostly affected by the leak:

MEMBERS GET ACCESS TO

Subscribe here

Subscribe for MMS Newsletter

Did you know...

Ridge Holland update – Gerweck.net

MMS • RSS

Related Posts

Subscribe for MMS Newsletter

Did you know...

DigitalOcean Introduces CPU-based Autoscaling for its App Plaform

MMS • Sergio De Simone

About the Author

Sergio De Simone

Subscribe for MMS Newsletter

Did you know...

Ridge Holland announces he is “stepping away from in-ring competition” – Gerweck.net

MMS • RSS

Related Posts

Subscribe for MMS Newsletter

Did you know...

How to Build Your Own RAG System With LlamaIndex and MongoDB – Built In

MMS • RSS

Retrieval Augmented Generation (RAG) Explained

What Is Retrieval Augmented Generation (RAG)?

How to Create a RAG System

1. Install Libraries

2. Data Loading and OpenAI Key SetUp

3. LlamaIndex LLM Configuration

4. Creating LlamaIndex custom Documents and Nodes

5. MongoDB Vector Database Connection and Setup

6. Data Ingestion

Vector Store Initialization

Data Ingestion

7. Querying the Index With User Queries