Presentation: How Green is Green: LLMs to Understand Climate Disclosure at Scale

MMS Founder
MMS Leo Browning

Article originally posted on InfoQ. Visit InfoQ

Transcript

Browning: We’re going to be asking the question, how green is green? We’re going to be answering it using a RAG-based system. We’re going to take you through that journey of doing that over the last year at an early-stage startup. The how green is green is being asked of financial instruments. Where that’s important to the problem context, I’ll give you a little bit of background. Mostly we’re motivating the kind of problem that you see when you’re starting up in a deep domain space, and looking to apply these kinds of tools to that, where you start, where you make your early gains, where the challenges are, and where you move to from there. There are a couple things that I’d like you to walk away from this talk with.

The first is, I’d like you to understand that RAG system development from 0 to, in our case, first users, with a reasonably small team, so myself and growing to two other engineers in that time. Also understand the constraints that are placed on developing in that environment, especially when you’re dealing with large language models, which I think are often associated with having to have large amounts of data or large engineering teams or large compute budgets to get up to something valuable. Lastly, I would hope that you get a little bit of a taste of what it is to work in a climate and AI startup and understand why that’s a great place to be.

In order to do that, I’m going to give you a little bit of problem context. Like I said, this is in the area of climate finance, so putting money towards climate initiatives at reasonably large scale. This isn’t running a local raffle to plant trees. This is putting nation scale finance towards global scale problems. We’re going to talk about the initial technical problem and the technical solution that we put towards it. Then I’m going to walk you through how we built that up from initial value propositions.

Some of the challenges and ways that we overcame them and built into a more scalable solution. We’re specifically going to be focusing on the human in the loop and data flywheel component of starting up in these spaces. Then I’m going to talk a little bit about the work that I’m doing currently or looking to do in the next little while to continue scaling up, with the aim that these steps, whether you’re in climate finance or a startup or a large company, do mirror some of the challenges and hopefully some of the successes you can take away from this kind of development using these tools.

Really quickly on the problem context, I want to let you know who I am, so you know who’s speaking, you know where I’m coming from. I’ll talk a little bit about the company again that motivates the kind of resource constraints that we’re working in, and why we’re choosing to tackle the problem we are. Then I’m going to talk a little bit about what our solution has been, and we’re going to move into the technical meat and potatoes of this. My name is Leo. I started off in physics. I worked in complex systems, nanotechnology, and quickly discovered that the code and the models and the simulation was more to my taste than the physics, and have been working in AI and machine learning engineering ever since. I was the first engineering hire at ClimateAligned. ClimateAligned is an early-stage startup.

What that means is we have a reasonably small team, which, at the moment, sits at 10 members, 3 of which are technical. That’s the scale of the work that’s possible with this team. We have some funding, which means we’ve got a year 18 months to work with on this problem. In order to deliver the maximum value and have the highest chance of success in that, which is, I think, probably a good scope for any first project, we are looking to have a real laser focus on proving initial value while not hamstringing ourself for being able to demonstrate that what we are building can scale to hundreds, sometimes thousands of times in terms of the amount of data that we need to be working with and how fast it needs to be produced.

A big reason for that is because the market that we’re looking for and the customers that we are engaging with are some of the biggest financial players in the world. They manage an eye watering amount of the world’s money. That money needs to go to a pretty significant challenge, which most of the world and many of the corporate big players have decided is big enough that they are putting significant targets towards it in monetary terms, and they are big. That’s GDP scale money that needs to be put towards these very significant challenges. One of the big appetites that’s growing out of that is credible information on what the things to invest in that are going to be sound monetary and climate investments, so that you can put your money in the right place.

Problem Context

At the moment, this isn’t a new problem, although these targets, the 20, 50 targets are recent in the last 10 or 15 years, especially the level of commitment. The problem itself isn’t, which is deciding where to put your money towards the maximum impact. At the moment, the industry does that by applying a lot of expert internal hours towards combing through the information available for a financial instrument or a company, determining if it’s a sound investment, and making a decision on that. That’s expensive, and as I said, requires a significant amount of expert time. You can outsource that, but at the end of the day, whoever you’re paying is doing that same thing with people. This is hopefully painting a picture that’s familiar to some of you here, where AI is a useful tool for this problem.

In our case, we’re specifically emphasizing the tool component of AI. We’re not looking to have it make decisions because I don’t think the tech is quite there yet, but also because these decisions often involve gray areas where there isn’t quite a right answer, so putting the correct information and a tool capable of aiding in decision making is the way that we’re targeting it. That is hopefully familiar to those who have had some exposure to large language models and their applications. Probably the most common is ChatGPT or some chat-like interface. We had our initial prototypes for chat type because that’s a useful way to engage with something that people understand.

Also understanding more structured or more system scale data, rather than as a human interacting with it, consuming the kind of documentation that companies put out around their financial and climate statements. We’re looking at both the kind of human question text input data, as well as, I’m going to use the term structured very loosely here, because these companies produce some wild documentation, but it is still a different form of data input. Our solution is to use these LLMs to process at scale. We use a document and data management ingestion, which we run in-house. I’m not going to focus on that hugely at this stage, and probably won’t touch on it in this talk.

More or less, we’re consuming documents from the internet and storing them in a structured dataset within our system. Because this is financial decisions that need to be made, accuracy is quite important, and trust in that accuracy is quite important, both for our users and for ourselves from a liability perspective. I will be emphasizing quite a bit the accuracy and accountability side of things, which, if I’m honest, outside of chat applications, even in chat applications, I think should be an emphasis on any work with these systems. They are very powerful, but their auditability is often low out of the box, and it is one of the areas where RAG, which is one of the things that we focused on, really shines.

Technical Overview

Technical overview, we’re going to talk a little bit about how this actually works. I’m going to give you a system overview. I am going to give a brief introduction to RAG. I’ll give a little bit of an overview. The simple version of the core technical problem that we’re trying to solve is to answer questions that need to have answers for them in order for somebody to make decisions on a financial instrument. The answers to those questions are in documents, and they usually come in a large selection of questions that need to be presented in a structured format. You can see, there’s a screenshot of the product here. It’s structured format of a number of hierarchical questions.

All these questions, if you look into them, have answers, and, very importantly, have sources, which is a very critical component of the RAG process. It’s literally what it is, but it’s also one of the reasons we picked it. It’s quite critical to its value proposition. I emphasized it earlier, but the documents that we’re consuming are of no structured format. They wildly vary across companies, and even within companies, year-to-year. They get a different consultant to make them. There’s a significant variation in the source data that needs to be accounted for and which LLMs, at least ideally, are well suited towards processing.

Challenges. Accuracy, how do I ensure the answers are correct? How do I know when they are correct at the time that we’re serving them to a user? Then, as I’ve said, auditability and transparency, which is covered by default in the RAG system. Then also flexibility, how do I ensure that if a user has a different set of these questions over here, that they can get answers in a timely manner and in a reliable manner, especially given some of the ways we initially approach things with more human involvement for correctness. I’ll go into some detail there. Then, as I said, ensuring things scale as we look to ask many more questions across the landscape of all documents and all companies, which, off the top of my head, there’s about 60,000 companies that are producing data in this space at a reasonable rate, a couple of times a year. That should give you a scale of the problem there.

As I said, first component, I’m not going to talk too much about document consumption other than to reference the quality of it. We’re going to focus primarily on the question asking, as well as the things that make that question asking more reliable and more auditable. We’re going to focus on the evolution of our search system, the evolution of how we have done the question answering. Then, looking at the involvement of humans and the need for review when you’re dealing with these LLM based systems. This is a quick jargon drop, so if you already work in this space, I’m going to reference a couple of quick technologies. They aren’t rocket science, but it gives you a little bit of a picture of what we’re working with. You can slot that away, if these are unfamiliar to you. I will be conceptually explaining everything. It’s not a big deal. It’s a RAG based system.

The search that is the differentiating aspect of RAG, versus just asking a question of an LLM as a hybrid text-based search. I’ll go into the various aspects of the hybridization. We use Postgres for all of our data storage. Especially to start with, we have used OpenAI for all of our large language model usage, although we will talk about moving away from that. We’re running everything on a Python stack, because it tends to be lingua franca for this workspace.

What is RAG? RAG stands for Retrieval Augmented Generation, where the retrieval component of it is a search across some sources of information that you know or hope contain the correct information actually in them. In this case, the LLM is used to process and synthesize that information, but is not in a single call being directly used to produce an answer from its own internal knowledge. You are using a search system to deliver some sources that get grouped in to the initial prompt along with whatever question you might be asking. That provides, in theory, the correct information to then ask the question of an LLM and get a better answer, or more up to date, or just with verifiable sources than you would get from asking the LLM directly.

Project Evolution

I’m going to talk a little bit about that evolution. The goal here is I’m going to be talking about my experience, and hopefully that parallels the experience that you would have starting a reasonably substantial project within your own organization, or as a side project, or elsewhere, moving beyond a first introduction to RAG or a first introduction to LLMs. Focuses that we have when we are thinking about it is that it has to be AI driven or some form of automated system, because, as I mentioned, this is already being done slowly by humans. That’s not a solution for us, at least not a long-term one. Because accuracy and competence are very important, I am going to talk about how we have heavily used human in the loop systems, both to ensure correctness in the product and to build datasets that allow us to then build automated systems to speed up that automation system.

Then we’re going to talk a little bit about search and how that applies to a larger scale problem. Let’s talk about humans. I think with AI talks in general, or especially talks about LLM, humans are left out a little bit. Human in the loop is a very powerful way to think about these systems, especially when you’re starting out. This is assuming that you don’t have, beforehand, a large amount of correct examples. In that case, you might reach for a human which has, out of the box, very good accuracy. Our analyst, Prashant, who is living and breathing this kind of analysis work for the last 5 years, is about 99% correct, and part of that is because these are tough problems and sometimes gray areas, but that’s very high. We as humans tend to have a high degree of trust in other humans. A user or a customer is always more likely to trust Prashant’s answer than a model’s just because of that initial trust. He’s highly flexible. He can abstract to other problem spaces or new types of data, but he’s very slow and he’s very expensive.

Even when we’re talking about GPUs that LLMs run, a human is a very expensive resource compared with that. Human in the loop thus has a great initial benefit when you’re starting out with a problem like this, because it lets you build your own internal trust. If you can find a way to demonstrate to your users that the human in the loop has that accuracy, and then you’re not going to rely on that forever more, it builds that trust with your users as well. Eventually, it’s too slow to keep up, and that’s where building that trust in scaling, even if the scaling is still down the road, as in our case, where we have made significant improvements and need to make significant ones in the future, is quite a valid approach.

Extending that RAG model to include human in the loop, in the case of our system, so we’re asking these questions to determine the greenness of a document type. We might be asking 100 or so questions about any given financial entity, and you can pick any big nation state or giant corporate. That’s what we’re talking about here. The model itself, so the core question answering LLM with some prompt tuning and with a reasonable initial search approximation, and we’ll dive more into the search in a little bit, is about 85% accurate for our system after some effort. That’s pretty good for an LLM. I think Jay mentioned that for a substantive problem, RAG can sometimes be 3 or 6 out of 10 accuracy, so 30% or 60% is a normal range you would expect, even with a bit of prompt tuning. We’re already pretty happy with that.

That’s a useful thing to think about when you’re thinking about your first problem is, where’s your happy level of accuracy? Because you could start out with something quite low, especially with a dense domain area where the models are a little bit less generally trained. Because we put Prashant, our analyst, our expert, insert whatever your expert knowledge source is between the model produced answers and our production, we have a system now, a product that we can confidently say is much more accurate than the underlying model, 99% as opposed to 85%. In this particular case, even with Prashant looking at every single answer, the time to assess for a single company goes from about 2 hours to 20 minutes, because already we’re giving him the sources in front of him. We’re giving him proposed answers, and most of the time they’re correct. He’s only having to make some judgment call, some additional assessment on top of that.

I think that’s a very powerful step, going from manual totally to something fully automated that is often overlooked when you’re thinking about how you can get initial value in things. That’s already a loose order of magnitude in speed-up in time, while still having everything human checked. I think that it was certainly very powerful for us. It enabled us to get out productionized data in a full platform, far before we were able to produce exceptionally high-quality models internally. The other benefit of having this initial human in the loop is that you do build up a fair dataset now of correct and incorrect examples produced by your core RAG system. A correct or labeled dataset lends itself to what I often, tongue in cheek, call Trad NLP or Trad ML or Trad AI. People call it classical AI. You need a little bit of training data. When you have that training data, you’re allowed to then pick from a wide variety of capabilities and specificities of model types, which you can then fit to a purpose that allows you to not use something massive and general purpose for a specific task.

However, that’s only unlocked with a certain amount of data, and that varies from task to task. Sometimes you might need 50 examples for something like few-shot learning in the NLP space. Or if you were trying to train a model from scratch, or an adapter, you could be looking at 5, 10, 100,000 examples. We found that the 500 mark is super useful for having validation data and exploring the few-shot space. We’ve just recently hit the 10,000 space, where we’re looking at fine-tuning models. That’s a really nice data mark there when you’re looking at the flywheel value that comes out of human in the loop.

Once we have this labeled data and we’re looking at what kind of other models we can add to the system, note here that we’re not actually thinking at this stage about making changes to the core LLM. At this moment the core LLM that’s being asked the significant question with the sources is still GPT-4, which I’ll talk a little bit more about pros and cons there. We were able to add in both some traditional AI to our search methodology, and I’ll go through that in greater detail, but also some traditional AI to accelerate, yet again, the process of our analyst looking through those systems that need review. Because we have this 10,000 dataset, that’s enough to build a simple classifier that’s able to flag those things that do need review, so that Prashant can completely ignore those that do not. We see about another order of magnitude increase in just throughput and ability to get the data into production while still getting some extra labeled data out of the system.

Again, the whole arc here has had our system accuracy at 99%. The underlying model, the core LLM, or the RAG component, is still at 85%. That’s not what’s been changed here. We’ve added other components to the system around RAG to increase the throughput and increase or decrease the latency of the system, and thus enable us to move faster and get more into production at scale.

I’ve said at least twice that we would dig a little bit into the hybrid search, or the search aspect of things. We started off, as I think most people do and should, in trying out RAG with a vector database or a similarity search. Essentially, you look at the vector similarity of your query and your sources, and you collect those things which are most similar. We found that especially because our system is designed to work on a very specific set of domain data, and we know about that domain, a similarity search combined with some heuristics, so some rules based on our own understanding of what questions might be asked of what kind of data to narrow down the search space, was very tractable initially.

By adding a keyword search, we used BM25, and then doing a little bit of reranking using RRF, which is Reciprocal Rank Fusion. I would recommend both of those as good first passes for keyword search. Combining that ranking with a similarity search, we were able to see significant increases in the quality of the model. This was the state of our search that brought us to that 85% accuracy. That was most of the improvement, in fact, from the out of the box 30% to 60% that you get. We saw most of our improvements by just tuning up our search and adding in a little bit of domain knowledge, as well as a combined keyword and similarity search. That provided a large amount of initial value in the space. I would encourage that as a first pass. I think the first thing off the shelf that I would make better is your search, especially making it more specific to your problem space.

Challenges and Future

What are we working on now? What are the challenges that are in front of us? What do we want to do in the future? Again, we’re at about the nine months to a year of technical work on this problem. We have a first user on the platform. We’re staring down the barrel, hopefully, of lots more users coming in and lots more use cases needing to be sold. Through that, we need to make sure that we can scale. Two bottlenecks at the moment, or rather, maybe two areas for improvement, rather than bottlenecks. We like to keep it positive. The search was the biggest improvement that we saw to the underlying model accuracy, which does underpin how hard every other problem is. I’m going to talk a little bit about how we can include document context and things like topic or semantic based tagging into that search system.

Then I’ll finish on talking a little bit about fine-tuning and in-house models or open-source models, and how we see that taking us from this initial stage of more general, more expensive API based models, into more task specific models as part of this growing constellation of components that build up the system that underpins our product. Those are the things that we want to do. I’m going to hammer home again, we always need to be considering for ourself, how can we ensure that the answers that we’re putting into production are correct? We don’t have Prashant or our analyst looking at every single one of them anymore.

That’s always something of consideration, whether the model that is now flagging things for review is still fit for purpose based on, for instance, a new customer’s use case, a new entity’s documentation that is very different than others, or a new type of financial instrument. We need to consider their own desires for different types of questions being asked. You can imagine the important information you might get from a document, could be anything from important numbers or summaries of their broad stances on things, comparison with other entities in the same industry space. You wanted to compare all car manufacturers.

Those are all both slightly different question cases, but also require a bit of scale. You can’t compare all the different car manufacturers until you have all the different car manufacturers. You can repeat that for every industry that you can think of. These all require a larger scope of work, which is a little bit of a pro and a con because we’re such a small team, we’ve been working on fast improvements up till this point. I think that that’s very important at the early stages of a company or a project, if you’re in a larger company. We’re now looking at more serious chunks to try and knock off that last little bit of accuracy and really reduce down that latency and have a faster time to product for new data.

We’re going to talk a little bit about improving search and document context. Then I’ll end on improving core LLM performance. The first two bullet points are the first two sections there, what we talked about earlier. They were the main improvement, as I said, that took us to that 85% accuracy. Something that I’ve been working on very recently has been adding topic-based ranking, which is tractable, both because we have some training data, so we now have enough correct examples that there’s something to work with, and because we have some expert knowledge. We understand that when you are looking to provide answers for a certain question type, the most general case might be that you ask the question, and the sources that you are searching across is something like the internet. It’s very hard to inject domain knowledge into such a large search set.

However, if you were able to, for instance, know that you’re always going to be asking questions about climate, you might narrow down your search set. We will be looking to do that in an automatic way, using topic detection. I’m going to talk a little bit of some initial results that we have there, and some of the initial challenges.

A little bit of a dense graph, but mostly it’s there to show the shape of the data than any specific data point. What we’re seeing is two documents for a particular entity, a financial entity. You can see them pretty clearly differentiated. I’ve just put them up on one slide so you can see the difference in the shape of the data here. The white boxes represent the tag targets that we have from our training data, so we know for these documents what different chunks of text. They are on the order of 500 across these two documents, which is a reasonably arbitrary number. You can slice and dice however you like. You can see the arc of these topics over the course of the document. You can see that in the first document, when we’ve labeled things, there’s a clear arc of important topic tags that happens in the body section, and a little bit at the start that’s obscured by my labels. We’re going to focus on this. In the body section, there’s a clear arc from a number of different topics.

These might be topics like investment in waste recycling, or investment in green energy sources, or investment in more efficient buildings. They’re referenced in sequence to the main body. Then, the rest of the document is filled with what I’m calling admin tags. They’re useful to know what kind of not useful the text chunk might be, but we don’t actually care about them. They might be things like headers and footers or filler information on methodology. The second document, at least in terms of the targets, doesn’t look very useful. That’s because this second document talks about assessing the company against a different way of thinking of climate analysis, and so isn’t actually useful to the questions that we’re trying to ask in this particular example.

However, you do see the model predictions, as noted by these red X’s, actually labeling some significant topic tags up in this top right corner. The reason for that is that in the text chunk size that we’re talking about, which might be about a half a page, a lot of the same language is being used. This latter document actually considers the case where they talk about how one might assess a company. Over the course of many pages or over the course of a whole section of the document, it might be clear that that’s the case. In the individual text chunk, the language being used is the same as the language used elsewhere when they’re talking about how it actually is assessed. This is quite a common problem with LLMs, broadly speaking, and with RAG systems, is that you have to decide the size of chunk of information that you give to them.

Some of the latest models, you can put in whole documents, if you would like, but their ability to determine specific facts within that context is quite reduced. For the most part, at the moment, you’re always trying to chop it up a little bit and serve the right things through to it. In this particular case we’re looking for about half a page, the problem being that when you serve that half a page, it doesn’t know where in a document that is coming from, unless you supply that in addition. While we were working on this topic detection, we realized that this topic detection itself, and most likely, the underlying LLM that would be using these source chunks would be greatly benefiting from having longer range context added to the chunks being passed through. That’s something that we’re actively working on, and a lot of it comes down to correctly parsing and consuming messy PDFs, because it’s a problem that’s been solved many times in many different ways. The likelihood of it being solved in your way is very low.

Then, lastly, the next step that we’re looking to make, and this is mostly on a scale performance, both in time and in money side of things, is we want to move away from using GPT-4 for everything. I think there was a talk that used a great phrase, she said, “GPT-4 is king, but you don’t get the king to do the dishes”. At the moment, we very much got the king up to his elbows doing everything, and that’s ok. When you’re proving initial value, I think that’s a very valid and important way to start. You don’t want to over-engineer yourself into building a lot of in-house custom models right out the gate. You don’t know if your problem is tractable. It does require more engineering effort to bring that in. Depending on the scale of the engineering team and expertise you can bring to this, that could be something you want to consider.

One thing we are moving towards on this theme of topics and being more specific about how we handle each source, is that we’re looking to split and potentially fine-tune the kind of models we’re using on a topic or document type basis. We think that there’s significant ground to be gained in terms of using different models for different source types. This isn’t for us in particular, but if you were using RAG in general for a more general context, you might use a different model, fine-tuned or otherwise, to do question answering on web pages versus on code. I think code is probably the most well used example here in terms of fine-tuning models to work on code. Yes, they are still technically natural language, and a general model can operate on code, but fine-tuning on a particular language or domain tends to prove great value, and that’s something that we’re looking at next.

ChatGPT and the GPT-4 models are the best-in-class off the shelf. If people would like to challenge that, I would welcome it, because I would love to hear about what the other best-in-class off the shelf is. That’s fantastic. They’re wrapped behind a pretty slow and unreliable API. If you haven’t worked with it yet, I would encourage you to try that out. If you’re used to APIs in any other context, these are the slowest and most unreliable. That is not to denigrate OpenAI, because the things behind them are enormous, and it’s a miracle that they function at all at the scale that they do, but they are that. When you’re thinking about them just as a service, there’s a risk there, but because of their flexibility off the shelf, great initial tradeoffs. I would recommend that highly.

Then, as I said, open models have this high control. We had a talk about all of the values that you get from open models with the engineering setup cost, which I think is a worthwhile value exchange once you’re past that initial stage of your project. We are looking at now expanding our data across a number of dimensions, covering financial reporting across time, so historical information on companies and financial instruments across different kinds of financial instruments, of which there are many. I couldn’t even list them all, because I’m not a financial expert. As well as across industries and applications of that money towards the climate goals.

Once you have that large scale of data, you start to unlock things like comparison, the ability to understand differences between the data that your users are interested in, and then, hopefully, unlock the ability for them to do new work, and then for you to add new value once you see what those new workflows are. It’s important to put the initial product in front of them, so that you can see what those workflows are without having to guess too much about that. We’re at that stage where we are now seeing customers use the products and look to certain things like this, although we have guessed that they would use others and they don’t want to, it seems.

Not over-engineering that too much is quite important so that you end up with a system where you’re investing the engineering effort that you need to when you know that it’s worthwhile, especially when you’re working with a people constrained development cycle, which I hope mirrors a project team as much as it does a startup. Although sometimes the constraints are a little sharper or more relaxed, depending on your institutional context.

Questions and Answers

Participant 1: There’s a bit of a paradox in using LLMs and financing climate initiatives. Can you elaborate a bit about that. Do you also assess your own CO2 emissions?

Browning: There is always a paradox in using the master’s tools to dismantle the master’s house. That is a core problem, just a core constraint, I think, in using tech to tackle any of these problems. We don’t assess our CO2 on a regular basis, although we did do some back of the napkin calculations in terms of the financial mistakes that one would make investing $10 trillion per annum, which is what’s required to achieve these targets at a very high level.

The mistakes one might make in terms of making bad decisions, versus our very modest GPU costs, at least at this stage, and of course, those numbers don’t stack up, but it is something of primary concern. It’s one of the reasons why I think those open models, if you’re assuming, which we are, that AI is a powerful tool to do this job better and faster, then looking towards smaller models that can be run more efficiently and more effectively is a very practical solution towards it. I think at this stage, that’s probably the only practical solution, at least in this case, because you are looking to answer questions on complex data. At the moment, LLMs, that’s an unlocker or enabler. I think that we’ll see because of the ability to generate data from LLMs, the ability to move to much lighter weight systems in the future.

Participant 2: It seems like a good chunk of the work that you’re doing isn’t actually climate specific, in terms of, you’re figuring out how to parse documents for certain information. I’m curious, A, why you’ve limited yourself to climate for this, and B, how much work you think it would take to transition to a different focus within a company’s finances?

Browning: Especially in light of how this talk, the aim was to think about how you might tackle these similar problems in a larger institutional context, you can substitute climate in your question, which I will reiterate, which is that this is a general problem. Why limit yourself to climate if you’re trying to solve a general problem? Then the flip side is, how do you move out of your specific problem? In a startup’s case, what we’re looking for, and perhaps in any project where you don’t have unlimited money initially, you’re looking for a small problem that easily scales to a big solution. For climate, the amount of money involved makes it a really strong value proposition for how big it can move. Where, in terms of the money we’re able to address, the scope of the technical problem isn’t a huge limiting factor. In terms of moving from specific to general, that would not be something we’d want to do, because a lot of people are solving RAGs generally.

A lot of the value that we have as a smaller team is not that we’re likely to crack the general problem much better than those teams. In fact, that would be very unlikely. Rather, that we are able to inject both, sometimes quite literally, human expertise, but also just expertise or domain-driven development, into those thorny problems that, in my view, will be very tough to solve generally, at least with the state of things at the moment. Every so often, I just try and slam whole documents through whatever the latest and greatest model is, and see how it works, or use somebody’s off the shelf RAG. While impressive in the fact that they can address many problems, they always come short of even our initial approaches to a very tightly focused technical solution that we’ve often encoded a lot of very domain specific problems around.

I think actually in answer to you, as a summary, I think tackling a very specific slice of a general problem is very tractable. Unless you believe the general solution is right around the corner, is going to take everything off the board, which I don’t, and I used to worry about it. Then as soon as I started working with these things, I realized both how magic and how rubbish they are. I wouldn’t be too worried about the general solution wiping you off the board, although, if the problem you have is general, then I think there are some pretty great options around for pulling in a service or rolling it yourself if you have enough resource.

Participant 3: You say you’re not going to provide a generic solution of SaaS with what you did. Do you plan to open source the project that could be reused by other business people, so for matters of business that they can use your approach to accelerate their time to business?

Browning: On the scale of a startup lifetime, given that our expertise is one of our significant moats, and there are certainly other people with that domain expertise who are not applying it to this problem, either because they are not able to or haven’t thought about it, that we would be unlikely to open source that side of things in the next little while. I think probably a much more likely situation is that we start to use some of the better, general-purpose system components and contribute in that way, and get a little bit of a bidirectional flow.

At least until we establish ourself, the more heuristic or domain specific knowledge components are unlikely to be opened up. Fine-tuning models is still a pie in the sky at the moment. I would love to get to it in the next three months. Once I do, I think that’s a really different consideration. There are certainly some people who are fine-tuning models on climate specific or financial document specific datasets, and I think that would be someplace as well where we would consider something like that. The core heuristic approaches will stay closed for a little while.

Participant 4: I’m trying to understand, so you process climate data, intersect that with financial product information to identify investment opportunities?

Browning: The data that we’re producing is companies actually reporting on their own climate financial data. It’s all in one. We don’t have to do any intersection there. As part of trying to be an attractive company to invest in, companies will release this data, often poorly. They’re not very good at it yet. We hope that they get better. We’re looking to answer questions across that self-reported data, at least at this stage. There’s no reason why, once the system becomes well-tuned, you couldn’t add in other sources of information, like their financials that are non-climate based to see if that’s stacking up. Our initial solution is focusing very much on their self-reported climate finance information.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.