Podcast: Adam Sandman on Generative AI and the Future of Software Testing

Uncategorized

Podcast: Adam Sandman on Generative AI and the Future of Software Testing

MMS • Adam Sandman

Article originally posted on InfoQ. Visit InfoQ

Transcript

Shane Hastie: Good day, folks. This is Shane Hastie for the Intro to Engineering Culture Podcast. Today, I’m sitting down with Adam Sandman. Adam, welcome back.

Adam Sandman: Thank you very much, Shane.

What’s changed around software testing trends over the last two years? [00:16]

Shane Hastie: It’s a couple of years since we had you on talking about quality and testing and the trends at that stage was generative AI was brand new and we were talking about what’s the implication of that got to be in testing? Well, we’re two years down the stream. What’s changed?

Adam Sandman: Well, I think we’ve seen I guess the power of generative AI and I think we’ve also seen some of maybe the limits and constraints. We’ve also seen AI evolve in ways that maybe we didn’t expect. I think two years ago, we saw a large amount of excitement around the ability to generate things, generate content, whether that’s code, whether that was blogs, whether it was documentation, or things that really I think have improved the productivity of software. I’m going to say software engineering or software production. Some of it’s the coding, some of it was the documentation, but I think we’ve also seen some new use cases come along which people didn’t expect around the new agentic AI as the new buzzword of the day.

But the ability for the AI to actually start to operate applications and provide qualitative feedback on things it was seeing using vision. Those are things that I don’t think were necessary forecasts from a large language model. We assumed it was more generation as opposed to more insight. I also felt like we might be further along in some ways in content generation than we have done and I think the additional use cases that everyone gravitated towards chatbots and coding assistants and blog writing haven’t evolved massively in two years. It’s been the same. It’s been more wrappers around the more functionality to make it more usable, maybe make it more convenient for people. People I think have been trying to map human and computers together.

So, in the non-development world, whereas people thought we just have machines would write everything, people will now have services where it’s a human and a computer doing it together with feedback around. So, I think what’s changed I think is that we understand better how AI can help us, but we’re also finding new ways to use AI that weren’t predicted. The other thing I think that’s very interesting is the cost. Scaling is very different than I think we thought. We assumed that the primary use case would be generation of content, which a large language model does quite inexpensively once it’s expensively trained and built.

We’re now seeing use cases around what’s called chain of thought and workflows where the inference that computing power and the ability to run through, actually query the model in these different ways is incredibly expensive. So, we’re starting to see the limits and costs on some of these ideas that people have. So, in some cases, we’ll talk about testing. Some of these LLMs have used for testing in certain ways can be more expensive than a human tester right now, which I didn’t think I would’ve thought two years ago.

Shane Hastie: So let’s dig into the testing because that’s the field that you are an expert in.

Adam Sandman: Thank you.

Shane Hastie: How is the generative AI, the large language models, how are they supporting testing today and what’s the implication of that on testing your software products?

Generative AI’s Role in Software Testing [03:11]

Adam Sandman: Right. I’d say we believe testing is a function of quality and I think testing is activity quality should be the outcome. So, that’s my little soap box. I just want to mention that and I know you have InfoQ. I think Q for quality always. I think what it’s being used today for testing is first of all the developer level, a lot of developers I know are using AI to generate things that they don’t like doing. So, for example, unit tests. We know that unit tests are horrible. I am a developer by trade. I used to write code in C# and Java and still do. I may be the CEO, but when they’re not looking, I still dabble in the code and have fun with it. But I hate writing unit tests. Everyone does and we know it’s really powerful.

We know if you look at the testing pyramid, it’s the classic best way to get coverage cheaply and reliably with the lowest maintenance and yet we still do a horrible job of it. So, a lot of our clients use AI to do a lot of the hard work of unit testing, particularly input parameter variations. If I want to test this function with 1,000 different parameter combinations, that’s a horrible human task. Developers hate that. It’s the most boring programming work. AI is good at that. So, I think that use case has been well and truly not to say solved, but improved with use of AI and Copilots and various other systems like that. Other things we’ve found to be very good at are test data generation.

If you need synthetic test data, generating 1,000 phone numbers or usernames and you can tell it, “I want these to have these boundary conditions, make them very long, make them very short, but dollar signs for special characters, whatever,” it’s good for that use case as well. It’s also good for some of the project management testing side of things. So, we’ve seen a lot of success. I’ve got a user story using it to give feedback on that user story. I see the word “it”, is in it lots of times. What is it? Is it the customer? Is it the screen? Making the requirement of the story better, less mistakes and interpretation by the different people and that improves quality, and then also things like generating test cases, generating test steps, generating automation scripts.

Now we start to get I think where some of the industry is moving in is using it to improve some of the UI testing and API testing. Particularly the UI testing that’s always been in the testing world the hardest part to do well. Automated tests at the UI are always disliked because they break very easily. The developers want to have the freedom to change the application that the business users certainly want the freedom to be able to improve the user experience. You can destroy your entire test suite in a quarter or in a month very, very easily by making large scale changes for good reasons, good business reasons at least. So, AI is helping in some of those cases to make the test more resilient.

It’s starting to be able to help all testers take natural language tests and turn it into automation. We’re heading in that direction where in theory, depending on who you ask and what kind of application, we might be able to have the AI do some of the testing for you where it can look at an application and begin to interpret the business scenario and interact with it like a human could to some degree. Certainly, some of the models now with their vision capabilities can actually see the application and give you real-time early qualitative feedback. This button is off the screen, this button is a rounded shape. The requirements says it should be, I don’t know, square or three-dimensional, whatever it is, those are things that we normally would’ve missed in automated testing.

A human tester might not have enough time to capture those. A good exploratory tester might catch them. So, AI is able to do some of those tasks. Now, things that developers don’t like doing, things that testers traditionally didn’t like doing as much. So, in theory, that’s a good thing if we use that as well as what we do. But if we replace everything with that, the danger is we may miss things. So, that’s what we’re seeing in some of the AI use cases at least.

Shane Hastie: So the term I’ve been using a lot is that these tools accelerate us, but we’re certainly not saying get rid of the human in the loop.

Adam Sandman: No. In fact, AI is not just building the software from spec. It’s also depending on the application you’re looking at, it may be building the software from its own learning. So, for example, core set of software right now is scanning human conversations with human customers and using it to improve the workflow to provide feedback. Those applications that self-driving cars as people are probably familiar with, those cars learn by traveling roads, learning road signs. We didn’t write requirements. We didn’t in some ways write code. They learned themselves. So, if the machine is learning this mission-critical, highly sensitive, highly safety conscious functionality that we don’t fully understand, we’re going to need humans in the loop almost to a larger degree.

But I think as we’ve talked before, the role for the human and the machine is a little bit different. It’s not that we’re necessarily going to be getting a user story and writing a ton of code and maybe human test it like we may have done before. What the computers, what the humans do, we’re shifting some of that paradigm a little bit and I think people have to be comfortable with that change in what their role is, what they’re good at. Computers are really good at certain tasks, LLMs are good at certain tasks, but as we know, they’re very myopic in certain other tasks. The danger right now is of course in the rush to cut costs managers or well-meaning product owners might decide to try, and as you said, cut that human loop to save money and then they end up with a quality nightmare and they won’t realize it until it’s too late.

Shane Hastie: I know that you do have some studies and some stats and of course there’s the caveat, there are lies, there are damned lies, and there are statistics to misquote. But what are some of the numbers that you are seeing and that you’re hearing about?

Studies showing challenges with AI-generated code quality [08:20]

Adam Sandman: It’s a great question. First of all, in terms of developers, just of the audience that we know we’re talking a lot to, the statistic that was mind-blowing, it came from a friend of mine, Andrew Palmer from Fujitsu, who’s been studying a lot of AI, the website we all know and love, Stack Overflow, where we all get inspiration, how to write code or solve problems that maybe have been solved before. Since November 2022 when ChatGPT was launched to today, their traffic is down by about 80%. You can see the graph and other coding websites are equally impacted. It’s not just them. So, where we as developers got our content from and maybe cut and paste, maybe editet, we’re now getting it from Copilot, Amazon Q, Gemini, we use a lot.

So, the question I suppose, is that code better quality or worse quality than copy and paste from Stack Overflow? That’s a good question, but a separate study which tries to look into that a little deeper from a company called PlaneIT in Australia, they came up with a study that the volume of code has gone up 300%. So, a single developer can write 300% more code when using a good or appropriate automation tool. I’ll talk a little about that next. There are some better benefits and differences in trying them out, but there’s a 400% decrease in the quality of that code so that the unit tests were failing. The testing team found four times as many defects going into the testing than they had 18 months earlier when it was purely coming from humans and human websites and human knowledge.

So, that’s one of the challenges. I’ve seen this when doing coding myself. I have built some integration with a system that doesn’t have good documentation. If I was searching the documentation and trying to find API methods and documentation, I couldn’t find it. So, I asked ChatGPT and Gemini independently to generate what sample code would be and they quite heroically came up with very well reasonably looking methods. It looked the exact format they should be that they bear no reality to anything implemented in the app itself. You run them once and they don’t run. So, in the olden world, I would’ve tried to find the documentation.

I would search Stack Overflow, which I did do actually. I couldn’t find the answers. I would’ve probably had to then find someone who knew this tool because it’s more obscure and ask the help or I’d give up or I would have to do a lot of testing myself to get all the JSON and all the different data elements. But the Copilot tools gave me a sample of JSON that looked just plausible. It looked like all the other packets that did have documentation. So, it fit the format, it fit the spec, but it was completely bogus, completely made up. I noticed that I think Gemini was less prone to this than say ChatGPT for this use case, but other cases were different. So, that’s where I think it can be very dangerous and introduce many quality issues.

Another use case that was very interesting, where I think it really helped us, is we had some SQL. Another thing developers really hate, at least I know as a developer, I hate, performance. We have code that runs, it’s beautiful. We spend a lot of time on it and now it’s slow and the client’s not happy. We have to rewrite this thing and it’s the worst task because it’s not fun. It’s not exciting, it’s not a new user interface, it’s just taking something you’ve done already and redoing it. So, what we did is we actually asked Gemini to rewrite a SQL Server SQL statement, and we gave it the parameters and why it was slow. It did it and we had no idea if it was a good SQL or not honestly. We didn’t have a lot of DBAs available for this project.

Then when we put it through our unit test harness, it was functionally equivalent. So, we had good testing so we could test it. That’s the key here is a good unit test and it was 30% faster in production and the client was happy. That to me was a really good use case of AI where we didn’t have a DBA on staff. It was a singular problem we had to do on this one stored proc. So, to hire someone for one stored proc to learn your application, a lot of times people wouldn’t bother. They would just tell the client, “Well, it’s as optimized as it can be”, ie with our current team and maybe that will be it.

So I felt like that was a really, really powerful use case where because we had really good automated coverage that the developers had written and spent the time to write, we had confidence it was functionally equivalent, but now it could fix the performance issue, which is a more unmeasurable and less quantitative, well less deterministic. It depends a lot on the load and the types of application. We ran it through the production load where we could see the benefits for that specific client’s use cases.

For other clients, we hadn’t seen a performance issue. So, that’s why it’s hard to format those and test those. With AI, you could try five different combinations and I think we did try multiple. It wasn’t just the first one that ran. We tried three or four, got three or four answers, tried each one in turn, made sure that the ones we used were all functionally equivalent from unit tests, and then we took the fastest of I think the four. So, to me, that was a really good AI use case and a good stat we had.

Shane Hastie: Again, leveraging the tools for what the tools are good at.

Adam Sandman: Right, exactly. So, trying to write an API where the documentation doesn’t exist is just literally hallucination paradise. It’s making it up based on the pattern. So, if all the methods have the same pattern and all the JSON has the same pattern, I can make that up too. I can assume that the method will look the same, but that’s not solving the problem. But something that performance, which is generally quite hard, there’s a lot of documentation out there on SQL Server.

There’s a large amount of ingestible content, but I haven’t got the time as a human to read it and understand it all. This is where an LLM was very good at being able to synthesize all that accumulated human knowledge and give me the one nugget and rewrite the stored proc in the way that synthesized that knowledge in a really quantifiable way and a very specific way for this one business problem that we had.

Shane Hastie: Other example you mentioned when we were chatting earlier was the execution of just point the tool at the product and say, “Do this”.

Potential and limitations of autonomous testing [13:44]

Adam Sandman: Yes, the autonomous testing. For those who have coming from all from the development world, if you ever decide to go to a testing conference right now, every vendor, every supplier is going to talk about this. This is what the latest buzzword is in the industry, autonomous testing. Go search on Google or ChatGPT or Perplexity on this topic. Here’s the practical reality today. If you use some of the most advanced and expensive LLMs, we’ve tried this out with Claude Sonnet 3.5 and 3.7. They’ve got a feature called computer use and what computer use basically is, if you’re not familiar with it is and I think Microsoft has something similar to OpenAI, they’ve trained it to look at all these common applications.

So, it’s learn what a web browser is, it’s learned what Office is, what Excel is. So, it’s basically learned the sum knowledge of user interface design and modern applications. Now you can point it at your application and we’ve tried this. We gave an application, never seen before, and we asked to perform some very simple tasks. It was interesting what it did. So, this was a simple application. We built the testing. So, it is simple. That’s really important to bear in mind here. I’ll come back why that’s important in a minute. If there’s a login page, password login button, you go to a grid with some data. It has a list of books. You can edit the books, change the genre, change something else, save it. That’s it. Very simple.

Traditional automated testing tools, you could build scripts to do to automate it. We just told it to log in with the login and password that we gave it and we said change the genre of a book. I think it was Pride and Prejudice to detective fiction, which obviously isn’t, but that’s fine. We didn’t say there was a login button. We didn’t say there was a menu. We didn’t say there was an edit button which you normally would need. We just told it what I told you and it was able to log in and do it, perform the tasks. It performs every time pretty consistently. It’s actually quite amazing to watch. The interesting thing was on the second go round of one of the times I was doing this, I forgot to reset the book back to romantic fiction or whatever it really is. I left it at detective fiction.

When I ran the test, it logged in and it said, “I’m done”. It had figured out that the book was already in the right genre and so it said I don’t need to change it. So, that was fascinating. Is that a manual test and is the pass or a fail? My test was wrong. So, that’s where it can be very interesting what it can do. The other thing we can have it do is log in and look at the screen and compare differences simultaneously while doing this. The problem with this is I feel like is this shows people, especially people who have budgets, wow, I don’t need any testers. We’ll just have us do our testing and magic.

There’s two to three problems with this today. Unlike a modern automated testing script, which if you run them every quick takes a few milliseconds, it takes about five seconds to do every operation because it’s doing what’s called decision tree. It’s not a simple LLM. It’s using a chain of thought. So, it goes through every possible probability and you can even see in the log what it’s doing. It’s doing something like, “Hey, this is a screen, this screen’s got some links. I have edited this book. Is that an edit? Is that the right one? Let me try it”. You can see it thinking. This is costing thousands and thousands of tokens, maybe millions of tokens, which translates to actually hundreds of dollars.

We did a study and it actually will cost you more than a manual test right now to use this at scale today. Now the compute power will change. But the second problem with this is this is just a very artificial simple application. What are people testing for real? Complex business apps, imagine that was SAP or Salesforce or Microsoft Office or any of the apps that you are developing, you are testing. It’s going to have a hard time. Now what we are thinking in the future is going to be though to some degree is the magic will be the spec. If we have good requirements and this application understands the requirements and the documentation, it has a better chance of being able to do this.

But that means we need to have better requirements, better quality in what we’re doing, and I think it’s still going to require humans in there to help guide us. So, it’s almost like an intern. You’ve got a testing intern which can do a lot of tasks quite well, but it won’t get it right first time and you train it. So, I think as testers and developers, a lot of the AI is going to be a very smart intern that we have, which works with us almost like the ultimate pair programming from the agile days working with us in pair. I think that’s where we can see some amazing benefits. Again, like the performance testing, like the unit testing, laborious, boring, uncreative mundane tasks that we tend to short circuit as humans, it loves doing.

The things that we are very good at, pattern recognition, some of the things, understanding the why behind we’re doing things is very important. Especially when you’re looking at systems that are AI generated, there’s also an ethical dimension. As a business analyst, you have to almost have a philosophy degree and maybe human ethics degree because a lot of these systems now are learning for themselves and we as humans have to test, “Are they fit for purpose in the engineering sense?” Fit for purpose, not the software sense, which is if you build a bridge or house or something and it’s fit for purpose.

What we think of that meaning is does it solve the business need? Not just did they meet the architect’s drawing, but you can’t actually walk up the stairs because the stairs are too steep. So, a lot of what we as software professionals have to think about is we’re almost like engineers and architects and have to look at the human factor side of what we’re building, whether we’re developing testing or being a business analyst.

Shane Hastie: What about the collapse of roles?

Collapsing roles in software development [18:46]

Adam Sandman: That’s a really good point and we talked about this a little bit earlier on as well, is that as a developer, well as a software professional, let’s think of it that way. I can use a low-code prototype that’s maybe 70% functional. I can show it to a client. Do I need to be a developer for that? Am I a tester? I’m doing the testing by pointing the AI at my application that I’ve low code generated. So, maybe the role of a business analyst, a developer, and a tester collapses into a software professional, software engineer role and that you have to be responsible for that whole piece rather than it being a handoff. The other thing which is interesting is one of the things I remember from the olden days of computing was we could do prototyping really fast for clients.

If you look at the agile manifesto, which is now of course 24 years old and not so revolutionary, the whole point was getting feedback early. If we can use AI to give a customer or a user an application, you write the requirement today and maybe by the next morning you’ve got not just a mock-up in Photoshop or Figma or something, but you’ve got an actual working application. My commitment to that was only to prototype, but using the AI. If they want to change it, I’m not going to spent weeks and weeks building it. I can change it on the fly. We can build much more real time collaborative experiences because the AI is making my human investment, emotional investment in this prototype much or less.

Now we get this real time iteration of feedback, which was always the ideal for Agile, but really was limited by the tools we had. So, yet again, Agile with AI is going to have a blending of roles and I feel like it can increase the feedback loop because now you can see real applications with real data or real synthetic data or let’s say realistic synthetic data at the very least in production, which suddenly the product owner or the client or even the end users, you could even have them try this out. If we’re way off base, we can make change tomorrow. We’re not waiting a month to change course.

I mean I think that can be a game changer for the efficiency of software startups when we know it may lower the bar for new entrants. A lot of these software players that have been around for a while could be challenged by new entrants with new ideas because the cost of innovating is going to be much lower for a small innovative startup company potentially.

Shane Hastie: What’s in your crystal ball?

Looking to the future [20:51]

Adam Sandman: Oh, well, it depends on what you ask. One member of my team is quite dystopian and it’s like, “We should all become plumbers. It’s going to do everything for us”. That’s not my crystal ball, that’s his. I think what we’re going to see with software development engineering, I think we’re going to see a lot more AI being used to help design the systems, take requirements to get feedback. I think we’re going to see potentially a lot more software maybe that’s not written for humans to use the way we do today. So, if we’re going to interact with AI at a human level, we are using screens, keyboards. We keep thinking it’s going to be all current user interface. So, I feel like AI hasn’t dramatically evolved the user interface.

But if you look at what agentic workflows and agentic AI can do, which is it can open up a computer, it can do things. I mean the whole experience of us interacting with our computer if successful should be different. I mean the classic example that people that our industry are talking about is instead of me going on Google trying to find how do I book a flight to… Let’s say I’m going to come visit you in New Zealand. I’m going to book a flight. I’m going to go to hotels. Maybe I’ve got my itinerary to go see the South Island and I like Lord of the Rings. So, I want to go visit and see all the sites. Well, the agentic AI could be given that paragraph. It would go with human feedback from me because obviously I might like certain hotels and they know I’ve got Marriott Rewards.

Ephemeral applications built on the fly and discarded [22:07]

It wants to go at Marriott. So, we could potentially book all that trip interacting with websites. We’ve even seen examples now where it will build the website on the fly. So, I saw a really interesting example where it was last year at a conference. Someone wanted to go find a restaurant in… I think it was New York City. Instead of it going to Google and doing that, the AI built a website on the fly for that user’s question and it took data from Yelp and other sources and various other things that it had access to. It built a one-page customized web page for that one user’s query like an ephemeral application. So, I think you might see more of this ephemeral apps where the workflow is building an app for you that then is disposed of as soon as the query is done.

So, the user interface has become non-persistent and maybe it’s the logic and it’s the knowledge and the data that’s persistent. You think about that. That changes software testing and development immensely, and the ethical privacy concerns. If I’m a restaurant owner and I don’t appear in the search results, I go to Yelp and I say, “Why am I being discriminated against this and maybe allure against it?” Well, this ephemeral application presented me with data that’s gone, who knows what it gave me? Who knows if there’s some sponsorship going on behind the scenes that somehow biases the app? How do we know?

So I think in my crystal ball, I feel like the whole user interface and the way we interact with computers is going to change in the next three years and how we develop software is going to change because there was no human building that web page. That was generated on the fly from existing data. So, as humans, what are we developing? What are we testing? I think our roles as software professionals will be very different. I think we’re going to have to figure out how do we work in that world and what are we doing and what are we using our tools for? I think as software professionals, if someone’s coming out of college or going to college today, they’re 18 and they were asking me, “What should I learn?”

Because a lot of parents do ask me this question and four years ago they’d been, “Get a computer science degree, become a developer, learn Java, learn Python or something”. I think nowadays I would say get a range of skills, learn design, learn software development, learn AI, learn data science, learn history, learn material design. I mean I would say go to a UX course. Maybe do something that’s out in the three-dimensional world like industrial design. Take an architecture course, right?

I think we have to broaden our skill sets because computers are going to be very good at these narrow specialized tasks. I think as you said, the collapsing of roles. We are very good generalists, and to be a successful engineer four years from now or as professional in software, I think on IT, you have to have that more generalist outlook. If you just expect to be able to write code or write tests or write user stories and that’s what you do, I think the world has changed and will change more.

Shane Hastie: The rise again of the generalist.

The rise again of the generalist [24:41]

Adam Sandman: Right. I’ve always liked that. As someone who started a company 18 years ago, that’s what it appears to be about being an entrepreneur. I had to learn tax, accounting, law, software development, marketing, business operations, HR, and I love that. Now some people don’t like that, but I think, yes, I talk about the solopreneur, the one-person unicorn. Could that be the future? Maybe. But it might be that as companies do look to automate more of their back office and white collar type of tasks, we may have the rise of the more generalist role, not the craftsman per se, but the large white collar standardized workflows where you go in and do the same task day in, day out, which I think has been replaced in the factory world.

It’s being replaced to some degree in the information world. We have to look for those more generalist roles. I think that is the future and we’re seeing it. I mean people who start companies have to have that and I think people recognize that with AI, you can potentially have a force multiplication of 10 starting a business. You don’t need 1,000 developers in the Bay Area. You can now have a very small team put together quite credible, quite quickly and get real feedback and even launch it.

Shane Hastie: Adam, a lot to think about there. If people want to continue the conversation, where do they find you?

Adam Sandman: Love to have those conversations. I do travel a lot. You go on LinkedIn, you’ll see where I am. I’ve travelled to most continents for speaking and also go to various events. Always happy to meet in person. Otherwise, on LinkedIn is probably the best place. I’m Adam Sandman on LinkedIn. There aren’t much many with my name and I’ve got a pink background. So, it’s easy to find me. But always happy to have those conversations.

My emails is also my profile on LinkedIn. But yes, Adam Sandman on LinkedIn is the best. I’ve given up most of the other social channels, I must admit. I find them a little bit too political these days. Again, thanks so much for having me on the show, Shane. It’s been a real pleasure. I learn something every time I come on these shows as well, so appreciate the opportunity.

Shane Hastie: Thank you. It’s been great to see you again.

Mentioned:

About the Author

Adam Sandman

Show moreShow less

.
From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Mobile Monitoring Solutions

Uncategorized