MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Calin: Hello, everyone, and thank you very much for having me here, first of all, because I’ve been upgraded to this massive room because of the interest that has been registered so I wasn’t expecting this so I’m both humbled and grateful for all of you to come here.
Imagine that you are the systems engineer at the fintech and you have worked on the infrastructure of a payment services provider product for more than a year and within half an hour into a planned penetration test, your whole production cluster gets compromised. We are talking about firewall rules being bypassed and we’re talking about Google Cloud services that you shouldn’t have access to also being bypassed.
In a day and age in which no security control is completely impenetrable, I am here today to talk to you about our hurdles as a financial-regulated institution and some of the things we’ve learned and what you can take away as well.
So who am I? I am Ana Calin and I am the systems engineer in that little story I was telling you about and I work for Paybase.
The things I’m going to cover today. First of all, I’m going to give you a tiny bit of context as who is Paybase and what we do, then we’ll look at the things we’ve achieved so far and some of our tech stacks, so not the complete tech stack but some of it, then I’ll give you proper details about this compromise that I keep on talking about. Some of the things you can take away from security and resilience point of view and then challenges we’ve encountered on our road to compliance, specifically PCI DSS compliance and challenges we’ve managed to avoid specifically because of our cloud native setup.
Paybase is an API-driven payments services provider. Our business model is B2B, so business to business, and our customers are marketplaces, gig/sharing economies, cryptocurrencies or, in more general terms, any businesses that connect a buyer with the seller. Our offering is very important because platform businesses, such as the ones I just mentioned, face a very tough challenge. Not only do they have to solve very complex technical issues, they also now, with the updated regulation, have to either become a regulated payments institution themselves or they have to integrate with a third party. The current solutions, well, they are very costly and they are very inflexible, so we came up with a product that is less expensive, very flexible, and also less complex. In other words, we just make regulation easier for our customers.
What We Have Achieved so Far
We are under two years old, one and a half, about the same thing. We have built our own processing platform from scratch, completely cloud native and with more than 90% open source software and we’re currently onboarding our first seven clients and we are in no shortage of demand. We are FCA authorized, we have an electronic money institution license and in 2017, we have received an Innovate UK grant worth 700,000 pounds in order to democratize eMoney infrastructure for startups and SMEs.
We are also PCI DSS Level 1 compliant, this is the highest level of compliance and in my opinion, having worked directly with it, it’s a huge achievement given that the current state of this standard is one that most very large institutions that need to comply choose to actually pay annual fines rather than become compliant. This makes sense financially to them because of the technical complexity of their systems, the technical debt and many of them because of the hundreds of legacy applications that they are still running.
Some fun facts about PCI DSS. Between 2012 and 2017, there has been a 167% increase in companies that have become compliant and yet in 2017, more than 80% of the businesses that should be compliant remain non-compliant.
Our Tech Stack
We follow a microservices-based infrastructure but our actual application is a distributed monolith that is separated by entities and the reason why we do this is because that way, we get flexibility in how we separate concerns and we scale. We have built everything on top of Google Cloud Platform and the reason why we chose Google Cloud is because at the time, we needed a Kubernetes-managed service and Google Cloud had the best offering and if I am to give my own personal opinion, I still think that, from a Kubernetes-managed-service point of view, they still have the best offering in the market today.
Our infrastructure is built with Terraform as infrastructure as code and our applications are deployed with Kubernetes wrapped around in Helm. From an observability stack, which is really important, just in general to have a good view of what’s happening into your cluster, we use EFK, so Elasticsearch, Fluentd, and Kibana for low collection and Prometheus and Grafana for metric aggregation. If you are thinking, “Why do we need the different observability stacks?” first of all, although they are complementary to each other, they give us different things. EFK gives us events about errors and what happen in the application and Prometheus and Grafana give us metrics, so for example, the error rate of a service, the CPU usage, whether a service is up or down.
The communication between services is done synchronously via gRPC with HTTP/2 as a transport layer and we also use NSQ as a distributed messaging broker which helps us synchronously process event messaging.
Details about the Compromise
Now that you know a bit of our tech stack I’m going to tell you a bit about our compromise, what happened. First of all, a bit of context. This happened in the scope of a planned internal infrastructure penetration test, this happened in our production cluster, but at the time, the production cluster wasn’t actively used by customers. The pentester, let’s call him Mark because, well, that’s his actual name, had access to a privilege container within the cluster, so it wasn’t a 100% complete compromise from the outside.
There were a few weak links. First of all, GKE, so Kubernetes managed service from Google Cloud comes insecure by default or it came at the end of last year. What do I mean by insecure by default? If you are to provision a cluster, you would get the compute engine scope with the cluster, you would get the compute engine default service account and you would get legacy metadata endpoints enabled. The compute engine scope refers to an OAuth scope, so an access scope that allows different Google Cloud services to talk to each other. In this case, GKE speaking to GCE and it was a read/write access.
I’m going to be naughty and I’m going to take the blame and put it on someone else, in this case, Terraform. The first bit on the screen, it’s a part of the documentation taken yesterday from how to provision a container cluster with Terraform. As you can read, it says, “The following scopes are necessary to ensure the current functionality of the cluster.” and the very first one is compute/read/write, which is this engine scope that I’m talking about. I can tell you that this is a complete lie, ever since we did that penetration test, we realized that you don’t need this, we took it out and our clusters are functioning just right.
Compute engine service account. When you provision a cluster in GKE, if you don’t specify a specific service account, it uses the default service account associated with your project. You might think that this is not necessarily a problem, well, it’s a big problem because this service account is associated with the editor role in Google Cloud Platform, which has a lot of permissions. I’m going to talk to you about the legacy metadata endpoints in a second.
My next point was about queueing the metadata endpoints, so within Google Cloud you can queue the metadata endpoints of a node and if this is enabled, this gives you details about the Kubernetes API as the kubelet. What that means is that if you run this particular command or just the kubelet you will be able to get access to certain secrets that that relate to Kubernetes API from within any pod or node within a GKE cluster and from there on, well, the world is your oyster.
What can you do to disable this? First of all, a very quick disclaimer. The latest version of GKE, so 1.12, which is the very current one, comes with this metadata endpoint disabled, but if you’re not on that version, you can disable it either by building your cluster with gcloud CLI and specifying the metadata disable legacy endpoint flag or you can do it in Terraform by adding that following workload metadata config block into your note config block. The result, it should be something like this and I’ve added some resources for you guys if you have GCP and you want to check this particular issue at the very end.
The weak link, number two, Tiller is the server side component of Helm. If you read the documentation of Helm, it says that you should never provision Helm or Tiller in a production cluster with all of the default options. We did this and we said we were going to change it later on, but when it came to the penetration test, we decided to live life on the edge and live it on and see how far a penetration tester can go. The default options means that Tiller comes with mTLS disabled, it performs no authentication by default and this is all very bad because Tiller has permissions that it can create any Kubernetes API resource in the cluster, so it can do anything or remove anything.
How would you go about getting access to Tiller from a non-privileged pod? I have taken a random cluster that has Tiller installed and I have deployed fluentd in a pod in a different namespace than the kube-systems namespace that the Tiller lives. It also has the Helm client installed, so can see that the very first command is Helm version, and this gives me nothing, but if I telnet directly, the address of Tiller and the port then all of a sudden I’m connected to Tiller and, well, I can do everything I want from here.
What can you do in terms of mitigations? Ideally, you would enable mTLS or run Tillerless Tiller, but if you are unable to do that, you should at the very minimum bind Tiller to local host so that it only listens for requests that come from the pod IP that it lives in and the result, it should look like this bit here.re, so unable to connect to remote host.
Security and Resilience
Let’s have a look at some of the security and resilience notes that we learned and we picked up along the way. A secure Kubernetes cluster should, at the minimum, use a dedicated service account with minimal permissions. Here, I’m not talking about service account in the context of Kubernetes but service account in the context of GKE. If I am to translate this to an Azure world that’s called the service principle.
A secure Kubernetes cluster should also use minimal scopes, so least privilege principle. It should use network policies, and network policies are useful for restricting access towards certain pods that run in a more privileged mode. or it should use Istio with authorization rules enabled and the authorization rules, if they are set up properly, achieve the same thing as network policies. It should also provide some pod security policies, and these are useful for not allowing any pods that don’t meet certain criteria to be built within your cluster, so in the event an attacker get into your cluster, they can only deploy pods with certain criteria. You should use scanned images because using untrusted image it’s counterintuitive and you don’t want to install vulnerabilities within your cluster without knowing. You should always have RBAC enabled, a note on RBAC. GKE, AKS and EKS, they all come with RBAC enabled today but AKS only had RBAC enabled recently so you should look into this if you don’t have RBAC.
A resilient Kubernetes cluster should- here we’re talking about resilience rather than security- first of all, be architected with failure and elasticity in mind by default. In other words, no matter what you’re building, always assume that it’s going to fail or someone is going to break it. You should have a stable observability stack so you get visibility into what’s happening and you should be testing your clusters with a tool such as Chaos Engineering.
I know that Chaos Engineering can be very intimidating for most us, I personally was intimidated as well by it, but after I played with it, and it’s just a matter of running that particular command on the screen to actually install it in a cluster, it’s really easy. It’s a great way of testing how resilient your applications are, especially if you are running JVM-based beasts such as Elasticsearch. This particular command says install chaoskube which is a flavor of Chaos Engineering and make it run every 15 minutes, so every 15 minutes, a random pod into your cluster will be killed. You don’t necessarily have to install it straight away into your production cluster, start with other clusters, see what the behavior is and move on from there.
The last part of my talk is mostly around challenges in terms of compliance, please come and talk to me after if you’ve had similar challenges or if you are looking at moving towards the same direction.
Challenge number one, as a PCI compliant payment services provider with many types of databases, I want to be able to query data sets in a secure and database agnostic manner so that engineers and customers can use it easily and so that we are not prone to injections.
This particular challenge has two different points to it. Especially when you’re using more than one type of databases, you want to make it easy for your customers, especially if you are API driven, to be able to query your data sets, so this is about customer experience. The second part of it is about security, so PCI DSS requirement number 6.5.1 requires you to make sure that your applications are not vulnerable to any injection flaws whether they are SQL or any other types.
“So how did we approach this?” I hear you say. Meet PQL. PQL is domain specific language that our head of engineering wrote, it is inspired by SQL. It is injection resistant because it doesn’t allow for mutative syntax, it is database agnostic and it adheres to logical operator precedence. How it looks? This is an example of how it can look, the way we’ve achieved this was through syntactical analysis so by parsing tokenized input into AST and through lexical analysis.
Challenge number two. As a PCI compliant payment services provider, I am required to implement only one primary function per server to prevent functions that require different security levels from coexisting onto the same server and this is requirement 2.2.1 of PCI DSS. This was a difficult one because we don’t have any servers. Yes, Google Cloud has some servers that we use through association, but we don’t really have any servers, we run everything in containers and note that the actual standard doesn’t even say anything about virtual machines, never mind containers.
The way we approached this was by trying to understand what the requirement is and the requirement is prevent functions that require different security levels from coexisting into the same space otherwise from accessing each other.
We said that we’re going to think that the server equals a deployable unit, so a pod or a container and in that case, we meet the requirement because we’re using network policies which restrict traffic between the different services and we’re using other bits of security as well. We’re also using pod security policies to make sure that only pods that are allowed, that meet certain criteria can come into the cluster and a very important one, we’re using only internal, trusted and approved images and scanned as well and I’ll come back to the scanning in a second.
Those were specific challenges, some examples, this is not an exhausted list but those were challenges that we had to think outside the box. Now let’s look at challenges that we actually didn’t have to deal with because of our setup.
As a PCI compliant payment services provider, I am required to remove all test data and accounts from system components before the system becomes active or goes into production, this is requirement 6.4.4. and it makes sense to have this requirement.
The normal way of organizations splitting their Google Cloud infrastructure- I’ve taken Google Cloud as an example but this sort of applies to other cloud providers- it’s by having one massive organization and then you’d have a project and under that project you’d have all of the services and companies like AWS actually suggest that you should have accounts or two different accounts, one for billing and one for the rest of the things you are doing. If you are to do this, then you can split your environments within a GKE cluster at the namespace level, so you’d have a production namespace, you would have a quality insurance namespace, and a staging namespace all living within the same cluster, then you’d be able to access all of the different services.
From a PCI DSS point of view, you have the concept of a cardholder data environment, which basically is the scope of the audit that you have to perform. If you are to do it this way then your scope would be everything within the VPC, but we did it in a different way.
This is the way in which we split our environments, for each environment, we have a specific project and then we have a few extra projects for the more important things that we need to do. For example, we have a dedicated project for our image repo, we have a dedicated project for out Terraform state, and we have one for our backups, this is very important from PCI but also from a compromise. From PCI point of view, it’s important because we managed to reduce the scope to just the GKE cluster within the production project, and from a compromise point of view, because most of the compromise happened with that compute engine default service account that only had editor role within the project, so yes, it managed to bypass our firewall rules, yes, it managed to get access to the buckets within the production project. It didn’t manage to get access to our image repo, it didn’t manage to get access to our backups or any of the other environments, so from that point of view, it was quite contained.
What do you get from this? You get an increased security, you get separation of concerns, get a reduction of the scope and of course easier to organize RBAC. I’m not saying that by doing this, you won’t be able to get it as secure, I’m sure you will, but it’s going to take much more work.
Removal test data and accounts from system components before the account goes live. This actually doesn’t apply to us because the test data would only ever be in all of the other environments, never in the production environment to begin with.
Challenge number four that we’ve avoided. As a PCI compliant payment services provider, I am required to perform quarterly internal vulnerability scans, address vulnerabilities and perform risk scans to verify all high-risk vulnerabilities are resolved in accordance with the entity’s vulnerability ranking and this is requirement 11.2.1.
This is a very interesting one. First of all, random comment, that’s just the image that I liked so I decided to put it there because it’s my talk, but this particular challenge is about interpretation. When you are running everything within containers, you don’t really have the concept of internal infrastructure or unless you’re going to say, “Well, the way to meet this is by doing very frequent penetration tests,” which is not necessarily viable to any organization. What we said is that we make sure that all of the images that ever reach our cluster have been scanned and no image that hasn’t been scanned or hasn’t passed certain vulnerability ranking requirements will ever get into our cluster.
I’ve made a diagram. When a developer pushes code into GitHub, some integration tests run, and if those integration tests have passed then an image build starts and if the image build has been successful then there’s another step that inherits from the tag of the first image and takes that image and it scans everything in it. If the scan passes, so it doesn’t have any vulnerabilities higher than low, then the initial image, it’s retrieved and then it’s successfully pushed into GCR. If the image scan doesn’t pass, then the build is failed and our developers have no other way of getting that code into an environment. That’s a way of dealing with it.
I hear you ask, “Well, yes. But what if you find a vulnerability that wasn’t there three months ago when you last updated your image?”. This is both a proactive and a reactive measure. That doesn’t happen because we deploy multiple times a day and because we are running a monolith. Every time any change into our code happens, a new image is pushed and all of our services, although they have different configurations, they will always be running on the same image. From a database and other application point of view, we ensure on a regular basis to check our images and also update versions and so on.
Security is not a point in time but an ongoing journey, a never-ending ongoing journey and that’s important to keep in mind. All of the things that we’ve done, that doesn’t mean that we’re all of a sudden secure, it just means we are more secure. We should all think of security as a way to make it really hard for attackers to get in so that they have to use so many resources that they won’t bother rather than, “Yes, we’re secure. We can sleep quietly at night.” You can never sleep at night in this job.
You can use open source software and achieve a good level of security. It’s just a certain amount of work but it can be done, we like a challenge because we’re all engineers. I want to leave you with the fact that we really need to challenge the PCI DSS status quo. It’s really hard for different organizations to become compliant, especially organizations that are not in the fintech environment and that can sometimes make them fail to reach market.
For us, specifically, it was really hard to become compliant because one of the things that we dealt with was the educational process that we had to undertake. Our auditor initially said that he had knowledge of containers and Google Cloud services and it turned out that he didn’t that much so we spent a lot of time educating him and this shouldn’t be our job. This should be the people who are training the QSAs to do their job.
If you have a similar setup or if you’re looking to go into this direction, please come and talk to me and hopefully together we can find a way to challenge PCI DSS and make it better for everyone. I have, as I promised, some resources if you are interested to read or check your clusters and I will make sure to make the slides available.
See more presentations with transcripts