Mobile Monitoring Solutions

Search
Close this search box.

Presentation: Software Supply Chain Management with Grafeas and Kritis

MMS Founder
MMS Aysylu Greenberg

Article originally posted on InfoQ. Visit InfoQ

Transcript

Greenberg: I’m going to talk about two really exciting projects called Grafeas and Kritis. Hopefully, you’ll perk up a little bit and enjoy the talk. I’m Aysylu Greenberg, I’m a senior software engineer working at Google. I’ve worked in a variety of infrastructure projects such as search infrastructure, developer tools infrastructure, drive infrastructure for Google Drive. Currently, I work on cloud infrastructure. I’m the Eng Lead of the two projects that we’ll talk about today. They’re both open-source projects and they’re called Grafeas and Kritis. I’m also online on Twitter@Aysylu22. If you’re wondering, 22 is not my age.

Let’s dive into this. Today the talk will be in four parts. First, we’ll talk about software supply chain management and make sure that we’re on the same page about what this means. Then we’ll talk about Grafeas and Kritis and how they fit into the software supply chain. Then next, we’ll talk about the upcoming release, 0.1.0, that we’re hoping to release in this quarter and the future of the projects. Let’s get started.

Software Supply Chain Management

I assume you guess Google runs in containers and every week we deploy over 2 billion containers. We have a pressing need to understand what happens to the containers, what happens to the code that gets deployed and runs, and where is the code that we just wrote? We need a lot of observability around it. For software supply chain management, it’s very similar to food supply. Just like with food, you plant the seeds, you grow them, then you harvest them, you make some food, you deliver it to the dinner table. In software supply chain, you’ll write some code and then you as a developer, you’ll check in the code, then you’ll build the image, the containers, the binary, you’ll test and verify it. Often, it’s automated using CI pipelines.

Then you will run some QA testing on it – it could be manual, it could be automated with canary services. Then finally, you’ll deploy it to production. This is automated often using continuous delivery pipelines. Just like with food, we often wonder where does this food come from? What country? Is it organic? Is vegan, is it gluten-free? All these questions that we’re meant to ask about food, same with software we will be asking, what happens to the code from the time it’s written and submitted to the source code to the time it’s deployed?

What about third-party dependencies? We know even less about them. We’ll just rely on them to work. Are there any vulnerabilities inside them? Can we trust the code that we depend on? What is the overall chain from the time that we add it as a dependence to the time it gets deployed? Are we compliant with regulations and so on?

We have the need to have central governance process so that it doesn’t slow down development, the developer velocity. Also, we can have a good look, like extra vision to everything that happens from the time code is written and or dependencies added to the time that we get to production. We use CI/CD pipelines to automate a lot of this, but also we need observability tools around that, not just testing and deploying it but also how is that related? Who submitted that code? Why do we believe that this build we can trust and so on?

Grafeas and Kritis come in and they have open focus ecosystem that we’re building. We’ll have an engineer that builds and deploys code and then she sends it to the CI/CD pipelines. They will do the secure build process, automated tests and scanning for vulnerabilities and analysis and then there’s the deploying of that code. There are a lot of different vendors that offer good solution for CI/CD pipelines. The problem is that we want a centralized knowledge base for this information, so, regardless for what vendors you are using, it would be good to have open metadata standards so that you can define what it means to have build metadata and a test metadata. That’s what Grafeas project comes in. It acts as a centralized metadata knowledge base, it will have information about vulnerabilities in your artifacts and build information and so on.

For deploy checks, we want to make sure that they pass based on our policies. We would rather write codify our policies in config so we can control and look through the changes as they happen. That’s where Kritis comes in – Kritis is an admission controller that when you’re deploying to Kubernetes specifically, it will run the policy checks that your cluster admin defines, then deny the pod to be launched if it finds very severe vulnerabilities in your image or it doesn’t trust the image location. If everything is good, it will deploy it to production. How many of you use Kubernetes to deploy? Two-thirds of the room. You’re in the right talk, we’ll talk a lot about Kubernetes because that’s what Kritis does, deploy time, checks and policies.

Kritis itself doesn’t store vulnerability information. It just does the policy checking. It has a lot of logic in it, so it will talk to the Grafeas metadata API to actually find out about, given this container, what vulnerabilities sit there? What severity does it have? Given this container, where did it come from? Do we trust it? That’s how Grafeas and Kritis fit in their overall software supply chain.

These projects have existed for a while and they’re not just a third project. They are actually are being used in production and Google has internal implementations of them that are available on Google Cloud Platform. Grafeas is available as container registered vulnerability scanning, and Kritis is available as binary authorization. They are being used internally and there are internal implementations of this, so we know that it works in production.

Kritis

We talked about software supply chain management. Now let’s talk more about Grafeas and Kritis and dig into the details of them.

Kritis was developed open source first. The old code and all the history of the code is available on Github under this link Grafeas Kritis. In the software supply chain, it fits right at the very end during the deploy time. When you’re deploying to production, it will verify the policies against the policies that you have, then choose to deploy it or reject the deploy request.

Let’s walk through the example since two-thirds of you are familiar with Kubernetes, so a lot of these concepts will be familiar. If you’re not, don’t worry because we’ll walk through this and it will be a pretty high-level overview with some of the details that will be interesting to know how Kritis works. Imagine we’re deploying an eCommerce website and so we’ll run coop control, applied side YAML, which actually defines our pod with the image that we’re deploying. Overall it looks like sending a request to Kubernetes. Krittis is installed inside of your Kubernetes cluster, so we’ll install it using the helm install command. Now it’s just running inside of your Kubernetes cluster.

The request comes in and the admission request sends the pods back. Then it gets taken by the webhook, which actually is implemented by Kritis, and it reviews this request against a set of policies that we define. Imagine that we have image security policy, this kind of policy would basically say, “Make sure that the container in the pod that we are trying to launch, has had vulnerability checks and it satisfies the policy where it doesn’t have any severe vulnerabilities in it.” Now we’ll define a policy for prod and for QA. If you were in Katarina’s [Probst] talk, she talked about namespaces, so it’s namespace scoped.

Now, we’ll go through the image security validator that actually will validate the request through the policy. Then we’ll fetch the metadata from the Grafeas API, which has some sort of database backing it. We’ll talk more about the pluggable backend storage in a few moments for Grafeas. Imagine I said there was an image and a push, but there was no vulnerability scan done for it. We want to make sure that the container has actually gone through vulnerability scanning. We’ll reject this pod and we will not launch it. Some time passes, vulnerability scanning has had a chance to catch up and inspect the container. Then it will say, “I found some CVE, some vulnerability,” this is how we refer to classes of vulnerabilities that we find. There’s an open database of different CVE. CVE stands for common vulnerabilities and exposures. Then we just basically sort by the year it was found and then some number attached it. We found a vulnerability in our database, so then we can fetch it. Grafeas API will fetch it from the database and return it, and then again, we won’t launch the pod because we have a vulnerability in it.

Then we’ll actually inspect this and we’ll say, vulnerability analysis is a very hard problem and often it’s better to have false positives than false negatives – it’s better to be safer than sorry. It finds a vulnerability and then say, “Actually it doesn’t apply to me,” then not find a serious vulnerability in your container. Then we’ll say, actually this vulnerability doesn’t apply to me because of something that I know about my application, so then we whitelist it.

After we whitelist it, we will admit the pod. Now we have our application running, our eCommerce website is up and running, so let’s scale that up. We’ll scale it with scope control command and we’ll get four replicas. The second replica will come up – ok, good, we’re waiting for the third and fourth to be launched. Then what happens is the new vulnerability is found. Vulnerability scanning, whether you pay a vendor or do it yourself, it’s constantly updating because the daily basis of being constantly updated with new vulnerability is found. You’re constantly checking against whether it affects your container or not. Does that mean that we can’t launch the third and fourth replica and we’re just stuck? We don’t want that. Because we just confirmed that this pod is fine to run, we just want to scale up our application and then figure out what happens without disrupting our eCommerce website. What did we do here is, Kritis is very clever about using attestations in this case.

Taking a step back, the first time we admitted the pod we’ll say, anytime we admit the image we’re going to record that we admitted this image. Anytime you want to scale up, you can always scale up or when the pods get restarted because things happen, then it will always get admitted as opposed to being prevented from it as soon as vulnerability scanning gets updated. Kritis has an attestor inside it that you specify using the attestation authorities you configured. Then you will write an attestation through the Grafeas API, it will store it in the database. Anytime a new pod comes up and we find a vulnerability, we just retrieve the attestation from Grafeas and we’ll say, “But I did say that this image is admitted so continue scaling up.” That’s how we are able to scale up, and then later we’ll inspect it.

Now, do we not look at new vulnerabilities ever? That would be bad. What if a heart bleed bug is discovered? It’s a vulnerability you want to know about. We have a background Cron job that basically inspects the running pods periodically and then it’s able to say, “For all the running of pods, I’m going to check it against image security policy.” Then it adds labels and annotations to market as it no longer satisfies the policy even though it’s been admitted. Then the cluster-admin can react to that.

Let’s talk a little bit about terminology of Kritis. It uses Grafeas metadata API as we saw, which uses it to retrieve vulnerability information and then also store and retrieve attestations for the already admitted images. It also uses custom resource definitions. They are extensions of Kubernetes open-source API and they use to store enforcement policies as Kubernetes objects. They’re really cool in how they work and that’s what allows Krisis to run seamless inside your Kubernetes cluster. We’ll take a look at the definitions of policies in a moment. Another thing that Kritis uses is validating admission webhook, which is basically HTTP callbacks that receive admission request and then decide should we accept or reject the request to while enforcing custom admission policies.

Here’s what a generic attestation policy looks like. We have a CRD and then we have the Cron genetic attestation policy where we define separately what that means, and then we’ll have the name for it so we can distinguish it from all the other policies that we might have because we might have many different types of policies for compliance. Then we’ll have this back, which is “These are the attestations authorities I trust.” Now cluster admin can say, “If I verify this image myself, then just trust that anytime we launch it.”

Attestations authorities look like this. We have again Cron attestations authority and then we’ll give it a name and then we’ll have some private and public key information. Then don’t worry about the node reference, it’s an implementation detail. The most important part is, we have private key and the public data stored in there, so we show the proof that the image has been admitted by the right person.

Image security policy is what we looked at in the example, they have the Cron image security policy, the name, and then, if you are running engine X image, then just whitelist it. We don’t care about the vulnerabilities in there, we just trust it to run it correctly. Then maximum severity we’re willing to tolerate is medium, so anything above that we’ll just reject it right away. Then we’ll like whitelist some of their vulnerability saying “We know it doesn’t affect us, so it’s ok, we can keep running.”

Grafeas

We talked about Ktitis, now let’s talk about Grafeas and what that does. Grafeas again was also developed open-source first, all the commit history is on Github if you’d like to take a look. Where does it fit in with the software supply chain? It represents all of the different steps. It’s specifically meant to be a universal metadata API, so it can store information about it, the source code, and they deploy it and who submitted code when and the test results and so on. Every single stage in the software supply chain, it’s able to represent.

You heard me say a lot of times artifact metadata API, so let’s unpack that a little bit. Artifacts are images, binaries, packages, any of this, we’ll just call them as artifacts, files that are generated as outputs, part of your build process, for instance. Metadata is build, deployment, vulnerability, anything that you care to represent and to keep track of in your software supply chain and then API allows you to store and retrieve metadata about artifacts.

:et’s talk a little bit about terminology of Grafeas and how we represent and think about this. Notes: a high-level description of types of metadata. For instance, we looked at CVEs common vulnerabilities and exposures, so those will be represented as vulnerability nodes. For every vulnerability we know that we found out through open databases, we will store them as vulnerability nodes. Occurrences are instances of those nodes in a specific artifact. Say you found a vulnerability in an image, so you will store it as an occurrence of that vulnerability. We will also think about providers and consumers because it allows us to ensure that you can rely on third party providers to do some analysis for you and then you just read those results.

Let’s take a closer look. We have Grafeas in the middle of this, then provider would be vulnerability scanning. Say you pay a vendor to look through your containers and tell you what vulnerability you have. They’ll store vulnerability nodes about, given all the vulnerabilities that are known out there. Also, it will look through your container and then tell you what vulnerability if found against those images, and so it will sort the occurrences for those containers.For instance, Kritis would be a consumer in this case, all it does is it reads vulnerability occurrences for the container and then it decides what to do with that. It doesn’t reason about how bad that vulnerability is. All of that is done by the provider vulnerability scanning. That’s stored in Grafeas API, which can be retrieved by the consumer.

A couple of other terms that are useful are resource URL, which are just identifiers for artifacts in occurrences that generally unique for a component within your software supply chain. For instance, Debian images, or docker images, or generic files will have some sort of resource URL associated with them that you can refer to throughout your system.

We also have kind specific schemas, which are strict schemas, very structured, which allows us to, first of all, represent the information across all the different vendors in a uniform way. It doesn’t matter if you’re using one continuous integration pipeline vendor and then you switch to another one, you can still represent them. Or if you’re using different vendors for CI/CD pipelines and vulnerability scan, you can represent it all using Grafeas schemas using Grafeas metadata kinds.

For instance, for deploying in nodes we’ll just have a resource URL inside it to represent what is being deployed, and then the occurrence will have a user email who on the team deploy this, deploy time – what time it got undeployed, and that resource URI that it’s attached to. No matter what delivery system you’re using, any of them can represent it in this way. This is really meant to be open metadata standards.

If you’re interested in contributing to Grafeas, let’s talk a little bit about architecture and how we think about the development of the project going forward. We have Grafeas API in the middle in the green, it provides nodes and occurrence kinds, the schemas for them and also API methods to store and retrieve them. Then on the bottom below is the database, the storage backend for them. Those findings will live in separate projects, they’re not part of the API itself. For instance, we provide Postgres backend is an example, but if your team is using MongoDB or MySQL or prefers anything or internal, we use spanner for the internal product. Any database you want, Grafeas API can be backed by it. If you’d like to contribute it, I’m very happy to hear about it and accept it and it will live outside of the Grafeas project itself. Then clients are used to storing and retrieve nodes and occurrences and they’re provided by the Grafeas team currently in a separate Github project because, again, they’re not part of the Grafeas API itself. Then the system will be provided as part of the core Grafeas project because we get into strong access controls.

To sum up, Grafeas is an open artifact metadata standard. We’ve had contributions from the industry, from various partners. It’s used to audit and govern your software supply chain, so without slowing down your development process. You throw all the different metadata you care about and then you’re able to build and look at what happened throughout the whole process. It’s a knowledge base for all your artifact metadata. We specifically focus on hybrid cloud solutions so that you can use it across on-premises and cloud clusters.

Finally, it’s an API with pluggable storage backend, so it doesn’t matter what your team is most familiar with in terms of storage backend. You can implement the bindings against the API and it would work well, so it’s very universal. If you’d like to ask any questions about Grafeas, we have a Google group, Grafeas-users. If you’d like to contribute, we have a Grafeas dev Google group. I call meetings periodically for us to get together as a community, discuss future releases, discuss prioritization. If you’re interested in contributing, please join and also, we have a Twitter account that we monitor actively, @Grafeasio if you have any questions.

Kritis & Grafeas 0.1.0

We talked about Kritis and Grafeas and how it fits with the software supply chain. Let’s talk about the upcoming release, which I’m very excited about, 0.1.0. What would we add there? It’s coming very soon, we’re hoping to release it in Q2 and there are three goals for it. First one is, enable users – you – to start experimenting with Kritis and Greafeas on your desktop, on your laptop to be able to do it on-premises so that we can gather more community feedback and move it towards hybrid cloud solution. We really want to make sure that you can run Grafeas and Kritis anywhere, regardless of whether it’s on-premises or combine it with any of the cloud providers. It’s meant to be an open standard for the industry. Once you are able to experiment with it, we would love to gather feedback from the community because we would like the community help to prioritize all the necessary features so that it continues to be most useful for the industry.

The scope is to have standalone Kritis on Kubernetes with standalone Grafeas. To bring up Kritis inside the Kubernetes cluster with a standalone Grafeas server, which talks to the Postgres, also standalone and your laptop. Then the two user journeys that we kind of think about. What can you do with this is seeing how a container is deployed to the Kubernetes cluster and then also seeing how the container that shouldn’t be deployed because if violate some policy it’s actually blocked from being deployed, and so that way you know, that actually works.

The features that we are going to add to Grafeas are helm chart to be able to bring it up as part of the Kubernetes and publish the image, having a standalone Grafeas server with Postgres storage backend and basic support for Go client library. Of course, we know that many of you might be using other languages like Python and Java and we should definitely talk about how to prioritize it but so far, the community feedback we’ve gotten is, Go client support is most necessary by the people who voice their preferences. First of all, provide a good experience with the basic client library and go and then expanded to other languages, and contributions, of course, are very welcome in this field.

For Kritis, we are adding generic attestations policies so then as a cluster-admin, you can just say “This image is good. Just deploy it and trust it” which simplifies some things as you are figuring out what to do for vulnerability scanning. Then also it’s providing a default fallback policy, so what if you don’t have any of their policies required, making sure that’s well behaved and well defined. Finally, making Kritis configurable – again, to ensure that hybrid cloud support is feasible and it’s easy to use.

If you’d like to learn more and follow along, please take a look at the Github repositories for Grafeas and Kritis. Take a look and join the Google groups that we have for Grafeas and Kritis users. If you’re interested in contributing, please join grafeasdev where we’ll have more information relevant to the developers. We are also online on Twitter @grafeasio.

I will end this talk with questions. How many of you are seeing a potential for using Grafeas and Kritis and is necessary in use case? I’m seeing a few hands and how many of you are interested in contributing to Grafeas or Kritis? Couple of hands. We welcome all your contributions. The goal is to develop this with the industry and make this useful for the whole industry. The community feedback is very important because some things that I think are important might not be as high priority to other teams and companies. Let’s get together and build the most useful thing and build the open standards for this.

Questions and Answers

Participant 1: It’s about the structure that you have shown to us. Which component is the vulnerability scanner of these structures? Can you use any scanner or is something inside of the inside architecture? Can I use the Nestles or something like that? Or is something inside of these components?

Greenberg: We are thinking about providing scanning framework. We do have vulnerability scanners, the proprietary products for the vulnerability scanning like on GCP, but we don’t have that right now for Grafeas. Providing a scanning framework where any vendor can plug in and use that or if you want to implement your own because you have certain information about the vulnerability that might be, so that’s in the plans for Grafeas. It’s not implemented yet, but it’s definitely something that we are considering for the future.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Microsoft Releases .NET Core 3.0

MMS Founder
MMS Arthur Casals

Article originally posted on InfoQ. Visit InfoQ

Earlier this week, Microsoft announced the release of .NET Core 3.0 simultaneously at .NET Conf 2019 and on their development blog. The new release includes support for Windows Desktop apps using Windows Forms and Windows Presentation Framework (WPF), new JSON APIs, support for Linux ARM64, and overall performance improvements. F# 4.7 and C# 8.0 are also featured as part of this release.

In this new version, .NET Core completely supports the development of Windows Desktop applications using Windows Forms, WPF, and UWP XAML. This is one of the most important features of this release: Microsoft presented it as the highlight of .NET Core 3 earlier this year, at Microsoft Build Live. Windows Forms and WPF were open-sourced last year, along with the Windows UI XAML Library (WinUI). Since then, the development efforts were focused on ensuring that .NET Framework compatibility was maintained – which included ease of porting from .NET Framework to .NET Core.

While porting desktop applications to .NET Core is an important feature, the new features also include new templates and tools. The XAML designer in Visual Studio was updated, and it now includes a feature called XAML Hot Reload. This new feature allows a developer to make changes to the XAML code while the application is still running. The Windows Forms designer was also updated, but it is still in preview (available as a separate download for Visual Studio). It is important to notice that Windows Forms and WPF applications only work on Windows.

Other important features related to Windows Desktop development refer to using and deploying different .NET Core versions. Windows Desktop apps can now be distributed as self-contained applications: they can use their own .NET Core version, independently of the environment in which they are deployed. There is also an option to distribute them as single-file executables, which is important considering that in past releases, desktop applications needed to be launched via the dotnet command. An interesting functionality related to single-file executables is the dependency trimming: it removes all assemblies not being used by the application, making the generated file smaller. This functionality, however, is still considered “experimental”: it was showcased at .NET Conf 2019, and generating a single executable while trimming the assemblies from the sample application took a few minutes.

.NET Core 3.0 also includes new JSON APIs targeted at reader/writer scenarios, random access with a document object model (DOM), and a serializer. The new APIs are in line with Microsoft’s plans of removing the dependency from ASP.NET Core to the Json.NET framework. These plans also include creating high-performance JSON APIs, which would ultimately increase the performance of Kestrel (the default web server included in ASP.NET Core templates). According to Immo Landwerth, program manager on the .NET team at Microsoft:

The requirements for the .NET stack have changed a bit since the arrival of .NET Core. Historically, .NET has valued usability and convenience. With .NET Core, we’ve added a focus on performance, and we’ve made significant investments to serve high-performance needs. […] We believe in order to support JSON parsing, we’ll need to expose a new set of JSON APIs that are specifically geared for high-performance scenarios.

A new version of SqlClient (originally part of the System.Data.dll assembly in .NET Framework) was also introduced in the new release. Besides being available in preview as a NuGet package, the new version also features support for Always Encrypted, Data Classification, and UTF-8.

Support for Linux ARM64 comes as part of the IoT development effort. According to Richard Lander, program manager on the .NET Team at Microsoft:

We added support for Linux ARM64 in this release, after having added support for ARM32 for Linux and Windows in the .NET Core 2.1 and 2.2, respectively. While some IoT workloads take advantage of our existing x64 capabilities, many users had been asking for ARM support. That is now in place, and we are working with customers who are planning large deployments.

Finally, the details on all performance improvements can be found here. These improvements include making the Garbage Collector (GC) using less memory by default (by making the heap sizes smaller) and reducing the .NET Core SDK to 25%-30% of its original size (on-disk, depending on the operating system). Other features include support for Docker resource limits and support for TLS 1.3 and OpenSSL 1.1.1 on Linux. C# 8.0 and F# 4.7 are also included as part of the .NET Core 3.0 release due to their significance. The F# Core Library now targets .NET Standard 2.0, and C# 8.0 adds async streams and nullable reference types.

An interesting characteristic of this release is related to the development process. Microsoft announces .NET Core 3.0 as “battle-tested”, since it is being hosted for months at dot.net and Bing.com. This was also noted by the community – as user Manigandham on HackerNews points out:

.Net 3.0 has gone through about a dozen preview releases and the last 4 have a go-live production license. It’s much better tested than the old .NET Framework with its monolithic releases.

According to the official roadmap .NET Core 3.1 Long Term Support (LTS) will be released later this year, in November. .NET Core 3.0 is supported on Windows 7+, Windows Server 2012 R2 SP1+, macOS 10.13+, and different Linux distributions. The latest version of Visual Studio 2019 (16.3 on Windows, 8.3 on macOS) is required for using .NET Core 3.0. Also, Visual Studio App Center was already updated to support applications developed in the new version of .NET Core.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Creating an Online Platform for Refugees Using Lean Startup

MMS Founder
MMS Ben Linders

Article originally posted on InfoQ. Visit InfoQ

What do you do if you want to reach a new user group whose complex needs you need to learn about quickly? At Agile Business Day 2019, Stephanie Gasche shared her experience using lean startup methods to create an online platform, without financial support, for the integration of refugees and asylum seekers.

Gasche started the integration platform “I am Refugee” early 2017, when she realised that there is no single contact point for a refugee or asylum seeker to receive a holistic, non-political view on integration in Austria:

Every person wishing to live in Austria and to be an active member of society – no matter whether asylum seeker, refugee, expat or immigrant – should be handed the digital services of Was Wie Warum in Österreich (formerly I am Refugee) at the first opportunity in order to better prepare for life in this country. Thanks to all the information found on the website, app and social media, every person should be empowered to have the same chances for personal success in the long-run as someone who has been born in Austria and is thus automatically part of the system.

From 2012 to 2015, Gasche worked as a management consultant in the field of Agile – helping Scrum teams, management and organisations transition towards Agility. She mentioned that lean startup makes the most sense in a complex environment, as its main aspects are customer orientation, fast feedback, continuous learning and improvement, visualisation, small deliverables, pull-principle and interdisciplinary team spirit. Seeing as the integration of refugees and asylum seekers is a highly complex field, this was the only possible approach for “I am Refugee”, she said.

Using Lean Start-Up methods, i.e. the feedback cycle Build – Measure – Learn, was ideal for making quick progress to see how the market responded to it, said Gasche. She mentioned two main differences from the “I am Refugee” initiative compared to for-profit organisations: the user group of refugees in the digital realm was completely new, and there was no funding.

There were no learnings yet anywhere around the world. After all, there is not just “the refugee”. These are all kinds of humans from all sorts of backgrounds and cultures, with different yet similar needs, said Gasche.

As to the lack of funding, the topic of integration has been a real issue in Austria. Gasche said that “we had to act fast and could not wait to write a business case and present it to a government funding agency and discuss and amend… to then maybe receive the money to start the project a year later.” They had to act quickly, get quick feedback on whether they had hit on a real need, and then build from that. Agile and lean methods were ideal, said Gasche.

A main difference in applying agile for a nonprofit compared to using agile in business organizations is the separation of customers from users. Gasche explained:

If we separate our main stakeholders into customers who pay the bills and users who actually use our digital services, we have to be aware that in the nonprofit sector, the interests of the two are often not the same. Thus, if you look at lean startup and the feedback cycle and how one should pivot when finding out new information about one’s market and users – we would have had to pivot a long time ago. Why? Because the government, city council and also private foundations’ funding – which is what a lot of the nonprofit sector is dependent on – all funded different kinds of integration projects. They supported initiatives that would help refugees find work. Or an apartment. Or German classes. These are all very valuable projects and initiatives, however, there were enough great initiatives out there already. The real problem was that refugees and asylum seekers did not hear about these initiatives, simply for the lack of there not being a single source of integration that would bridge offers with requests. This is the gap that we filled with Was Wie Warum in Österreich – I am Refugee. And since we were unwilling to pivot due to seeing and feeling the need, we did not receive any of the funding.

As the initiative is completely based on volunteer work, the three elements for intrinsic motivation that Dan Pink identified – purpose, autonomy and mastery – are even more true, said Gasche. She mentioned that her friends helped quite a bit for short iterations. However, people who volunteered because they believed in the purpose felt like they could try out something new and learn from it (i.e. facebook advertising); these strangers became friends and stayed on board, said Gasche.

InfoQ interviewed Stephanie Gasche after her talk at Agile Business Day 2019 about the “Was Wie Warum in Österreich – I am Refugee” initiative.

InfoQ: How did you find out what the main problems were that the website iamrefugee.at had to provide solutions for?

Stephanie Gasche: Early 2016, I decided to change my job in order to support the Austrian government in the integration of the many refugees and asylum seekers who had arrived in Austria. Thus, I became the first integration trainer teaching governmental value and orientation courses in Austria. Due to working with refugees every single day, I realised where the real problems lay and what needs the Austrian public as well as government were not addressing: we needed one central point for integration that would get rid of the many misunderstandings and misinformation provided, i.e. in Facebook groups. Since refugees were spread all over Austria – some more remote than others – a digital service would make most sense.

InfoQ: What does “I am Refugee” provide?

Gasche: The platform gives its users the opportunity to learn what the 9 steps towards integration are, why it makes sense to take them and how one can take these steps. It thus enables persons new to Austria to take charge of their own integration process, instead of being pushed to do an integration step that does not make sense to them.

In order to cover as many persons within the diverse user group of “refugees”, we use visualisation, short how-to tutorial videos and have translated all texts from German into English, Arabic and Dari Farsi. In 2018, we decided to change the name to “Was Wie Warum in Österreich”, as we came to realise that the topic of integration may not only be relevant to refugees.

InfoQ: How have you applied lean startup and agile principles and practices?

Gasche: In addition to the Agile mindset of empowerment and the importance of user centricity that the platform is based on, we also use agile methods to organise ourselves as a team. We use Trello to gather, visualise and prioritise the deliverables, as well as show the progress of the work. From here, team members pull the tasks according to their availability and interest. We used to do one-month iterations with Planning meeting, Review and Retrospectives. However, now we have adapted to regular co-workings, communication via WhatsApp and Show & Tells.

InfoQ: What benefits did you get and what have you learned from using lean startup?

Gasche: So, so much. Apart from having learned a lot about my country Austria, behind-the-scene-politics, the funding world behind startups, and now having a very good overview of integration projects and initiatives happening in Austria, I have learned to work with a truly intercultural team having different intentions, as well as the value of focus in a multidisciplinary team. I can say that whenever we focused on one goal (i.e. getting funding; getting translations ready etc.), we always hit our target by the committed time, because we pushed and supported each other. This was also when the job was the most fun! Especially when working as a distributed team.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Swift 5.1 Brings Module Stability, Opaque Return Types, Property Wrappers and More

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

While module stability is by far the most impactful new feature in Swift 5.1, the latest version of Apple’s language includes a number of new language constructs, such as property wrappers and opaque return types, and a number of standard library extensions.

As you may recall, Swift first reached ABI stability, with version 5.0. ABI stability aims to make it possible for code compiled with different Swift compilers to work together, thus bringing some level of binary compatibility to the language in face of changes to the language syntax. Two major implications of ABI stability are the possibility of not bundling a copy of the Swift runtime with each shipping application and enabling the distribution of binary frameworks by third-party developers and Apple itself. This goal was accomplished only partially with Swift 5.0, so apps did not need to include the Swift runtime anymore starting with iOS 12.2, but developers had to wait until Swift 5.1 and iOS 13 for the creation of distributable binary frameworks to become real.

Apple paid special attention to preventing the requirements of ABI stability from impairing any performance improvements that future, non-ABI-compliant language features might provide. This mechanism is called library evolution in Swift jargon and basically comes down to enabling frameworks to opt-out of ABI stability on a per-type basis. This would make it possible to take advantage of given compile-time optimizations at the expense of binary compatibility. It is also important to notice that non-standalone frameworks, i.e. those that are bundled within an app, are always compiled with the full set of available optimizations, since ABI stability is not relevant for them.

At the language level, Swift 5.1 brings a number of significant new features. Property wrappers aim to generalize the possibility of defining implementation patterns for properties and provide them through libraries. For example, Swift included lazy and @NSCopying property modifiers that were embedded into the language itself. The compiler knew how to translate those modifiers into the corresponding property implementation, but this mechanism was not flexible and made the compiler itself more complex. So, Swift 5.1 introduces a special class of generics that are marked with @propertyWrapper. A property wrapper provides the storage for that property it wraps and a wrappedValue property the provides the actual implementation of the wrapper. For example, this is how the lazy semantics could be implemented using a property wrapper:

@propertyWrapper
enum Lazy<Value> {
  case uninitialized(() -> Value)
  case initialized(Value)

  init(wrappedValue: @autoclosure @escaping () -> Value) {
    self = .uninitialized(wrappedValue)
  }

  var wrappedValue: Value {
    mutating get {
      switch self {
      case .uninitialized(let initializer):
        let value = initializer()
        self = .initialized(value)
        return value
      case .initialized(let value):
        return value
      }
    }
    set {
      self = .initialized(newValue)
    }
  }
}

Opaque return types address the problem of type-level abstractions for return types. This feature enables to abstract a return value whose concrete type is determined by the implementation and not by the call syntax, as it usually is case with generics. For example, a function’s implementation could return a collection without disclosing what concrete collection it is. This could be represented by the following code snippet:

protocol Collection {
  func operation1()
  func operation2()
}

protocol ApplicationObject {

  // a collection used by this type
  associatedtype Collection: Collections.Collection

  var collection: Collection { get }
}

struct Invoice: ApplicationObject {
  var items: some Collection {
    return ...
  }
}

This is similar in nature to what is known in Objective-C as class clusters of which the most compelling example is the class (cluster) NSString.

Swift 5.1 also aims to remove a nuisance when using default values for struct properties. In such cases, the Swift compiler was not able to generate a default initializer that took into account the fact that some properties of the struct had a default value, thus forcing to specify all properties in the initializer call. In Swift 5.1, if a struct property has a default value, the compiler will synthesize a default initializer that will properly match those defaults in its signature. For example:

struct Person {
  var age: Int = 0
  var name: String

// default compiler-synthesized initializer:
// init(age: Int = 0, name: String)

}

let p = Person(name: "Bill")

// Swift 5.0 obliged to specify both arguments:
// let p = Person(age: 0, name: "Bill")


A less powerful, although not less useful new feature in Swift 5.1 is the implicit return that is assumed in one-line closures, similarly to what is allowed in other languages. For example, you can now write:

func sum(_ a: Int, _ b: Int) -> Int { a + b }


Other new language features in Swift 5.1 are key path member lookup, aimed to extending the usual dot syntax to arbitrary names that are resolved at runtime in a type-safe manner; static subscript and class subscript, which extend the possibilities of defining subscripts which were previously only allowed as class members. Other improvements in Swift 5.1 are found in the Standard Library, including [ordered collection diffing'(https://github.com/apple/swift-evolution/blob/master/proposals/0240-ordered-collection-diffing.md), contiguous strings, additions to SIMD support, and more.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Improving the Quality of Incoming Code

MMS Founder
MMS Naresh Jain

Article originally posted on InfoQ. Visit InfoQ

Bio

Naresh Jain is an internationally recognized Technology & Product Development Expert. Over the last decade, he has helped streamline the product development practices at many Fortune 500 companies like Google, Amazon, JP Morgan, HP, Siemens Medical, GE Energy, Schlumberger, Ford, EMC, CA Technologies, to name a few clients.

About the conference

Agile India is Asia’s Largest & Premier International Conference on Leading Edge Software Development Methods. Agile India is organized by Agile Software Community of India, a non-profit registered society founded in 2004 with a vision to evangelize new, better ways of building products that delights the users. Over the last 15 years, we’ve organized 58 conferences across 13 cities in India. We’ve hosted 1,100+ speakers from 38 countries, who have delivered 1,350+ sessions to 11,500+ attendees. We continue to be a non-profit, volunteer-run community conference.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: From Developer to Security: How I Broke into Infosec

MMS Founder
MMS Rey Bango

Article originally posted on InfoQ. Visit InfoQ

Transcript

Bango: The reason I did this talk is I really wanted to do it in a developer-centric environment because I wanted to speak with folks who I’ve had this really strong relationship and bond with for almost 30 years. It’s become such an important topic. I think, in many cases, in the security field, we don’t really look at it. We take it for granted that a lot of the things that we do are just going to be secure. As developers, we have a lot of pressures, and those pressures are coming from management, they’re coming from product teams to get the product out. Sometimes, things fall by the wayside. I chose to start trying to address that, in some fashion. This is how I did it, and some of the challenges that I went through, some of the hurdles that I had to face as somebody who didn’t have formal security experience.

Also, I hope this gives you a foundation for maybe getting into security yourself in some fashion, whether it’s just part of your development lifecycle, or whether it’s something that you want to switch to as a full-on career. There are a lot of opportunities there.

My name is Rey Bango, and that’s my Twitter handle – fairly unique, you can tell. I work for Microsoft. When I joined Microsoft, I came from the web developer space. I went in there with all the bright-eyed and bushy-tailed expectations of cross-browser development and making sure developers were all happy and “Hey, Kumbaya,” and all that stuff. I ended up supporting the best browser in the whole entire world – Internet Explorer, it’s awesome. Literally, I was the Internet Explorer evangelist. I don’t even want to call it advocate anymore. I was the evangelist because I had to get IE back into some semblance of love, and it was hard. This was really hard, because I know all of you loved coding for Internet Explorer.

It was a slog, it was really challenging. I had to go and speak with web developers who were very much into open-source, cross-browser development. Some of them were very challenged by the things that Internet Explorer had, some of the quirks. It was rightfully futzed. The developer tooling was challenging, the rendering engine, tried it, God bless it, it was good at the time. Then it just wasn’t modern. I was trying to find a video of me at one of the events, and I really did find one. This was actually me walking out of one of the web developer events. They set me on fire, it was a rough position. Think the things that I had to endure. Think about the challenges that I had to face, not only in the web developer side, but also on the infrastructure side, working with enterprise customers, thinking about all the different things that you as professionals have to endure, including security.

Of course, I had to deal with the running jokes. Chrome and Firefox were fantastic browsers; they were modern, they were fast. The rendering engines were amazing. That’s why developers really started moving down in that path and I get it. People want the new shiny, but they also want modern and they want fast. They want features, and these browsers were offering those features. I think we lagged quite a bit behind.

Then we said, “Let’s try to figure this out,” and we switched to Microsoft Edge. It really was way better, but there were still some quirks. Now, thankfully, the team – and I’m no longer on that team – has embraced the Chromium engine. I’m very proud of that team for embracing the Chromium engine. There are a lot of iterations going on. I have the Chromium version of Edge here and I’m really proud of it. I think it’s a great browser, it’s been solid, it’s been consistent. I’m proud of that.

Security

After doing that for so long – maybe eight years – I really started wondering, what’s the next step for me? What are the things that I want to talk with developers about? It’s great to talk about cross-browser development, but after a while that becomes repetitive. I started looking for a new narrative, something that I felt was important, and something that developers would hopefully relate to. That was security. I really felt that there was a need to start talking about how our applications were built from a security perspective.

There were a lot of things going on in the news about vulnerabilities in web apps, whether it’s PHP-based, whether it’s client-side apps, whatever it might be. There were so many news articles coming out that I felt that we weren’t doing our job to properly secure what they call the new endpoints. Web apps, for all intents, are the new endpoints. This is what attackers are looking at. During my research and transitioning, it’s been known now that cloud infrastructures are getting much more robust and secure. If you implement something out in the cloud and you do it the right way, generally, it’s going to be solid; less moving parts to deal with, you let the infrastructure provider handle a lot of security. You do have to do some things on your end, but for the most part, once you get it down, it’s pretty solid.

What does that mean? Attackers have to look for new endpoints, whether it’s your APIs, whether it’s the fact that you didn’t sanitize input properly, or maybe it’s cross-site scripting, whatever it might be, but they’re looking at web applications. I remember the first time I saw a web shell appear within an application. I freaked, “How do they do this?” If you’re not familiar with a web shell, think of it like when you open terminal and you can type in commands. Think about doing that within an application – your web application. That freaked me out, I didn’t know that was even possible.

That’s how now they’re trying to tackle security. That’s how these criminals are trying to break into your systems. A lot of times, they’re only compromising it, maybe to install crypto mining. Sometimes they’re doing it for different reasons. You might get somebody who’s a nation-state, if you work for a financial institution or government agency. Nation-states are really interested in all the facets of your web applications. They want to know, because that data is really valuable. Has anybody suffered identity theft? I have, it is horrible.

This is why it’s so important to think about the security of your applications, because the impact is greater than you. The impact could be human life. Somebody breaks into your app and manages to steal PII. That’s a big deal. That could mean that somebody’s life now is ruined because that data is out on dark web markets and it’s being used to get everything from loans to just stealing bank accounts. So it’s really important and imperative that we start thinking about security.

When I decided to transition, I decided to go to the conference that had the most black hat hackers in the whole world. That’s how I wanted to learn about security; throw myself into the fire. Isn’t that great? I was really scared to go to this because everything you hear about DEF CON, is, “You’ve got these criminals, and they’re there, and they’re going to hack you if you open your phone – oh, my God.” No, seriously, they recommend you take a burner phone and take a burner laptop, and then take the laptop when you’re done, throw it on the floor, set it on fire. I’m, “Oh, God.”

I did manage to go and it was fun. I tweeted this out as I was getting myself ready, and I really just thought about security from, “Maybe I can do something from a hobby” perspective; get familiar with it, and then try to incorporate it into my work, especially in the application security side. I’ve never thought about security from a professional perspective.

Then WannaCry hit. It’s ransomware that basically encrypts all the files in your PC. Normally, if you’re dealing with financial institutions, you get hit with a ransomware where you personally get hit with a ransomware, “All right, big deal.” All right, your computer’s locked, it’s maybe bad, because maybe you have personal pictures. At the most, it may be a really horrible inconvenience. If you’re a financial institution, your money is safe, it’s insured. If Chase gets locked, all right, they’re insured. You’re not going to lose your money.

What about human life? This is what impacted me. This is what really was the catalyst for me to start thinking about this as a professional. WannaCry locked the computers of hospitals, and patients were being turned away for dialysis treatment, and chemotherapy, life-saving treatments that they needed. From a personal perspective, this hit me hard because I thought, what if it’s my family? What if it’s one of my children? What would I do? Think about that helplessness that you would feel if this was locked and the doctor says, “Look, I can’t treat your child today,” and that child might need life-saving treatment, and you had to walk away. How would you feel? That’s how I felt.

AppSec is Hard

I really started digging into the security space. The one thing I realized is that, “Yes, application security is really hard.” It is really hard because we have a lot of moving parts. We have a lot of libraries that we use, we have a lot of frameworks, we have a lot of infrastructure. We have to integrate all this into a cohesive thing, and then it’s the human factor. We’re all writing code, and we’re all infallible. I don’t know if any developer, except for myself, writes 100% pure, awesome code. Ok, bad joke.

Seriously, we don’t write perfect code. Invariably, we’re going to make mistakes, and that’s the hard part about it. They say in the cybersecurity world that defenders always have to get it right 100% of time. Criminals and hackers only need one time, that’s all they need. That’s hard, think about that. How do we do that, especially when we’re trying to work with a security organization that basically thinks this is what we do? This is us, the developers. We just lounge around, getting served stuff, we’re party animals. That’s what the security community thinks we are.

It’s not like that. They don’t understand that we have all these bits of tooling that we need to incorporate into our application to make it successful. Whether it’s the framework to reuse, the CI that we have to implement, all these moving parts are critical to success of modern development workflows. They just don’t get it. Something as simple as saying, “I need to install this,” gives you the look from the security people. When you go in there and say, “Listen, I need admin access on this spot, I need to install all these things,” what happens?

Tell me you haven’t been zapped just like this. It’s the God honest truth that security people have this fear of application developers, because we have to have so much permission to do things that security people don’t want. We need to have that access, they want to limit the access to limit their exposure. So who’s right and who’s wrong? It’s neither. Both are doing a job. How do you meet in the middle to actually be successful across the board?

I love this right here, though, because to some extent, I understand what security people are talking about. Developers can be lazy, I’m guilty of that. The number of times security researchers have found passwords or API keys in GitHub is incredible. There are people who literally will go out to GitHub and do a complete scan of GitHub, on repos, to find API keys. If you didn’t know that, now you know. You should log into your GitHub repo and go back, and start looking through your code, and say, “Did I leave a password or an API key in there that I shouldn’t have?” Seriously, it’s really rough.

This is why security people get freaked out by developers. We sometimes do these things. On top of that, you have things like the Npm attacks that have been happening recently. These are called supply chain attacks, basically. This is actually really clever, but it’s been predicted for years. The thing that’s happening on a regular basis is that somebody is coming out and creating a really well-liked package, something that people will use on a regular basis.

How many of you have just done npm install package-name without verifying dependencies? I’ve done it. All of us have put ourselves at risk because we did that, we didn’t verify dependencies. We’re making an assumption that all that is great. I think that’s something that in the open-source world is very common. This is why this happens. Because we are to some extent so trusting, it allows us to look at a package name and say, “That looks pretty good. I’m going to include it in my stuff.” But little do we know that somebody could go in there and say, “I’m version 3, I’m going to add that little crypto-mining script,” or, “I’m going to add this script that’s going to steal all the stuff from your Coinbase account.”

That’s exactly what happened here; somebody did a supply chain attack. This was a dependency on some other packages. Thankfully, the Npm team caught this. They’re actually doing a lot of security auditing. Adam Baldwin, who ran the Node Security project, his company was acquired by Npm, and now they have a very dedicated team of talented professionals. They’re auditing codebases, especially on regular and popular dependencies.

Then this is what I was saying, basically – that this has been known for a while. The problem is that not only has it been known for a while, there’s been a lot of pushback around this. When people push back and say, “We have to be careful about this,” it’s not that people are trying to stop progress. They’re just asking us to think and be thoughtful about the way that we’re building things so that we don’t get burned down the road. I have seen this type of stuff, not in the security space, but in other aspects of web development, where if you push back on anything because you just feel like, “Maybe we should take a couple of minutes to really think it through,” there’s always that comment about, “No. Progress. We need to move forward. Why are you being that guy who’s blocking us?” It shouldn’t be that way. We should be a little bit more thoughtful in the way that we approach things.

I did a search on Npm dependency health. I got this slide, which is great. Look at this, this is real. I’m going to show you why it’s real. I said, “I wonder if this is an actual real repo or somebody just faked it.” I actually went to the GitHub repo and I went to the dependency graph, and I started looking at this for this project. I’m like, “Oh, ok,” and keep going, and keep going. Oh, wait, there are 63 more. The key thing to remember is that each one of these is a dependency for this project. Each one of these has an opportunity to be an attack vector. I know Substack, reputable person, but let’s say Substack adds a collaborator, and that collaborator gets trusted privileges to push code and inject something in there that nobody else in the project sees, because a lot of these are one-person projects. What happens then? They do an update on their app. That dependency automatically gets injected. Now all their customers are at risk.

I don’t want to pick on them, because this is not just them. Here’s a project I worked on called PWABuilder. Let’s look at this. Boom, there’s Mocha, there’s Chai. Oh, wait, there are 39 more. This is a project that I worked on, I’m not picking on anybody. This is an example of a very common type of dependency graph on most modern projects. Even something as simple as a simple project I worked on would have this.

One of the things that I love about GitHub is that they’ve become a little bit more proactive in helping you identify security vulnerabilities. If you didn’t know this, you should really sign up for their security notifications for your projects. This is a silly project that I was working on called Pinteresting, when I was trying to learn Ruby on Rails, and I haven’t updated it in years. The great thing is, I still get security alerts for it. Nobody’s using it, so I’m not worried about it, but it was a great example of how GitHub is being proactive in giving you alerts about security vulnerabilities.

You go through here, and you can see what needs to be updated. For example, Actionpack has a high severity bug. If I drill into it, it’ll tell you what the remediation is, which is great. That’s number one. I think anything you use should help you remediate the issues. Then, it tells you about what the CVEs are, the actual vulnerability reports. It tells you what the problem is. I think the last thing you want is for something to be able to run a remote code execution – kind of a bad thing. So it’s important that all of us start thinking about how these services can help to complement your workflow. Listen, security is tough, and I’m not expecting everybody in this room to all of a sudden become a security expert. Try to supplement the work that you do by leveraging the work of others who are security experts. We’ll talk a little bit more about some services that can help you with that as well.

Stats

I spoke to a company called Snyk, and they gave me some really good stats. This was actually really eye-opening. They asked a bunch of people, especially OS maintainers, if they audit their code and the cadence of it. 26% of them said, “We didn’t.” Think about this – 26% of open-source maintainers said they don’t audit their code. This is the code that you are likely actually looking at. 21% has said at least once a month, which is good, but 10% of them said every couple of years. Every couple of years means that, in the interim, while you’re getting these releases, you’re just installing whatever is there. You don’t know what’s being installed.

How many of you have looked at every dependency and audited every piece of code from those dependencies? That’s what I thought. We don’t, because we have too many things on our plate. We have time constraints, we have features, we can’t do this. If you’re not doing it, think about the open-source maintainer that is that one person that’s out there trying to do this. How are they going to go ahead and audit thousands and thousands of lines of code? You hope that maybe they’re doing it from the very beginning and think about it, but many don’t.

The problem is that we have implicit trust. We assume that everything that we install is safe. We love open-source, “Clearly that developer should know what he’s doing and making things safe.” No. A lot of these developers are really good at building JavaScript and Ruby on Rails, and a lot of them are really poor on security, so we have to think about how that plays into it. If you look here, how do you find about vulnerabilities? This is more, I think, a generalized developer question. 27% said, “I probably won’t find vulnerabilities.” That’s peachy, that’s really reassuring. 36% said, “We use a dependency management scanning tool that notifies us.” At least there are some people who are thinking about that and I think that number is expanding. I think there’s becoming more awareness around it. A lot of these vendors are actually going out there and being proactive about educating the community about the things that we need to think about.

Dark Reading had a great article about web apps. It said that, on average, each web application that this company tested contained 33 vulnerabilities and 67% of the apps contained critical vulnerabilities such as insufficient authorization errors, arbitrary file upload, path traversal, and SQL injection. The fact that we’re still dealing with SQL injections at this time floors me. Does anybody know how to solve SQL injections? What’s the easiest way to solve SQL injections?

Participant 1: Sanitize inputs.

Bango: That means that somewhere down the line, somebody’s not sanitizing an input. Why is that still a thing? Even the frameworks offer that. Here’s another one from VeraCode, they pulled 400 developers. They found that just 52% update the components when a new vulnerability is announced. Think about that. They don’t update their components. That’s 48% don’t update when a new vulnerability comes out. That’s scary. They know it, but they’re not doing it. We do have a problem.

Web Apps & APIs are the New Attack Endpoints>/h2>

Web apps and APIs are the new attack endpoints. This is a real thing. I’ve been speaking to a lot of security people over the last two years that I’ve worked to transition over. Kevin Johnson wrote the web application pen-testing course for SANS Institute. I was having a conversation with him, he says, “The majority of penetration testing engagements I get right now are for application security.” It’s because cloud infrastructures are pretty solid. Networks themselves, the people are getting really good. Endpoint detection systems are getting way better. With AI and machine learning, things are getting way better.

It’s not always going to be fail-safe. I don’t say that, but now they’re shifting to a different location. They’re saying, “How do we go ahead and compromise these systems through web apps?” because it’s the human factor, humans are fallible. They’re going to make mistakes, that’s how that happens, that’s how they target it. That’s why I would suggest that everybody in this room take a moment and just go look at the OWASP Top 10. It’s the top 10 list that OWASP has determined of common web application vulnerabilities. I’m going to call it application vulnerabilities, I don’t even want to say web app.

Do you know what the number one vulnerability is on this latest release? SQL injection. Cross-site scripting is still on there, cross-site request forgery is still on there – all these things we can mitigate as developers. I think a part of that is that we were not really sure how to do it, in many cases. Some of the things are hard. I would urge you to take a moment and go to OWASP, and use that as a reference to get better at securing your applications.

The other thing that’s becoming really important right now is, start thinking differently about the way that you build your applications. In traditional methods, we gather requirements, we do the design, we code, we test, and we release. Then, maybe we’ll think about security around the testing phase, “Yes, we’ll test it out. We’ll get somebody to manually do input testing stuff like that.” That’s great. That really doesn’t do much. You might catch a couple of things, but what you want to do is shift left.

Tanya Janca is my teammate, she’s awesome. She’s one that you should follow as well. She’s really strong on application security. She’s into DevOps. I would say follow her, and I’ll give you her information later on. This was her slide, and I took it from her, I pilfered it – I social engineered it. Basically, we should be thinking about security from the requirements stage. We should be shifting left, and thinking about security from the beginning, and moving forward. As a new feature comes out, we should be thinking about how security comes in. The problem is that we have a disconnect between the way that we as developers work, and how security professionals think. We don’t really talk the same language in many cases, and that’s hard. The communication has always been a challenge between these two, and I call them complementary fields.

Security Champions

The biggest thing that’s happening right now is the notion of security champions. This is where it could be a good opportunity for you to start shifting into security, for you to go to your organization and say, “Listen, I’m really interested in security. I want to make our app secure. I’d like to be a secure champion and be able to work with the security folks, and have that good dialogue, and that feedback chain going across the board.” You would be the person who’s responsible for having those communications at the requirements stage, and ongoing conversations about how security can be best implemented into your applications.

How do you enable developers to have a really strong workflow, while still letting it be secure? This is where a security champion comes in, because you understand that side. On the security side, you also have a security champion; that person who’s going to be that advocate, not only for the security side, but also will be the advocate for the developers, because that person is going to take the time to understand your needs as developers. This is a really cool way of thinking. I urge you to go to your management and spend time talking about that. How do you identify two people? Just start with two people that can be those advocates, that could say, “We want to build secure systems and we want to come up with a good method to helping everybody be happy.”

Jerry Bell said it great, “I find that developers often lack the perspective on the adversarial mindset.” This is really important, because we don’t have that adversarial mindset. Our job is to build features and to build really compelling apps. We need to have this, we need to start thinking about what would a malicious actor do with this feature? Start drilling into it, whether you’re using something like Burp Suite or Fiddler to tinker around, and add data, and see what’s happening, or interpreting how the requests are coming across the wire to see can they be pilfered in some way?

Tools for Learning about Security

We need to start thinking a little bit more like that adversary so that we can identify the vulnerabilities. That’s why I wanted to go into the security space. I actually enjoy that little adversarial side. My hacker handle is a little bit evil – I have that little evil bit in me. It is kind of fun to break things. It’s also important to fix things. I thought this was going to be me, I really was going to put on the mask. I started looking for security training, so I can get up to speed. What I do? I go to YouTube, of course. What happens on YouTube? You just get flooded with stuff. Where do you find stuff? It’s impossible. There’s just so much, there’s a glut of information.

I said, “All right, I’m going to start off with this one, The Complete Ethical Hacking Course.” That was ok, but I think it depends on the type of person that you are. I’m the type of person who actually enjoys going into a classroom, sitting down, being able to talk with somebody, getting that mentorship aspect to it. Some people enjoy online training. That’s actually great, I took an online training class that’s turned out really good for me as well. But I do enjoy being able to sit down and talk with people, and also have that mentor I can bounce questions off of.

I ended up going here to Hacker House. That wasn’t online, that was an in-person class that gave me the foundations of how bad actors actually do their things. It’s incredible the amount of tooling out there. It makes it trivial for anybody to pick up and have some hacking tools, and actually go at stuff. It doesn’t mean they’re doing great. It doesn’t mean they’re going to be quiet, they’re probably going to be very noisy. But the fact that these tools are so readily available, and they can make your life impossible, it’s scary. I got that foundation because that gave me that adversarial mindset. It showed me, “This is what I can do.” I can use this tool to go ahead and probe databases, a tool like Sqlmap, a great tool. It looks into MySQL databases, and will actually return your databases, your data tables, users, data inside of it. That’s just one tool. The point is, it gave me that mindset.

Then I carried on with my education with eLearnSecurity which gave me a little bit more advanced. This was online. I just came back from a course with SANS Institute. This fried my brain, because if buffer overflows are hard as it is, because you’re dealing with assembly code, and machine code, and memory addresses. By the third or fourth day, I was done, I was fried, I wanted to go home. I curled up into a ball in the room and I was, “No. Please.” It was horrible, but it was a great course and gave me perspective on what’s capable in terms of bypassing operating system defenses.

I’m not saying that you have to do all that, but the very least, taking an ethical hacking course and getting familiar with the tools that give you application security coverage will be really beneficial. For example, there’s a course on YouTube, it’s a free one, on how to use Burp Suite. It’s done by Sunny Wear, she is an expert on application testing. I would tell you, look it up and follow her course because she actually walks you through how to use Burp Suite to do penetration testing on your app. It’s really powerful. That’s the type of stuff that all of us can do. You all can download the community version of Burp Suite and poke around in your app to see if you find stuff. It’s really helpful.

I also have a ton of books. The one that I would recommend is called, “The Web Application Hacker’s Handbook.” This is widely considered the Bible of penetration testing for web applications. Some of the concepts may be a little dated, but a lot of them are not. I would tell you, “Go buy that book.” If there’s one book that you can buy, as web developers, get that one, and walk through it, and you’re going to be freaked out. The other one you can get is “The Tangled Web.” That’s another really good one, but this is the one that everybody recommends.

It’s more than just reading books, it’s more than taking a class. You have to practice, you have to go out there and actually try this stuff out. I would say look at OWASP Juice Shop. This is great because it’s a Node.js Application, uses Express, uses Angular, and it’s purposely vulnerable. It allows you to download as a Docker image. You can install it on a VM, and you can hit it. You can actually tinker with it to get familiar with things like cross-site scripting, cross-site request forgeries, SQL injection. This is the type of stuff that allows you to practice in a safe environment, where you won’t get in trouble.

This is the one thing I’m going to say. This is my disclaimer to all of you. If you decide that you want to go down the hacking route, please do it in a safe environment. Do it in your own virtual lab. Go to something like Hack The Box, or hack.me. Do not hack production systems, please. You will go to jail. Juice Shop is one of those opportunities for you to be able to hack something in a safe environment, where you can load it into a virtual machine and go at it, and actually get to learn how some of these attacks happened, and how to mitigate them.

The other one is Damn Vulnerable Web Application. This has been around for a long time, it’s another great one. What I love about it is that it gives you increasing levels of difficulty. You start off with the really foundational stuff, then you flip a bit and it keeps getting harder and harder. As you continue on, you can learn different ways of hacking the site. Then, of course, Tanya Janca, my teammate, has something called the OWASP DevSlop project, which is in a similar vein, but more for people who are focused on DevOps. Definitely something to consider if you’re getting into DevOps, to understand more how to protect systems.

Automate Security

Ultimately, we have a lot of work on our plate. Security may not be what we’re going to do. Maybe it’s going to be something that you’re going to get a little bit familiar with, but not your forte. Totally fine, that’s acceptable, this is where automation comes in. This is where it’s critically important, start thinking about how can you include services that help you solve some complex problems, especially around security. If you’re not a security expert, maybe you should be looking at leveraging other people who are. This is where companies like WhiteSource, like Snyk, like VeraCode, and Black Duck come in. I was talking to the WhiteSource folks and they gave me some really great data. Let’s read through this. Over 75% of open-source projects were aware to only 50% of their open-source inventory. That’s scary, that’s inventory management. If you’re a company that has to focus on compliance issues, especially around IP, that’s really scary.

I know at Microsoft we have very strict policies about open-source because we have to be careful about lawsuits. Somebody creates an open-source project that has leveraged code from another project, and that’s not property license, they can come back to us. As an organization, if you just embrace an open-source project and you’re not aware of those licensing concerns, that’s a big issue. 90% have at least one vulnerability; over 45% have five and up. There’s at least one license that doesn’t meet company policy, on average. How many of you in this room would be able to tell if the open-source project that you’re using would meet licensing requirements across the board, taking into consideration all the dependencies that these open-source projects have? Four people. Let’s say there’s 100 people in this room, so four people out of hundred can feel confident that they know that they have awareness of the licenses for the open-source projects they use. Think about the issues that that could play for your project down the road. When you embrace a vendor, yes, you’re going to pay out of pocket for it. This is the type of stuff that you want to pay for. This is the type of stuff that’s important, because it protects you not only from a security perspective, but from a compliance perspective.

WhiteSource is great, I love their dashboard. I love the fact that you can drill into their dashboard and it gives you a really good view of the vulnerabilities that your dependencies have. This is stuff that’s really challenging. Remember what I said: if you don’t know security, work with somebody who does. All the vendors will give you some kind of dashboard. You want to have that bird’s eye view that says, “This is a vulnerability. I need to dive in there and solve this problem.” You want that, and doing it by yourself can sometimes be challenging.

I would urge you to look at any of these vendors. I don’t work for any of them, obviously. That’s why I put several up here. I think it’s important for you all to look at the choices, and there’s more than that. There are a lot of different vendors, these are just the four that I knew off the top my head, and I didn’t want to have a very cluttered slide. Look at them all, and see which one solves your problem, and which one is the one that you feel really strong with. All of them are good vendors. It’s up to you to evaluate which one fits in for you.

Build a Strong Network

The other thing I’m going to urge you is to start building a strong network. What I mean by that is build a strong network of people who understand security. Get to know people, fellow developers, security professionals, who understand the landscape and understand the threats that are coming out. There are people who are proactively monitoring this constantly. You don’t have to be monitoring it yourself. I don’t think we should, but I think you should make friends with people who are, and ask them to help you stay on top of that.

If you go to your security team and say, “I really want to know when a security vulnerability comes out, because I want to stay safe with our application. I want to build secure software,” I can guarantee you, they’re going to be so excited. They’re probably going to hug you. You’re going to have this weird, awkward hug thing. Tanya Janca, my teammate, she is amazing. I would say follow her, get to know her, say hi to her, she’s really friendly. She is one of the best application security people I know. Of course, you have me. Hopefully, you guys will reach out to me and ask me questions. I would definitely urge you to do that.

The other part is, look at non-traditional talent to help you out in solving some of these problems. We always look for that person who has the 5 or 10 years of experience, whether in application development, or security, or whatever it might be. What about that person who maybe is a really good project manager who wants to be that intermediary between the two teams? Why can’t you leverage that? Does it have to be an application person and a security person talking to each other, or could it be somebody who has this desire to go out there and be that bridge? Look for people who are outside of your normal domain and capitalize on their strength, whether it’s communication, project management, leadership, whatever it might be, to be those advocates for you. Most importantly, do the right thing. I love the saying, “Go hack the planet,” just please don’t hack me.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: How to Tell Compelling Stories Using Data: Q&A with Dr. Christine Bailey

MMS Founder
MMS Ben Linders Christine Bailey

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Every good story has a beginning, a middle and an end
  • Stories influence more than data
  • Use data and insights in addition to emotion to tell stories that move minds
  • Start with why and don‘t forget to add in curiosity, anticipation, humour
  • Never lose your common sense!

The more evidence we have, the more likely our ideas are believed – or so we’re conditioned to think . But data doesn’t always engage people; this is where storytelling can help to combine data, insights, and emotion, said Dr. Christine Bailey. At Women in Tech 2019 she presented techniques that can be used to tell compelling stories with data, and showed how that can increase our influence with external and internal stakeholders.

In today’s world, as the sheer volume of data continues to grow, and terms like ‘big data’ and ‘internet of things’ have become ubiquitous, how we communicate insights from data is becoming more and more important, said Bailey.

Bailey explained how a story should look:

“Every good story has a beginning, a middle and an end.  At the beginning, you’re laying out the ‘situation’ – the ‘what is’ state.  The middle of the story describes the ‘complication’ i.e. what needs to be overcome and/or the critical issues.  It also describes possible implications or consequences of not taking action, to add some urgency.  Finally, the end of the story lays out what needs to be done; the action the audience needs to take, the benefits/evidence of taking that action and a picture of the ‘promised land’ or the new/better future.”

Every compelling story starts with WHY?  In this day and age, you’ll get found out if you make it up, so your purpose, or your “why” has to emerge from a deeply-held company value, something in the very DNA of the operation, combined with deep customer insights, argued Bailey.

Bailey stated that at the end of the data there is a human being. She reminded us to be ethical: “Just because you can use data points to influence (manipulate?), that doesn’t mean that you should,” she said. She advised people to be guided by their moral compass and be authentic, in order to prevent being accused of trust-washing or woke-washing.

“It’s important to be bold and aspirational, but don‘t drown in marketing speak as we‘re talking to humans,” said Bailey.  Think about how to add in curiosity, anticipation and humour, she suggested. Do all this and you’ll surely have a happy ending, she said.

InfoQ interviewed Dr Christine Bailey, chief marketing officer at Valitor, after her talk at Women in Tech Dublin 2019 and asked her about data storytelling.

InfoQ: How can we tell compelling stories using the data that we already have?

Dr Christine Bailey: Are you sitting comfortably?  Yes?  So now we’ll begin.  Once upon a time … think about a time when you had to influence others in some way. Maybe it was persuading your boss to buy into a brilliant new idea that you knew was going to be unpopular or difficult. Perhaps it was closing a must-win deal with an important customer, or maybe it was convincing your team to try something radically different, when they had a track record of being resistant to change.

Knowing you had to make a case for your ideas, where did you turn to first? Chances are you looked for data points, facts and figures and reached for the old, reliable powerpoint deck.

The more evidence the better, right? We’ve been conditioned to believe that the more points of evidence we have, the more likely we are to influence people. Even if we’ve made them up!  But while data may be factual and accurate, it’s not necessarily engaging.  People tend to start with the evidence, whereas that’s more recommended near the end of a compelling narrative.

You’ll need data points and insights to support all of your story (especially the evidence/benefits) at the end.  In my doctoral thesis published in 2008 titled How Large UK Companies Use Customer Insight for Customer Acquisition, Development and Retention, I identified five different sources of data:

  • Competitors
  • Customers
  • Markets
  • Employees
  • Channel partners

From this data, companies were generating four different types of insights:

  • Market predictions
  • Customer segments
  • Propensity models
  • Customer analytics

Ideally you’ll find those nuggets of difference that are going to help you connect with people and make the story more human.  

InfoQ: What are the challenges that data storytelling brings?

Bailey: Never lose your common sense!  I’m sure I’m not the only one who has cursed themselves for blindly following the satnav, even when your intuition knows it’s wrong.  Don’t ever blindly follow your data – if it doesn’t appear to make sense, ask more questions.

That’s also the danger with machine learning and AI – they’re only as good as the humans that built the models and algorithms behind them and are subject to bias and interpretation.  Last year Amazon famously had to scrap a secret AI recruiting tool that showed bias against women and Microsoft had to apologise for racist and sexist tweets by an AI chatbot.

What happens if you’re buying a gift for another person and the AI thinks it’s for you?  You’ve bought a jumper for your rather large granny and suddenly up pops an ad for plus-size clothing.  Or what about if you’ve bought items for a hen or stag weekend – what kind of future recommendations does that trigger? Personalisation is a double-edge sword.  If you get it right, you can delight your customers.  If you get it wrong, you can really upset people.

InfoQ: How do stories influence people?

Bailey: Here’s the thing: analytics alone don’t actually influence others.  Data points are transactional.  People don’t remember them or respond to them.  As Philip Pullman famously said, “After nourishment, shelter and companionship, stories are the thing we need most in the world”.

Who’s a Game Of Thrones fan?  Did you hear what Tyrion Lannister said near the end of the finale? He asks the rhetorical question, “What unites people?” He goes on to say, “There’s nothing in the world more powerful than a good story. Nothing can stop it. No enemy can defeat it.”

Stories tap into our emotions.  They move minds, not products. We remember them. As poet and civil rights activist Maya Angelou famously said, “People will forget what you said, people will forget what you did, but people will never forget how you made them feel.”

InfoQ: What benefits of storytelling have you seen?

Bailey: Disruptive brands always have a story.  Look at Virgin – one of the most powerful, disruptive brands in the world, transcending a whole multitude of products and services. How does Richard Branson explain this success? His tips are to “tell great stories that help people associate with your brand values”.

Regardless of sector, successful branding today is all about having a clear position, a ‘North Star’ purpose, building trust, creating immersive, innovative and personally engaging experiences and customer journeys in a human way.

Branding today must be rooted in insight. Without insight, you won’t have understanding, and without understanding, you won’t be able to build the kind of compelling and personal storytelling that connects with your customer and communicates your vision.

InfoQ: What tips can you give for telling stories?

Bailey: The next time you need to influence others, stop for a moment and think about how you want people to feel. More often than not, you can answer that question by looking at your own passions and motivations around your project. Harness that positive energy. Think about how you can share that feeling with others.

Then think about how you’re going to build your story, with a beginning, a middle and an end.
By all means use your data and insights as your support structures – you’re going to need them, particularly when you need your evidence.

Every good story starts with WHY, so find your why!  Get inspiration from Simon Sinek!  Go out and conduct some research and talk to customers. Find your purpose.  

As an example of how to do this: when I joined Valitor as CMO in August 2017, the company was operating as five different companies, with five different brands, all with different value propositions.  Valitor had started out in 1983 as Visa Iceland (issuing Visa cards) and was dominant in the Icelandic market, but had recently expanded into the UK and Denmark through acquisitions, and from there was serving customers across Europe. How could we go from being an Icelandic issuer and acquirer to an international payment solutions company?  How could I find our why? Develop a mission statement with a purpose and find the “red threads” that were common differentiators across the whole business? Everyone internally had strong opinions, and they differed widely, so I embarked on a strategic insights study with the help of a company called Context Consulting.

We went through a four-stage process.  First we interviewed fifteen internal stakeholders to get their view of the business and the market.  We identified seven of our top competitors and reviewed their marketing messages.  We didn’t validate their strengths and weaknesses or add our own assessment; we just looked at the messages they were putting out in the marketplace. Then we interviewed thirteen experts from the payments industry.  Finally, we conducted fifty five in-depth telephone interviews with end-users in Iceland, the UK and Denmark, who were either existing customers or prospects who matched our target profile.

It turns out that people don’t really care about payments, but they DO care about buying and selling and it should be as easy, simple, transparent, frictionless – whatever term you prefer – as possible.  So that became our mission, our purpose, our WHY – to make buying and selling easy.

To start with, I got quite a lot of resistance from people arguing, “How can we say that we make buying and selling easy when we only look after the payments piece?”  So that’s when you need to add the “how” piece or your story. We take care about our customers’ payments, so they can focus on buying and selling.  

About the Interviewee

Christine Bailey joined Valitor, an international payment solutions company, as chief marketing officer in August 2017.  She has 25+ years’ experience of B2B marketing in technology/payments, including leading European marketing functions for Hewlett-Packard and Cisco Systems. A respected thought leader and motivational speaker who gave the TEDx Talk ‘Unconventional Career Advice‘ (102k views), Bailey was recently voted #1 Woman in Tech by B2B Marketing, #3 female influencer in UK B2B marketing by Onalytica, and is among the top 20 women leading the charge in revolutionary SaaS marketing by SaaStock. Bailey is an advisor for the European Women Payments Network (EWPN) and a senior fellow of The Conference Board. She has a doctorate (DBA) in customer insight from Cranfield School of Management.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: The Not-So-Straightforward Road From Microservices to Serverless

MMS Founder
MMS Phil Calcado

Article originally posted on InfoQ. Visit InfoQ

Transcript

Calçado: I have been doing this microservices thing for a while. Most recently I have been at Meetup, which is a company owned by WeWork. You might have heard of them, you might have been to a Meetup. It turns out that I need to make an update – last Friday was my last day at Meetup. If I say anything here that sounds weird, it does not represent Meetup at all, so don’t blame them, blame me. Back at SoundCloud, I spent many years there and prior to that, I was at ThoughtWorks for a few years. I’ve been a few times going through this microservices transition, which really is going from monolith to microservice architecture.

Most recently, when I actually joined Meetup, I faced a new kind of paradigm that some of you might be familiar with, which is all this serverless stuff. I use the word traditional microservices out there on Twitter and people made fun of me, and they should. I feel I’m a very traditional person when it comes to service-oriented architectures, and this was a great opportunity for me to check my biases which are based on experience. Experience and bias are really the same thing at this stage. We’ll work out what’s the new way to build microservices with some of these things that people usually call serverless.

What are Microservices?

The first thing I want to do is to create a shared vocabulary here. I won’t spend too much time on this, but I’ve found it to be important in other talks I have given, which is try to create a working definition of what’s a microservice, and hopefully, other words we might need. The first one is what’s a microservice, or what are microservices? The interesting thing that I found in software engineering in my career is that most things have no actual definition. We reverse engineer some form of definition from what other people have done. Over time, at QCon’s and many other conferences, people have come to say, “At Netflix, SoundCloud, Twitter, whatever, we’ve been trying this kind of stuff and it works,” and somebody else is, “I’m going to call it microservices.” “Dude, you call it whatever you want. It works.” That’s what matters.

We were able to reverse engineer the definition based on a few things that are shared by these companies that have been applying this kind of technology. It’s very similar to how agile came to be a word. How these folks got together, “We are trying to do all these different things, it seems to be working, let’s see what’s common amongst them all and give it a name.” The way I like to refer to microservices is, they seem to be highly distributed application architecture.

There are a few interesting words here. Highly distributed is one of them. By that, I mean that you don’t have one or a few big processes software systems running; you actually have a boatload of them. You have many different software systems that implement business systems you might have. This seems to be something common across all these different use cases you might have seen. Something interesting is, I’d say, distributed application architectures, because I’m sure you’ve been to many talks, at this conference and others, and read articles, and whatnot. People keep talking about distributed systems, which is something everybody is talking about all the time and gets really boring, to be honest.

The main difference I see between what I’m talking about here and what usually is called a distributed system, is that these people often talk about infrastructure pieces. They often talk about databases, consensus, gossip, things like these, all this stuff that nobody really understands but we need to pretend to understand so we get a job at Google. That’s not what I’m talking about; what I’m talking about is actual business logic pieces talking to other business logic pieces. That should be hopefully a good enough definition for what we want to do now.

What is Serverless?

Moving on, what is serverless? That’s an interesting one because, like I said, I’m not really that experienced with this word. There are many people in this very room right now who have been doing this for way longer than me. I’m a biased traditional person on the microservices side of thing. My answer to this is, “Dude, who knows?” I don’t know what is serverless. It seems like every single person has a different definition and it’s weird and it seems to be really whatever a cloud provider wants to do.

One interesting difference I found between what’s generally known as serverless and what became the concepts behind microservices, is that the microservices realm seems to be reverse engineered from practitioners, from people coming to stages like this, writing papers, writing blog posts, saying, “Who’s actually done this? This is how we managed to get Twitter out of whatever it was back to sanity. We’ve done these things in Netflix. We are able to deliver all these stuff.”

Serverless seems to be the other way around. There are a lot of cloud vendors pushing new technologies they want all of us to use and giving it a name and packaging it in a nice format. There’s no problem with that, I used to work for a cloud provider, so I know how the market works a little bit, but it’s different. I don’t know if I’m qualified to give you a definition. I’m going to actually rely on a definition by a couple of smart people who are based in New York, Mike Robert and John Chapin. They run a company called Symphonia here. Full disclosure, I’m friends with them but also I have hired them many times to do various forms of consulting work. You should totally hire them, but very conveniently, they wrote a book called, “What Is Serverless?”

There is some definition there that’s some kind of litmus test which I like, but we don’t need to go too much into detail to tell the tale I am going to talk you about. Maybe the most interesting definition or part of this definition is that there is a disconnect between the codes you work with, and the concept of servers, hence serverless. There are a few ways that it manifests itself, but one of them is, you don’t really think about CPU or memory or instance size or instance type, things that you might have to do with containers, or VMs, or things like this. I’ll go with this, this is good enough. Go check out their book, it’s available freely online.

Then, as for examples of what this is, a few things come up. Up to yesterday, this slide only had three logos. The first one on the left is Amazon’s Lambda, then there is Google Cloud Functions, and I’ve actually never used that, and the other one is Microsoft Azure Functions – I’ve also never used that, I’ve only used the AWS flavor. I was here at the conference yesterday talking to some folks and they were mentioning that they’re actually using this Kubeless framework or platform, whatever it is. [inaudible 00:07:00] serverless on top of Kubernetes. Never used that, just found it interesting. Those are some of the flavors of what’s generally known as serverless frameworks, platforms, whatever you might want.

Keep Your Monolith for as Long as Possible

Trying to tie it back to the story I’m trying to tell you all here, there’s a general piece of advice you receive all the time, which is, try to keep your monolith for as long as possible. Most people would tell you, “If you start a new company, or if you are starting a new initiative within a bigger company, don’t try to just spread out your systems throughout many different microservices or get all this RPC going on. You should use this. You should use gRPC, HTTP, service mesh, whatever.” This is probably not the right time for your company to be thinking about this.

Instead, what you could be thinking about is how to grow your business. You don’t know if you are going to have a company or this initiative within your company in two months’ time. Don’t focus too much on technology, focus on the business and just keep things on a monolith. Usually, this advice is paired with something like, by the time you need to move away from the monolith, usually what you do is that you have your big piece of code with various different concerns represented by different colors there. You start splitting one over there and one over there, and slowly moving things away from the monolith.

My experience is that you’re actually never going to move things away or extract them from the monolith. The code is still going to be there, you just pretend it’s not there and it’s often ok. If you think about Twitter, they only turned off their monolith a few years ago. SoundCloud still uses the monolith and I left their company five years ago, and we were on a way into the microservice journey. Don’t worry too much about it. That’s the general framework people use when extracting systems.

What if You Waited So Long That There are Other Alternatives?

What happens if you followed this advice so well that it took you 10, 15 years to actually consider moving away from the monolith? Something that has happened is that, as I was saying, all the stuff like serverless and various other options came to become a thing. This wasn’t available when I was first doing migrations from monoliths to microservices at SoundCloud back in 2011, 2012, but they are available now. This is a little bit of the situation Meetup found itself in.

Meetup is a very old company. I didn’t know that before I joined it. It’s 16 plus years old. It’s one of the original New York City startups and it’s mostly based on one big monolithic application, that for various reasons that are not actually funny, we call Chapstick. Chapstick is a big blob of Java, actually Java the platform, because there’s also Jython. There is also Scala. There is probably some stuff that I never really cared to look at. It’s just your regular monolith developed over many years. It’s not the worst thing I’ve seen, but obviously it’s very hard to work with.

Engineers at Meetup decided that at some point they wanted to move away from it. They tried different things, that was way before I joined, so there is only so much I actually know about this firsthand. They tried moving to your typical microservices setup with containers. There was an attempt at using Kubernetes locally, various different attempts and they all failed for various different reasons. Having had some experience with this, I think a lot of it had to do with trying to go from the Monolith straight to Kubernetes. There are other ways to get there.

The point is that there was a big problem. We had this monolith, and we, the company, had just been acquired by WeWork I believe, in 2017. We knew change was coming and we need to get ready for that. What does that mean? One engineering team in particular was really passionate about this new flavor of things, what we usually call serverless. They decided to give it a shot and built a few projects that looked a little bit like this. This is actually a screen-grab from one architecture document we had. It might be a little confusing to explain if you are not used to serverless asynchronous architecture, but I will try my best.

The first thing far on the left-hand side is Chapstick, our monolith. Chaptsick again, Java code writing through MySQL database. One thing we’ve done is that, we followed a pattern that was mostly popularized by LinkedIn, to my knowledge, with their database project where, one interesting thing about some of these databases like MySQL, is that whenever you have a massive replica setup, there is something called the Binlog. Where you send your writes to, it tells all the replicas that are read-only about some change that has happened. The Binlog contains every state change within MySQL setup, so all inserts, deletes, updates, whatever. It’s part of the Binlog.

One thing that companies like LinkedIn have done in the past and we’ve done at Meetup, was to actually tap into this Binlog, this stream of data, and try to convert this into events. Every time a user was added to the user table, there was some code that would read from the Binlog and a user always created an event, then simulates that into a DynamoDB table. DynamoDB is a database by AWS. It’s one of the original NoSQL databases, it’s very good in my experience, but don’t worry too much about it. Just know that it can hold a lot of data, way more than any typical MySQL or Postgres setup.

We keep pushing all these stuff into DynamoDB as events: User was added, User was deleted, User joined group, User left group. Then we had a lot of the Lambda functions that would consume from this database using something that’s called DynamoDB Streams, which is actually very similar to the Binlog where DynamoDB forever insert, delete, sends this as a message. “This user was deleted. This row was deleted. This row was added.” That can be consumed by other systems. We would get another Lambda consuming this information and doing some processing with it, and copying it to a different database that would be a very specialized version of what we wanted.

In this particular case here, this is the architecture that used to show the members who belonged to a group. If you go to Meetup right now, /groups, /whatever, /members and you see the list of people who belong to that group – it could be the New York Tech group for example, that’s a big one that we have in New York City – this data comes directly from a database that is specialized for that. It went through all of this processing to generate this materialized view that serves that experience only. That was the general design we had for a while.

One thing that happened is that this project wasn’t particularly successful. We are going to discuss some of the issues that made it not successful, but generally, we actually had to revert on this approach recently. It went back to just having an API talking to the legacy code base. I want to discuss a few of the challenges we had. Some of them are more interesting than others and they are long term for Meetup, because there is always some accidental things, like it’s an old code base, it’s an old problem, the team is formed by members who might or might not have some experience with this or that. Some things are probably general enough that we can extrapolate from them a little bit more.

Challenges We Faced

One of the challenges we faced was, first, solving a bug means reprocessing all the data. Remember that I said the legacy code base writes to MySQL, and then MySQL says, “There was an insert on this table. There was a delete on that table back to this stream that is converting to events.” There is a user inserted event, user deleted event, all this kind of stuff. Then another system gets these events and say, “A user has joined the group.” When I am creating my materialized view of all users who belong to a group, I need to add or remove that person depending on what goes on.

One challenge is that all this is code we wrote. As every single piece of code, we wrote, there are plenty of bugs. Every now and then, we would find something, and one piece of this transformation wasn’t working very well. We would patch it. That’s great, but as opposed to online systems you might have, that also meant that we had to run all the data again through this pipeline because the copy we had here, so the last database to my right-hand side here, was not correct anymore, because we had to fix some logic or we had to change something.

That has happened a lot during this project. Every now and then, we would have to go and find out that, “Actually, we thought that this field in the Legacy schema meant this, it actually means that, we need to remove that.” Or, “Oh my God, we cannot add accounts that were disabled automatically by the system.” We have a lot of spammers in the platform as everybody has, we need to make sure they are not there anymore. We add the logic to prevent spam and we have to run all the data pipeline again. That was a little complicated. There are a few ways to deal with this in kind of event sourcing CQRS architectures. We struggle with this a lot.

Another one is the write path issue, as I call it. I’m showing you the read path. I’m showing you how information gets into the screen that shows you who belongs to this group. I’m not showing you what happens when somebody, maybe an admin from a group, says, “Actually, I want to ban this person from my group.” We’ve struggled with this a lot because that pretty much meant that we had to rewrite all the logic that was in the legacy code. Ultimately, the pragmatic decision was, “You know what? Let’s just use the legacy code as an API for writes.” Now you are like, “Ok, but then why did we go through all this work just to get the read-path if the write-path is actually going to do the legacy code base anyway? Are we any more decoupled from what we had?” Those were the questions we had, and these were the challenges we were facing.

Another one which is a little more common is that we actually were using Scala. Meetup was heavily a Scala shop, and we had this issue of cold start for JVM Lambdas. Cold start is what happens when the first request hits a Lambda that you just deployed. Amazon, we know it’s magic, needs to get your code from some form of disk, load into memory, and if you’ve worked with JVM for a while, you know that JVM also does a lot of optimizations in itself. It takes a few milliseconds – depends on what could be milliseconds for that particular function to be up and running. If it’s steadily receiving traffic, that’s fine, but for the first requests it’s always a little challenging. We had a lot of problems with JVM Lambdas based on this. This is one of the reasons why we migrated some of our JVM Lambdas to use Amazon Fargate, which is actually a container platform. We had all these complications where some things were Lambdas, some things were containers. It was a bit crazy.

Another one was DynamoDB costs. DynamoDB is an interesting one because we had a lot of copies of the same data. We were paying over and over again the price for storing the same actual data. There was no one canonical model we all use. We had different specialized copies from it. DynamoDB is interesting in the sense that, very similar to anything else on Amazon, it’s mostly by default you do on-demand provisioning. “I want a new table.” “Here is a new table.” “I actually don’t need this table anymore.” “Ok, it’s fine. Don’t worry about that.”

It turns out that if you ever had to deal with Amazon billing, this is the most expensive thing you can do. It could be ok for every use case, it’s ok for a lot of people, but in our case, this was extremely expensive. The solution was really to do some good old capacity planning and sit down and work out, “Actually we need this much data.” The Symphonia guys that I was talking about before helped us with this; we literally saved millions of dollars just by actually working out, “We actually need this data. Hey, Amazon, can we provision this beforehand as opposed to just asking for it on-demand?”

To me, as the person who was overseeing the architecture developer tools team, the biggest problem was that we had no governance whatsoever. It’s like, who is calling what? What function is deploying where? The moment when you go on your Amazon console and you list all functions and you see people with names, you know you are in trouble. There is absolutely no way to know what’s going on. To this day, we turned off some functions that had VP’s names, and it turns out it was actually very important for billing. We did not know, nobody knows. She wrote that many years ago. Anyway, it was a little hard to manage all this stuff.

Challenges We Did Not Face

Thinking about me coming as this very biased microservice old school – I can’t believe I’m saying that – person, I walked into this scenario, “I could see this was going to happen. Told you so.” There were a few things that I 100% thought were going to happen and didn’t. One of them is that getting new engineers productive on Lambda and DynamoDB was actually pretty easy. There is something on top of this, which is we had hired a lot of new engineers from Node.js front-end backgrounds. All this stuff was in Scala. I’ve used Scala for a fair chunk, and I know that Scala is the kind of language that either you hate, or you love. There is no middle ground, nobody is lukewarm in Scala. I happen to like it, but I’ve managed 200 people writing Scala and swearing at me for a very long time.

The interesting thing is that if you are actually writing just 200 or 300 lines of Scala, it’s not that hard, especially if you use things like IntelliJ that really help you out. The way I think about this, is the same way as I personally deal with makefiles. I’ve been in this industry for 16 years, I don’t know how to write makefiles. I have never learnt, I copy, paste, change, and move on. It’s probably like there’s been one makefile that somebody gave me and I keep changing these things for 16 years now. I actually have no idea what I am doing, but it works, it’s fine. It’s the kind of approach that we were taking there. Obviously, we had tests and different support to make sure people were not doing something completely stupid, as opposed to what I do with makefiles. That was the principle, it’s very discoverable. You can just change something and move on.

Another one is that I was talking about cold start. I mentioned that we moved some of the things to Fargate. Then we moved them back to Lambdas but we converted them to Node.js Lambdas. We didn’t find any of the cold start problems anymore. Obviously, there was a little bit of an issue, but nothing that would hit us to the level that the JVM ones were hitting. Another one is operations allergy. I spent four years at ThoughtWorks, so I’m very indoctrinated into the right way of doing things as any ThoughtWorker would tell you. There is only one, it’s the ThoughtWorks way. One of the things we were super big on is the DevOps culture, which is different from having a DevOps team.

The thing here is that I’m so used to building engineering teams and trying to get people to own and operate their own systems. They’ve been really allergic to this, they really actually hate it. Exactly the same thing that’s going to happen here. It turns out that although Lambdas can be really weird at times and opaque, it doesn’t really happen that much. In my experience, people are reasonably happy being on call and just operating their own system. I think it has to do with the size of things. If I’m responsible for something that even as a microservice – it’s kind of big and I don’t know everything that’s going on there because I might be more of a front-end person versus a back-end person, it’s complicated – if it’s something that’s using the language I’m used to and I have all this tooling and it’s small, it might be ok.

Another one that I was 100% sure was going to happen is local development. If everything is on the cloud, how do you do local development on your machine? What if you take a plane? It turns out most planes just have WiFi, at least in the U.S. That’s not so much of a problem. One interesting thing about local development is that most of the tooling we have available right now – and my only experience firsthand is with the same framework from Amazon – have enough simulators that you can use within your local development. I don’t suggest you do much with them. One thing we’ve done at Meetup is to make it really easy for people to provision their own accounts so they can have a sandboxed space where they can deploy all their Lambdas, do their other stuff without messing with other people’s in-production systems and things like this. The simulator reminds me a lot of Google App engine when it first came out. It works 80% of the time, but that 20% is what’s going to kill you.

The last one is developer happiness. I found this would make no difference because it’s just like back-end technology. Most of the engineers don’t care about that, they want to develop a product. But it turns out that people are really happy with this whole stuff. They really wanted to keep using that, and it’s not just because it’s different from what we had as a legacy system, because they could use something else. They could use containers, they could use whatever they wanted really. They actually wanted to keep using that and they were asking for more help and tooling and training on this.

How Can We Keep the Good and Get Rid of the Bad?

The one question that then came to my mind and I was working with my team on, was how can we keep the good and get rid of the bad? I was coming in super biased, I had years of microservices I’ve done, the SoundCloud position, then I worked for Digital Ocean, went through the same stuff, I worked for a Service Mesh company for a while. So I have my opinions about how things should be done, but it was pretty clear to me that a lot of this tooling was just good, it works. It’s not perfect but it works. I was thinking about this for a while and I realized that maybe the easiest way for me to rationalize this whole thing is not too dissimilar from the same way that I perceived that we went on through microservices and other things.

If you think about it, if you’ve been in this industry for a while, you usually have a few applications within a company. Here’s a silly example where we have the financial reporting application, you have some form of user manager application and some kind of point of sale application. They are all different, they are all independent. They might be because if you are talking about old school, they might be managed by the same team, but it could be completely different. Maybe it’s even different consulting companies writing them.

This was the picture that anybody would show you on your first day working for any corporation, maybe 10, 15 years ago. What they didn’t tell you is that actually it was kind of like this. It’s one company, everybody needs the user data, everybody needs the sales information, whatever it is. There is always some kind of cross pollination, collaboration is probably a good one, across these different databases. We have perceived that this was a little bit of a problem because all these systems were coupling to each other. They rely on the schema information and various other things.

One way we managed to get out of this mess was by creating normal services. Service Oriented Architecture is a very old term in technology now. We’ve been talking about this since the late ’90s. It means different things to different people, but mostly it means that you expose this kind of feature and data that you need to other systems on your network. One thing that I have seen happening a lot is that once you did that, you didn’t really have this whole application or independent applications that had their own back-end and their own data. You might have been to projects like this. One day in history 10 years ago, you’re working on some sort of customer portal or admin portal because you are tying all these things together. Instead of having one application to do this little thing and the other application to do this other thing, we actually were having one front-end that would talk to many different services in the back. This was how the industry was evolving. This was generally interesting and good.

One interesting way that I see serverless applying to this whole picture is that even if with microservices you would have smaller boxes and smaller services, you would still only have so many. What serverless tends to do is something more like this, where actually, you start exploding your features into a million different functions. These functions are 100 lines of code, 200 lines of code, and it’s very easy to make functions talk to each other. It’s very easy for these kinds of super deep-in-your-system features just to access data from another super deep-in-some-other-system feature. These systems don’t really exist anymore. The colored lines there are just to illustrate how these things are kind of related, but not really. That’s a little bit of what we ended up with at Meetup.

It’s funny because a few months ago, I was on Twitter, and I saw somebody describing this, I think, perfectly. A guy I used to work with, he used to work at ThoughtWorks, called Chris Ford. It’s a whole paragraph, but pretty much he calls it a pinball machine architecture where you send a piece of data here and there is a Lambda there and a Lambda there, sends you a bucket, put it over there. You don’t know what’s going on anymore. This is pretty much what we had. As we progressed towards this kind of architecture from something more like this, we just don’t know what’s talking to what, what’s going on. We were legit victims of this pinball machine architecture syndrome, if you will.

Thinking about this for a little while, it’s like, “Here is a challenge we have.” We want to keep the good things about serverless architectures we’ve talked about. We have good provisioning systems, people were generally happy to write code for it, we have good support, we have good tooling, we have all this stuff.” What are the things that we need to do that will prevent us from getting to this kind of pinball machine architecture I was talking about?

I think in every presentation I’ve ever given, I’ve always fallen back to something that Martin Fowler has written before. One, if you’ve ever worked with me, you are going to receive a copy of this PDF at some point, because this is one of my favorite articles in software engineering of all times. I think it’s super underrated. This is a piece Martin wrote – I don’t even know when, it was many years ago – about the difference between published and public interfaces. The article is good, read it. You will understand it, he has very good examples, but mostly think about it this way: any programming language – I don’t know if any programming language, but most programming languages – will offer you some key words like public and private. Java has that, Go has some convention system that goes above these, C#, Ruby. Either there is a key word or some kind of convention.

The interesting thing is that, for example, if you talk about Java or Ruby, you can go into the runtime library for these languages and you can find this class deep in God knows where, in some weird package you never heard about and you can access it. You can access it because what public and private in a programming language do doesn’t really have to do with security. It’s the different levels of access control. Even programming languages like C++ and others have package protected things. You can always find a way to access this stuff. The point is that semantically they could be expressed in a better way, but it doesn’t actually matter, because what matters is that the person who is providing you with the library or whatever it is, needs to offer you some published interface. A published interface is what you want your users to actually use, what you are willing to support them using to do whatever they want to do.

It’s not because something is public in a library that you might be using, that you are allowed to use it. Of course, you can use it, and sometimes it’s a very good hack, it allows you to do many different things. But it’s not what you should be doing. What you should be doing is using what Martin calls the published interface, which is what the users or what the library owner or the service provider, whatever, provides you as an API. I was reading this a little bit and thinking about other challenges we were having. We found one interesting way to map these concepts to our explosion of Lambdas everywhere.

You remember we had a situation where we had lots of Lambdas talking to a lot of Lambdas, and this doesn’t even digest this to the kind of stuff we had, because we had no idea what was calling what. Obviously, yes, you can use this Amazon whatever tool, that Amazon whatever, but as you are writing code, you don’t know. It needs to be a report that somebody generates. One thing we’ve done was to split them out and put some kind of API gateway in front it. Now that’s interesting because at the same time, there is a lot of contention and no contention at all about this design. It’s actually recommended by many people that you always use an API gateway in front of your Lambdas.

At the same time, people look at it as, “Ok, but then aren’t you just creating services and using Lambdas as your compute units? Isn’t that just the same?” To be honest, yes, it’s very similar. Again, I’m coming from my bias, I’m trying to adapt this to keep working and to fix things I haven’t seen working, but it works well. The thing here is that even though each one of these Lambdas is adjustable, it could be called by anybody. You can prevent this with some security configurations, but we didn’t even have to do that.

Everybody in the company knows that the only way to access one of these Lambdas is through the API gateway. The API gateway is your typical HTTP interface. It has a well-known domain name that you access and do things. But that’s the only way to go from A to B. Instead of having this kind of peer-to-peer communication network, you’d start piling things through an API. It’s the case since forever as this should become during services. One interesting thing is that some of the internal Lambda functions in this diagram, and that is on purpose, call all the services, and they can call it directly. You don’t need to have a boundary object that some people create to make these calls. It’s fine if they actually call the API gateway for another service directly. This is something that we’ve done.

Use Serverless as Building Blocks for Microservices

What we’ve done was really just start using serverless as building blocks for the microservices we had. It’s really using the platform as a service, part of what serverless provides, and not so much some of the more interesting bells and whistles that we could be using. One interesting challenge is how do you segment? How do you create these lines across services? One thing we’ve done at Meetup is that we started using AWS accounts for that, which works well except there is one big caveat, which is AWS is really bad at deleting accounts. You make a request to delete an account and it could take days. It might never be fulfilled. We have some kind of organizational unit which is a grouping of accounts. Every time you want to delete an account, you actually move that account to this organizational unit. It’s the repair one and it’s not active anymore. It’s not perfect, but it works well enough for us.

Another interesting thing we found is that API gateway is actually quite expensive. You are going to see many articles online talking about how it’s an interesting product, but it’s not great. My own experience with AWS is that unless you really need to do something about the cost right now, wait for a little more; talk to your account manager. I don’t know how Google and Microsoft would work on this, but talk to your account manager, see if there is any kind of price drop coming.

We didn’t have this with API gateway when I was at Meetup. I mentioned that we moved some of the systems to Fargate. Fargate was very expensive, we just had no option. We knew that we were going to increase our bill a lot by using that. It turns out that one week before we moved some of our systems to Fargate, Amazon just slashed the price in half. That’s the Amazon policy. That worked well. We could actually keep using Fargate because it was ok. I don’t know if that’s going to happen to API gateway. I haven’t seen massive price drops there, but one of the reasons I wasn’t worrying about this is because what API gateway does is actually not a lot for us. If that became a problem, I could actually just get a team on the side and create some load balancing structure that would replace that. It would be a bit ad hoc and not invented here-ish, but we could do that. That was an option for us.

This is the thing that I still I’m a grumpy old man about when it comes to serverless architectures. There is a lot of people who tell you that Amazon provides you with everything and it’s all going to be awesome. At Meetup, we went full on Amazon. Meetup still runs some stuff on Google Cloud but it’s moving through AWS. We had only cloud formation, we didn’t even use Terraform because we wanted to make sure that we were using the right tools that our AWS people could help us with. We did everything the Amazon way and I still had about 10% of the engineering team working on tooling. There were a lot of loose ends. Teams that don’t talk to each other – you can clearly see that AWS has many different teams and they kind of hate each other because this feature is available there but not integrated with these other things. Why? This happens a lot and you need to be ready for that.

Is This Really Serverless?

I think ultimately, one interesting question is, is this really serverless? Is this what the future looks like, because it looks very similar to what we do now? Are you sure? My point is that I don’t actually care, because that was one way we managed to get from one point to the other. But more important to me – this is also something that I tend to use as a saying, I tend to repeat a lot – you don’t move from two out of ten to ten out of ten in one jump. I think some of the mistakes we’ve made in the past were exactly trying to do that, moving from a legacy application that was big and complicated, and 16 years old, to the bleeding edge of serverless computing with asynchronous workflows using Kafka and God knows what. This is too much of a jump, the gap is too wide. We need to go inch by inch and try to get somewhere.

This is an approach where I can hire somebody who has experience with microservices and they immediately understand what’s going on, as opposed to what we had before that was very complicated. Does that mean that CQRS and event source don’t work? I’m not saying this right now, but it’s just because I never actually worked in a project that did this in right way, whatever you want it to be. I’m sure it can work well; I haven’t seen it working so far.

Going back to the whole you are coming from a traditional microservices background to the Lambda kind of approach. What does that mean? What’s the future like? Not being cynical at all, I actually think that serverless definitely is the future. As I mentioned, I used to work for a company called Buoyant. Oliver Gould is going to giving a talk about some of the work we did there and that is still there. There is an amazing piece of technology that has been built on the platform around Kubernetes mostly, and things like this. The moment I saw how easy it was for an engineer fresh in the company to deploy things using serverless, I said “There is definitely something here. There is a future there.”

To be honest, I think AWS is pretty bad at tooling. I’ve worked for a few years at Digital Ocean here in New York City. The main thing they do is fix the user experience things that Amazon cannot fix. I’m familiar with some of the challenges there, but once they get this through, I really legit think this is the right way to go.

Questions & Answers

Participant 1: On your last slide, you mentioned that serverless looks like the future but it’s not there. Can you give a list of features that you think are currently missing?

Calçado: I don’t know if there would be features per say because I think the building blocks are there; it’s how to get these things to talk to each other. We have various different components that work great in isolation and I was mentioning that we have 10% of my team working on tooling. We are not actually writing any software or any actual software. We are just linking things together, making something that will spit out a cloud formation template that wires this thing with that thing over there. I don’t know if there is a particular feature that I am missing, as much as just the workflow nature of things which I believe the cloud providers are going to get better at and some companies offer some alternatives on top.

Participant 2: On the slide where you showed how you use an API gateway to essentially gate serverless functions, there was one place where you had a call from a gateway to an actual function. Can you elaborate on when it makes sense to do that and when it makes sense to always go through the gateway?

Calçado: I have two answers to this question. One is that there is actually a mistake on the slide. The second one though is that one thing that I have seen mostly happening at the edge – it will happen sometimes downstream as well – is that you need the API gateway to do some form of authorization and authentication, something like this. Usually, it wouldn’t be so much of a big jump from one system to the other, but it might be something like this. Actually, it’s funny because I haven’t mentioned this in this presentation, but I used to believe in really dumb pipes, and now I’m becoming more and more a believer that the pipe should be smart enough and including the API Gateway piece of infrastructure – rate limiting authentication, authorization, and various other things that you might need to do from that layer, but that one is totally a mistake.

Participant 3: You mentioned just briefly about Service Meshes. I wonder, how does that fit into this landscape here? Would the Service Mesh be East-West then the API gateway North-South?

Calçado: The only way I have been thinking about this right now – Service Mesh wasn’t the problem that I was thinking about, but I think it’s the same – I was sitting down with somebody from one of the monitoring providers who was trying to tell us some things. We already had some of them, so just negotiating contracts and things like this. By the way, we have decided we are going to invest more in serverless. If I go with you because I’m signing a two-years contract right now, what does that mean? Their response was kind of iffy. Eventually they got me to talk to the CTO of the company. It turns out that all they could do was read the Amazon logs and plot them.

Extrapolating from this to your question is, this ecosystem is so closed up that I have no idea if anybody is going to be able to offer something at the Service Mesh level or any of these things. It worries me a lot, because I have been an Amazon customer for many years. SoundCloud was the second largest AWS account in Europe, and I know that AWS is great at some things, but I don’t like the way they develop software. I don’t want them telling me how to develop software. I trust other folks to do that, but not them. Yes, I’m worried about this, but I don’t know how it’s going pan out.

Participant 4: You’ve got a bunch of lines between Lambdas. My limited experience so far is that that’s problematic and we still end up doing it sometimes, but we are using a lot of step functions and queues and publishing topics and reading them, and so forth. Can you talk about your experiences there?

Calçado: I tried to simplify a little bit this diagram but it’s funny because when I was preparing this, I thought that it would also not super match the quote I had from Chris there, because he is talking about exactly what I’m talking about; many different step function, buckets, whatever it is. In my experience, it’s what you ended up using to communicate these things, and you are right. It’s funny now to match this. I’m not going to call it wrong, but if there is something you feel, that probably should be revisited.

One thing I found in this more actual serverless architecture using different objects or resource types is that unless you apply this rule of tying them all together with something – it could be tags but we use accounts specifically – you are going to have a lot of trouble trying to figure out what’s used by what and what’s going on. We have a lot of interfaces that are just databases. They get exposed to other services, and not being a natural published interface as I was saying creates a lot of trouble. That’s how we build it with the step functions and various other objects and things. I actually don’t like step functions that much, to be honest, but buckets and DynamoDB tables all over the place and also the Kafka thing from Amazon. If you have too many fanning out from one function, probably a function is not 200 lines of code.

Participant 5: I wanted to ask in terms of your architecture using multiple AWS accounts, it wasn’t clear to me, what’s the benefit in doing that? Also, in terms of your deployment, how is that done? If you have multiple accounts, how do you get around that kind of dynamic account ID type thing?

Calçado: The abstract benefit is that you need to group your resources by something. Usually, people do it by team or by environment, but we decided to group them by service. We could use tags for that, but we figured out that one interesting boundary between services was going to be accounts, because you have to explicitly jump from one to the other. It’s a bit safer in our experience. One challenge we had at Meetup was that we had one account for everybody, all environments and everybody had access to it. We figured out that people were really afraid of doing anything in production, because any wrong thing they had done could impact the whole company, so containing it further and further was going to be an option. We followed this to the point that we decided that actually their unit was going be a disservice to itself. Again, it’s arbitrary, it could be anything, it could be just tags and things like this.

We have a little bit of a complicated network architecture at Meetup because of the legacy account, the things we need to talk to, because the monolith is still there. Usually for deploy, you do an assume role, whatever the primitive is called, and you deploy to that account. Engineers have access directly to all their accounts, because our main thing is not that people cannot access the other accounts. They need to say, “I am going to use the user service account now,” as opposed to just applying something to a generic course grained account. Big caveat, a lot of the tooling mentioned surrounds this. We have a lot of tooling to create accounts, to delete accounts, and jump between accounts as people jump between services. It’s the working context or the work space you are working with.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Peloton – Uber's Webscale Unified Scheduler on Mesos & Kubernetes

MMS Founder
MMS Mayank Bansal Min Cai

Article originally posted on InfoQ. Visit InfoQ

Transcript

Bansal: I am Mayank Bansal from the data infra team at Uber, and here is Apoorva who is from the compute team at Uber. We’ve been working on Uber’s open-source unified resource scheduler; we’re calling it Peloton. Let’s look at what the cluster management story at Uber is today. Currently, we have thousands of microservices running in production. We have thousands of builds per day, which happens in production, tens and thousands of instances deployed per day. We have 100K plus instances running per cluster. We have more than 10 million batch job containers running per day. We have thousands of GPUs per cluster and 25 plus clusters like that.

Right now all the microservices, which we call stateless jobs, run onto their own clusters. We have batch jobs, stateless services run on Mesos and Aurora. Batch jobs on Hadoop, Spark, TensorFlow, all these jobs which runs on Hadoop YARN. We have Cassandra, Redis MySQL, and other stateful services, which run on bare-metal on their own clusters. Then you have damon jobs, which actually go and run on all of these clusters, and there is no resource counting for those. The vision for the Peloton is to combine all these workloads together onto the same big cluster. We wanted to do that to improve the cluster utilization. The scale which runs is thousands and thousands of machines, and if we co-locate all these workloads together, we are envisioning that we will have a lot of resource efficiencies, which actually will be translated into millions of dollars. The other reason we can co-locate all these jobs is because the profiles of these jobs are very different. The online services or the microservices are very much latency-sensitive jobs, which cannot be preempted, because if you preempt them, you are actually impacting the business. When you try to open an Uber app and you call for an Uber, and if you preempt a service there, the guy’s not going to show up; the batch jobs which have all these offline training, distributor training, machine learning jobs, spot jobs, all that analytics which can be preempted.

What we are thinking is, we will co-locate them into the same cluster. By that, we can preempt if one of the other profile spikes. We don’t need to do the DR capacity based on principle. If you preempt them, we can use it on higher priority jobs. We don’t need to buy extra capacity, which we do currently, because we need to over provision the services cluster as well as the batch cluster for spikes and all these DR reasons. Because of the complimentary resource nature of the different workloads, those can get the better cluster utilizations.

Existing Cluster Management Solutions

We looked at the existing cluster management solution back in the day, and then we found there is no silver bullet right now. We looked primarily for solutions Borg, YARN, Mesos, and Kubernetes. Borg is Google’s cluster manager, which runs all the application workloads through controller jobs; job/task lifecycle, management, placement, preemption and allocation happens at the Borgmaster, and the Borglet does all the task execution of the country and orchestration.

YARN does it little different. It has job/task lifecycle management done through the application master, and the rest of the things happen at the resource manager, and execution happens at the node manager. Mesos is pretty much not a scheduler; it’s a resource manager, so orchestration happens at the agent and the resource allocation at the master of the Mesos, and the rest of the scheduler primitives happen on these frameworks on top of Mesos. Kubernetes, same model; task execution at the Kubelet and the scheduler happens at the kube-scheduler, and all the jobs of task lifecycle happen at the controller.

We looked at all of these four schedulers at that time, and then we found we like Borg, because Borg is the only scheduler, right now at least, which we know that co-locates all different workloads together, but we cannot use it because it’s not open source. YARN is a batch scheduler which is good for the batch jobs, but it’s not good for the stateless services. Kubernetes is not good for batch; it does not have elastic resource sharing, elastic resource management. The throughput and the high-churn rate of batch jobs Kubernetes cannot scale. Nobody is at Uber’s scale which needs 10,000 plus nodes.

Peloton Architecture

Let’s look at the Peloton architecture. Peloton is a resource scheduler for co-locating mixed workloads. It runs on top of Mesos. It’s integrated with Apache Spark, TensorFlow, YARN, uDeploy, uDeploy is the orchestration engine on top of our microservices. It can be run on-premise and on the cloud. Overall, how does it work? It runs on top of Mesos. Mesos just does a resource management piece and gives the resources to Peloton. Peloton manages all those resources, and all the scheduler primitives are being implemented on the Peloton side. All the orchestration engines actually are integrated through Peloton APIs, like stateless services, uDeploy, Cassandra, MySQL, Spark, Hadoop, YARN, and TensorFlow. All these are integrated through Peloton APIs.

If we compare it against other schedulers which we looked at, Peloton is pretty much how Borg works, except that resource allocation and task execution happen at the Mesos agent level. This is our overall Peloton architecture, we have Peloton UI, Peloton CLI, and Peloton extensions, which talk to the Peloton through Peloton API, which are GRPC and are backed by protobuf. We have four major components which make up Peloton. One is job manager. These are the stateless damons which run as a container in the cluster actually, and manage the job lifecycle of all the jobs. We have resource manager, which does all the elastic resource sharing, preemption, admission control, etc.

Placement engine is all of the stateless. These are the stateless damons which run and they are different for different type of workloads, because you want different placement logic for different workloads. Placement engine has batch placement engine, stateless placement engine, stateful placement engines. With the batch placement engines, you can run many of them together to increase the throughput, and that’s where we scale horizontally on Peloton’s side. The host manager is the abstraction layer on top of Mesos. We use Cassandra for state management and Zookeeper for all the damons leader election.

This is the workflow of Peloton Clients; it talks to the job manager to submit a job, the job manager puts the state into Cassandra, and then the resource manager does the mission control, the placement engine goes and places the task, which is launched by two Mesos through a host manager. Now, I’ll welcome Apoorva [Jindal] to talk about Peloton’s other cool features.

Elastic Resource Management

Jindal: I’m Apoorva [Jindal], and I’m one of the engineers working on the compute team at Uber on Peloton. I would like to thank you all for taking the time to listen to us. Today, I’m going to talk about some of the interesting features which we have implemented in Peloton for some of the unique use cases we have at Uber. I’m also going to talk about some of the integrations which we have done with some of the open source batch platforms.

Let’s first start with batch workloads at Uber. The first feature which I would like to showcase is elastic resource management. Let’s try to understand why is elastic resource management needed. Let’s say we have a cluster with a finite set of resources, and we want to divide these resources between different organizations. One way to go about it is let each organization submit their jobs onto the cluster, and first come first serve. Whoever served it first wins, they get to run it first. Then once they are done, the next organization gets to run, and so on.

The issue which we saw with this approach was that if one organization submits a large number of jobs, then it pretty much takes up the entire cluster, and other jobs from other organizations get starved out, so that didn’t work out for us. Next, what we tried is static quota management. We said, “Each organization has to go and figure out how many resources you need based on some expected demand.” Each organization comes back, submits to us, “This is the demand which I’m going to have,” and we gave them a static quota. That becomes a guaranteed quota for them that will always remain available. After we implement that, what we observed is that there were many times where the cluster was kind of idle, there was very little stuff running on it, but it had a very high demand. Why that happens is because batch jobs are very short-lived, and the request patterns from the different organizations are very dynamic.

It could be because some organizations are distributed geographically over time zones, and also there are some use cases which different organizations have which caused them to create workloads at a certain time. For example, in the cluster which serves our autonomous vehicles group, what we saw was that there was a group which used to process data being collected by the cars being driven on the road. The cars would go out in the morning, they would come back in the evening, and all the jobs would get created at 5:00 p.m. from that organization.

Then there is another organization which submits a lot of jobs to simulate the different vehicle conditions which can happen on the road. All those happen during the daytime when the engineers are in the office, so there is heavy demand from one organization during the daytime, and very high demand from another organization during the nighttime. What’s the solution for this? One way to do it is, use the priority-based quota model, which Borg has. To be able to do this, what we need to do is we need to take every workload which a user can submit, and assign it a global priority across the entire company. Then what we do next is we divide the cluster capacity into different quotas, production and non-production, and hand it out to different teams.

The issue we found is that if a team has production quota but is not using it and has a lot of non-production jobs, then the resource cannot be interchanged between the two quotas. A more structured way to do this is to take the cluster and divide those cluster resources into the different organizations as a reservation or as a guarantee, that if you have demand, then you are guaranteed to get this many resources. Then we ask each organization to divide the workloads into production workloads and non-production workloads. If they submit a production workload, which is essentially a non-preemptible workload, then we place them in their guaranteed reservation. If after placing them there is still some capacity available, then we take the lower priority non-guaranteed workloads and try to place them in the same reservation.

Now, let’s say for a given organization after summing up the demand from production workloads and the non-production workloads, there is still unallocated capacity from that reservation. What we do is we take that unallocated capacity and divide it, and give it to other organization to run their non-production workloads on. If there is one organization which has very high demand, and another organization which has no demand, then we take the resources from the second organization and give it to the first organization.

The way we implement it is using the concept of hierarchical resource pools. The root resource pool is essentially our cluster that gets subdivided into multiple resource pools, one for each organization, and then each organization has the liberty to further subdivide into different teams. Each resource pool has a reservation, which we talked about, then each resource pool also has allocation. If a team has a demand less than its reservation, then it gets an allocation equal to its demand. If it is more, then it gets at least equal to its reservation and then any extra capacity gets subdivided.

Let me illustrate this idea using an example. Let’s say I have three resource pools and a total capacity of 100. There is very low demand in resource pool one, and resource pool at two and resource pool three have very high demand. Resource pool one gets an allocation equal to its demand, because it’s less than its reservation, and resource pool two and resource pool three get the reservation, plus any extra capacity in the cluster gets divided between them equally. If the demand for resource pool one increases, then we do the same calculation again, and take the resources back from resource pool two and resource pool three, to give it back to resource pool one. Basically, we say that the non-production workloads, since they are preemptible, we have to go and preempt them from the cluster to release the capacity to give back to resource pool one.

Priority Preemption

The second feature which I want to highlight is priority preemption. At Uber we use YARN for most of the batch workloads, and one of the issues which we now face is when we rolled out YARN, we didn’t assign a priority to each workload. What happens is that since there is no priority, low priority workloads get submitted which take up the capacity, higher priority workloads get submitted later, and it gets starved out. Low priority workloads keep running and higher priority has to wait for either more time, and if it’s a critical workload, then actually it requires a manual intervention from our SRE team to actually go and preempt some of the low priority workloads to give the resources to the high priority workloads.

What we do in Peloton is as follows. Job manager enqueues each task with a given priority to the resource manager. All this preemption and entitlement and allocation calculation happens in resource manager, based on the capacity which it gets from the host manager. The job manager submits it to a pending queue, which is a multi-priority queue. What the resource manager does is it takes the highest priority task from the pending queue, looks at that location, and sees if it can be admitted. If it can be admitted, great, it goes ahead.

Let’s say there is no resource available to place that high priority task. Then it goes and sees there is a low priority task in the same resource pool, which is taking up its current day resources. Let’s say if its current day resources are being consumed by tasks which are greater than equal to the priority of the current tasks, then great, there’s nothing thing we need to do. If it’s taken up by a lower priority task, then we go ahead and preempt it. This allows us to remove the manual intervention to allow critical tasks to run.

The final requirement, which is very unique to us, is the requirement of large jobs. This is a requirement where we get jobs with tens of thousands and even hundreds of thousands of tasks, and the kicker is that each task has a different configuration. Why is that? The use case comes when we have to process extremely large data sets. For example, our autonomous vehicle maps group had a use case where they broke a street into small segments and processed each segment as a task, so one street is a job, and each segment on the street is a task. Of course, if the street is very big, then it results in a lot of tasks.

What are the potential issues which we see in a scheduler? Firstly, the job configuration becomes huge. If you have 100,000 tasks and each task is around 10K configuration, then the overall job configuration becomes a gig. Another issue is that if we launch all the tasks in parallel, then they can kill some of the subsystems that the job relies on. For example, for most cases we have jobs which go to a storage subsystem to read and write data to be able to process it. If we launch 100,000 tasks which go and talk to the same storage subsystem, then it kind of brings down that subsystem.

How do we solve these problems? One thing we do is we always compress configs. What we figured is that the latency to actually compress and decompress configuration is always less than the latency to actually write and read configuration from any storage system. So, always compressing configs is fine. Then, we do asynchronous task creation, so our user comes in, submits these large jobs, we write to storage and return the API, and then our goal state engine picks up that job and starts creating the task asynchronously. While it’s creating those tasks, it also controls the number of parallel running tasks. If a user configures the number of parallel running tasks to be 3000, because that’s the number the storage subsystem can handle, then we make sure that no more than 3000 are running, and as soon one task completes, we launch another one.

Another issue we saw was that since the data we are storing is huge and if we store everything in memory, which is what a lot of open source schedulers do as well to get scale, then our memory blows up and we keep running out of memory. So we build some kind of a cash to just store what we use very often and not store everything.

Most of the batch workloads at Uber, which are submitted to Peloton are via Spark. Why Spark on Peloton? There are multiple ways to run Spark. One way is to use YARN. The version of YARN which we have at Uber doesn’t have docker support or a big container support. Another way to do it is via use Mesos, but Mesos doesn’t have any quota management or elastic resource sharing, and there is also a big scale issue, because every Spark job registers as a framework, and Mesos can run 100 frameworks at a time at max, maybe 200.

Another option is to use Kubernetes, but it has no dynamic resource allocation or elastic resource sharing. Another way to do it is to say, “Anything that requires docker, let’s run it on Kubernetes or Mesos, and anything that doesn’t require docker, let’s run it on YARN.” Then we run into the resource partitioning issue, about how do we dynamically share resources between YARN and Kubernetes? Plus, to be able to do that, we run into the same problem that now we need to define global priorities across workloads, which is nearly impossible to do at Uber. How does Spark integrate with Peloton? Spark submits a job to Peloton which launches the driver task, and the driver task then talks to Peloton to launch the executer task and uses the Peloton APIs to monitor them as well.

GPUs & Deep Learning

Let’s talk a little bit about deep learning use cases and GPU. At Uber, we have a very large number of deep learning use cases, and that’s why as Mayank [Bansal] mentioned in our scale, we have very huge GPU clusters. The use cases range from autonomous vehicles to trip forecasting for predicting ETAs, as well as fraud detection and many more. What are the challenges which distributed TensorFlow faces? One is elastic GPU resource management, also it requires locality of placement, it needs that each task be able to discover every other task, and it also requires gang scheduling and corresponding failure handling.

What is gang scheduling? A subset of tasks in a job can be specified as a gang for scheduling purposes. Note that each task is an independent execution unit and hence runs in its own containers, and can complete and fail independently. From a scheduling perspective, all these tasks in a gang are admitted, placed, preempted, and killed as a group, so if one of the tasks fails, the entire gang fails. The way distributed TensorFlow integrates with Peloton is, the deep learning service creates the job and creates the task in that job as a gang on Peloton. Peloton makes sure that all these tasks get admitted and placed as one unit. Then, once the launch succeeds, the individual deep learning containers talk back to Peloton to discover every other task in that job, or in that gang rather.

Horovod is an open-source distributed TensorFlow implementation which integrates with Peloton, and which is what we run at Uber. We compare distributed TensorFlow and Horovod running on Peloton. For standard benchmarks, we see huge gains with Horovod. Let me spend a few minutes on some of the unique stateless features as well which we have in Peloton. I’m not going to talk about how federation system interacts with Peloton. We have a Spinnaker-like federation system called uDeploy, which integrates with Peloton, but it’s closed source and not open-source.

Stateless Jobs

What’s a stateless workload? Stateless workloads are long running services, which handle user facing latency sensitive user requests, and do not need any persistence state on local disk. All the microservices which serve Uber requests, like the microservice which serves Uber riders, or the microservice which serves Uber riders or drivers or payments and so on, are essentially a stateless job running on Peloton.

One of the very important requirements from these workloads is that they have strict SLA requirements. We cannot bring down a large number of instances without a user visible disruption. If you bring down a large number of instances Uber goes down.

Let’s talk about the first feature which is rolling upgrade. Upgrade is changing the configuration of that service. Our service has multiple instances deployed with some configuration; you want to change it, and the most common configuration change is to change the code which is running for that service. If you bring down all instances to bring up the new configuration, then of course, we violate the SLA. What we do is we first bring down a small subset of instances and then bring them back up on the same host they were running before. If you’re familiar with some of the other open source schedulers, you will see some subtle differences here. The first difference is that we don’t do red black deployment, we don’t bring up an instance first and then bring the instance down. The reason is that we run at such a high allocation in our clusters that we don’t have spare capacity to actually bring another instance up.

The second interesting thing which you will see is that we bring it back on the same host. We try very hard to do that, and the reason is as follows. One of the unique problems at Uber is that our image sizes for our microservices tend to be really huge, so if you look at our deployment time, most of the time is spent on actually downloading the image during the deployment. Anything we can do to improve the image download time reduces the download time significantly.

Being able to come up on the same host allows us to do as follows. When the deployment starts, we know upfront which instances the instance is going to come on, so while the first batch of instances is being upgraded, we go and refresh the images on every other host. When the second set of instances go down, when they come up the image is already present, so they come up really fast.

Another thing which it helps with is emergency rollbacks. An emergency rollback is a scenario when a user deployed a new configuration or a new build to every instance already, and through some monitoring or some alerting they figure out that the build is causing problems. Now, they want to roll everything back to the old build, so that’s an emergency roll back. If you roll back again on the same host, then since the image is already present on the host, the emergency roll back goes really fast.

Oversubscription

The next feature we have is what we refer to as oversubscription. What’s oversubscription? Let’s take an example of a host. Let’s say there are two tasks running on that host with some allocation. In most cases, the utilization of that task is much lower than its allocation. The difference between utilization and allocation is what we call a slack. We have a resource estimator which actually measures this slack available on that host, and then Peloton offers these slack resources as non-guaranteed resources or spot instances for non-production batch workloads. Then, if you detect that a host is running hot, then we have a QoS Controller, which can kill these non-production workloads to reduce the resource contention on the host. As you can guess, and as Mayank [Bansal] will talk about later as well, this is one of the most important features which allows us to co-locate batch and stateless workloads in the same cluster.

The final feature I want to talk about is SLA awareness. As I mentioned, SLA is something very critical to our users, and it’s defined as the maximum number of unavailable instances. Why can an instance become unavailable? There may be an ongoing update or there may be some maintenance operations in the cluster, like we are trying to upgrade the kernel or do a docker damon upgrade, or we may be just doing some task relocations, for example, defragmenting the cluster, or we ingested a bunch of new capacity and we want to spread the task on to new hosts. It’s also used to reduce host utilization.

What we have built in Peloton is a central SLA evaluator, so any operation which can bring down an instance has to go through that central SLA evaluator, and it can reject the operation if the SLA is violated. For every workflow we’ve built, which can cause an instance to be unavailable, it has to account for the fact that its request can get rejected, and has to somehow handle it by either retrying or finding some other instance on that host to relocate, and so on. Now, I’ll hand back to Mayank [Bansal] to talk about what lies next ahead for us.

Peloton Collocation

Bansal: We have built so many cool features, so now let’s talk about what we are building right now. We have all the batch clusters, we have all the batch jobs running, we have all the stateless clusters, we have all the stateless microservices running. This is the thing we started Peloton with, the vision where we wanted to co-locate all kinds of workloads, so this is what we are building right now. Why do we want to do co-location? It’s simple, we wanted to have efficiency gains into our clusters, and by tracking down some numbers, we’ve figured out that if we do a co-location, we can save at least 20% to 25% of compute resources, which will translate to millions of dollars.

What are the challenges right now with the co-location? If we are trying to do the co-location on the same machine, there are four challenges. One is the disk I/O. The stateless services are very CPU heavy, however, they write less data on to the disk, but the batch jobs which are very high throughput, are writing all the data on the disk, which increases the disk contention on the machines. Once we start writing so much data on the machine, the stateless services latencies get impacted. We need to come up with a very good disk I/O isolation story. To be frank, this is a very hard problem, and I don’t think anybody has solved it yet.

The second problem we’re facing on the same machine co-location is the network on the machine. Batch jobs are very much network-heavy because they need to read and write lot of data from the network. However, stateless services are very much latency-sensitive jobs, which are relying on the network to serve the traffic. If you are running batch jobs on a machine where the critical stateless services are running, you are going to impact the stateless latencies. Batch jobs on the nature are very heavy in terms of memory, and they use lot of memory so you actually dirty the kernel caches and the CPU caches, which also impacts the stateless latencies, because the CPU cannot cache instructions.

Once you’re running a stateless and the batch instances together, then you need to have some kind of a memory oversubscription, because if you don’t do that, then you cannot drive through utilization in the cluster because you would not have many instances running on the same machine. By looking at all these challenges, we thought maybe running on the same machine was a farfetched problem. So what we are doing right now is we’re co-locating on the same cluster. We are building something called dynamic partition co-location, where we are partitioning the same cluster into multiple virtual partitions.

We create a virtual partition of stateless jobs, we’re creating the virtual partitions of the batch jobs into the same cluster, and they are dynamic. You can move machines into different virtual partitions. You oversubscribe the physical resources, you say that I have more resources than what I actually have there in the machines, so you can say, “Instead of 10 CPUs, I have 16 CPUs.” Then you can pack more services or more jobs on the same machine, and you move machines if some workloads get spikes.

This is how it will look like. You will have the same cluster where the stateless host and the batch host will be running, and all the services are running on the same virtual partition, however, they are packing on each other, because we did oversubscription. Right now at Uber, we probably don’t oversubscribe the machines where we don’t pack each instance on top of each other, and the reason for that is because those instances give more slack in the cases of spikes or failovers. However, in this case, what we are trying to do is pack on each other, and we are going to preempt jobs from the batch partition if you have spikes in your stateless partition.

As soon as we monitor that, the services are going from the green to the orange to the red zone. We are going to get machines from the batch job virtual partitions and place those services into those machines. This is how it will look like. We will monitor the cluster on every host utilization level, we’ll monitor every host load level, and the latencies of each service instances which are running on these hosts. As soon as we see the host latencies are impacted, the P99, P95s are getting impacted, we are going to proactively get the machines from the batch. We preempt the batch host and get back to the stateless virtual partition.

This is how we’re going to make it work. Every host will be running some host agents and there is a central damon which will run host advisor. Each host agent will monitor the host and provide all the necessary utilization levels of the CPU memory, P99 latencies for each instance, and then we calculate everything at the host advisor level, and that will tell Peloton that if these hosts or these services are running hot, you need more machines. Then the Peloton will go ahead and move machines across virtual partitions.

Peloton & Kubernetes

The second cool thing we are working on is Kubernetes and Peloton integration. Kubernetes is a widely-adopted container orchestration system. Peloton is an intelligent scheduler built for web-scale workloads. Before we did integration, we wanted to do some benchmarking on Kubernetes and Peloton, and see if it makes sense to even integrate it, or we should just use plain Vanilla Kubernetes. We have something called virtual cluster built on top of Peloton, which is open source, so you can run any cluster on top of Peloton. You can actually run a Kubernetes cluster on top of Peloton, a Peloton cluster on top of Peloton, a YARN on top of Peloton. We did that because we wanted to do all these benchmarks. Getting this many machines is very hard, so you can build a virtual cluster on top of a Peloton cluster and then you can do all this benchmarking.

We did this cluster size of 1,000 nodes to 10,000 nodes, and we ran the jobs which are from 10,000 containers to 100,000 containers. We measure two things – firstly, how much time it is taking to complete the job, and secondly, how much time it is taking to upgrade the job. These are the numbers we ran for 10,000 containers on a 1,000 machines up to 10,000 machines, and the Kubernetes numbers are way off from Peloton numbers. Kubernetes took at least 500 seconds and Peloton was scaling based on the number of nodes. As we’re increasing number of nodes, the Peloton numbers are coming down here. We actually tested with 50,000 containers per job, and at that scale, Kubernetes couldn’t even scale it; it was not able to finish the job.

Similarly, we tested batch jobs for 2,000 nodes with 10,000 containers, 20,000 containers, and 50,000 containers, where Kubernetes also couldn’t scale and even 50,000 jobs could not run on the Kubernetes cluster. Similarly, for the deployment of the stateless for 2,000 nodes, Kubernetes was taking way more time than Peloton for the 50,000 Pods deployment, and it was the same case for the rolling upgrade.

Why do we want to do Peloton and Kubernetes integration? Kubernetes right now is a current technology, and there is a very big community building around Kubernetes, so we wanted to take leverage of the Kubernetes and the integrated Peloton with that. It will also enable Uber to leverage open source. Why Peloton? Because it meets scale and the customization needs. There is a clear path from our Mesos workloads, to move from Mesos to Kubernetes in the future.

This is how we are going to integrate Peloton on top of Kubernetes. We’re going to have our Peloton scheduler which will be integrated through the API server using scheduler extension. We’ll build our stateless job controller, batch job controller, and stateful job controller, which will be giving the same APIs of Peloton which we have right now, and the orchestration engine will run on top of those controllers. We’ll use the same Kubernetes model to plug in our scheduler. Based on our niche, the Peloton stateful engine resource manager and host manager, we already have all the elastic resource sharing, higher throughput for batch jobs, and the rest of the cool features we have for Peloton, and we can combine both of the worlds together.

Peloton in Production

Let’s talk about what the status in Peloton production right now. We have been running in production for more than one and a half years now. All our autonomous vehicle map workloads run on top of Peloton. We have integrated with Spark, we are moving all the Spark jobs to Peloton. We have distributed TensorFlow deep learning jobs all running on Peloton. We already started to move over stateless services on top of Peloton. Once we have all the co-location projects done, we start co-locating them together. As for scale, we are running more than 8,000 nodes right now, and we are running more than 2500 GPUs. We’re running 3 million plus monthly jobs, with 36 million plus containers monthly on Peloton production.

This is an open source project, so this is a blog post and open source repo. Please look into it, give us feedback, and we’re very happy to take contributions also.

Questions and Answers

Participant 1: You’re talking about migrating to Kubernetes, and from what it looks like, you’ve already invested in Mesos heavily, so what would be the actual technical reasons to do this besides the community? Are there actual reasons to migrate from existing Vulcan Infrastructure? Are there are some projects that may be more hyped at the open source?

Bansal: There are two reasons. One is the community, definitely, because if you look at the Mesos community right now and the Kubernetes community right now, there is a very big gap. The Mesos community and the features which are coming out of Mesos are fewer and reducing every day. However, in Kubernetes, there is a big ecosystem around Kubernetes, and we wanted to leverage that. Secondly, we’re starting to hit the Mesos scale, which is one example of the events acknowledgement that Mesos does. We have started to hit scale, so now we wanted to invest in something which would be more long-term, which people are not investing in. These are a couple of reasons.

Jindal: I can add this. One of the specific features that are causing us to move away from Mesos is Pod support. We want to run multiple containers in the same Pod and Mesos doesn’t have that, and Kubelet just provides it out-of-the-box. There are a couple of other ecosystem-based features, for example, one is SPIRE-based integration. There is already an integration between the API server and SPIRE, and our identity-based authentication. Our identity system is going to be SPIRE as well at Uber. We get some of these ecosystem-based features for free. These are the two main drivers right now to move away from Mesos to Kubernetes.

Bansal: We can add these features to Mesos as well, but the point is, do you want to invest there where you cannot get more out of community, or do you want to invest somewhere where you get a lot of support and a lot of new features? These are the tradeoffs you need to do.

Participant 2: The deployment time comparison that you showed between Peloton and Kubernetes, is it the same as the differences when the image is not available on the same Pod, when the new version is coming up?

Bansal: No. We haven’t added that feature right now in Peloton to do the pre-fetching.

Jindal: No, that’s not the reason. What we did in those benchmarks was to eliminate that. What we did is we just did a sleep zero or just an echo batch job which didn’t require any image download. The difference we’re trying to measure is the scheduler throughput, and not external factors like trying to initialize the Sandbox or trying to download the image or anything else.

Participant 3: You mentioned that Peloton supports multiple clustered technologies. How hard is it going to be onboard new cluster technology, like Service Fabric?

Jindal: If you look at the architecture, there is one component which abstracts the cluster APIs from us. We require some very basic functionality from the cluster APIs, like being able to launch a Pod, being able to kill a Pod, being able to get all the hosts in the cluster and the resources, basically available resources of that host. It started with CPUs, 128 gig memory. For anything that provides those APIs, we can write a very simple abstraction, so there’s a very simple damon which abstracts everything out, so you just need to replace that damon to integrate with any new cluster technology.

Bansal: There are two dimensions to it. One is the resource management piece – you can add Mesos, you can add Kubernetes, then you can add any other resource management on top of Peloton. Then there is another dimension, which is how easy it is for any orchestration engine or any workload to integrate with Peloton. We provide the APIs, so many of the orchestration engines go through APIs to run different workloads on top of these clusters resource managers.

Participant 4: If you’re running only stateless microservices in a cluster and using Kubernetes, would there be any reason why I should consider Peloton?

Bansal: There are two reasons. One is the scale – if you’re running very small scale microservices on top of Kubernetes, yes, you might be better off not using Peloton right now. However, there are multiple features which we have in Peloton. One is elastic resource sharing, the scale we have, the throughput we have in terms of Kubernetes. In the long-run, we wanted to also co-locate batch jobs as well as the stateless services. If you need all these multiple features like elastic resource sharing, priority preemptions, and all these scheduling primitives which are right now not present in Kubernetes, then you might want to consider a Peloton with a scale. In the long-run, if you wanted to co-locate all these workloads together, then I would probably encourage using Peloton.

Jindal: If you are running at low scale in cloud, then I don’t think there’s a better solution than Kubernetes. But if you are running on-prem at scale, then Kubernetes has multiple rentals, we found that we ran into multiple challenges with Kubernetes. Some of them we already discussed, and Mayank [Bansal] also mentioned them. We can discuss more offline about the specific use case you have and if Peloton makes sense for that use case or not.

Participant 5: You mentioned that the person can help an organization to move from Mesos to Kubernetes. I want to ask if we can run Mesos and Kubernetes at the same time, and with Peloton on top of that and scheduling the jobs into both clusters?

Bansal: The one direction which we haven’t shown right now, we are actually working on to have in the host manager level. If you look at our architecture right here, from the host manager is the abstraction on top of Mesos, so you can actually plug in lots of other Mesos clusters as well. In addition, you can also add multiple Kubernetes clusters. We just need to write a different host manager for that Kubernetes cluster, and then you can run it on the same cluster. That’s what we are actually right now working on, as well as this architecture, to seamlessly migrate workloads from Mesos to Kubelets. You can call it Peloton 1.5, and then Kubernetes 1, Peloton 2.0.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: Testing Microservice: Examining the Tradeoffs of Twelve Techniques – Part 2

MMS Founder
MMS Wojciech Bulaty Liam Williams

Article originally posted on InfoQ. Visit InfoQ

A successful microservice testing strategy must effectively manage the interdependent components involved. This article presents the tradeoffs for twelve testing techniques. Each approach has advantages and disadvantages. Which technique, or blend of techniques, should be used for your application, depends on your context.

By Wojciech Bulaty, Liam Williams

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.