Month: May 2019
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Thank you very much for coming, everybody. I really appreciate your time. I know there’s an awful lot of much better talks than mine, so if this gets a little bit hazy in about five minutes time, just run for the door. It’s OK, I’m going to speak anyway. It’s such an adrenaline buzz being up here and getting to talk about what Skyscanner has been up to, not because we’re trying to sell a product, like a vendor, but it’s basically therapy for me. I get to come up here and I get to talk about all our mistakes and all our failings and if anybody takes anything from that, then it’s no longer a mistake, so data point. I can go in and feel so much better about it all.
So who am I? I’m Stuart Davidson. I’m a Senior engineering manager at Skyscanner and I work for the Developer Enablement Tribe. The Developer Enablement Tribe is an organization within an organization that tries to enable developers. What we focus on are our services, an infrastructure and platforms that try and take away a lot of the day-to-day stuff that developers have to do to get their product out the door. A lot of that will come up here.
Surviving Uncertainty Zone
I was pretty excited when Ann came and asked me to talk about surviving uncertainty because it’s all about change management and change management is really exciting. It gets everybody so excited. You’d be surprised how few people are actually excited about change management. It turns out change management isn’t the most exciting thing in the world, unless you call it surviving uncertainty because all of a sudden, there’s a primal urge there, so visceral. You’re surviving, it’s life and death and there’s uncertainty. It’s like that scene in “Jurassic Park,” where the guy is sneaking up on the Raptor. He raises his gun and he’s thinking, “That’s great.” And then all of a sudden, the bush next to him rattles and this Raptor comes out. I did wonder how well that would hit. “Jurassic Park” is 25 years old, so I know there are some people in this room that have no idea what Jurassic Park is. If you haven’t seen “Jurassic Park,” you should. There’s lots of pop culture references in it like “this is Unix, I know this” and “you didn’t see the magic word”. This is me just judging the sort of the mood of the room just to see the age and the clientele that I’ve got and who we are talking about.
I’m going to make an assumption, everybody here is in technology, has become part of technology, or has stayed with technology because it’s fast, there’s a lot of change going on. In fact, sometimes you get to control that change. You get to code something up, you get to commit it, you get to deploy it into production. You’ve made a change. You’ve made a real change to somebody’s life somewhere around the globe.
There I a sort of enthusiasm in the room for change, but how many of us actually look at change management and how do we actually manage that change? What effect does that have? Maybe you’ve got the idea of what the code would do, but what happens to the people that use that code? What happens to that feature when it rolls out? I would say congratulations to all of you for coming along to this track because you’ve taken the right step. You want to find out more about change management. It’s not that bad, change management is exciting.
Managing Change In A Large Organization
Skyscanner is going through an awful lot of change at the moment. We have tons of change, we have lots more customers, lots more travelers hitting our website, which is tremendous. We have a lot more people in our company, which is tremendous. We’ve moved over our data centers into AWS in the last year, we’ve changed our source control system, which is something I’ll talk about and that’s really only from our area. I complained about this last year: “Oh, no, we’re growing again, what a terrible problem to have!”. Every company wants that problem, but it is still a problem and it’s my problem because I’m running the trade that deals with platform, it deals with infrastructure, and deals with services. As much as I would love to, and in certain circumstances I can, I’d love to just be able to horizontally scale the problem away, to just spin up more servers and it’s gone.
Sadly, AWS does have some limits because cloud computing is someone else’s servers. In some circumstances, the things that we have to change are not necessarily how many servers are running, but how we work together. Sometimes, even more difficult than that, it’s how we act as professional engineers to allow Skyscanner to scale, so we’ve got problems technologically and we’ve got problems organizationally, and in some instances we have problems culturally. What I want to do today is talk you through some of the things that Skyscanner does to try and resolve some of that friction, try and help us out.
Caveat on all this, none of the concepts I’m about to talk about are mine. There’s far, far cleverer people than I that have written books about this stuff. If you zone out throughout this whole talk, that’s absolutely fine. Make sure you get a photo of this list of books and go and read this stuff. Because these six books to me are the core of how you manage change in a large organization. However, I do promise you that they are all factually true anecdotes about how Skyscanner implements these bits and pieces. I promise you that all of the representations I talk about are true in terms of how Skyscanner takes the ideas that are in these books and implements them.
I can’t really talk about Skyscanner and how it survives uncertainty without talking about how Skyscanner is organized.
Tribes and Squads-Spotify Summary
Can I get a show of hands about who knows about the Spotify model? Ok, so it’s a pretty common concept. This is the first thing that people think of when they think about the Spotify model. Skyscanner, for reference, has been working with the Spotify model for about four years now, so we’ve had it for quite a long time, but we have adapted it over time to our own needs. We found over time, again, that there have been some flaws in the model, but let’s walk through the model very quickly. You’ve got Squads are groups of 6 to 10 people, so collections are basically a team. Tribes are collections of Squads focused on delivery, so you have many Squads within a trade. Your lane management is held within a chapter, or this is how I understand it to be. The lane management is held in a chapter of subject matter experts in a particular area that are cross Squad and then you have guilds that are cross company interest groups, essentially.
I think this is the first thing that people think of when I talk about the Spotify model, how people are organized into teams. Actually, it’s not the most important part of the Spotify model. I think this is [pointing] and that’s how you empower these teams, these entities, and how do they work together. In this graph this is another part of the Spotify model video and if you haven’t seen it, it’s a good one. It talks about the crossover between alignment and autonomy. You want your teams to be aligned but you also want your teams to be autonomous. If they are neither, then they just aimlessly wander around. If they have alignment but no autonomy, then they’ve got a boss, a directive boss, who’s pointing at them and saying “We need to cross the river, build a bridge,” regardless of what’s going on and you would follow the boss.
In opposite to that, if you have autonomy but no alignment, it’s brilliant, it’s a holiday. Autonomy without a responsibility is a holiday, you can do whatever you want. You can build spaceships, you can use Kotlin you can do whatever you like, but here’s probably a manager, someone who’s held responsible and they’re thinking, “I hope somebody is thinking about same problems as I am.” Alignment and autonomy are the key if you can get it together, if you can get autonomy and alignment all at the same time. You’ve got here “We need to cross the river”, figured out how, so empower that squad! If you tell them to build a bridge and they run over to the shore, there’s a boat there and they think “Okay, we better build the bridge then,” that’s not what you want. You want these Squads to deal with uncertainty because you don’t know what’s at the shore. You don’t want to give them that big plan because you don’t want them to be focused on one path and one path alone. You want to set them up to deal with uncertainty.
Tribes and Squads-Art of Action
Skyscanner took this even further and we follow a book called “Art of Action” written by Stephen Bungay who’s a military historian. You may have known about the Spotify model but I doubt many people in this room know about “Art of Action.” Because he’s a military historian, Stephen Bungay starts off by talking about the wars in the 17th and 18th century where everybody was formed, there were regimental ranks and it was all about fit drill, just to get people to be centralized and to all move up as one from a centralized point. Then, the Prussian Army led by Von Moltke the Elder came in and they decentralized all. He said, “As long as my field commanders follow my intent, they essentially have autonomy. They can do what they want if they’re following my intent.” And that happened, they took the initiative and they rotated all these armies. The Prussian army was the best army in Europe. That was Chapter I, it didn’t go on about military history for the whole thing, but I thought that was a great start, I thought it was really engaging. Then he started talking about how large organizations deal with the same challenges as armies had in the 17th and 18th century, and how we’re still learning the lessons that were learned back then with blood and steel. We’re still making the same mistakes centuries on. It’s madness.
He’s got this nice model to explain what he’s trying to say and we think it’s just great. You’ve got these outcomes, you got these things you’re trying to accomplish and to get that outcome, you decide to make a plan. And you give that plan to people and they action that plan and the outcome of those actions is the outcome you hopefully get. However, there are some gaps between these, the outcome you want and the plan that you write. There’s often a gap between what you know and what you would like to know. Trying to build a plan on something that you don’t know about, there’s a bit of a gap there. Now you’ve written a plan and you give it to people and they maybe do something that you didn’t expect them to do. The alignment gap is the difference between what you want people to do and what they actually go and do. These people have carried out your actions, but did it have the effect you wanted it to? There’s a difference between what you wanted to achieve and what it actually achieved.
These are the three gaps that Stephen Bungay came across. He’s identified as the core parts of how you run a large organization. Now, your knee-jerk reaction to all of this is normally the following: “This book is massive, these talks can go on for days, I’m condensing it down”. I really hope that you go and read more about it later on.
If you have a gap in your knowledge, your knee-jerk reaction is to do more study, to go and figure out more about it before you write your plan. You want this plan to be awesome, you want to detail this plan as much as you can, so the alignment gap to prevent people from going off and doing things that you didn’t want them to do, you make the plan more detailed. You’ve got all this knowledge, now you make the plan exactly, take two steps and then tickle in, taste it. Make more detailed plans. The effects gap between what you wanted to happen and what actually happened, the knee-jerk reaction there is to get more control, to grab it, to say “Oh, you did that wrong. Come here. Let’s have a governance meeting. Let’s have maybe one every day for an hour and you can talk me through exactly what you did and we’ll review it and make sure that you’ve done the right thing.” That doesn’t sound like the way to run an organization to me. I certainly wouldn’t want to be the person doing the governance.
What Stephen Bungay suggests is a different approach to all of this. This is how we try and do things in Skyscanner. Instead of looking for more information, limit what your knowledge gathering is about, in order to define and communicate your intent. Just like in the previous set of slides, we’re talking about that I don’t want you to build a bridge, I want you to cross the river, that’s the intent. Instead of writing really detailed plans, you delegate, you defer how things are actually implemented as far down as you possibly can and allow each level to define how they will achieve the intent. To back-briefing, we’ll talk very quickly about that in a bit.
Then you’ve got the effects gap. This is where we survive uncertainty. You allow people to change their minds about what they’re going to do. You don’t jump on them as soon as they change things because maybe they know something that you don’t. Provided it’s aligned to the intent of what you’re trying to do finally, like fundamentally, then that’s great. That’s a great thing to empower your Squads to do.
Strategy Briefs are the way that Skyscanner deal with passing work down from level to level. We talk about intent and then we empower the Tribes and then the Squads underneath to decide what they’re going to do. Back brief is about going back again and making sure that the intent is correct, so we’ve made the decision about what we’re going to do, but we’ll take half an hour just to make sure that we’re aligned and maybe the boss hasn’t forgotten to put something in the intent. That little iteration is key, and it saved us so much wrong work and it saved us weeks and weeks of wandering off and doing a thing.
Maintaining line management within Squads is better. It allows teams to form and be long lived and co-located and with a line manager in them so that they start to build that psychological safety that Andrea talked about so much in our previous presentations. That psychological safety is key to having really high performing teams. That goes counter to what the Spotify model is all about.
I’m not going to define SLIs, SLOs, SLAs. If you know what they are, Skyscanner are starting to use them and the initial signs are good. It’s a way of us being clear about what to expect inter-service. If Squads are building services in this Tribes and Squad model, Squads will be reasonably autonomous, so they may build service separately. If one service is depending on another, it’s a good way of defining that relationship and what to expect from one another.
We’re seeing really good behaviors from that, and it’s taught us a tremendous amount about our dependencies. For the first time in a long while, we’ve actually looked at what we’re dependent on in order to make our SLIs and SLOs and we’ve found circular dependencies, we’ve found the tool that we deploy our infrastructure, deployed on the infrastructure that was going to deploy the infrastructure that was on the infrastructure that it was deployed. SLIs, SLOs, SLAs have been really good and the “Site Reliability Engineering” book from Google is ideal.
This is what Ann wanted me to talk about and that’s the reason I’m rushing. Continuous Deployment in Skyscanner has been around for over two years. It defines what we do in Skyscanner. We deploy as much of our code as possible using Continuous Deployment. We do conflict deployments, we do infrastructure deployments, Continuous Deployment. Let me very quickly go through the concept of what it is.
Continuous Delivery is where for each change you want to apply to the platform, you build it, you test it, and then you save it in a place and then there’s a human that decides whether to push that into production or not. That might mean that there’s time delay, your engineer might wait for a particular time or event that’s happening on the platform. You might take each release and go through a testing phase, you might go into test environments. you might go through your QA team. Continuous Delivery is a very, very good thing because it means you can release any bundle of code, but we’ve taken it one step further and we follow Continuous Deployment. It means that every single change that’s committed or merged into your master branch is deployed into production, provided it passes through the automated tests. Once you’ve merged into master, there’s no human interaction. It goes straight into production immediately.
Who works at a company that does Continuous Deployment? There’s a few people that practice it, I can’t think of any other way of working. I’m a total convert and I do realize that there are certain software setups that you can’t do Continuous Deployment. But in the vast majority of cases, there must be a way of doing it. I have engineering managers and engineers coming to me and saying, “You’re mad. What a terrible risk you’re taking automating this and throwing every change into production like that, you’re insane. How can you survive such uncertainty?” Bringing it back to the track.
I would say, “Well, actually, let’s look at this in a different way. Would you rather take all of the changes that you’ve done over a week and would you take all your Squads changes over a week and all your Tribes changes over a week and then throw on to production and see what happens? Or would you like to take each change and put it in production and see what happens? Then take the next change, put in production and see what happens. In addition, put in production and see what happens.” Of course, you want to do the second one, you want to put a single change into production and see what the effect is. You don’t want to launch 100 changes into production and just hope that it works.
That’s the way that companies are working today, it’s madness. The other way you can think about it is if you come across a problem whilst deploying into production. Would you rather roll back a week’s worth of work or would you rather roll back one single change? Seems pretty obvious to me, but that’s still the way that some companies work. They would much rather roll back a whole week’s worth of work rather than one change, talk about uncertainty. You’re going to have a week’s worth of change in a release that’s been rolled back. That means your production environment is one week behind your test environment that’s, depending how long it takes you to test, maybe one week behind for your developers are developing, that’s uncertainty. What we do in Continuous Deployment is the sensible approach.
The third way, the third thing I really want to talk about because it’s my biggest bubble. You think I was passionate before, but this is the thing that really gets to me. There are some companies in this world that expect a graduate with an out-of-date Word document to get out of bed 3:00 in the morning and deploy the code into production. If it goes wrong, they have absolutely no support, because it’s 3:00 in the morning, they’re tired and you’ve asked them to put text in the text box. It’s just madness the way that some big organizations try and operate, I would personally rather work in an environment where we can do 20 deploys a day. It’s ubiquitous, it just happens, it’s this thing that just happens.
We do so frequently that it’s common. It’s not a big deal when you deploy with a strategic enabler. Not only are we comfortable about deploying, we’re comfortable about rolling back as well. We’re comfortable about applying changes to get a set of problems. We’re comfortable about applying it during the business day, so we can do when everybody’s awake and aware and paying attention and really understand what’s going on. That’s the way to operate the software, so when you say surviving uncertainty, I think Continuous Deployment is certainly the approach I would take if I had the chance.
One of the books that talks about high performing organizations is “Accelerate”, it’s another one that I like. Jimmy and I have already had words about it and we will continue to have words in the bar later. Whereby, there’s a survey that’s done, “State of DevOps,” it’s called. These very clever people have gone away and taken all the data sets and analyze them and compared the capabilities of certain organizations in terms of Continuous Delivery, deployment and all sorts of other things, compared to how well they’re performing in the market, how much money they’re making, how well their shares are going.
There’s a correlation between how quickly companies can get into production, how quickly they can make changes, how quickly they can recover from changes. It’s a shame that a big study has had to take place for some people to be convinced. Being able to recover from change quickly seems like a good sign, that seems like the thing that you want a high performing company to get behind.
Source Control Changes
Migrating a source control system in a company isn’t something you do through Continuous Deployment. You have to really think about it, this change in particular when we moved from GitLab to GitHub in Skyscanner. That’s the first and last time I’ll mention the vendors. This was it, this was a change that I really wanted to do for a variety of reasons and I will talk to you in the public if you want to hear more about it later on, open a drink, have a coffee.
We had a license coming up for renewal in August, this was January time. I had a time constraint, and I had to make a change that affected every single engineer in the business. I had to do a time of great change in the organization, anyway. Like I said, we moved from our data centers into AWS. There’s lots of cool changes ongoing, lots of projects, it would have been pretty risky to get involved with. You’re playing with source control, so actually, we had the source of Skyscanner in our hands as well. If we didn’t migrate that appropriately, you know, my P45 is heading towards me very quickly if I lose the source code of Skyscanner.
I had to take a different approach to how I was going to migrate our source control system in Skyscanner. I went back and I looked at all the migrations that I’d carried out over the last sort of two to three years, because, like I said, in Skyscanner, we change all the time, we change tools all the time and as we scale, we find things don’t work and we change, so there’s been a lot of migrations over the last little while. I went through and I picked out the things that really worked and then the things that really didn’t and I prepared a list of things that I was going to do in order to make this migration happen. It did happen and it happened flawlessly, it happened on time, on budget.
Every engineer in the company was affected, every engineer in the company was delighted, it worked so well that I got absolutely no praise for it because everybody just thought it was so routine. That is a great place to be as well as a slightly frustrating place when it comes to review, but it’s a great place to be. This is my list and I’m going to share it with you today because this is the key to doing large migrations in a company.
Source Control Migration
These are the 10 tips, leadership and buy-in. Let’s start off with setting the grind work. We want to talk to the key influencers in the company. I went to every single Tribe Engineering lead and I told them that we were going to migrate our source control, 80% of them didn’t really care, but that’s fine. I’d sat down and I talked to them, I prepared a document that talked about, specifically, management and operational issues, sort of a strategic level. I talked about all the benefits we were going to get, and I put it in front of them and I said, “We’re going to make the source control change.” And they went “Yep, fine, cool.”
Leadership and Buy-in.When I started to get further on in this approach, knowing that the Tribe Engineering leads had all said, “Yep, that’s cool,” was a really important part. Not one of them could go, “Oh, lots going on. Not sure we could do this” , so prior to all, I went around the leadership got buy-in. That started off. And I see it so simply, it was pretty simple. I don’t know if that’s just unusual to Skyscanner, but it did seem pretty simple provided I put this document in front of them and explained why it was important to them.
Unity of purpose, number two. This is the thing that trips up a lot of people. When you’re making such a vast change across the whole organization, you want to make sure that you’re serious about it. Don’t just muck around, “We’ll just change this a little bit and this a little bit,” because it gives people excuses to have hope that the old system is going to stay. That way, they don’t need to worry about your new system, they don’t need to worry about this change that’s coming up, they would much rather de-fair that, so you need unity of purpose. When it comes to source control systems, that means it’s at the nexus of everything that’s going on in terms of engineering. You get your CI, you’ve got slack integrations, you’ve got all sorts of weird and wonderful scripts, our internationalization that was version controlled in our source control system, a lot of config was version controlled in there as well. There was a lot going on, and we had to have a plan to change every single component of it and make sure people were aware of that, because that unity of purpose, that strong intent, that direction made the company understand that this change was happening.
Understand the technical landscape. That kind into what I was talking about there. Understanding who’s going to be impacted, understanding the technical landscape of where you’re working was key. We could identify people right at the start who were probably going to be our biggest blockers. We’re going to push back on it and get in front of it, and put in front of them solutions before they could say no and give us problems. We understood the technical landscape before we did all this. This is all before we wrote any migration code whatsoever. We were laying the groundwork, we were getting people excited about it, but yes, make the migration simple. We started to write some code, we could have given people a document because it’s source controls get, so we could have given them a document and said, “Right, step one, take this command. Step two, take that command.” Instead, what we did was we wrote a migration service that took all of the source code from an organization and put it into another organization in our second source control but updated all of our continuous integration tools. They updated all of our documents, it just took all of the sting out of this change.
It meant we could go to each team and instead of here’s some work to do, we would say, “Oh, can we just get an hour of your time? What meetings are you up to in the next week?” “Oh, well, we’ve got this hour and a half design review.” “Perfect. Perfect. We’ll take it then.” We will migrate your source control then and you’ll come back out of that meeting, it will all be good to go. Again, no one can get in the way of that, no one can argue with that. Some people tried, but, we made it super simple so that people couldn’t say no.
Neutralizing blockers. People still said no, no matter how easy I made this, so what we did was basically promised to fix a forum. There was one team in particular who said, “We can’t possibly make this change, we’re so busy, we’ve got all these things going. Can’t possibly make this change.” So we said, “Fine, we’ll go into your code base and we will make your change for you. We’ll do that, no problem. We will work every hour of every day to make your change for you.” I’m hoping some people from the company will watch this, thanks though. Yes, we tackled real blockers, we neutralized them with enthusiasm, with effort, with sweat, we got into the challenge, the problem space, and we made it work.
Often people were coming up with stupid excuses not to migrate. Again, the number of times that people will, “Oh, we can’t possibly do this because of X.” And we went, “Oh, well, we’ll just look at X.” That’s not a problem at all, it’s fine. Then they went “Oh, well. Okay, well, what about Y?” “Oh, we looked at Y, it’s fine. We looked at that overnight and we’ve got the solution.” Neutralizing blockers, tackling things.
Gain and maintain support for the migration. Just because you’ve laid the groundwork doesn’t mean that you shouldn’t keep pushing it along. There’s some momentum thing here, there’s an initiative. As soon as we started to migrate people, we did interviews with them. We did one-to-ones with them and said, “You know, so Mr. X, how did the migration go?” And they were, “It was great.” We take that and we put in a blog post. These little case studies not only got people supportive because we put them in a good light. “You’ve supported a really important initiative in Skyscanner. How does that make you feel?” “Oh, that was great, yes” Skyscanner take that and put in a blog post. People were starting to get the idea that this is a good thing. They started to see the company migrate and then they started to think, “Oh, wait a minute. I want to be migrated, too.”
There was a momentum, began to build. It got to the stage where we couldn’t migrate people fast enough because they all wanted to get over to the new system. That was key, that was a really simple thing to do, to talk to people, to write it down. It really, really worked.
Operate in accordance with the production standards. We did talk to the managers, the managers all went, “Yes, that’s fine.” There’s also an engineering path of senior and senior and senior distinguished senior engineers. They’ve set the production standard. There is autonomy in our Squads but we still have production standards that each Squad should adhere to in order to get code out to production, things like unit tests. It’s pretty obvious but we’ve written down so that we all follow them. We operated in accordance with the production standards just in case someone took umbrage to us doing this work so we could point to the production standards and say, “Look, all of our code is production standard ready.”
We also aligned ourselves to things like SOX compliance, which we’re really glad we did because we didn’t really think about it at the start, but then it became this massive thing. Later on in the line while we were migrating, we’re like, “Huh, we’re already covered.” Think about the laws and the rules that are coming up and really align yourself to it.
Be transparent. Oh, man, the amount of documentation I wrote, the amount of blog posts I did. I was as transparent, but people were so bored of this thing by the end of it. You’re already getting bored of me talking about it, think about people living through 10 weeks of this, my enthusiasm. Getting people’s faces saying, “We’re migrating people. This is exciting.” I had this confluence document that had every single Squad and every single repository, when they were going to migrate, if they had migrated, the whole thing was transparent to the whole company.
Again, you started to see the shift of the early adopters with a small box. Then we had a couple of people that were scheduled, and then all of the people who were unscheduled, and then as the early adopters started to build and build and as more and more people scheduled, the rest of them started to get nervous that they weren’t getting scheduled and everything started to stack up, so the transparency formed a sort of a momentum of itself. I really, really advise that you’re as transparent as possible when you’re doing these big changes so that people aren’t nervous about it and they can understand what’s going on. We shared the code to allow all of our engineers to figure out what was going on.
Prepare for the long term. This is a contentious one. I had a license date of the third of August 2018. The easiest way to put teams on the defensive is to say, “You must migrate by the third of August 2018.” Because that’s a date. They can always say no to a date. They can always for some reason, “Oh, I’m going to holiday, I can’t possibly make that.” Instead what I did was I held the narrative of “Yes, the world will continue but what will the world look like past August the 3rd 2018.”
I suggested, I didn’t write it down, but anybody that was giving me any friction, I suggested that on the 4th we would form a guild of people who hadn’t migrated, and they would own the old tool. They would be the ones to run and maintain and manage this tool and we would assist them with advice and we would pass the documentation over to them and the sales executive who run the previous account and they could figure out budget and they could figure out how they were going to maintain and operate and backup and restore and all that sort of stuff.
We didn’t set a hard limit, but we did talk about what the long term looked like if they didn’t adhere to the migration. That changed everything that gave them a decision rather than a point that they could disagree with. Nobody hung around, they were like, “I’m not doing that. Are you kidding?” They didn’t say it like that but, “Oh, yes, I can see the cost benefit.” Prepare for long term is a key one. Don’t put dates in front of people, it’s too easy to say no to.
Final point. This aligns to everything that we do, learn and iterate. We had an approach, it was doing ok, we weren’t fast enough, we made some changes to how we migrated people, we got faster. Learn and iterate, this is me learning and iterating on this set of rules and I will continue to learn and iterate on it.
These are the books again. I really strongly recommend that you look at these books if you can. Because surviving uncertainty is just like Jamie talked about all about being prepared for change, change is a constant. You’ve deliberately thrown yourself into a realm that’s filled with change. You should be prepared for it rather than trying to avoid it. “Art of Action,” I have talked about that. “Site Reliability Engineering” talks about SLI, SLOs, SLAs and a lot of other really great stuff. “Accelerate,” we’ve talked about. These three at the bottom we haven’t.
“The Phoenix Project” is all about the theory of constraints, it’s a great book. I’m not going to spoil it for you, but it does read like a novel. It is really easy to consume and I highly recommend it because you’re going to learn so much about it, about IT. Anyway, Lean Startup, MVP. It’s about our Squads are given a copy or our Squad leads are given a copy of “The Lean Startup,” because it talks about doing an iterative approach to your development. It’s all about just doing the bare minimum to get to a level where you can learn something and then iterate and iterate. It’s a key part of dealing with change. In an iteration you get to a point where you can deal with change at that point. You don’t have a six-month project that will suddenly die. Finally, “Turn the Ship Around.” It’s all about delegation, it’s a wonderful story about people in a submarine. You’ll love it.
Questions and Answers
Moderator:The timing is perfect.Thank you very much for that, Stuart, very useful insights into how you do things at Skyscanner and how you continue to do that. Do we have any questions for Stuart?
Participant 1: I really liked your presentation and mostly the approach on how the change relate to those git repositories source portion in control. How you tackle a challenge where something gets imposed to you without anything that is presented then?
Davidson: How do I deal with something that’s imposed on me, with the change that’s imposed on me without having any of the stuff up there? Sadly, no matter what you do, some of this stuff up there are things that you can’t apply. These are books you could get them on your Kindle. These are ideas, and ideas are the most powerful things. Understanding the context, I think, is the key. Understand why this thing is being imposed on you. It might even be down to the manager and how stress they’re feeling, how much pressure they’re feeling, why are they imposing something on you. Understand the context and the why and that will lead to the what and that will lead to the how. If you’ve only got the what or even the how, start working back up the chain. What and why is this being asked of you?
Participant 2: I got a question in regards to your title. You are Senior Engineering Manager and I’m curious to see how it fits into that Spotify model that you explained. How many engineering managers do you actually have and what’s your day-to-day job?
Davidson: Change is the only constant in Skyscanner. My job has changed, I don’t know, two or three times over the last little while. Right now, my job is working for the platform Tribe. There’s myself, there’s a guy called Rob Harrop who has talked to you various times, and a guy called Paul Gillespie. The three of us are running the tribe of the platform Tribe and we have, at the moment, five Squads that work for us, so each Squad has an engineering manager. I’m a senior engineering manager but I’m relatively junior compared to Rob who’s a senior director, and Paul Gillespie who’s our Senior Principal Engineer. We have an IC track, and we have a management track, so Paul’s our senior engineer, he’s our technical lead, and we have Rob who’s our management lead.
My day-to-day is around the Tribe health, so a lot about the psychological safety. Do we have enough people? Are we doing the right things? Just keeping people honest, I’m the Tribe conscience and sometimes the Tribe motivator. I don’t know if you can see that from sort of how I bounced around. That’s my day-to-day job, trying to get people focused on the right things and trying to make sure that people are motivated.
Participant 3: Just for reference, how large is the total population of engineers at Skyscanner roughly?
Davidson: No. The reason I’m so ping to that is I see no reason why I can’t tell you that, but I have been told in the past that it is not acceptable to talk about that. When we look at our competitors, that is one of the metrics that our competitors look at and we look at our competitors, is how many people we employed compared to how many engineers we have in the organization. There are 1,200 employees in Skyscanner. There are an awful lot of engineers.
See more presentations with transcripts
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
I have my own slides that I did. It was about a week preparation. And as usual, I left it to the last minute. I sent Greg my slides last night, and he said, “You have way too many slides and way too much text on them.” So I’m going to have to move fast. But I wanted to see if you guys could help me figure out which content to do. I couldn’t do cases slides because the third slide had a giant Lego penis on it, and I just could not figure out how to work that into my presentation. So we’re going to go with something a bit more corporate feeling, I think. Before I get started, who here by show of hands has worked with blockchain?
A few of you, not many, actually. I think I’m used to audiences that are all blockchain and of those few, have any of you worked with Corda? A couple. One out of three employees has. That’s good. I’ve got a fairly fresh audience, which is great. I would adjust a little bit what Greg was saying in the sense that, if you know us at all, which I wouldn’t expect you would, our origins are somewhat finance but we’re finding ourselves moving kind of quickly outside of finance into other use cases. And that’s what I was actually going to start with because I think in the industry, we’ve come of age in kind of a very strange time where it’s very press-driven, social media-driven things. And the perception is actually somewhat ahead of the reality, which I would openly acknowledge.
The first question most people tend to have is why even use a blockchain? I wanted to jump in a couple minutes on that, give you an idea of some of the use cases we’re doing today. So it’s a little out of order, in a sense, but it’s probably more in the order of the questions you might have. Second would be why would you build a new blockchain? There’s already platforms out there. So I’ll try and answer why R3 members set out to build our own platform. And then the next is trying to get a bit more technical for you guys. This is actually an overview of the platform. How is it different than other blockchain platforms that are out there? How does it differ in just the basics of how it works? And then last is in promise of cases description, what’s the future for Corda, or blockchains in general as we see it.
Why use a blockchain?
I’m going to jump in quickly. As promised, I’ll move fast. Why use a blockchain? I’m using our corporate slides to tee this up. I’ve got three use cases that we’re looking at today. Of those, two are ISVs, and each have different drivers run around why they’re using it. Frankly, some of this just isn’t driven just by the technology for technology’s sake. I think we’ve seen a lot of that. There’s a lot of use cases out there that I would, when I see, I think, “Gee, wouldn’t you be better off of the database for that?” And it is kind of true. But a lot of cases, there are places where a blockchain absolutely does something that no other technology can do.
Finastra is an ISV, if you’re not familiar with them. I think they’re the third largest in the finance space. Finastra is using Corda for a couple of different use cases. The primary one of which is taking what they call LenderComm. It’s a lending platform, obviously. It’s when large corporates go to seek funding, and they go to agent banks, and they spread out the risk and they gather up loans. Now, this is a great basic example of where blockchain is good, because you have a set of entities who compete with each other, but they need to cooperate around something. Finastra already has their technology in these companies.
They have a version of this. But it doesn’t allow those participants to work together. They effectively have to come back to Finastra, or come back to the existing technologies to do that. What Finastra is doing is what we see with a lot of ISVs – they’ve understood. They have a large customer base who all acts on a particular solution for a business problem. Then they are discovering that they can actually start connecting them and doing that new thing. That’s exactly what Finastra started with LenderComm and a couple of other use cases we’re looking at.
The next is B3i. Finastra, obviously, works a lot by the way, with financial institutions, a lot of banks, fairly large networks. We’ll talk a bit about the size of the networks towards the end. B3i is a consortium. It was started by a group of reinsurance companies. B3i is interesting for us because when you see a little bit about our history, their name’s quite similar. But initially, we were thought to be competitors. We’re actually very close collaborators today and largely because B3i is trying to solve a very specific problem and we’re trying to get a very abstract platform technology that can solve many problems. B3i’s problem was reinsurers. Again, this is an industry where these very large organizations have to collaborate.
If you think of reinsurance is all about taking insurance and spreading out the risk. They have to do a lot of things to coordinate with each other. One is they have to balance portfolios so that they don’t double insure something. They kind of increase their risk in particular. They always have to have an overview of what’s occurring in their portfolio. But they need to work with insurers and other reinsurers in cooperating. Very basic thing when somebody makes a claim, they have to go figure out who’s going to actually pay for this claim. They have to start collaborating or coordinating with each other. The insurance industry for R3 in particular we’re seeing is probably the area that’s going fastest actually.
Banking is an area that had early traction. The use cases are fantastic for banking, finance in general. But insurance has missed out on the last technology wave. Like I’ve discovered, most industries in the world tend to run on email and spreadsheets. They have the advantage now being able to kind of leapfrog into this. B3i is one of the industries that’s taken this technology. They have a large set of participants, very large organizations implementing it today.
The last is TradeIX. TradeIX is somebody we work with closely. They’re a startup, so a different breed than the other two. What TradeIX is doing is working in trade finance. If you think of supply chains, these are these long, fragmented supply chain of companies that have to coordinate with each other. But it’s almost like a baton race. They pass documents, they pass goods. They don’t have anything other than the fact that they get proof that they have passed it and that they should trust who they’re passing it to. So there is a lot of trust in these supply chains. Supply chain, in general, is a fantastic use case for blockchain because a lot of what blockchain does is give you the ability to trust the information because you can see the provenance of it. Somewhat cliché, but when you get into something like trade finance, you’re looking at an area that is very inefficient today. It’s very paper-based. One of their big challenges is double financing, as an example. It’s fraudulent, but people go out and they get financing from one bank, and they go to another bank and finance it again. That right there is an ROI for the banks that are working with TradeIX.
Now beyond that, it’s actually changed the economics of what they can do in trade finance because it’s made it cheaper, dramatically cheaper, for them to get access to small, medium-sized organizations for seeking financing. Those are just three examples. I wanted to stay as much in the technology as possible today, considering the audience, but of how the technology is being used today. These make sense primarily from a business perspective. There’s an ROI that is very apparent for them. I’m going to do my best to keep track of time.
Why build a new blockchain
Why build a new blockchain then? If you believe that you need blockchain when R3 set off on this, we were actually not a technology company.
We are more of a services company. We had a group of member banks. Those member banks were brought together in what was externally thought of as a consortium to evaluate blockchain against finance and to determine is this a threat or an opportunity. If it’s an opportunity, what is it good for? We spent a long time running projects, taking use cases in the banking industry, and applying the technology to see how well it would work or not. The outcome of this was we’re fundamentally missing two key characteristics. I’ll run through those. One was, if you think of any distributed ledger, and I use blockchain and distributed ledgers interchangeably, mostly in a marketing sense, not technical, the idea is that we have a shared ledger – that’s the basic premise. So that when I get something from you, I can look at the global ledger and understand if it’s valid and unique. Those are the two questions you always have to answer for yourself.
There are some obvious problems in this, the first of which is privacy. If I have a copy of all the transactions, which are needed to assure myself that the ledger has integrity and what you’ve handed me actually matches something on that ledger, then I can see everything that’s happened with my competitors. This is a very big problem, obviously. It’s not just a practical problem. If a group of banks got together and they said, “Hey, we’ll just accept the fact that we can see some of each other’s transactions,” it’s actually a regulatory problem. A lot of our use cases could not use a ledger where other transactions were visible. That’s the first problem.
The second problem is finality. If you’re familiar with Bitcoin – is everybody here familiar with Bitcoin? Show of hands? How many of you still have Bitcoin? A few of you. How many of those wish they would have sold last year? So the problem with Bitcoin is finality. As you get a group of transactions, batch mode processed effectively, and put into a block and that block propagates around the network, there’s a likelihood that that block will conflict with another block that’s been mined. The network has a protocol to determine which of the blocks to drop. But the reality is one of those blocks could drop.
You don’t have certainty that your transaction is final. That’s a very big problem with finance, obviously, because without finality, you can’t move on to the next transaction. Which is something that’s obviously a very basic requirement in financial scenarios, if not other industries. Our members or group of about, I think it was about 42 banks, initially, that kicked this work off, started a technical Working Group. Each contributed technical members of the organizations, and we kicked off the design of a ledger that would try and meet these criteria. That’s what Corda is. That’s fundamentally the two things Corda did.
One of the other things we attempted to do from the beginning was to aim for productivity effectively, and this is really looking at it from a TCO perspective. We’ve made some very basic decisions, which I’ll cover, which if the problem is already been solved in the industry, don’t resolve the problem. Don’t try and reinvent everything.
I’m only 10 minutes in, I’m doing quite well, maybe too quick. Let’s jump into Corda itself. If you look at a blockchain network, really any of the blockchain networks, there are a couple things. One is I love what the diagrams we always see. They look like they’re kind of these neat networks, I love the hairball one is my perspective of it. They’re peer-to-peer systems – anybody can connect to anybody.
And so with Corda by the way, Corda is similar to Fabric in the sense that the boundary of that network is effectively the Root Certificate Authority. These all most likely run on the open internet or leased lines, but that’s effectively the logical boundary of it. There’s three areas that I tend to group this collection of technologies together into. One is the ledger, the second is the transaction itself, and then third is the network at large. I’ll cover those off quickly in that order. When you look at the ledger, there’s a couple characteristics around ledger that are important. Immutability we hear a lot about. Immutability is an interesting one because that is something you can just get from a database. You would have seen with Amazon with QLDB, I think it is, announced immutability in the database.
Being somebody who came from the database industry that always struck me as one that’s a fairly simplistic thing to do. I think a lot of people in the audience could go build an immutable database system. However, it is an underpinning requirement of the platform in a way. Shared facts are quite important. This is effectively what the ledger brings you – I have a copy of information and you have a copy of information, and we can assure ourselves that it’s identical. It’s not only identical, but it’s legally binding. That information has been signed by you. So I can go verify and prove that you had accepted a particular transaction, as well as the information in that transaction.
The next one is the transaction itself. If things are stored on a ledger somewhat static, how do entities transact with each other? The transaction has a couple things. One is just a shared logic. In a way, it’s the business logic. We actually split that into two components I’ll show you in a minute. But you can think of one, you have to have the contract, which tells you as the person operating this node how something will transform on the ledger. How does the state change from one state to another? What are the rules behind that? That’s quite important. Because you’re running that, you’re going to trust the code that executes that. So that’s where you bring their trust back to.
The second is universality. I won’t talk too much about that. If you go back to those early examples and you think of like insurance, and all these big networks working together is you don’t just need shared logic around that transformation, you need everybody to use that same logic. If I represent cash in a particular way and my counter-party has a different contract for cash, well, they’re not fungible. They’re not interchangeable with each other. This is, again, less of a technical challenge and more coming back to standards and other things that exist in a lot areas today that we can just adopt. Finality we talked about.
The last area is the networks. Carolyne [Quinn] is going to talk about the network. There are a lot of characteristics. A really important one is opening as big as possible, right, because otherwise, your boundary of what you can transact on is limited to the number of participants in that network. Identity. PKI has always suffered from this. I need to be able to discover a peer on the network and assure myself that that peer is who I believe it to be.
Scalability is a very big challenge for the blockchains. I won’t talk too much about that today. You’ll see in a second what Corda’s architecture and how we approach that. Scalability is a challenge because everybody has a large copy of the data set. Then if you have a consensus algorithm, a number of nodes have to re-execute the contracts simultaneously and reach agreement on that. So performance is always going to be a challenge with blockchains.
The last is uniqueness. This is the double spin problem. I break this into valid and unique, and I’ll show you why in a minute.
Ledger – Critical Insight for Corda
How does Corda differentiate in this? The first is probably the critical insight why I think for Corda is at the ledger level. If you think of a Bitcoin ledger, it’s effectively a tree of transactions. Are you guys familiar with UTXO? UTXO, unspent transaction output, is pretty simple as you have an issuance of something on a ledger. As it gets into a transaction, it gets spent. What you’re really spending are these endpoints.
This is all spent information but you need this history and you need the full ledger to know that it has integrity. You need this history of transactions in order to assure yourself that not only is that leaf unique, somebody hasn’t spent it simultaneously while trying to give it to you, but all the contracts that executed previously had the right inputs and outputs because otherwise, somebody could have minted money or done something malicious along the way. You effectively want the full ledger, the leaf, and the contracts to go back and execute.
Corda’s key differentiation is we realize that you don’t need the full ledger. You actually only need the portion of the history that’s related to the transaction that you are entering into. If that’s the UTXO that you’re presenting me, if you just want to make payment to me, I will take that UTXO, but also take the history behind it and the contracts that go with it. I’ll re-execute those to determine they’re valid. So I can assure myself that what you’re trying to give me is valid. I can’t yet assure myself it’s unique. You cannot assure yourself of that, the network has to give you assurance on that. I’ll talk about how we split that out in a minute. You don’t have a copy of the ledger, but you can assure yourself that this is valid. That’s fundamentally how Corda differs. Nobody has a full copy of the ledger. It exists in a logical sense on the network. It would effectively come back to a set of trees in intermesh with each other because you have different asset types. But fundamentally, Corda is passing branches around, not the entire tree.
Ledger – Privacy by Design
This differs from other platforms. I probably should have put Fabric up here, but primarily, Corda is bilateral, so only parties of the transaction have copy of it. You can see the transaction history for that one branch behind you. I’ll talk about how we’re solving for that. But you can’t see ahead of you, obviously. You don’t get copies of transactions later.
It’s UTXO style. All records are related to each other. We really only hold a smart contract with that information. The shared unit for Corda is the transaction itself. That’s what you share with your counterparties. Now, the theory and by contrast is account-based, not UTXO. If you have a contract, you have to name the parties in the contract. If you want somebody else new to transact on that, they had to be added to that contract, which means they get visibility in contract’s history. Bitcoin, we take some similarities from the sense that they’re a UTXO style ledger. However, we don’t share the entire ledger like Bitcoin would. We also have smart contracts that are much more sophisticated and arbitrary assets, whereas Bitcoin effectively represents one asset type. We talked a bit about that.
Transaction – Smart Contracts
The smart contract is the second area. The key components of the contract and how Corda differs here, one is they’re written in Java or JVM language, any of the JVM languages. We actually chose Kotlin. It was a bit of a risky choice at the beginning because it was a new language, but it was very efficient, it was one of the reasons we chose it. You could create domain-specific languages. We could start to ease the developer experience by creating DSLs that they could leverage, making it syntactically much simpler.
What Corda adds then is you still have common contracts. Most platforms have that. We separated out something with the workflow, though. If multiple parties are coordinating to build a transaction, we give you the tooling to create that coordination between parties. In a JVM language, in Java you could write the logic of how you would go about getting signatures, proposing to counterparties, etc., assembling all that, and then submitting that to the ordering service, which brings me to the ordering service.
Network – Ordering Service
The ability to reassure yourself that something is unique, is it has to be done at the network level. The double spend problem still exists, all be it at a network level now. You can assure yourself something is valid but not unique.
The way Corda treats this is we had originally looked at the mining mechanism. That’s where we have finality problems, but we didn’t know which algorithm would be best. In this space, in particularly, BFT is very young and immature in reality. We wanted the ability to swap out algorithms. We also wanted the ability to create very large networks, which means multiple ordering services running. That’s what we have. We call this the notary. It is effectively just an ordering service. Those ordering services can be run multiple. They can each run different algorithms.
Anytime two parties transact to move something between the ledger, they have to get a signature from the ordering service noting that those are spent transaction outputs. They only care about the spent. It’s very simple. Now it may be running BFT, but each of those nodes in a BFT cluster would simply say, “You have a UTXO in an index. That’s an output. Have I signed on that previously? No, I’ll sign for it. Now I’m going to add that to my log and move on.” That’s how we do the ordering service. We’ve done a couple of implementations of algorithms and more to come actually. I can talk a bit about that and the future part real quickly. I’m going to leave that till the end, just for time.
The technology platform itself, we did make some choices to try and make this two parts. One is simplicity from a developer experience, but also simplicity from an IT experience. Most of the representatives on our technical committee came out of large organizations. Integration into those organizations was something we were aware of from the very beginning of the design. It’s still challenging, for the record. It’s still very challenging to deploy a point-to-point system inside of a bank, as an example. So having a continuously-changing set of network participants behind a firewall is not very easy in a bank. We’ve made some changes to solve those types of things. The first choice we selected is the JVM. We thought that this has the largest developer community and would lead to a lot of productivity. I’ll talk about some of the trade-offs in that momentarily.
We selected AMQP as our messaging protocol. We did not develop our own whisper protocol. It is not a whisper protocol-based system. It is point to point, so you choose who to message with. AMQP exists in a lot of organizations. It is robust and proven at this point. Last is we started our ledger in a relational database. There’s a lot of byproducts of this. You get all the full querying capability you would get with your tooling today around BI and such when you use a relational database. Most of the existing solutions went with key value pair systems, which are oddly were invented well before databases. Databases were designed to solve the key value pair store problems. We also get to rely on some of the mission-critical capabilities of these databases.
Future: Expanding Architecture
Where do we go from here? Where would Corda go from here? At this point, I should explain where we are. Where we are today is we have live implementations running this solution with all the things you would have seen. But we’re still left with a couple of problems. If you would have followed, I hand you a transaction branch. In order to satisfy yourself that what you’re receiving is valid, you have to execute those contracts. And you need all the data that goes in and out of those contracts. What we’ve effectively done is dramatically reduced the surface area of a privacy problem, but we haven’t eliminated it.
Ledger-Privacy via Encrypted Transaction History w/SGX
How we going about eliminating that? If you think of the transaction history is it starts with an issuance and it goes through various transactions. Now keep in mind, if this was cash as an example, it could be used in various types of transactions. It could be cash versus buying a car title, cash used to pay for a pack of cigarettes. Each of these needs a contract that goes with it. So you have a variety of contracts, potentially. Cash is the hardest one, by the way. What you’re receiving in the end is the unspent transaction output, which is the unspent part of that. It should have a value, it should have a signature associated to it, they can unlock it, etc. But when you hand me this, I need to rerun these contracts to make sure that this thing is valid, it was issued by somebody I trust, etc.
Today, we do that on your node. Where we’re going to next is we’re encrypting this transaction history. So step one, the first milestone of which we’ve reached, and I think next week it will be public, is doing this in a trusted enclave. We’re doing this in Intel’s SGX technology. Are you guys familiar with trusted enclaves? These are the things I take for granted. It’s present, primarily on phones today, actually.
Secure enclaves are a piece of hardware that’s inside the chip and when it boots, it bootstraps itself with the belief that the entire system has been compromised. It all runs encrypted. It’s signed with the chip producer’s signature. Anything that runs inside there then, it runs with the belief that it can’t see it as the local host because it doesn’t trust the hosting environment. It only trusts the inside of this enclave. That means that you can take this transaction history encrypted with the enclaves key, pass it to somebody. They can’t read it. But they can pass it into the secure enclave, execute the contracts. The enclave will provide them with assurance that it’s valid. It’s their enclave, so they know the parameters around the operation of that.
The first milestone we have reached is we have a Java virtual machine running inside the secure enclave. The next step is you can the execute contracts there. Developers won’t have need to have any awareness of this. It should be a fairly simple developer experience. Now why did we choose this one? Why not go with zero-knowledge proofs? Because zero-knowledge proofs are quite fashionable and people love to talk about zero-knowledge proofs. But they have a few very big problems. One, they’re not at all performant. So they’re on the order of a minute or so for basic contract validation. Zero-knowledge proof doesn’t require an enclave. It can execute just on the machine. It uses cryptographic procedures to run this same verification process.
Two, they had to be handcrafted today. That means that somebody, a developer, has to craft that zero knowledge proof, which is quite difficult. There’s very few people in the world that can actually do this today. To expect that all the contracts in the world develop ZKPs would be somewhat unrealistic in the near term. Last is the trust problem with a lot of ZKP implementations. You had to actually trust that the initial keys generated with it were not compromised. However, Corda is somewhat agnostic on this. This transaction history, once the ZKPs are viable, could be encrypted with zero-knowledge proofs. We’re not religious about this at all, we just took what was practical. You get a near-on chip performance with validation in an enclave, which is much more impressive than what you would have thought. That’s the first thing.
Determinism via “Sandboxed JVM”
Another thing that’s actually shipping with Corda 4 in its first preview is deterministic JVM. If you look at the Ethereum community, why would they create their own virtual machine, which seems a bit silly? But they actually had to do it out of necessity for two reasons. One is they needed to assure that the contracts run deterministically. Every time you execute a contract you get the same output, if you give the same inputs, which is difficult. In fact, there’s a lot of non-deterministic APIs in most of the Java libraries.
The second reason for them, which is a differentiation on the public blockchains, they needed to charge for resources. They needed an intimate control over what happened on the machines, so they could charge you gas appropriately for usage of somebody else’s machine. We don’t have that need. What we didn’t say is we took the Java Virtual Machine and sandboxed it. Now it’s a bit of a marketing term in reality because it’s a pre-processor that runs through the contracts and evaluates all the APIs called to see if they’re deterministic. That actually was about a two-year effort of whitelisting deterministic APIs. You still get most of the Java library, but ones that are known to be not deterministic are eliminated at pre-processing time, when you load the contract. That way you have assurance that all the contracts that run are deterministic.
The next area we’re looking at is tokens. Tokens are quite fashionable. A lot of had your hand up for Bitcoin but not much about the using the technology. Tokens have a lot of power and a lot of potential, but they’re effectively just a contract. A lot of our work has been looking at how could you make uniform tokens that are fairly fungible so they’re interchangeable? There’s a lot of standards work on that. That’s the boring part nobody really wants to do, but has to be done slowly one use case at a time.
The other part is how can you make that token more powerful? If you look at ERC-20 on the Ethereum platform it’s really just about creating an interface that’s standard so that, when you take a token, you give it to an exchange they can list it. They just call the same APIs on every token. You have to have a common, I think, it’s like five APIs on your ERC-20. We’re looking at that further how can you make the tokens more applicable to broader scenarios? If you think of one of the key scenarios for us lately has been looking at finance scenarios. How can you issue something on ledger? How can you then securitize that and somebody who receives the token can work their way back to the issued item and assure themselves that it’s legitimate rather than having to just trust the token itself? That’s where some of the work is happening today.
I’m going to leave this last part to tee up Carolyne is on the network. One of the big things that we talked about with the network, is that the boundary of a network is its Root Certificate Authority. That’s the way Corda is designed today. It’s also the way Fabric is designed today. Fabric is a bit different in the sense that because Corda is kind of naturally a sharded architecture, in other words, you only get branches thrown around, nobody has to have a full copy, Fabric solves that in a very different way. They create mini blockchains within groups. The problem with Fabric is then you have to be able to take something from one blockchain and move it to another blockchain. That’s an interoperability problem and that’s a much different problem.
For us there is still a problem, which is each one of our customers is going to go stand up these networks with their participants, hopefully with that many nodes. It would be fantastic. These cannot interoperate with each other. Fundamentally blockchains can’t interoperate with each other because you don’t have the trust. You’d have to participate in both networks to assure yourself that you’re getting something that’s valid and unique. What Carolyne is going to talk about is, because the bounding of the network is simply the Root Certificate Authority, if we all operated under the same root CA we could operate in effectively more the equivalent of a public network. Carolyne will talk about that momentarily.
So my call to action – sorry, I’m moving really quickly and I think I’m just on time – is corda.net. The next step, if you’d like to learn more, corda.net is the website dedicated to the open source version. I should have said this in the beginning. Corda is an Apache tool licensed open source project. We have contributors from all our member banks, etc. We’re always seeking contributions, so it’d fantastic if you want to participate in that. The simpler way to get started if you go to corda.net is to go through the instructions for creating a contract. That will get you the most familiar with how the system works. That’ll pull down the dependencies from Corda itself.
My last slide actually is I have a problem. I run the Developer Relations team also. I didn’t bring the stickers because we’re running out of stickers and I need to develop new ones. I was sick on the weekend, so I came up with some ideas on my own. We’re going to come back to this after Carolyne presents. But I need you guys to clap and vote on which one you’d actually like, because I have to get them printed today because I’m flying next week and need them. So I’m going to take advantage of all the people being in the audience. I’m going to hand over to Carolyne and I’ll be back up for Q&A also at the end.
What is Corda Network?
Quinn: I joined R3 about six months ago. I’m not a software developer so also be kind to the previous presentation. What I am working on is governance and working on the application of the Corda network and promoting us to many different customers. Going back to Mike’s points like Corda network, so trying to break down what Corda network actually is. There are four or five main components that I think about when trying to get my head around Corda network. There’s a network map, which is essentially a list or a phone book of all identities on the network. This is an IP address and also a name, which is a legal entity name and several like a country and a location. Anyone who is a member on Corda network will be part of this network map.
Secondly, there’s an identity issuance service. Anyone who’s a member of Corda network will have an identity on the network. This is not an identity that KYC has been done on but there is a legal entity name, there is a country, and there is the closest city. And before you join Corda network, there are a couple of very large identity checks that we do to ensure that you can join. The identities are verified to a certain extent but not that KYC hasn’t done on them.
There is also at least one notary service. Corda network is providing a notary service. But anyone who wants to join can also run their own notary. We provide a notary service, which as Mike explained earlier, is basically proving that the transactions are unique. That’s also an important thing to get your head around. The trust route, as Mike mentioned, is the route of all the transactions, which have happened on Corda network, and are validated through this trust route. It’s offline. It’s hidden in a secret room. We had a huge ceremony to put this trust route in place. If you’re interested in finding out more, I know Mike outlined corda.net for our software, but corda.network is where you can find all the information about these things and also about the new foundation that we’ve set up to governance. I’ll talk about that in a minute.
All the policies which regulate all of these different things are all outlined on that website and I encourage you to have a look at us, should you be interested. Obviously, nodes are up at the last part of the network. A node is an identity, is a participant, is a legal entity. These are the same things to a certain extent. If you’re taking part in Corda network, you will be a legal entity. Individuals currently are not participating on it, it’s just for legal entities.
The way many projects or consortiums are joining Corda network at the moment is through business networks. This is a terminology we use internally. You could call it consortium, you could call it a trading group. Typically, these are groups of 10 to 15 legal entities. There could be way more than that. We have a couple of groups joining who are in the hundreds, but these are groups who have decided before joining Corda network that they need to interact together. They need to transact together. They want to use Corda.
We’re seeing reinsurance syndicated landing. Trade finance are the big joiners at the moment, but there is also customers and many other industries like healthcare, supply chain, and oil and gas even. Business networks is how people tend to join Corda network. All the rules about who can join a business network is up to that business network themselves. They create rules around pricing, around who is allowed on, what kind of legal contracts are there, how they transact together. We want a lot of the control to go to these business networks to determine their transactions. As Mike mentioned, one of the big advantages of Corda network is your nodes can be a member of 10 or more business networks. Any assets like cash or other assets that you have in your business network can also be used in other business networks. The term interoperability is a buzz word, we recognize that, but we do think that Corda network has this in comparison with other networks.
How is Corda Network Different?
I don’t want to go into this in too much detail, but on the right-hand side is Corda. It’s a global network, which is an interoperable network, but it also has privacy and high scalability like transaction throughput in comparison to other leading blockchain networks, such as ourselves. I wanted to quickly talk about our foundation. When people hear about Corda network, they’re excited but until recently they have been a bit reluctant or a little bit suspicious because it was controlled by R3.
Why should I join something that is controlled by a for-profit corporate essentially? Why would I risk having vendor lock-in when there are other networks out there for which this may not apply? In reaction to that, I’ve spent the last six months helping set up a foundation. It’s a completely separate legal entity to R3. It’s called Stichting and it’s a resident in the Netherlands but Ethereum also has a Stiftung, which is a Swiss version of this. It’s a governance entity, so it has no shareholders and no owners, it’s just there for governing. What’s exciting about it is there’s two R3 directors, but there’s nine directors from other corporates. These are customers basically of ours or participants on the network and they’re directing and governing how the network runs. This is anything from pricing to changing the rules about who is allowed on the network, potentially, the network operator. It can also govern who runs the network.
It has quite a lot of free scope for changing Corda network. The transition board was internally announced last week, and we’ll be doing a press release in the coming weeks about who those members are. They’ll be there for the next year and they’ll be deciding how the network is run. So it’s really taken away from R3. The businesses have taken a good bit of risk by leaving control go to a completely separate legal entity, but it’s set up and it’s all on corda.network, if you’re interested in finding out some of the policies and anything more about it.
What Will The Network Be Governing?
R3 is the network operator. We set up the Corda network and we’re running it for the first three years. After that, the foundation can change the network operator. So we really are letting go of control.
The foundation can decide on pricing for the networks. All the pricing is currently publicly available on corda.network. They’re very low prices. It’s a not-for-profit entity, so it’s just there to cover its costs. It’s all available, should you be interested in joining. Any stuff around technology upgrades for Corda network, the foundation will be heavily involved with that. Any changes to the bylaws, the articles of association, the governance guidelines, any of these legal documents that are set in place for the foundation to run, these can be changed by the board of the foundation as well.
Questions and Answers
Participant 1: You were talking about the double financing, solving the double financial problem. What I understood is that their financing have to be on Corda network. If it is outside of the Corda network, there’s no way to actually detect whether there is a double financing or not. Is that correct?
Ward: In this case, it’s the way they’re attempting to solve it. I wouldn’t think of the Corda network, specifically. No offense, Carolyne, but that makes for a bigger network. In this case, you could just run a financing network as an example. The double financing would only be aware to the participants of that network. When they issue the request for financing it can only be consumed one time. If they go to issue something again, well, they probably have received copies of that issuance. So there’s some information that’s actually shared broadly in these networks and some information that’s kept private. But something akin to a request to finance something would be shared fairly broadly among the major banks. It’s a very big problem, though. Actually, I didn’t realize it cuts in quite substantially to the profits, trade finance in particular.
Participant 2: You were talking about the governance. If you look at the blockchain as a whole, the main thing is decentralization of everything. Having something governing bodies sounds like a centralized control. How will you solve this kind of dilemma? This will be obviously like when it comes to the expanding the network or trusting the network.
Quinn: I think there are two points. Firstly, a lot of the control is actually with the business networks themselves. The foundation, it’s there to govern the boring underlying infrastructure, which is the Corda network. But a lot of the power about who can join business networks, who can transact with each other, who may be kicked off, is down to the business networks themselves. There could be millions of business networks. So in some senses, the decentralization still does exist. And on the foundation side, there’s 11 directors in its current form. This is in its earliest stage and there could be more directors, but it’s the members of the network who are on that. After the transition board who’s just been appointed, there’s going to be elections, and anyone who’s a member of the network can elect the boards or stand for the board as well. So it’s representative democracy rather than participatory democracy. It is a certain bit of decentralization, but it’s still pretty open.
Ward: A lot of that could be put on ledger at some point. But what it’s governing is actually very minimal. What it’s governing is the root CA and the operator of that CA’s ability to add people to a reputation list. So can you kick somebody off the network effectively? The only reason you would do that is denial of service attacks, in reality. By the way, it’s the root CA’s ability to do CRL is an issue and then the second thing is what are called the network parameters, which are super basic. They’re basically saying like the epic, how long when a node is unreachable until it gets removed from the network map because it’s considered offline permanently, those types of parameters, which are very generic. If you look at what they’re trying to govern then in the ability to add or remove participants, even a CRL is something you can recognize or not.
However, it’s very useful for denial of service because we use TLS as the basic security layer. Certificates are required for that. That’s the fundamental reason we have certificate infrastructure. The next layer is using that certificate for identity. A business network, so a smaller group of participants operating on a particular asset type and using a particular contract, could choose a different means of identity that would be separate from the TLS identity. They could choose a means of recognizing each other that would be completely independent of the network. Therefore, they always have business continuity. These are still peer-to-peer systems, so there’s not much you could actually control in reality. You could very easily change Corda to say ignore the CRL entirely.
Participant 3: Could you please explain a little bit how you solve that problem? If there is, for example, a vulnerability in one of the JVM APIs we’re using, how do you expand and change all the different instances of contrast running in all the nodes?
Ward: Sorry, I didn’t really follow that too much. That was a little fast.
Participant 3: If the open source community found that there is a vulnerability issue with one of the JVM APIs that you are allowing to have in your small contracts, how do you change that in every single instance running in every single node?
Ward: If there’s a problem with Corda is kind of an application server in a way, and it was a security problem, so we have a process through the open source community of taking those and responding to them as quick as possible. If it’s open source you could patch it yourself if needed. If it’s a problem, you’re saying, like in a contract using an API that was discovered to have a vulnerability, it would be up to the contract. Unfortunately, that means somebody has developed that contract that has nothing to do with the open source community. They’re just happening to use the platform for that purpose. So they would have to go through it and make updates.
One of the hardest things actually in blockchain platforms is contract upgrades. I heard a really good interview with one of the tethered tokens on the Ethereum platform, and they said, shipping an Ethereum token of that caliber where it’s tethered to funds that are in a bank account somewhere, is like shipping hardware. They can’t change the contract after it’s shipped. So they’ve got to get it right the first time. They had some very notable processes for doing that. But Corda doesn’t work that way. We kind of designed it knowing that you’d constantly be upgrading contracts because they reflect business logic that would change. You may find bugs, etc. You just want increased capabilities. Corda has built in a means of allowing contracts to be upgraded.
So this isn’t the software itself, it’s the contract, the executing. The most basic means is, if it’s a small network, and there’s a network operator, they can whitelist which ones are acceptable. Now, that’s not good because that means the operator of that network now controls what goes on it and you don’t want that. So the next is you can actually have a consensus mechanism run that all the participants of a contract would accept a new reversion. At that point, they would all transact once. They would take this stuff that’s an old version. They would create a transaction migrating to the new version and they all upgrade to the new contract. What we just introduced with Corda 4 is the signature constraints. The participants can agree. A signed JAR is something they would automatically upgrade.
That’s when you get into a more controlled network. When you receive a contract from somebody, the contract can propagate with. If you and I transact, I can send the newer contracts attached to that. These are all adjustable things based on how you want your security profile, but I could propose a contract, attach the JAR, you look at the JAR and say, “It’s a newer version of mine.” JAR is signed by a signature of a producer I recognize. I will automatically take this, upgrade my ledger to the new contract, and transact with you on the new version of the contract. And that becomes an automated process in that way, so it’s efficient.
Now, it’s always a trade-off then of how much risk do you introduce or not. That probably depends on what you’re transacting over and who you’re transacting with. Unlike Bitcoin, which is somewhat arbitrary – I’ll pick anybody in the world, give them my address and transact because it’s one asset type made for payments effectively – Corda has it’s arbitrary. So you’ll have governance rules that are very depending on who you’re working with and what you’re working on.
See more presentations with transcripts
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Ameisen: What I want to talk about today is practical NLP for the real world, which is maybe a lofty goal. What I really want to cover are lessons learned from building a lot of NLP projects, and practical tips that can help you succeed and go beyond the standard blogs or papers that you might read.
Why am I talking about this? I work at Insight Data Science, I’ll tell you a little bit more about this, but my role has been mainly in helping a lot of brilliant fellows build applied machine learning projects. A lot of them are on topics that are not NLP, but quite a few of them are in topics that are NLP, so I put a few on the right here. In 2017, we had this automatic review generation for Yelp where we managed to generate reviews that were really hard to detect. We had another project where we were classifying support requests automatically, I’ll tell you a bit more about how these work and what works in practice.
What’s Insight? Basically, all of these projects were done by some of these Insight alumni, which are from all over the U.S. This map is outdated, we now have offices in L.A., in Toronto. These fellows come to Insight, build these projects, and then go on to work at some of the companies listed here and others, so while they’re at Insight working on these projects, the Insight staff and mentors help them.
Let’s dive right into practical NLP. What I want to do is talk very briefly about the “Why?” I think most of you probably have a sense for why you would use NLP, so we won’t spend too much time there. Then talk about the theory, and then talk about where maybe some of that theory breaks down in practice and some of the challenges that are there.
Practical NLP: Why
What’s NLP and why you would need it? Practical examples of things that you could do with it, we had a project where we did medical understanding where you automatically extract keywords of diagnoses from recent medical publications and use them to update a database that then doctors can reference when they’re treating patients to always stay up to date with the newest treatments. We had another project that used NLP based on code where you can simply read somebody’s code and their answer to a coding screen question and you can try to automatically assess whether they’ll get hired or not. It turns out that a machine learning model is pretty good at that, which, I think, tells you a lot about whether that’s a good interview practice or not. Then, support ticket triaging of routing support tickets to the right version, there are many versions of systems like that that exist, but I’d say that that’s a very common use case of NLP that can also deliver a lot of value.
Why focus on NLP? I think images get all the hype, but when I talk to my friends that are survey engineers or data scientists, it’s very rare that their day-to-day tasks include identifying stop signs, unless they explicitly work on that problem. Most useful data is in text, whether it’s public data, tweets, or Reddit, your proprietary data, or a mix, whether it’s reviews or comments about your company. Maybe more importantly, since you’re all familiar with NLP in general, compared to computer vision, it’s much easier to deploy NLP models, they’re usually shallower, usually easier to debug and usually more affordable to maintain. That depends, we’ll go into some pretty complicated models a little later, but usually, that holds true.
In theory, how will NLP save the world? In theory, it just works, if you read corporate blogs or papers, you do your end-to-end approach. Whatever your NLP problem is, translate this to that, write code automatically, write reviews automatically. You put your inputs, you put your outputs, you train for a couple months on 200 GPUs, and then done. Your data is easy, either you have a standard data set that’s used in academia and you can just use it and see if you can get that extra percent of performance, or you work at a massive company that has absolutely infinite data, just one query away, or you have the money to have somebody label 9 million photos, or sentences, if that’s what you need.
Finally, I’d say that the other in theory is you build your model, you get a really good model, and then, that’s it. That’s where most papers stop, that’s where most blogs stop, but how do you know that your model is actually good? How do you know how to deploy it? How do you know when you update it? How do you know when you change it? These are some of the things that I wanted to talk about.
A lot of this theory, I say, comes from the promise of deep learning, which is that it’s going to automate a lot of this work for you. You’ll be able to just feed in raw data and through some models that will automate feature extraction for you, you won’t have to do a lot of that work. I think in practice, deep learning is very useful, but it doesn’t solve nearly all of the practical problems that come up.
I wanted to illustrate this with an example, so this is another project that we did at Insight. This was by a fellow in the summer of 2018, the project is simple, the idea is you want to learn how to paraphrase sentences, so say the same thing in a different manner. Why you would want to do this? A big example is, let’s say, you’re on the Alexa team or the Google Home Team, and you want to capture all the ways that somebody could ask you, “Hey, Alexa, play this on the TV.”
We built this model that learns to paraphrase, and we gathered this massive corpus of sentences that roughly mean the same thing, that was a whole endeavor in and of itself, and then trained a model that was sort of a simple encoder-decoder, which is a pretty standard deep learning architecture for text and we got some reasonable paraphrases, if you look on the right, that seems reasonable. Here are a few issues with it, the model is powerful, so that’s good. It’s pretty hard to do this, and it generates reasonable suggestions. It’s data hungry, I would say most of the work was just looking at the data, looking at the data again, cleaning the data up, realizing that the sentences weren’t aligned, realizing that there weren’t enough sentences, realizing that there were a few things that were missing.
It is resources intensive, the final model took over a week to train, and it’s brittle. Here I gave you, I think, a decent example of it working pretty well, but it’s very brittle and it’s going to be hard to see how brittle it is. What I mean by that is I was just sort of showing this model to a friend after building it, my friend’s name is Tosh, he’s great. We just put in a sentence that said something like, “Oh, Tosh is going to the grocery store,” or something like that. It turns out that the model, because of some data it had seen, was really 100% confident that Tosh was a curse word and just spat back just insults. No matter what we did, no matter where we put Tosh, it would just spit out insults. My friend was fine with it. He was like, “Oh, that’s deep learning.” I was like, “Well,” but in the real world, if you were to ever use this, this would be a pretty bad use case, you find yourself insulting somebody in your living room and all of a sudden, Alexa starts playing music, you’ll be a little confused. There are a lot of issues that I would say come with what happens before you have a model, so you get the data set that works, and what happens after you have your model, in between like, “Hey, I trained the model. The curve looks really good and we have a product that we can actually use.”
The way I like to think of machine learning in practice, it’s just like regular bugs, but worse, because they’re often harder to see. Machine learning code can entirely run, no errors, no warnings, but everything is wrong, and your model is terrible. It can entirely run with no errors, no warnings, good-looking accuracy, but everything is wrong. It’s a challenge, I think that’s a lot of the challenge about this. Some of you have probably seen this quote, “In theory, there is no difference between theory and practice, but in practice there is.” As I was preparing for this talk, I was Googling who gave that quote, in theory, it’s Benjamin Brewster, but in practice, it’s very disputed.
Here are the real challenges and the way that we try to think about them. One, there are very many ways that you could frame an NLP task. NLP is broad. If you think about just understanding text or understanding sequential data in general, that’s a pretty broad domain. There are a few tasks that work really well, you don’t have to transform everything into these tasks, but if you can, you’re in a pretty good spot. Mainly, if you can transform your task into either classification, so you take examples and you give them one or multiple categories, named entity recognition, or information extraction where you take some sequence and you try to extract salient information where you’re like, “Ah, somebody said, ‘I won’t be able to make my appointment tomorrow.'” You say, “Ah, they’re talking about an appointment and the date is tomorrow.”
Those tasks usually work pretty well and are pretty well understood and anything outside that has to do with embeddings, which is finding a good representation of your text so that you can use it later for recommendations, for search, for anything like that. The other thing we talked about already, I’m going to focus most of the talk on, is the debugging step of once you have a model, how do you look at it, how do you validate it, and how do you do a deep dive?
We’ll walk through a practical example and this practical example is actually from a blog post that I wrote over a year ago, and that was pretty popular. I basically took that blog post again, which was a center machine learning pipeline and did a deep dive on it, which I hadn’t done in the original blog post to see exactly what’s going on. Here’s what the practical example is, this is a professionally curated dataset, contributors looked at 10,000 tweets, a little over 10,000, almost 11,000 that sort of contained disaster words. The question is, well, is this about a real disaster? I think maybe a disaster that somebody would want to know about if they were emergency responders, or the police or just like knowing about generally, actual bad things that happen in the world, or somebody that was just very angry about a sushi restaurant and was using extremely strong language, so, can we build a model to separate both?
The reason I chose this task is because I think it’s really interesting, because by design, you’re trying to separate things which are using the same words, because all this data set was curated by looking for these words here, “ablaze, quarantine, pandemonium, etc.” You’re building your task so that it’s a little harder because you can’t just discriminate on words because a lot of these tweets will share the same vocabulary.
Here’s what we’re really going to talk about. One, vectorization or embeddings, in a way, two, visualizations, three, the model. Because so many resources are dedicated to how you train models, how you train good models, etc, we don’t spend too much time on that. Then four, deep dive into the model to actually analyze it.
How to feed your model data? This is something where I wasn’t sure how familiar all of you would be with this, I’m happy to dive deeper or stay at a high level depending on how people feel. Models can’t usually take in raw strings and train or predict, they need to have, usually, numerical data. Most machine learning models need to have numerical data, so all models that work on text will need to find some ways to represent text as a number or as a set of numbers, as a vector.
You can think of it on the left, there you can simply transform your text to a vector using yourself, you can use some heuristics, we can talk about them, and then feed them some simple model, like a logistic regression. Or, if you like really modern NLP, this is a diagram from ULMFiT, which was one of the first papers that started the transfer learning for NLP phase, you can have the super complicated model but then, if you think about it fundamentally, all the model does is that it has a complicated way to give you a really good vector for a sentence. Once you have your vector, then basically, you have what is equivalent to a logistic regression and you pass it. In practice, there’s some nuances there, but fundamentally, that’s what’s happening.
Who hasn’t seen this diagram? At this point, I feel like every presentation about NLP has to have the word2vec plot, so there’s my contribution to that rule. I’m just putting it to remind you that the idea here is that ever since 2013, 2014, there are ways to vectorize words, so to find embedding for words, to find ways to represent words and sentences as a set of numbers that are semantically meaningful; meaning that on average, your hope is that words or sentences that talk about the same things, that mean the same things, will have vectors that are close to each other.
How you do this? I feel like there are enough talks about this, there’s word2vec, GloVe, there are very recent approaches, BERT, etc. The main takeaway here is we find a way to take our tweets, our sentences and make them vectors and there are a lot of just pre-trained models online. Here, for this part, how good the model is, is not that important, so we’re just going to start with something that gives us pretty good vectors.
One simple way to do this is to not even use any of these complex BERT, GPT-2, all that stuff; you take your sentence, you take all the words, so, “We love machine learning.” ,you transform them into vectors using a pre-trained model that you can find online that just has basically, a dictionary of word vector, word vector. You take the average, you also take the max along each dimension to preserve salient words that have a strong meaning, or you concatenate both, and that gives you a pretty good vector, this is from a paper that you can see at the bottom. This is definitely not the best performing ways to embed vectors, but it gives you pretty good results and it’s really simple, so we’re going to do that.
The traditional machine learning way is somebody will still be like, “Well, I have this great new method. I embed the vectors, then I feed them to my classifier and done, we get 80% accuracy, this is great, we’re ready to deploy.” but let’s actually look at our vectors. Here, this is for the same dataset, I used the dimensionality reduction technique, just PCA. You can think of PCA as just a way to project data, so those vectors are very large, 300 dimensions, let’s say, I wouldn’t be able to show them, so I’m going to show them in a 2D plot. There are a variety of dimensionality reduction techniques, PCA is one of them, it just helps you project from 300 to 2 so that we can actually look at them.
Here, this is a simple embedding technique that’s even simpler than the one I’ve shown. If our classes are pretty separated, we’d expect the embeddings to sort of spread out, so we have blue on one side, which is disaster, and orange on the other side, which is irrelevant. It isn’t happening here, that’s fine, but we’re going to try a little bit of a better embedding. This is TF-IDF, which normalizes words a little better, this is starting to look a little more separated. Then this is using the word vectors from that famous slide. This is looking a lot better, we can have hope that our vectors are pretty reasonable, maybe if we feed this to a classifier, we’ll be in a good spot, so we do.
This was a very long blog post that I’m summarizing it in a few slides, but essentially, we get a classifier, we use a simple one, logistic regression, 77% accuracy on relatively balanced classes. It’s basically a two-class problem, there’s a third category here, you can see, “Unsure.” but, there are five examples in the whole data set so we sort of ignore that, and we get good results, so again 77% accuracy. In the blog post, I go on to try more complex models, CNN and RNN, etc. and it gets us up to 80% accuracy. Now we’re done, we have our model, we have 80% accuracy, we’re ready to deploy it, we’re going to give it to the FBI and all of the police are crawling Twitter and just be like, “Yes, here, just use this. It’s great,” or not, or we’re going to dive a little deeper and see what’s actually going on.
What we’re going to do is we’re going to inspect the model, and we’re going to inspect the data and then we’re going to inspect the errors, which are basically, I like to think of as the combination of the model and the data. The way we’re going to do this is first we’re going to just look at the model. Because we used a simple logistic regression, we can just look at the coefficients of the logistic regression and see what words it finds important. Here, for disaster, the first few words are “Hiroshima, fires, bombing,” seems pretty relevant, for irrelevant, “See, this, never.” It seems like the disaster words that the model picked up on to us for its decisions are pretty relevant. That’s looking good so far, but let’s dive a little deeper.
One thing that I found super useful, and I recommend to fellows at Insight all the time, is to use vectorization- so again, just transforming your data into a vector and dimensionality reduction techniques- to just inspect your data and validate it. Oftentimes, you’ll hear, especially at companies that have the resources, “Just label some data.” that’s often uttered by somebody who never had to label any data. If you work in machine learning, I would say, one of the most instructive experiences you could have is spend three hours labeling data, it will change your life, maybe not in a good way, but it’s really enlightening. I think there are very many things that are extremely hard about labeling data, one of them is just a numbness that comes with doing it a lot. The other thing comes with is, once you’ve labeled 100 examples, whether the 101th tweet is about a disaster or not, becomes this very uncertain concept, and so you just start guessing. You get in a flow state, but a very weird one of just guessing left and right. This was a professionally labeled data set, so it’s easy to sometimes treat it as, “Yes, it’s ground truth.” , if we have a model that performs 100% on this dataset, then we have a perfect model.
What I’m going to do is I’m going to do a deep dive on these labels. Here, similar to the plots I was showing before, we have a plot of all the labels with the relevant ones and the not relevant ones. This is a UMAP plot, which is a different dimensionality reduction technique. I chose a different one just to show you that these techniques are great at giving you a view of what your data looks like, but they make a lot of approximations. In fact, this looks very different from the other ones, but it’s the same data. You just want to be a little careful about making too many assumptions, but they allow you to actually look at different parts of your dataset and actually look at the individual examples.
What we’re going to start with is finding outliers. What are these points that are super far from the center? Is there a reason that they’re super far, are these tweets maybe messing with our model, or maybe they’re really complicated? At this point, I spent about, I’d say, 45 minutes trying to debug this visualization tool because I kept having this. I was like, “Ugh, the thing that shows the text when I hover is obviously wrong because it keeps showing me 20 of the thing instead of one, so I have to have some loop that’s wrong somewhere.”
It turns out that 10% of the data is basically duplicates or near duplicates. It’s usually duplicates or near duplicates of the worst things, because a lot of the duplicates are people that tweet things for contests. They’ll tweet the exact same sentence with some extra stuff. This one says, “One direction is my pick for Army Directioners”. The idea being that you have dozens and dozens of these repeated tweets that are going to be super heavily weighted by your model, but maybe not for a good reason, maybe this is not really what you care about. Then there are questionable labeling decisions, this person says, “I got drowned five times in the game today,” which is unfortunate, but probably not a natural disaster, it’s labeled as one. Then there’s “China’s stock market crash” is labeled as irrelevant. That one, I think, is actually even more interesting, because is it a disaster, a stock market crash? Maybe, maybe not. You can see how you’d get in that situation after labeling which I think would be like, “Well, I don’t know.”
Then, there’s the even better version where this is about the movie about the Chilean miners that were trapped, so you have two tweets about them, one is labeled relevant, the other is labeled not relevant. They’re the same tweet, so you can imagine how feeding that to a machine learning model would be sort of extremely risky. What I wanted to show is, by just removing these duplicates, so removing a thousand duplicates and cleaning up data, we get a much better model. In fact, this model performs on par, if not slightly better, with the most complicated models I used in the blog post, even though this is the simplest model I’m using on just cleaner data.
There’s a little thing that I want to ask you about. If you look here, we have a better model, the metrics are better, the confusion matrix looks better, so we have cleaner data, a better model. It seems reasonable, but our accuracy only increased a little, which I was a little saddened by initially. After thinking about it, our accuracy should have dropped, does anyone know why? What I’m saying is, after we cleaned our data set, our new model on cleaner data, removing all these examples that I showed you, should be performing more poorly.
Participant 1: Typically, it should not treat the label the same way, therefore, correcting these labels. If you remove them, that means that you’re losing part of the labeled data that should have been labeled correctly, therefore, you’re losing things that you assumed were correct.
Ameisen: Yes, that’s exactly right. All of the duplicates were actually really easy cases, especially because if you’ve seen two, then you can guess the next 20. Even more, we had severe data leakage because we weren’t controlling for the duplicates, so if you have 30 of the same example, you probably had five in your training set and then 25 and the ones that you used to validate, so we would actually expect, since we removed all these duplicates, to have a much, much harder task. The fact that our model’s metrics have actually improved, shows that our model is actually not a little better, it’s much better because it’s doing something much harder, so the metric is as good as the data.
How can you find out about the quality of your data? The easy way is to inspect it like we did, the hard way is to just deploy it to production. If you deployed that model on to production, it would fail horribly because it wouldn’t have to deal with this easy, full of data leakage data set, and then, you’d know you did something wrong, but you’d have to go back to the first step anyways to look at it.
I want to talk about how you would do that for complex models, we did this for a simple model, but is there a way that we can go deeper where we basically look at the data to see what’s mislabeled? Once we’ve trained the model, can we see what’s particularly tripping our model up? Complex models are better, but they’re harder to debug and validate, oftentimes.
Here, I’ll skip a little bit of the details of complex models, I’m happy to just talk about that in the questions. These are just a few that I’ve seen work well, CNN’s language models, and transfer learning. Once we have a complex model, as you now know, it’s not the end, there’s a few things that we can do to debug them. I’d like to narrow it down on two things, one is LIME, LIME is one framework out of many that’s basically a black box explainer. A black box explainer just makes no assumptions about what models you’re using, and tweaks the input in various ways, sees how your model responds to it, and then fits a surrogate model around that, that basically is going to try to give you explanations. In this example of something that was relevant, it removes words and it says, “Oh, well, when I remove this word, your model says that it’s not relevant. When I remove that one, it says it’s more relevant, etc.,” and so it gives you an explanation. No matter what your model is, you can get an explanation, and here’s for a pretty complicated model, and we have a reasonable explanation of what’s going on. You can then use LIME on, let’s say, 1,000 examples picked at random in your data set, and average which words it thinks that are important, and that will give you important words for any model. That’s model agnostic explanations, it’s really useful for sort of the high level debugging.
This is a trick that I don’t see done enough in practice, so this is really the one that I want to share. Visualize the topology of your errors. You have a model, you’ve trained it, and then the key is you’ll have this sort of confusion matrix where you’ll say, “Ah, here’s our true positive, here’s our false positive, here’s our accuracy.” but what’s actually happening? What are these errors? Is there any rhyme or reason to them? What you can do is you can do the same plot, here it’s the plot that we did before, you’ll notice that it looks pretty different, because [inaudible 00:28:46] here is in orange, it’s the predictions that our model’s got right and in blue, it’s the ones that it got wrong. Taking all your data, taking all your model’s predictions and seeing, what does it get right, what does it get wrong, and seeing if there’s any structure.
By looking at that and zooming in, here you can see we zoom in on the bottom right side, you can see that there are labels that are in conflict. Here there’s one label that it’s basically the same thing, where they’re similar sentences, one is labeled as irrelevant, the other as relevant. You can see that there are still even more duplicates, apparently, our Duplicate Finder was not perfect. Here are data gaps, so that one is, I thought, a little cheeky, so I added it.
There are a lot of examples of a joke that was apparently going around Twitter, which is, “I just heard a loud bang nearby. What appears to be a blast of wind from my neighbor’s behind,” so sort of a crass joke. Then there’s, “Man hears a loud bang, finds a newborn baby in a dumpster.” One of them is really bad, the other is just a joke, but our model has seen so much of that joke that it has associated a loud bang with just, “Ah, certainly, that’s about a joke.” and so it said that this horrible baby story was a joke and it’s fine.
Looking at the errors of your model can also make you see like, “Ah, this is very clear because we have a gap in our data.” and this is oftentimes, a lot of what happens with either recommender system has gone wrong or search gone wrong, that some sometimes a malicious actor has found some gap where your model isn’t good and has exploited that. Using this visualization of, “What does our model actually get wrong?” is super helpful to find these.
After seeing hundreds of these projects, what I’ve learned, and I know that saddens most data scientists I talk to, is that, basically, your priorities in order should be to solve all the duplicates and the conflicting data, then fix all the inaccurate labels that you learn about when doing this. After you’ve done this about 30 times, you can usually look at your errors and more definitely say, “Ah, perhaps, to understand the context of the joke, we should use a language model.” That’s usually the step that is useful, but only after you’ve done all of these many steps before. The better the dataset solves more problems than any model.
In practice, what I recommend is just find a way to vectorize your data. This seems like a simple tip, but it’s so important because debugging a data set by just looking at thousands of examples, especially for NLP, is extremely hard and mind numbing. Vectorize it, organize it in some fashion, use different methods, and then visualize and inspect each part of it, and then iterate. That’s usually the fastest way that we’ve seen to build models that work in practice.
You can find me on Twitter @empowered, if you want to know more about these projects, there are a bunch of ones on our blog. You can apply to Insight, or you can come partner with us.
Questions and Answers
Participant 2: This is more of a curiosity question based on the example you had given saying it took your friend’s name Tosh and kept thinking it is a bad word, or something of that sort. What was the reasoning behind that using the same techniques? What was it that led to that conclusion for the model?
Ameisen: The more complicated a model is, the harder that it is to say, so that one was actually quite puzzling to me. I don’t know that I still have a good answer, my best answer right now is that the model wasn’t working on words, it was working on sub-word concepts called byte pair encodings. You think of a few curse words that I won’t mention here, but that are a few letters away from Tosh, those curse words were in the data set, and so that’s what I think happened.
Participant 3: Very interesting talk, thanks very much. One of the questions I have has to do with the models you’ve used in your career. I’m new to the NLP field, I’ve used Naïve Bayes a couple of times, I haven’t gone as far as to use recurrent neural networks. Would you say that neural networks or recurrent neural networks are the best model to try every time, or do you think it’s a case-by-case decision?
Ameisen: The best model to just try every time is a bit of a complex question because there’s what the best model will be at the end. If all your model’s implementation was perfect, and if your data is perfect, then usually like a transformer RNN, if you have enough data, will give you the best results. In practice, that never happens, even if you have a large data set. In practice, what happens is that for whatever task you want, your data set is imperfect in some way. The best model to use, in my opinion, is actually using even the Scikit-learn tutorial of count vectors, logistic regression or word2vec logistic regression, and then doing at least one pass of this deep dive, because you know you’ll be able to code that model in 10 minutes. Then looking at the data, you’ll get much, much more value than anything else by simply, basically saying, “Oh, here’s 10% of my examples are mislabeled. No matter what model I use, it’ll be wrong in the end.” so I usually the best model is the one that you’ll implement in five minutes and be able to explore. Once you’ve done that a few times, you can go for the artillery of models of RNNs and transformers.
Participant 4: More curiosity: a lot of people type emojis and a lot of our data, you end up having a million smiley faces, sad faces. Is that part of the vectorization? How do you deal with that unknown words language?
Ameisen: It depends on your data set, it depends on what you want to keep. For this one, the emojis were removed from the data set, which I personally think is a terrible decision for this dataset, because you probably want to capture if there’s 12 angry faces in one of those tweets, it’s probably a big indicator. You can sometimes pre-process them away, if you were just trying to see, did that person order a pizza or a burger, you don’t need emoji, or maybe you do, actually, for a pizza or burger, but it depends on the use case as far as how you would represent them, how you’d recognize them. I don’t know if they are in a lot of pre-trained word vectoring models because those are based on Wikipedia and Google News, and those don’t have that many emojis, but you can train your own word vectors.
I skipped over that slide, but fastText is actually a really good solution to train your own word vectors and then you can define any set of characters. You can use emojis, as long as there’s enough in your dataset to learn what a frowny face maps to, you’ll be able to just use them as a regular word.
Participant 5: Thanks for the talk, I’m wondering regarding conflicting labels or conflicting interpretations, have you explored whether you can leverage the conflicting interpretation and maybe serve different models based on whether the consumer of the output aligns more with one labeler or versus labeler?
Ameisen: A lot of these conflicting labeling cases, as you said, it’s hard to determine if it goes one way or the other, so using the user preference is a good idea. For this particular project, no, but here’s how that’s done a lot in practice. What we essentially have here is we have vectorized representations of all these tweets and then from that, we tell you whether it’s relevant or not. If you wanted to take user input into account, what you could do is you could vectorize the user in a way so find a vector representation of a user and then use it has an input to your model.
That’s what YouTube does, or at least, according to their 2016 paper, that’s what they used to do, where based on what you’ve watched, etc. they have a vector that represents you. Then based on what you search, they have another vector, then, they feed both of those to their models. That means that when I search the same thing that you search, maybe because our viewing history is entirely different, we get different results, and so that allows you to incorporate some of that context. That’s one of the ways, there are other ways as well.
Participant 6: Great talk. I was curious if you have a task classification. By default, your instinct is to use something like word embeddings, but is there any use for character-level embeddings, in your experience, or sentence-level embeddings?
Ameisen: It always depends on your data set, basically, is the answer, so maybe a show of emojis, you do something slightly different. It’s really important that you keep the order of the words in a sentence, because you have this data set where it’s very much how the sentences are formulated, then you want to use something that is like an LSTM or BERT or something, a little more complicated. Essentially all of those boil down to finding a vector for your sentence, what the best vector is is task dependent. However, what we found in practice is that for your initial exploration phase, usually using like some pre-trained simple-word vectors, work. I would say the point here is just find something where you can get vectors that are reasonable really quickly so that you can inspect your data and then worry about what the best implementation is. I wouldn’t say there’s an overall best one, it depends really on the task and what your data looks like. Sometimes, you want character-level because just capturing a vocabulary won’t work because there’s a lot of variance, sometimes you do want a word-level, especially if you have small data sets, so it really depends.
See more presentations with transcripts
MMS • RSS
Article originally posted on Data Science Central. Visit Data Science Central
This article was originally published on OpenDataScience.com, written by Daniel Gutierrez.
For many new data scientists transitioning into AI and deep learning, the Keras framework is an efficient tool.
Keras is a powerful and easy-to-use Python library for developing and evaluating deep learning models. In this article, we’ll lay out the welcome mat to the framework. You should walk away with a handful of useful features to keep in mind as you get up to speed.
In the words of the developers, “Keras is a high-level neural networks API, written in Python and developed with a focus on enabling fast experimentation.” It has been open sourced since its initial release in March 2015. Its documentation can be found on keras.io with source code on GitHub.
The Genesis of Keras
Although there are many quality deep learning frameworks to choose from, the Keras framework makes it easy to get started because of its design as an intuitive high-level API. It is attractive to new data scientists because with it, they can quickly prototype and develop new models.
Keras was built with modular building blocks and programmers can easily extend it with new custom layers. Considered a deep learning “front end” with a choice of “back ends,” it is an important tool for any data scientist working within neural networks and deep learning. It is particularly useful for training convolutional neural networks that have small training data sets.
François Chollet, a deep learning researcher at Google, developed the framework as part of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System). In 2017, Google’s TensorFlow team decided to support Keras in TensorFlow’s core library. Microsoft added a CNTK back end to the framework, which was available as of CNTK v2.0.
Keras is based on object-oriented design principles. This characteristic was described by its author François Chollet in the following way:
Another important decision was to use an object-oriented design. Deep learning models can be understood as chains of functions, thus making a functional approach look potentially interesting. However, these functions are heavily parameterized, mostly by their weight tensors, and manipulating these parameters in a functional way would just be impractical. So in Keras, everything is an object: layers, models, optimizers, etc. All parameters of a model can be accessed as object properties: e.g. `model.layers.output` is the output tensor of the 3rd layer in the model, `model.layers.weights` is the list of symbolic weight tensors of the layer, and so on.
Choice of Back Ends
Though developers initially built Keras on top of Theano, its abstraction ability made it easy for them to add TensorFlow shortly after Google released the back end. Eventually, the Keras API was implemented as part of Google TensorFlow.
Now, the deep learning front end supports a number of back end implementations: TensorFlow, Theano, Microsoft Cognitive Toolkit (CNTK), Eclipse Deeplearning4J, and Apache MXNet.
Seamless Python Integration
As a native Python package, Keras offers seamless Python integration. This includes simple access to the complete Python data science feature set, and framework extension using Python. The Python Scikit-learn API can also use Keras models. There are great tutorials to integrate these two tools and develop simple neural networks.
Porting Between Frameworks
Industry experts regard the Keras framework as the accepted tool to use to migrate between deep learning frameworks. Developers can migrate deep learning neural network algorithms and models along with pre-trained networks and weights.
Runs on CPUs and GPUs
The framework runs on both CPUs and GPUs. It can use single or multiple GPUs to train deep neural networks, or it can run on a GPU using the NVIDIA CUDA Deep Neural Network GPU-accelerated library (cuDNN). This approach is much faster than a typical CPU because developers designed Keras to deal with parallel computation.
Keras in the Cloud
There are many options to run the framework on a cloud service, including Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and IBM Cloud.
Graph Data Structure
Rather than relying on graph data structures from underlying back end frameworks, Keras has its own structures to handle computational graphs. This approach frees new data scientists from being forced to learn to program the back end framework. This feature is also what prompted Google to add the API to its TensorFlow core.
Deep Learning Education
Keras is a common choice for deep learning education because it allows beginners and seasoned practitioners alike to quickly build and train neural networks. What’s more, it allows users to do so without revising low-level details, which makes it easier to begin to understand the concepts of mechanics of deep learning.
The Keras framework is used to teach deep learning on popular online platforms, including on Coursera with Andrew Ng and fast.ai with Jeremy Howard. (The latter has been Kaggle’s number 1 competitor for two years.)
As I’ve stressed in this article, Keras is a popular, high-level, deep learning API that helps data scientists rapidly build neural networks using a modular approach. It provides support for multiple back ends and allows for training on CPUs or GPUs. With it, data scientists can iterate machine learning hypotheses and move from experiments to production in a seamless fashion.
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
AWS announced the release of their Open Distro for Elasticsearch back in March. However, the release has not come with support from all members of the community. While AWS state that they have released Open Distro in order to ensure that Elasticsearch remains fully open source, other members of the tech community claim this is another move by Amazon to further solidify their strong customer base.
The Open Distro for Elasticsearch is, according to AWS, a value-added distribution of Elasticsearch licensed fully under the Apache 2.0 license. This release leverages the open source code from Elasticsearch and Kibana. According to Jeff Barr, chief evangelist for AWS, “this is not a fork; we will continue to send our contributions and patches upstream to advance these projects.”
The first release contains a number of new features including advanced security, event monitoring, alerting, performance analysis, and SQL query features. However, as Itamar Syn-Hershko, CTO at BigData Boutique, notes, these features align closely to the Elastic X-Pack feature-set. Elastic open-sourced the previously proprietary X-Pack code in 2018. However, in doing so they put the X-Pack under a new Elastic License which prevents re-selling or re-distributing the code to any third-party. This therefore prevents AWS from using the open sourced X-Pack code in their own AWS Elasticsearch offering. In doing so, Elastic moved their previously 100% Apache 2.0 licensed open-source repositories into a mixture of Apache 2.0 and Elastic licensed code. Elastic noted the following on their blog post sharing that the X-Pack code was being open-sourced:
We did not change the license of any of the Apache 2.0 code of Elasticsearch, Kibana, Beats, and Logstash – and we never will. We created a new X-Pack folder in each of these repositories that is licensed under the Elastic License, which allows for some derivative works and contribution.
However, AWS views this as a negative move away from a pure open-source model. According to AWS, they have received feedback from their customers and partners that these changes are concerning, and present uncertainty about the longevity of the Elastic open-source projects. Adrian Cockcroft, VP cloud architecture strategy at AWS, notes that:
When important open source projects that AWS and our customers depend on begin restricting access, changing licensing terms, or intermingling open source and proprietary software, we will invest to sustain the open source project and community.
Cockcroft continues by explaining that AWS responded similarly when Oracle indicated they would make significant changes to how they support Java. In that instance, AWS released the Corretto project providing a multi-platform distribution of OpenJDK. Cockcroft further explains that in his opinion:
The maintainers of open source projects have the responsibility of keeping the source distribution open to everyone and not changing the rules midstream.
According to Cockcroft, AWS has discussed these concerns with Elastic and even offered to provide resources to support a community-driven, single-licensed version of Elasticsearch. However, as Cockcroft states, “[Elastic] has made it clear that they intend to continue on their current path.” Shay Banon, CEO of Elastic, expressed a differing viewpoint in an article he recently published. In the article, Banon states: “Companies have falsely claimed that they work in collaboration with our company, topically Amazon.”
This move was not met with support by a number of members of the community. Sharone Zitzman, head of developer relations at AppsFlyer, was critical of how AWS presented their decision. She expresses her disdain with AWS in her recent post:
Preaching open source to a vibrant open source company with deep roots in the OSS values – that has been fully transparent about their needs to monetize and maintain a stellar product, and make dubious claims about its authenticity is simply disingenuous. This is Amazon seeing someone’s shiny toy, and just wanting it for themselves. This is called a fork.
However Adam Jacob, CTO at Chef, disagrees with Zitzman and feels that this move by AWS is a positive move for open-source software in general. He explains that the primary winner in this are the values of Free Software:
Let me be 100% clear: this is not a failure of Open Source. This is the deepest, most fundamental truth about Open Source and Free Software in action. That you, as a user, have rights. That those rights extend to everyone, including AWS – or they don’t exist at all.
DigitalOcean’s survey found that there is strong belief that AWS does not support open source, with only 4% of respondents answering positively that AWS “embraces open source the most” (out of Google at 53%, Microsoft at 23%, and Apple coming in at 1%). As Joe Brockmeier, editorial director for Red Hat blogs, notes that while Amazon uses Linux to power its servers and its Kindle devices, it doesn’t appear in the top 20 kernel contributors.
While the response to the release of AWS’ Open Distro for ElasticSearch is heavily mixed, it appears that this pattern of AWS producing its own versions of open source products will continue.
Where do you stand on this issue? Do you feel that this move by AWS is in the best interest of the open source community? Share with the community in the comments below.
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
CircleCI has announced new partner integrations as part of their Technology Partner program. CircleCI previously introduced a package management solution called Orbs. Orbs bundle common CI/CD tasks into reusable, shareable packages. With this announcement, CircleCI has added partner-supported orbs for AWS, Azure, VMware, Red Hat, Kublr and Helm.
Orbs are shareable components that combine commands, executors, and jobs into a single, reusable block. This allows for organizations to share their preferred CI/CD across teams and projects. It also simplifies the integration of third-party solutions into the CI/CD pipeline with a minimal amount of code.
CircleCI provides an orbs registry allowing for sharing of both official orbs, partner-provided orbs, and orbs contributed by the community. Tom Trahan, head of business development at CircleCI, indicated in a conversation with InfoQ that since the launch of Orbs back in November there are now over 700 orbs available in the registry. Within the registry orbs are marked as Certified if they are provided by CircleCI or Partner if supplied by a partner integration.
This latest announcement sees additional orbs added from both CircleCI and partners to support and manager Kubernetes services and environments. This includes orbs for Google Kubernetes Engine, Amazon Elastic Container Service (EKS), Azure Kubernetes Service, and Red Hat OpenShift. Orbs were also added to facilitate interacting with container registry services from Amazon, Google, Docker, and Azure.
Nathan Dintenfass, product manager with CircleCI, shared that orbs are meant to solve three key problems. The first is providing better DRY support within CircleCI configuration. Secondly the team looked to allow for code reuse to be possible between projects. Finally they looked to provide easier paths to allow for common configuration and reduce the amount of boilerplate code common in setting up CI/CD pipelines. In preparing the new package management solution, the team looked to maintain existing decisions made with how CircleCI configuration is structured: preferring the configuration to be in code, providing deterministic builds, and storing configuration as data.
As Dintenfass explains, the team made the decision that orb revisions are immutable. This ensures that no changes are shipped without first adjusting the version. For version tracking, orbs must follow a strict semantic versioning approach. To allow for development versions, orbs can be published with the version
dev:foo. This allows for development orbs to be mutable by anyone on the team with the orbs expiring 90 days after the last publish date. Orbs published with a semver version are considered production orbs and are then immutable and durable.
Orb dependencies are locked at the time of publishing. For example, if Orb-A has a dependency on Orb-B, that dependency will be version-locked when Orb-A is added into the registry. If a new version of Orb-B is shipped to the registry, it will not be incorporated into Orb-A until a new version of Orb-A is published. Dintenfass indicates that this choice aligns with the decision to provide deterministic builds.
Each orb lives in a unique namespace. As Dintenfass describes, “There is no ’empty’ namespace, nor are there reserved special defaults like
_ for CircleCI or ‘official’ orbs. We decided that we didn’t want orbs we author to be considered the default set or have special significance in our namespacing scheme.” They have introduced a Certified Orbs program, where certified orbs are treated as part of the platform by the CircleCI team. At the time of writing only orbs in the
circleci namespace are certified.
At this time, all orbs are open-source. Dintenfass indicates that: “If you can execute an orb you can see the source of that orb. This is so you are never executing a black box inside your jobs, where your code and secrets are present.” While there is no automatic static scanning of orbs at this time, Trahan has shared that this is on the roadmap for an upcoming release. He also added that all certified and partner orbs go through a review process by CircleCI to ensure they follow best-practices.
Trahan provided details that the CircleCI has themed releases planned as part of their roadmap. The team reviews common use cases from their clients and looks to provide improvements to simplify those cases. According to Trahan, future themes will include security improvements (vulnerability scanning, secrets management, and policy compliance), tools to provide management of open-source projects, and automated testing solutions. In addition to future feature improvements, the team is also adding new partners on a monthly basis.
CircleCI has provided documentation on both how to use and publish orbs. The current listing of orbs can be viewed in the Orb Registry. Orbs are currently available within both the free and paid tiers of the cloud offering. While not currently available on the self-hosted version, Trahan has indicated this is on their roadmap. For more information about the new partners and orbs added to the registry, please review the official announcement on the CircleCI blog.
MMS • RSS
Article originally posted on Data Science Central. Visit Data Science Central
I pulled out a dusty copy of Thinking Stats by Allen Downey the other day. I highly recommend this terrific little read that teaches statistics with easily understood examples using Python. When I purchased the book eight years ago, the Python code proved invaluable as well.
Downey also regularly posts informative blogs. One that I came across recently is There is Still Only One Test, that explains statistical testing through a computational lens rather than a strictly mathematical one. The computational angle has always made more sense to me than the mathematical. In this blog, Downey clearly articulates the computational approach.
The point of departure for a significance test is the assumption that the difference between observed and expected is due to chance. A statistic such as mean absolute difference or mean square difference between observed and expected is then computed. Following this in the simulated case, data are randomly generated in which the “observed” are sampled from the “expected” distribution (where, by design, there is no statistical difference between observed and expected), the same comparison statistic is calculated, and the list of all such computations are stored and sorted. The actual observed-expected statistic is then contrasted with those in the simulated list. The p value represents how extreme that value appears in the ordered list. If it falls in the middle, we’d accept the null hypothesis that the difference is simply chance. If, on the other hand, it lies outside 99% of the simulated calcs, the p-value would be < .01, and we’d be inclined to reject the null hypothesis and conclude there’s a difference between observed and expected.
The remainder of this blog addresses the question of whether 60 rolls of a “fair” 6-sided die could reasonably yield the distribution of frequencies (8,9,19,6,8,10), where 19 represents the number of 3’s, and 8 denotes the number of 1’s and 5’s, etc. The expected counts for a fair die would be 10 for each of the 6 sides. Three comparison functions are considered. The first is simply the max across all side frequencies; the second is the mean square difference between observed and expected side frequencies; and the third is mean absolute value difference between observed and expected side frequencies.
The technology used is JupyterLab 0.32.1 with Python 3.6.5. The simulations are showcased using Python’s functional list comprehensions. The trial frequencies are tabulated using the Counter function in Python’s nifty collections library.
Read the entire post here.
MMS • RSS
Article originally posted on Data Science Central. Visit Data Science Central
The energy industry is undergoing a rapid transformation in recent past owing to the enhanced role of renewables and enhanced data-driven models making the value chain smarter. In the context of the primary constituents of this sector comprising of coal, power, renewables, solar energy, oil, and gas, there is a huge role AI can play.
We illustrate some key use cases below:
1. Smart Grid
The biggest disruption in power in recent times is in the smart grid which is quite flexible in comparison to the traditional grid. AI can be a huge enabler in the form of providing optimal configurations etc to create a really smart and efficient grid.
2. Distribution Losses
By thorough analysis of data related to losses AI can help prevent transmission and distribution losses.
3. Fine Tune Supply
In the case of the smart grid, there are tons of consumer data available in the form of consumption patterns, etc which can help fine-tune supply from the smart grid.
4. Conserve Consumption
Data from the smart grid ecosystem in terms of precise data of load patterns etc can help in dynamic configuration of the grid-like lesser load during afternoon times can help conserve consumption via switching off during those lean times.
5. Consumption Production
Via microgrids, today houses can become net suppliers to the grid and make money. Via proper analytics of consumption production tradeoff homes can make money.
6. Controlling & Optimization
Via powerful visualization tools and analytics tools power grid configurations can be controlled and optimized efficiently.
7. Preventive Maintenance
Overall across the energy sector, there is a need to imbibe the notion of preventive maintenance. Machinery like turbines, windmills, etc is very often subject to repairs and maintenance. By an AI-enabled preventive maintenance platform, all these equipment can resort to a preventive indication of faults and thereby avoiding reactive maintenance. This is a huge cost saving for the energy industry. They could even prevent major disasters from happening by flagging them early.
8. Image Classification
Using image Classification and AI processes like that In the context of oil gas and coal industry there is a huge role AI can play in the analysis of mine earth or seismic data highlighting potential areas for digging. This is a huge cost saver.
9. Optimal Placements
In the case of solar cells, etc optimal placements can be obtained by the use of AI algorithms.
10. Optimal Supply Chain
By use of sensors combined with AI, an optimal supply chain can be designed for transportation of coal oil gas, etc..
11. Predicting Movements
Use of AI can be done to better predict potential movements in stock indices based on commodities.
12. Autonomous Transportation
Via use of autonomous trucks several mining companies are optimizing transportation costs from remote mines.
13. Chatbot for Customer Service
Last but not the least using AI in customer service, for example, a power company using a ChatBot to resolve customer queries.
MMS • RSS
Article originally posted on InfoQ. Visit InfoQ
Shifting the test team to the left brought the whole team closer together, enabled faster learning, and improved collaboration, claimed Neven Matas, QA team lead at Infinum. He spoke at TestCon Moscow 2019 where he shared the lessons learned from building a QA team in a growing organization.
One of the challenges Infinum had to deal with was adapting to a growing influx of projects, all of which had its own set of demands. The differences in team size, length, technologies, and architecture usually meant that they needed to approach each project from a different angle, all the while trying to welcome and mentor new people into the QA team, said Matas.
Being a software agency, client demands from project to project also tended to vary wildly. While some just needed routine exploratory testing from time to time, others wanted to invest in a full-on software testing experience, said Matas. This meant that on one project they might have to make brief inquiries into new features, and on others maintain extensive test documentation, automate test cases, and involve themselves into the project on all levels. “Any semblance of a silver bullet wasn’t in sight”, said Matas.
The one thing all Infinum projects have in common is a perspective of shifting the test team to the left, involving testers (junior or senior) in all project-related matters with an emphasis on continuous participation as a key to good QA, said Matas.
Simply delivering compiled code at the end of an iteration to a person unaware of the thought process behind it usually kept them from doing an excellent job, argued Matas. He stated: “I firmly believe bugs can be avoided way before a single line of code is written.” By intervening at an earlier point in the proceedings, they brought the team closer together, enabled faster learning, and more collaboration from the get-go.
Matas also mentioned that they started by putting much more effort into employing and mentoring more and more testers in an attempt to keep track of the growing number of developers that came with the company’s rapid expansion.
Matas presented his vision for success in building a good QA team, which is built around four key areas:
Context – Approaching each software project differently, by examining the risks, recognizing the key features and developing a tailor-made testing strategy that will eke out the highest testing ROI.
Variety – Employing people with various backgrounds that will help make the team heterogenous, bring valuable domain knowledge, and help grow the team in multiple directions.
Knowledge – Consistently working on developing their team members’ skills, not failing to recognize the fact that technical and people skills often bear equal importance.
Collaboration – No non-trivial software project can succeed in the long term without positive and powerful collaboration between team members. You should empower your employees to work with others, push for improvements, converse with clients, and leave the comfort zone often. If you do that well, you will never hear the dreaded “this is not my job” again.
InfoQ spoke with Neven Matas after his talk at TestCon Moscow 2019.
InfoQ: You mentioned in your talk that the attention of quality has shifted over time. Can you elaborate?
Neven Matas: In the beginning, not each project had a specialized tester to deal with the testing and quality workloads that were distributed amongst the team. With increasing complexity, we figured out that having a dedicated quality person on each project brought tangible benefits. Even though quality is always a team effort, this sort of dedication gave the QA team a chance to find novel ways of pushing projects closer and closer to perfection. No one knows the app as well as a software tester, since that person gets to live with it day in, day out on a level which is only really familiar to end users. Not only that, but those “big picture seers” can, in the end, become the “big picture influencers” if they pinpoint bottlenecks and find ways to fix problems in the entire software development process, not just the end product.
InfoQ: How do you support your team to keep improvement ongoing?
Matas: There are several things we implemented in our process that are helping us to become continuously self-improving:
- Workshops – Each week the QA team holds a two-hour meeting where we come together as a team to share knowledge on either a theoretical or a practical topic. The workshops range from “how to be more assertive” to “writing assertions in unit test.”
- One-on-One meetings – Each week I, as a team lead, sit down with members of my team to go through issues each person might be having in their particular projects and ways of tackling them in the short term. We also discuss things they managed to do well in order to trickle down that knowledge to the entire team later on.
- QA Buddy – We recognized the need for early feedback, so you get paired with a different person every couple of weeks. This means that you will jointly work on sharing experiences, examining each others’ test cases and approaches, and do some pair testing to help each other minimize the effects of tester’s fatigue.
- Educational budget – Each team member gets a yearly educational budget which they can spend on workshops, online tutorials, software, conferences, books, etc. We also curate a small library of QA and software development books which is open to everyone.
- Switching projects – A significant thing is “tester’s fatigue” – the phenomenon where you, as a tester, can no longer see the forest for the trees due to over-familiarity with the project. A welcome relief is switching projects for a while to gain a fresh perspective and reboot your testing senses.
MMS • RSS
Article originally posted on Data Science Central. Visit Data Science Central
Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value
The broader fields of understanding what data science includes mathematics, statistics, computer science and information science. For career as Data Scientist, you need to have a strong background in statistics and mathematics. Big companies will always give preference to those with good analytical and statistical skills.
In this blog, we will be looking at the basic statistical concepts which every data scientists must know. Let’s understand them one by one in the next section.
Role of Statistics in Data Science
Before beginning with 5 most important statistical concepts, let us try to understand the importance of statistics in data science first!
The role of statistics in Data Science is as important as computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modelling needed for prediction.
Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.
Important Concepts in Data Science
1. Probability Distributions
A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of the variable vary based on the underlying probability distribution.
Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you can create a distribution of heights. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.
2. Dimensionality Reduction
In machine learning classification problems, there are often too many factors on the basis of which we do the final classification. These factors are basically variables or features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features can have a correlation, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It consists of feature selection and feature extraction.
An intuitive example of dimensionality reduction can be a simple e-mail classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap. In another condition, a classification problem that relies on both humidity and rainfall, we can then club them into just one underlying feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas we can visualise a 2-D in a 2-dimensional space and a 1-D problem to a simple line.
3. Over and Under-Sampling
Oversampling and undersampling are techniques in data mining and data analytics to modify unequal data classes to create uniform data sets. Also, we can call oversampling and undersampling as resampling.
When one class of data has under representation minority class in the data sample, oversampling techniques may be useful to duplicate these results for a more uniform amount of positive results in training. Oversampling is important when data at hand is insufficient. A popular oversampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.
Conversely, if a class of data has an over-representation as majority class, undersampling may be useful to balance it with the minority class. Undersampling is important when the data at hand is sufficient. Common methods of undersampling include cluster centroids and Tomek links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.
In both oversampling and undersampling, data duplication is not really useful. Generally, oversampling is preferable as undersampling can result in the loss of important data. Undersampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.
4. Bayesian Statistics
Bayesian statistics is a particular approach to applying probability to statistical problems. It provides us with mathematical tools to update our beliefs. These are about random events in light of seeing new data or evidence about those events.
In particular Bayesian inference interprets probability as a measure of believability or confidence. It is what an individual may possess about the occurrence of a particular event.
We may have a prior belief about an event, but our beliefs are likely to change when evidence is brought to light. Bayesian statistics gives us a mathematical means of incorporating our prior beliefs, and evidence, to produce new posterior beliefs.
Bayesian statistics provides us with mathematical tools to rationally update our beliefs in light of new data or evidence.
This is in contrast to another form of statistical inference, known as classical or frequentist statistics. It assumes that probabilities are the frequency of particular random events occurring in the long run of repeated trials.
For example, as we roll a fair (i.e. unweighted) six-sided die repeatedly, we would see that each number on the die tends to come up 1/6 of the time.
Frequentist statistics assumes that probabilities are the long-run frequency of random events in repeated trials.
When carrying out statistical inference, that is, inferring statistical information from probabilistic systems, the two approaches — frequentist and Bayesian — have very different philosophies.
Frequentist statistics tries to eliminate uncertainty by providing estimates. Bayesian statistics tries to preserve and refine uncertainty by adjusting individual beliefs in light of new evidence.
5. Descriptive Statistics
This is the most common of all forms. In business, it provides the analyst with a view of key metrics and measures within the business. Descriptive statistics include exploratory data analysis, unsupervised learning, clustering and basic data summaries. Descriptive statistics have many uses, most notably helping us get familiar with a data set. It is the starting point for any analysis. Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.
Descriptive statistics are very important because if we simply presented our raw data it would be hard to visualise what the data was showing, especially if there was a lot of it. Descriptive statistics, therefore, enables us to present the data in a more meaningful way, which allows simpler interpretation of the data. For example, if we had the results of 1000 students’ marks for a particular for the SAT exam, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this.
Let’s take another example like a data analyst could have data on a large population of customers. Understanding demographic information on their customers (e.g. 20% of our customers are self-employed) would be categorized as “descriptive analytics”. Utilizing effective visualization tools enhances the message of descriptive analytics.
We had a look at important statistical concepts in data science. Statistics is one of the important components in data science. There is a great deal of overlap between the fields of statistics and data science, to the point where many definitions of one discipline could just as easily describe the other discipline. However, in practice, the fields differ in a number of key ways. Statistics is a mathematically-based field which seeks to collect and interpret quantitative data. In contrast, data science is a multidisciplinary field which uses scientific methods, processes, and systems to extract knowledge from data in a range of forms. Data scientists use methods from many disciplines, including statistics. However, the fields differ in their processes, the types of problems studied, and several other factors.
If you want to read more about data science, read our Data Science Blogs