Mobile Monitoring Solutions

Search
Close this search box.

Facebook Open-Sources RoBERTa: an Improved Natural Language Processing Model

MMS Founder
MMS Anthony Alford

Article originally posted on InfoQ. Visit InfoQ

Facebook AI open-sourced a new deep-learning natural-language processing (NLP) model, robustly-optimized BERT approach (RoBERTa). Based on Google’s BERT pre-training model, RoBERTa includes additional pre-training improvements that achieve state-of-the-art results on several benchmarks, using only unlabeled text from the world-wide web, with minimal fine-tuning and no data augmentation.

The Facebook team announced their work in a recent blog post as “part of Facebook’s ongoing commitment to advancing the state-of-the-art in self-supervised systems that can be developed with less reliance on time- and resource-intensive data labeling.” The team re-implemented Google’s BERT neural-network architecture in PyTorch, made several changes to the model’s hyperparameters, and trained the network with an order-of-magnitude more data and for more iterations. The model was evaluated on three common NLP benchmarks: General Language Understanding Evaluation (GLUE), Stanford Question Answering Dataset (SQuAD), and ReAding Comprehension from Examinations (RACE). RoBERTa outperformed BERT on these tests, and in some cases also outperformed the current leading model, XLNet.

Many machine-learning tasks require a labeled dataset, which consists of input examples tied to correct output values, against which the training process checks the AI’s answers. Because they often require human work, very large labeled datasets are relatively rare, especially compared to the wealth of unlabeled data for NLP that exist on the internet; for example, the contents of Wikipedia or Google News. Pre-training is an NLP strategy that uses large unlabeled datasets to create “general purpose language representation models”, which can then be “fine-tuned” for a specific NLP task on smaller labeled datasets. Open-sourced in late 2018, BERT, or Bidirectional Encoder Representations from Transformers, is an NLP architecture that uses pre-training to learn relationships between words, by predicting masked words in input sentences. BERT is based on the Transformer architecture, and was the first bi-directional deep-learning NLP model, meaning it could use words after the masked word, as well as those preceding it, as context for predicting the answer. BERT also models relationships between sentences by training on next-sentence prediction (NSP); given two sentences, does the second sentence truly follow the first in the original text?

In creating RoBERTa, the Facebook team first ported BERT from Google’s TensorFlow deep-learning framework to their own framework, PyTorch. Next, they modified the word-masking strategy; BERT used a static mask, where the words were masked from sentences during pre-processing. RoBERTa uses dynamic masking, with a new masking pattern generated each time a sentence is fed into training. Next, RoBERTa eliminated the NSP training, as Facebook’s analysis showed that it actually hurt performance. Finally, RoBERTa was trained using larger mini-batch sizes: 8K sequences compared to BERT’s 256.

RoBERTa was evaluated against common NLP benchmarks and compared to the original BERT results and to XLNet, another transformer-based architecture that currently has the high scores on several of the benchmarks. RoBERTa outscored BERT and XLNet on both the RACE benchmark and GLUE’s single-task benchmark. GLUE also has a public leaderboard for its ensemble benchmark, and RoBERTa achieved “highest average score to date” on it. One the SQuAD v2.0 “dev” benchmark, RoBERTa set a new high-score, and on SQuAD’s public leaderboard is the top system that does not rely on training data augmentation.

RoBERTa’s technical details and experiments are described more fully in a paper published on arXiv. Paper co-author Myle Ott joined a Reddit comment thread about the paper, providing more context and answering several questions. Ott said that “more data isn’t as important as training longer,” and

Even training for significantly more epochs than past work, we still couldn’t overfit the BERT objective and consistently saw improved end-task results each time we trained for longer.

One commenter pointed out that the comparison with XLNet was not quite “apples-to-apples.” Ott agreed, saying:

Another difference, in addition to the ones you noted, is the data size and composition are different between XLNet and RoBERTa. We ultimately abandoned doing a direct comparison to XLNet-large for this work, since we wouldn’t be able to control for the data unless we retrained XLNet on our data.

RoBERTa’s pre-trained models and training code are available on GitHub
 

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Judy Rees on Effective Remote Meetings

MMS Founder
MMS Judy Rees

Article originally posted on InfoQ. Visit InfoQ

This is the Engineering Culture Podcast, from the people behind InfoQ.com and the QCon conferences.

In this podcast Shane Hastie, Lead Editor for Culture & Methods, spoke to Judy Rees about making remote meetings effective, clean language, the series of articles she is curating for InfoQ and the upcoming remote meeting that our listeners/readers are invited to participate in

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Modern WAF Bypass Scripting Techniques for Autonomous Attacks

MMS Founder
MMS Johnny Xmas

Article originally posted on InfoQ. Visit InfoQ

Transcript

[Note: please be advised that this transcript contains strong language]

Xmas: I am Johnny Xmas. I have done pretty much everything there is to do in the security space. I have a rule; whenever I switch companies, I want to move into a role that’s something I’ve never done before so that I can come out of it with a way more well-rounded understanding of what the boots on the ground are doing in all these places. I started out as a systems engineer and then moved into network engineering because I was really drawn to that. I’ve been a hobbyist hacker for all of my life. I’ve gotten in plenty of trouble for it. I have whole other talks on that. I moved into systems engineering, network engineering.

I was an information security engineer at a Fortune 500 company, Global 1000. That was amazing, I really got to cut my teeth in an amazing space when you’re dealing with something that’s at that level of an enterprise that really gave me a great understanding of what’s going on on the business side as well as the IT side. From there I was able to move into information consultancy and help other Fortune 500s etc. deal with the security problems in a way that would help IT be able to talk to the business and really cobble things together and remediate things in a way that everyone liked and everyone could understand.

From there, I moved into penetration testing specifically. That’s where all the fun stuff starts. Those are the guys who break into your company and hopefully give you a really lengthy report on what they did to break into your company, what you need to do to fix it, instead of just running in and hacking everything up and giving you the finger and leaving. Who’s had penetration tests done at their company? Who’s had application testers come in? Good. Every year I see that hand count go up and that’s very pleasing, because when I first started doing that, it was one hand in the back and it was terrifying.

Then in my last position, I was an industrial security researcher. That was super fun, the industrial security scene is terrifying. Anyone here working in the ICS space? Probably not, it’s very empty. It’s like going back into the past. The average staying time for a piece of hardware in industrial IT is 19 years, because they’re very much into stability over anything else. Don’t fix it if it ain’t broke. That means you’re still dealing with 20-year-old software out there in the field, you’re dealing with 20-year-old hardware, you’re dealing with firmware that hasn’t had an update in 20 years. The security implications of that are terrifying, it’s a huge problem and we’re not focusing enough on it as a country, and that might just be to keep the fear and the raving masses at Bay. But yes, industrial security is a terrifying place to be.

I moved out of there because this company, Kasada, had developed a really freaking cool product for defending web applications from automated attacks. I’ll touch base on that maybe towards the end of the talk if I have time. I am the director of field engineering for Kasada. I’m also a blade runner, meaning I spend a good chunk of my time just hunting bots and doing research on the bots that are out there on the Internet attacking all of your stuff. The product we have sucks up all of that general white noise that you get whenever you connect anything to the Internet. Because of that, we have a lot of really cool data to play with to see what’s going on at the Internet as a whole.

This talk here is for developers. I try to bring something that would be for this audience. Normally, I’m talking to hackers at hacker conferences. I’m a hacker, I know how to code, I know how to write scripts, but I’m not a dev. Everything I do is for speed and not for stability. I don’t do unit testing when I’m hammering out my scripts to get through a situation. Especially as a penetration tester, when I had a very limited time limit to break into things, I had to just write stuff and go. Speaking here makes me really nervous, because you’re the ultimate coders. You are the people who do everything right, and I’m the people who do everything wrong. I’m the one who wrote that goofy script you found in a Github repo that doesn’t even have comments in it and didn’t properly use classes and functions, and you’re, “What five-year-old wrote this?” That was me, and it was a five-minute script I needed to hack through and get a job done.

One of the reasons we as hackers write a lot of these scriptings is to automate something we’re trying to do because normally if we were doing these attacks by hand, it would take us potentially hundreds of years. I’m talking about trying to brute force a login where I have huge word lists of usernames and passwords and I have to find the right combination. I’m not going to sit there and do that by hand. I’m going to sit there and hammer this out. I’m going to have Python do it, I’m going to have JavaScript do it. I don’t have time for this crap, my computer has time for this crap.

Over the years of doing penetration testing and specifically web application testing, I developed this bag of tricks for getting past a lot of the defenses that exist out there that are stopping us from doing these brute forcings, from doing our scraping. I guess everyone here probably has a need for what they think this talk is about, which is very interesting to me. I didn’t realize how big that need was until I started working for Kasada and really realized that there is a massive amount of what I’ve always called corporate espionage going on. That’s the, “I work for one company and I need to figure out what another company is doing, but that company won’t just tell me, so I’m going to have to go find out, sometimes based on publicly available information, such as aggregating pricing data for all my competitors, or figuring out what they’re selling all of their stuff at so I can make sure that I’m competitive in my pricing.”

There are a lot of defenses out there to try and stop this sort of stuff from happening, and a lot of them are really bad, and a lot of them are really old, and that’s why they’re really bad.

Web Application Firewalls

There are two main things out there. There are these basic WAFs. WAF means Web Application Firewall. The reason it’s called a firewall comes from the fact that, for the longest time, the majority of them worked based off of IP address-based rules. The most they could do, as a defensive mechanism, was blacklist your IP. Some of them, later on down the road, and we’re talking mid-2000s, started to be able to do some very basic behavioral analysis, and we’re talking about timing. If something’s coming in, if you get X number of requests in X number of seconds, well that’s clearly not a human and we’re going to shut that down. That’s about the max you can get out of a basic WAF. It’s super easy to bypass these, simply even just by rotating your IP address. The fact that they blacklist your IP as their only real mechanism for defense is, again, trivial to bypass. I put this up here twice because these don’t do much and there’s not much to say about getting around them, it’s all right there. Simply rotating your IPs is really all you need to get around these basic WAFs.

These basic WAFs are what’s out there in most cases as well. These are usually on-prem. These are usually appliances you have in your data center. You set them up so your network routes all of your incoming web traffic to the WAF first, it takes a look at what’s going on and that determines whether or not it’s going to send those requests over to your origin server or not. In nearly all cases, these aren’t in-line like that. In nearly all cases, these corporations are putting these in a monitor mode where they’ll alert if something wacky is going on, but they still won’t block. The business never lets these block because their false positive rate is so freaking high, that it’s near impossible to convince the business to allow you to actually block these.

Your security folks will tell you, “Yes, the business never lets us block in-line. We always just have to alert and then it floods our inboxes with these alerts and then we don’t care because we have a trillion alerts a day and we’re overwhelmed and here we are.” So even tripping the alerts on these things often doesn’t get you caught before you’ve accomplished what you’re trying to accomplish because it takes so long for that security team to come through and actually take a look at what’s going on and do something about it.

SQL Map

One of my favorite ways of bypassing those old style WAFs is this tool called SQL Map. You can take a look at that repo to see the exact mechanisms it uses. The other thing that these basic WAFs are doing is post data inspection, and they’re trying to identify specific types of attacks that come in through that HTTP post data. They do that by looking for a specific groupings of characters or groupings of words. That’s all they can do. SQL Map has a really cool obfuscation technique that it uses to still be able to send in injection commands while fooling the thing into not seeing the characters that it’s sending through. It’s what you can see that’s going on here. It’s effectively putting a lot of these null characters in place, while still doing union SQL calls. If you want to see some cool obfuscation, just check out the SQL Map repo.

Between that and just rotating your IPs, getting past these basic WAFs and the fact that nobody responds to those alerts anyway, is actually really trivial, and you’re going to see that that triviality is a common theme in what we’re doing here. You’re going to be really surprised that most of the things I recommend are going to be simple things that you can add in one, two, three lines into any of your scripts that you’re writing. There’s not going to be any devastating, “Holy crap, this is a really complex attack, I can’t believe somebody thought of this” kind of stuff in this talk. It’s going to be surprisingly basic and you’re going to be really upset at the state of things in the defense of universe right now. Forgive me if I’m also flying through this – there is a lot of data we have here.

There are these more modern WAFs coming out now that I call sophisticated WAFs. These often exist in the cloud as a kind of a reverse proxy. They operate similar to how a CDN operates, you’ll have your DNS send all the traffic up to them first. They’ll figure out what requests are good, what requests are bad. They’ll send the good requests to your origin, and they’ll do something about the bad requests, and that something varies based on whoever sophisticated WAF you had been using. These often partially rely on JavaScript execution. This is usually to fingerprint the client environment.

What we’re doing here is we’re actually taking a look at the connecting client and not just the post data. We’re not just looking at HTTP headers, we’re not just looking at post data. We’re not just looking at IP address information. We’re actually seeing what’s going on in that client environment. Is this a real browser that’s actually trying to connect to my server? Or is it somebody pretending to be a real browser? Unfortunately, in most of the cases, that fingerprinting is still not very good. What I’m going to do with the majority of this talk is tell you how to get past this more sophisticated stuff because the first stuff’s really easy.

Bare Minimums

This is the bare minimum stuff that you’re going to do. This next few things I’m going to tell you in this section are going to be like, “Please at least do this.” This is the, “At least you showed up to work today.” At least you tried, you’re going through the motions. Everything you write should at the very least, be doing these few things. Rotate your IP – you should be rotating your IP all the time. When to rotate your IP is going to vary based on what you’re attacking. There is an art to this, you may have to rotate it with every single Git request. If you’re bypassing something that’s really good at doing behavioral analysis, you might get flagged after a single Git request. And if you’re trying to scrape data off a webpage, getting stopped after one Git, is super irritating. But if you can rotate your IP after each Git, that’s devastating. That’s going to bypass so much stuff out there, and all you’re doing is that one thing and that’s something that’s super easily scriptable.

Obtaining IP addresses for this is really easy. There is no end to free proxy sites out there. Just Google free proxies, you can look up VPNs as well, they call themselves sometimes. The paid services of course are going to be much more reliable, whereas the free ones often get blacklisted by IP reputation services relatively quickly, but the paid services are pretty cheap. We’re talking 15 bucks a month for thousands of IP’s that you can use. There are really cool services out there that will let you lease residential IPs. This is super devastating if you’re dealing with the old style WAFs, which, like I said, is most WAFs out there, in that no business is going to allow their security team to block residential ASNs. Whose company is going to let you block all of Comcast? Especially if you’re an eCommerce site. If you’re a company where a significant portion of your business comes from access to your website, or you simply run customer portals, if you’re say a health insurance website, anything like that where you absolutely need individual people with residential IPs to access your website, these will never get blocked. There’s a lot of really cool services that’ll rent these to you relatively cheap for what we’re trying to do here.

Where they get these IPs is super sketchy, because normally where do you get a residential IP? You call up Comcast or AT&T, whoever, and you get your one IP address. If you need a second IP address, what’s that cost? What’s that cost in New York? Probably 10 bucks? 10 bucks an IP, if you can even get another static IP from them. What these companies do is they run these side hustles. Luminati is my personal favorite. Who knows Hola VPN? Maybe you use it to watch TV in other countries. That’s its most common thing that it pushes, it’s a free VPN service. Have you read the terms of service for Hola VPN? No, of course not, nobody does that. If you skim through there – it’s not that long of a terms of service – you’ll see that Hola VPN says if you’re using their free service, you are agreeing to also allow them to use your home IP address as an exit node for other services.

This other company they run is called Luminati. If you check out luminati.com, they lease residential IP addresses. By using this free software, this Hola VPN, you’re an exit node for botnets. You are literally hosting malware from your house traceable to your IP, mind you. But, as people who need some residential IPs, that’s a great place to go get them.

Monkey Socks is another one. Monkey Socks leases mobile IPS, cellular network IP addresses. It gets those from an SDK that it offers that anyone who is effectively writing most any mobile app that is capable of establishing a network connection, same thing, it just ties in there and it says, ”Use our SDK. Throw our little blurb in your terms of service that you say, by using our free app that we wrote for Android, you also agree to let us use your network connections for whatever the hell we see fit.” That’s terrifying. For any free app that you’re using on your phone, take a look at the terms of service. Especially if it’s not ad supported, they’re getting money from somewhere. This is where they’re getting that money from. But you, as the attacker, can go ahead and use an entire ASN full of mobile IPs, which nobody’s going to block.

Aside from rotating your IP addresses, we’re going to start getting into what the more complicated, the more sophisticated WAFs out there are looking for. Again, they can really only rely on the data that you send them for the. This is the medium sophistication stuff. Make sure when you’re writing the HTTP scripts, that you’re sending the usual HTTP headers that any browser always sends to them. You can take a look; just go into Chrome or Firefox or whatever. Go in the inspect panel and just look at the normal request headers that get sent in there. Specifically, I call out the accept*/* that bypasses so many bot detections, it’s hilarious. Most scripts, most binaries that do HTTP, your curl, your wget, it doesn’t send that header, so these rules that exist on the defensive side just go, “Does it have that ‘accept anything’?” If that accept header is missing, it goes, “Nope, that’s a bot. Shut it down.” You can bypass that rule just by adding your accept*/* and you get right in. Sometimes that’s the only bot detection going on in these mid-grade WAFs.

The Do Not Track, is another one that they’ll use as it’s more of a weighted thing since it’s not always sent in the first place, but you can use that to weight in your favor because they’re going to go, “All right, we also saw the do not track header, so probably a real browser.” Sometimes, depending on the type of communications you’re doing and what you’re up to, there’ll be X headers. You guys know what the X Headers are, like X forwarded for X. That’s one of the most common ones. These are optional headers where you can pretty much invent any kind of header you want. You just by spec, you start it with an X and you add whatever data. Look at the X Headers that are coming in and being sent when you’re doing normal communications on that website and figure out if this is something that you should be adding to your script or not. Watch out for the X forwarded for if you’re using free proxies, because some of those free proxies will add an X forwarded for that will put your source IP, the actual IP you’re coming from in there and then that just completely blows your cover. Most of the paid proxies don’t do that. You want to look for something called a transparent proxy or an invisible proxy. Most of those are going to not add any header data and will also forward any header data that you add. That’s really critical for this.

Your user agents – definitely send a valid user agent, something from a modern browser, and you can just go look at what the current Chrome one is. Look at what the current Firefox one is. When you’re copying and pasting your user agents, don’t include the quotes. Everyone includes the quotes and that’s the easiest way to detect if somebody’s up to something sketchy, because their user agent still has those single quotes in there, and you go, “No, no browser actually sends the single quotes,” but they copied it and pasted it right from Chrome because when you view it in the inspect panel, it’s got those single quotes and they just pasted it and it’s a dead giveaway that this is not somebody using a real browser. Watch the quotes.

Sometimes you want to use session cookies. This is something you’re going to have to experiment. Really, this is all stuff you’re going to have to experiment with. Try it on, try it off. Generally, the top two are something you’re always going to want in there. The other three and the user agent. X Headers, Session cookies, you want to play with. Session cookies often allow you free access to everything. They often eliminate throttling, especially in an authenticated scenario. Sometimes you’re going to go in by hand, authenticate to the website, then grab your session cookie and add that to your script, then the remote server is going to have no rules throttling authenticated people because no way they’re fake, they’re authenticated.

Sometimes they’re just even non authenticated session cookies, but because, by default, things like Python requests, or curl, or wget don’t even deal with those cookies, the fact that they aren’t there is a great tip off, and these mid-grade systems will use that to just block you right out of the gate. Again, real simple stuff. We’re not hacking anything, all we’re doing at this point is just abiding by the HTTP protocol. That’s going to be overarching theme of what we’re doing here. Just make sure that you’re mimicking a real browser as much as you possibly can by hand.

There’s a really cool tool called POSTMan. The code option in POSTMan, in the upper right, is the tiniest thing on the planet. Whenever you send a request out, there are a bunch of links just above the window that has all your data in it, and one of them says code. Click that code link. That code link gives you this drop down that lets you pick the language that you’re scripting in – I just picked Node.js here – and it will generate the request that you just sent in whatever language that you’re scripting in. You can copy and paste this and it includes all your really cool stuff. There are your cookies right down there and there’s your accept*/*. This is just right out of a request I dumped in, just go to google.com/maps. It’s going to take a lot of that default stuff that the remote server is expecting, throw it right into the script for you so you don’t have to spend a ton of time doing this by hand. Definitely check out POSTMman, even though you guys already are.

Rotate your user agents. This gets past so many things. A combination of changing your UA and your IP address is one of the most devastating things you do. Rotating your user agent is often a great way of getting more usage out of a single IP address. A lot of requests may come in from a single IP address for an organization, such as from a university, or workplace. You could have 10,000 people all using the same public IP because they all come out that same exit. Rotating these user agents really gives the look of it actually being an organization with a ton of different users. You can find lists of every user agent in existence anywhere in the Internet. If you look in my Github in the scripts I wrote from hacking Venmo last year, I have a flat file in the Venmo script. It’s like 4,000 user agents and you can just grab that and use it. Again, super simple, you’re just taking a flat file with the user agents in it and telling your script, “Go get the next one,” or, “Go get a random one and just rotate through this.” It looks like it’s a bunch of people at some company, or some university campus, or something at a bunch of different computers because, again, the defenses for this stuff aren’t that complex in nearly all cases.

You can also do this if you’re trying to fuzz a whitelist. A lot of WAFs, and the more sophisticated WAFs, they’ll have whitelists that people will make the poor decision of just basing off of a user agent. They’ll say, “All right, if we see this specific user agent come in, that’s fine. Just let it through.” You can go through and fuzz every single user agent until you get one that actually lets you in. Definitely make sure when you’re writing these scripts and you’re trying this stuff out, you’re logging what’s going on, you’re seeing what’s working and you’re seeing what’s not, because there’s a lot more going on in the background than just a binary, “Let this person in”, “Don’t let this person in.” You’re going to be fuzzing cookies. You’re going to be fuzzing user agents, things like that. Again, cookies are the same thing.

Sometimes you have to provide a session cookie to even get where you’re trying to go. Sometimes you can eliminate the session cookie in order to eliminate the throttling. You can write your script so that every time it gets a cookie, don’t provide that cookie on the the git request because the remote server is watching how many times that session has made a request. They’ll do that to try and get around you rotating your IP, because if you’re rotating IP and still providing the same session cookie, you’re literally providing the same ID to them over and over and saying, “It’s the same person still.” Make sure you’re not doing that. Sometimes you actually do have to provide that cookie, so it’s an art, it’s something you’re going to want to try both ways, see what works for you.

Watch out for sneaky WAF cookies. The more sophisticated WAFs will often drop identifier cookies that you have to provide. These are the ones that will run a bit of Javascript on your end, do that fingerprinting, post that telemetry up to their server and then they’ll respond with a cookie that definitely IDs you as you. Sometimes, you have to provide that every single time or it’s going to block you outright, or you’re going to get caught in this fingerprinting circle and not actually get anywhere. Or, sometimes you can get fingerprinted in a real browser. Get that fingerprinting cookie out of the way manually and then just copy and paste that fingerprinted authorization cookie into your script as a cookie replay attack. That’s a super common one.

If you’re having trouble getting your script to convince the remote server to generate the necessary cookies you need, do it in a real browser, and then just copy and paste that cookie and see what happens. A lot of products are susceptible to that. That’s an old hack, it’s called Cookie Replay, it works against a ridiculous number of things. Go ahead and try that. In fact, everything you’re doing here, you should be doing manually in a real browser first to understand how the application you’re attacking works, and then you’re going to write a script to do whatever you need to do. Don’t be afraid to copy and paste as much crap as possible.

Let’s talk about the super serious stuff now. This is when we’re dealing with the modern sophisticated WAFs. These are what we’re dealing with the really expensive bot defense stuff. The stuff that is able to effectively force you to fingerprint your browser and as do it and does a really good job at making sure your browser is in fact a real browser, and not just a super, super snarky Python script that you spent three months writing to really mimic a real browser.

Edge Enumeration

Occam’s razor with this one. Try to bypass that WAF entirely, try to find another way into the website that you’re attacking. I wrote a script – it’s on my Github and there’s a link at the end of this talk to where my Github is – called Scan Canon. It’s like a hundred-line bash script, and all it does is enumerate the edge of whatever ASN you give it. It finds all the servers that are running out there, it’ll find a bunch of other stuff, but for this purpose it finds all the web servers that are running. Hopefully, that’s going to find some of the edge servers, and then hopefully those edge servers have crappy firewall rules around them that let you connect directly to them instead of going through this cloud WAF they have set up. This sounds dumb, I see this all the time. You’re literally just bypassing the WAF because normally you’ll have this scenario where you punch in the web address, the domain you’re trying to get to, and the DNS says, “Go to this IP address,” and that IP address is the cloud WAF. Then you have to try and figure out how to bypass this cloud WAF before it’ll let you get to the actual origin server.

What we’re doing here is just finding the IP address of that origin server and just connecting directly to that IP. Now we’re not dealing with tricking this WAF; we’re literally bypassing this WAF. This is really common because people are really bad at writing firewall rules. I don’t know why; firewalls are not complicated, but it’s in the top three number of things I exploited as a penetration tester, was just these bad firewall rules.

ARIN is the American Registry of Internet Numbers. This is the full public list of who owns what IP addresses for American companies – InterNIC takes care of other international ones – and all of these registries are public. You can go in and you can say, “What IP space does QCon own?” You can get that full ASN, that full list. Then you can go to town and say, “Let’s see what’s out there. Let’s who’s running web servers on each of these IP addresses.” This can take you a while, which is why I wrote a script to do it.

Sometimes you’ll come across the website you’re looking for, or the web application you’re looking for. Sometimes you’ll come across a dev instance that has different firewall rules, but in the end still gets you access to the data and the backend that you were looking for. That’s really common. They’ll have prod to go through the complicated cloud WAF, but dev doesn’t have to because it’s just dev, but they screwed up on the backend and let dev still access whatever you’re trying to get. Or the dev site will use live data. We as devs know that this is a stupid idea, it happens all the time. It’s common, don’t be afraid to go look for it. You can save yourself a lot of time.

If you’re forced to go through this cloud WAF, a lot of times they’re using a CDN in front of it, and CDNs have path rules that will pass certain requests for certain paths through the WAF, and other requests will bypass it. Start fuzzing the paths. That can be hard, you’re literally just punching random words into the URL path. See if you can find a URL path that gets you to the place you’re trying to get to and doesn’t force you to go through that WAF. This is some more advanced stuff. This is a lot of last resort things, but that’s where we’re at at this point in the game. A lot of times you’ll find that accessing an application via a different URL path or different means has different rules associated with it, because somebody forgot to add them into the CDN. CDN is 6,000 rules as it is, and nobody knows how it works, and now you have found a way in that they weren’t aware of.

Start smashing their DNS, find other domains that that company owns. Do you guys know what DNS zone transfers are? That allows you to dump every domain name that’s registered within their DNS server. Look through those domain names. Does anything have the word “origin” in it? Here’s a really common one. Look up if a company has www-origin.companydomain.com or whatever the TLD is – that’s a freebie. There’s someone very popular out there using that as a way of hiding the origin servers. Look for the word “origin” or something that looks like an origin server just within all the DNS names that their DNS server has and just start hitting that. It might work, you’d be surprised. I told you this wasn’t going to be any devastating, insane, complicated hacks, this is simple stuff. You just have to think outside the box and find other ways in.

If you’re able to get in, but you’re being throttled and you can’t figure out how to defeat this throttling that’s happening to you, find all the edge nodes, find all the IPs that are hosting whatever it is you’re attacking and attack only one. Attack only one until it stops you, then try attacking the next one. There are certain WAFs out there that have a really long window where blocked IP info gets synced up. Some of them are as long as 15 minutes, and so you can attack one and then you’ve got 15 minutes before that one tells all the other ones to start blocking you. Then you can just wreck the next one, and then wreck the next one, and then rotate your IP and start all over again.

Look for an API that hosts the data that you’re looking for. Oftentimes this is just domain name/API. Go look, just type that in your browser and see if it exists. APIs are almost always less protected than the actual websites themselves. That’s often because of the systems that need to interact with the API not being able to interact with it in the same way that a browser provides. You’re not able to fingerprint the connecting system because maybe it’s a mobile application, so you can’t protect the API in the same manner. Look for that API, because that API may be providing the pricing data you’re looking for, and may be providing whatever you’re trying to aggregate, whatever that API may allow you to authenticate. You could possibly, if you’re trying to brute force some login credentials, brute force against that API and it will have completely different rules associated with it. Don’t assume that everything on their website has the same rules. Don’t assume that every URL path has the same rules associated with it. Every single page and every single means of getting to that page can have its own rule applied. Don’t assume anything, always try everything.

Sophisticated WAFS

Look for UUIDs or really complicated DNS names. Look for something that’s super long and you go, “What is this?” especially in a scenario where it stands out, where you go, “These three names are really long hashes, and then everything else is very obviously named.” Those are probably obfuscated origin servers. They’re definitely obfuscated, and they’re are obfuscated for some reason. See what they are, check those IPs. Use a tool like N Map to see what ports are hosting services and that’ll help you figure out what that server is. Just connect to it on 80 and 443. It might be what you’re looking for. There’s a good chance that’s an origin server and that might be what your target is for this particular situation.

We talked about the WAF cookies. These more complicated WAFs are going to throw down JavaScript snippets at you. Sometimes these snippets are in-line in the page template itself and every page has the same one. Sometimes, you’ll send a git request and your response will be just a blank page with a JavaScript fingerprinting snippet in it and then you have to process that JavaScript, send back whatever response it’s looking for, and then you’ll get a valid fingerprinting cookie that lets you continue. Take a look, see if those are happening. Sometimes it’s as simple as just not running that JavaScript. Block that specific snippet and it’ll fail open. There’s definitely a pair of products that just simply fail open if you don’t run their JavaScript. That seems ridiculous to me, but that’s a thing that happens. Or, run that JavaScript, run in a regular browser, take the resulting cookie, dump it in your script, Cookie Replay. Say, “Thanks for the cookie. I’m going to put this here now.” Everything works fine a lot of the time.

Automate a Real Browser

Failing all of that, this is where it gets devastating. Just automate an actual real browser. Like I’ve been saying, what you should always be doing when you’re trying to do anything that interacts with a website is you go interact with the website yourself manually. You see how everything works, you make some notes and then you write a script that’s going to accurately recreate what you’re trying to do to make this website work for the computer. Why don’t you automate that? That’s where things get really fun. That’s what bypasses so much stuff. It is, however, a bit more complex and there is a bit of a learning curve to it.

ZombieJS and PhantomJS are kind of deprecated more or less, and neither of those are being maintained by the original creators anymore. I think Phantom is still at least community-maintained. I believe Zombie is totally dead. Anyone heard of Arachni? It’s a web application vulnerability scanner. That one, I believe, runs PhantomJS in the background and it uses that to increase its ability to access more pages within the browser by actively running the JavaScript. Those used to be the way to go for a long time, because they were tools that would run JavaScript and pretend to be browsers by the mere fact of, “I run JavaScript, therefore I must be a browser.” When we’re dealing with these modern WAFs now, they’re actually doing a lot more hardcore fingerprinting than just, “Can you run JavaScript or not?” Well, some of them are. There are still a ton that are just seeing if you can run JavaScript or not, and that’s a terrible way of fingerprinting anything to determine if it’s a real human in a real browser.

These days, the things you want to be using are Selenium – Selenium is super popular in QA testing – and Puppeteer in Headless Chrome. That’s my all-time favorite, I absolutely love Puppeteer Headless Chrome. I’m not going to get into a super lengthy how-to on how to use Puppeteer here. It’s got a learning curve to it, but honestly, you can watch an hour-long YouTube video that you can Google up and get the gist of how it operates. The deal is that Selenium, Puppeteer, they’re running Headless Chrome. Headless Chrome is Chrome, more or less. Headless Chrome claims that it’s a clone of Chrome and it does all the same things that Chrome does and it looks like Chrome. It’s not exactly true, there are a lot of things going on under the hood of Headless Chrome that kind of give it away.

If you’re dealing with something super complex that’s really digging deep into that browser, it’s going to identify this stock Puppeteer, Selenium, Headless Chrome setup right out of the gate. But it’s kind of rare that you’re going to come up against that. Definitely, even just try the stock config and see where you get. Automating it is super easy, it’s very similar to JavaScript. That’s going to get you a really long length of the way. It looks like human activity because you’re using a real browser, you’re using Chrome, and you can set your timing in there. You can set your throttling just like any other method of scripting.

It executes JavaScript to the fullest extent. It executes JavaScript because it uses the same JavaScript engine that Chrome uses. It properly leverages cookies, it does that whole tradeoff the way it should. It stores cookies, it does sessions properly. You can run multiple instances of it per IP because it’s just a browser, like you can open multiple browsers on your computer.

Realistic WebDriver

If you’re going to be doing this, go into your WebDriver settings for Selenium or for a Headless Chrome or for Puppeteer. Change at the least these aspects of it to make them look more realistic. Depending on what you’re using, sometimes these will be inherently discerned by the WebDriver, and the WebDriver is this automation tool. Literally someone driving your web browser. If you’re running this on your AWS instance that has 12 cores in it, your WebDriver is going to report that you’re running a 12 core CPU. The average user of a website probably isn’t running 12 cores on their desktop computer, so you want to fix that.

These are the things that a lot of those complex WAFs out there look for even in the automated browsers like these. If you change these things to be more reasonable for what an average user would have, like screen resolution, a lot of these default to 320 by 240 for the screen resolution. Some of them default of 1024 by 768. Nobody’s running that these days. That’s an insanely tiny screen resolution, change that. Go through these, and once you see them, they’re obvious what they are. Set these to normal, what I call normal human values. Set these to what your mom would use. That’s going to bypass a ton of the most sophisticated WAFs that are out there today.

Now that you at least have that thing to do in your bag of tricks, you’re going to get past a lot, like 90% of the things that have been stopping you up to date. Again, I’m Johnny Xmas, I’m super active on Twitter. If you want to ping me and ask any questions, there’s my company, Kasada. We at Kasada do sell a product that is not susceptible to literally anything I’ve discussed in this talk, because that’d be weird if it was. If you’re at a company who’s trying to get around people who are launching these more sophisticated attacks, you go ahead and give me a call, we can talk about that as well.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


InfoQ: Highlights from JAFAC 2019 – Day 2 – Leadership, Cultural Readiness, Self Care and Growth Min

MMS Founder
MMS Shane Hastie

Article originally posted on InfoQ. Visit InfoQ

This is a summary of the highlights from day 2 of JAFAC (Just Another F&#k!ng Agile Conference) 2019 was held in Wellington, New Zealand on 5 & 6 September, 2019. Hosted by Nomad8, the conference aims to bring new and different voices to the fore, avoiding the “usual suspects” and highlighting ways that agile ideas are being applied in a wide variety of contexts. Important themes that emerged on day two were leadership, cultural readiness for change, the importance of self care and the need for a growth mindset at all levels of an organisation.

There is no central theme for the event, rather the organisers invite “people we would like to hear from” and give them carte-blanche to present on whatever topic they are passionate about. The conference was structured with invited talks in the mornings and Open Space talks in the afternoons on both days.

InfoQ’s Coverage of day one can be found here.

Charlotte Walshe, CEO of Jade Software, opened the second day sharing her experiences as a leader in agile environments. She humorously used the example of the Borg from Startrek as an exemplar of agile teams and teamwork, saying that they are holacratic, self-organising, obsessive learners with self ownership and a fully distributed organisation.

She discussed leading the transition to “all in” agile adoption at her previous organisation, and the changes they needed to make as a leadership team to enable the transformation. She emphasized the need to move away from command and control towards a coaching style of leadership, which was very hard for many of the executive team as they struggled to find their own purpose in the new organisational style. They went so far as to replace their Executive team with a fully cross-functional senior leadership group working in two week sprints using the agile principles to guide their way of working, doing what is right, not blindly following an agile dogma.

An important lesson she took from that experience is the importance of cultural readiness for successful adoption of new ways of working. In order to lead and guide a transition it is important to have a clear understanding what values-based leadership really entails and what the impact will be on the organisation. For instance: to enable high performance is there readiness to have some people “on the bench”, not working at full capacity but ready to step in where needed (as is done with sports teams).

Sharesies co-founder Sonya Williams told the story of how they founded a financial services company with a clear goal “to create the most financially empowered generation”. They identified the challenges that exist to investing and designed the platform to cater for the needs of the Millennial generation who often have little spare cash for investment and are not well educated about what is needed to invest.

They took a very design centered approach to building the platform, starting with deep ethnographic research to identify the barriers that prevented people from investing. Things they identified were:

  • Starting to invest needs too much money
  • Available solutions are not suited to the millennial demographic
  • Barriers to entry are high due to jargon and poor communication

They built a platform focused on overcoming these impediments and now have over 60000 people investing with them.

Key messages she shared from their experience are

  • There are no shortcuts to understanding your customer
  • Ensure that everybody in the organisation is very aware of and aligned with the purpose and goals – what are we doing and why are we doing it
  • Focus on solving problems over being perfect – have a :minimum loveable product” and grow that incrementally

James Magill, Executive General Manager, Retail Markets at Genesis Energy told the story of how they transformed a “trusted but old and staid” energy retailer from an environment of releasing on new product in two years to a dynamic customer focused provider of digital services to electricity consumers. Important aspects of the journey were moving to cross functional “squads” with all the disciplines needed to deliver customer facing products.

Sam Laing changed the audience’s focus from looking at how other organisations were doing things to challenging their own personal vision and self-care. She told her own story of building a successful consulting business, moving countries, changing jobs and getting burned out, and what she learned from working through those experiences.

She introduced the idea of crafting a personal vision and using the concept of “Minimum Viable Feedback” to design experiments to validate that vision and adapt it based on what you learn from the experiments. For example, she and her wife have a goal of living on a farm, but had never done so. To run the experiment they rented a lifestyle property outside of Auckland and lived there for 6 months to learn what the farming lifestyle really meant. She explored the idea of “limiting beliefs” – the things we tell ourselves and that others tell us that become barriers to our goals.

She left the audience with two key pointers:

  1. When you look in the mirror, who do you see? Be kind, be curious and be grateful to that person – they’ve brought you to where you are today and will be with you for the rest of the journey
  2. Shift your power, shift your energy towards who you really want to be, run the experiments and make the changes you need

Sandy Davey, founder of ProductSpace and board chair of CHOICE, a consumer advocacy organisation in Australia told the story of how the board of CHOICE adopted an agile approach to their governance of the organisation. She started by explaining the history of the organisation, a 60 year old advocacy group which conducts product research and publishes a (now digital) magazine. This single focus was recognized as a significant risk (she referenced Jason Fox and the inevitable Kraken of doom which feeds upon the sweet nectar of your impending irrelevance) and the board needed to challenge the organisation to adapt to modern realities and change the business model.

She explained how they led the changes from the board level, prepared to take on risk and trade off short term gains for long term sustainability. The directors clearly exhibited a growth mindset and the ability to lead the changes. However “the legacy machine is designed to stamp out anything not like itself” and the resistance was substantial.

They brought in a number of changes which supported and enabled genuine transformation of the ways of working across the whole organisation. She identified specific tools that have helped with the changes:

She concluded by talking about the benefits the organisation has achieved from the new ways of working:

  • Better discussions leading to better decisions
  • Maximizing the work not done enabling them to spend more time on what matters
  • Speed which contributes to faster learning cycles
  • Focus on outcomes resulting in being more responsive to user needs
  • More transparency which allows people to be more open about the real challenges facing them

The afternoon of both conference days was run as an Open Space event. The summary posters from those sessions are available here.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article: Q&A on the Book Level Up Agile With Toyota Kata

MMS Founder
MMS Ben Linders Jesper Boeg

Article originally posted on InfoQ. Visit InfoQ

In the book Level Up Agile With Toyota Kata, Jesper Boeg explores how to apply Toyota Kata to drive improvement in organizations that are using or striving to use agile ways of working. He shares his experience from combining agile with Toyota Kata to enable organizations to keep improving towards their goals.

By Ben Linders, Jesper Boeg

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: The Trouble with Memory

MMS Founder
MMS Kirk Pepperdine

Article originally posted on InfoQ. Visit InfoQ

Transcript

Pepperdine: I’m going to tunnel through a very specific problem. That’s just memory. Did you guys get to the keynote this morning? I thought that keynote rocked. It is probably one of the better wake up calls I’ve seen at a keynote in quite some time. Everything this guy was saying was spot on. There are a few areas I wished he had gone deeper in, but I’m not going to complain about that. It was really awesome. If you were at Brian Goetz’s talk, you could see that a portion of his talk was also really there on this memory issue. The memory issue is actually very broad. I’m just going to narrow in on one part of it, but that’s generally what we’re going to talk about here.

That’s my marketing slide that we have to put in here. We’ve co-founded jClarity, where we’re producing what we call a diagnostic engine. We’re just trying to bring predictability into the performance diagnostic process. I co-founded JCrete which we call the “hottest unconference on the planet.” It’s every July in Crete, and you can imagine the weather there is nice and hot. Yes, we do have sessions on the beach in case you’re interested.

What is Your Performance Trouble Spot?

The question is, what is your performance trouble spot? What do you guys think is your primary source of performance problems? Just think about that yourself. This is a survey that a friend of ours did a couple of years ago, it’s probably still valid. What are we looking at here? We’re looking at the answer to that question, what is your performance trouble spot? If you look at it, you’re probably going to recognize things. Slow database queries – everybody that’s thinks slow database queries are the problem? I guess you have to ask first, who’s using databases still? They’re still useful, apparently.

You see there are three big bubbles really about inefficient application code. The point is, there are all these really big bubbles about things; bubbles around all these different performance problems. We get data from tons of systems across all industries from all corners of the planet. What do we find as the biggest problem? It’s this thing down here, excessive memory churn. We actually see that 70% of Java applications are bottlenecked on memory churn.

The question is, if my application is bottlenecked on memory churn, why is it that it’s not showing up in this chart? This is really a statement on the state of observability of our systems today. The observability in our systems is highly biased. We’ve always had problems with database queries, so what do we do with this problem? We put a lot of instrumentation around it, and we log. We use these wonderful things called logs, and then the monitors we’re using, “We got a slow database query,” so we need to go back in there and retune that bit.

Then maybe something else happens, and you get a regression, and after however long it takes you to go through the whole process, you end up going, “That’s the problem. Let’s put logging around that.” Then you find the next problem, “Ok. Let’s put logging around that.” Then you end up with this highly biased view of your system which is based on what your historical performance problems have been, but as you can notice, the observability doesn’t actually give you that much insight into what’s happening in the here and now. That’s a real problem.

The other problem is that if you apply an instrument to a system, it’s always going to tell you something, and you’re probably going to act on whatever it is that it’s telling you. The question is, is that really the problem? The other problem in this case, as you can probably notice, is with memory churn. Does anyone get an idea of how you might measure memory churn in your application?

Participant 1: Page fault delta.

Participant 2: Profiling in a production system.

Participant 3: On Sirius.

Pepperdine: Profiling for what, though? You’re going to profile for problems that you think you have. You’re not generally profiling for problems you don’t believe you have. That’s another issue.

Participant 4: Profiling for target performance.

Pepperdine: Yes.

Participant 5: What you think the system should be getting.

Pepperdine: We’re getting a lot of answers here. I’ll give you one of our key performance indicators in this particular area, it’s just simply garbage collection logs. That’s it, just doing a simple analysis on the garbage collection logs will give you an inference. It’s not an exact figure, but it’ll give you some idea of what’s happening in terms of memory churn inside your application. It’s as simple as that, it’s an analysis people don’t do, so they absolutely don’t see it.

The point is, this is what we’re seeing. When we get garbage collection logs, since I tune a lot of garbage collectors – I think this year was low, it only got to about 1,000; in previous years, I’d go through 2,000 to 3,000 different JVMs a year – you get a lot of data from these things. This is a problem that once you start seeing it, you just can’t stop seeing it. It just shows up absolutely everywhere.

Garbage collection is not really at fault here, this is really just your application code. Then you get a lot of contributions from different things. For instance, you use Spring Boot, Cassandra, or any other big NoSQL solution, Spark as an example, and all the derivatives from that, or data frameworks like Log4J, or any Java logging framework for that matter. Of course, I bet you nobody is using this, [JSON] or basically, any Marshaling protocol that you’re going to use, and I’m going to include SQL statements and things like that as a marshaling protocol, albeit an accidental one. Also caching products, Hibernate, and so on.

You can see there are a lot of these things that we actually just embed into our application that are just going to magically drop you into the 70% without you actually even realizing what’s going on. These things can actually make a difference. We all have war stories – here are the war stories, you can read them. I don’t really want to go through them too much, the point is that we’ve gotten some tremendous performance improvements simply by looking at the memory efficiency, the algorithms that people are using, identifying hot allocation sites in the code, and doing something to basically destroy them, or reduce them, or remove them from the code.

It’s like anything else. Once you can see these things, then you can design some tests to figure out where this stuff is happening. After that, we can flip from diagnostic mode into thinking creatively about how we can solve these problems. Generally, a lot of these problems are very easily solved. A lot of this work that was done where we’re getting massive improvements in performance were happening within a period of less than eight hours of work – real work – believe it or not. They can happen very quickly once you see where the problems are.

Allocation Site

The question is, what does an allocation site look like? If I have this piece of code here, and I decompile it, you can actually see the bytecode that’s actually generated. These allocations are mostly going to occur in Java heap and they can occur in a couple of different ways. Generally, what we have in the JVM are these things called allocators, and the allocators are going to look at the allocation itself and make some decisions as to how I’m actually going to perform that allocation. I can go down a slow path, fast path. It’s even possible, if I have small objects of the right shape, that the optimizers are going to come along and just basically get rid of the allocation site on its own. We don’t even have to do anything. What’s going to happen is that it’s just going to say, “I know what to do with this, an on-stack allocation,” and then we’re not doing an allocation out in heap.

Since we’re talking about Java heap, I just want to very briefly give you a review of what this tends to look like. Of course, this does change with Shenandoah GC and G1GC, but it’s approximately the same. We’re going to have some nursery where the allocations are going to occur. We’re going to have some intermediate area where we can just store things temporarily so that we can figure out if they’re going to be long-lived or short-lived. Then we’re going to have this tenured space where we’re going to put the data that is going to hang around for a long period of time.

Each of these spaces contributes to a different set of problems, and all are going to affect GC overheads. That’s one of the problems that we’re going to have. When we actually do an allocation here, we’re going to have this thing called a top of heap pointer. When we go to execute some code, what we’re going to see is we are going to create a new Foo, a new Bar, and a new byte array. Essentially, what we’re doing is we’re just going to say, “Foo is this big” plus a header. We’re going to grab that much space by doing a bump and run on this top of heap pointer. We just move the pointer forward and there is where our allocations occur.

This creates a hot pointer. What we’ve tended to do here is we said, “Instead of just getting enough memory for the single allocation, let’s go out and get a large chunk of memory, and then I’m just going to do my allocations in this buffer. Then once I filled the buffer, then I’ll go get another one.” This is known as a thread-local allocation block. These things start off as being one megabyte in size, and they’ll be adjusted depending upon some heuristics about how your thread is allocating or not allocating, as the case may be.

If you look at it this way, it’s pretty much the same thing is happening. You have a TLAB and you have TLAB pointer. Then as we allocate, and we have more threads allocating, and then you can see the Foo, the Bar, is going to go into the TLAB. The byte buffer doesn’t fit into a TLAB, so that goes into a global heap allocation. This does affect the speed of the allocation, so the Foo is going to go in quickly, the Bar is going to go in quickly, the byte is going to take a bit longer. There are also some failure conditions here that you have to consider, like what happens when we get to the end of the buffer.

We’re going to have a TLAB waste percentage, which says, “If I’ve allocated beyond that line, don’t bother trying to use the rest of the buffer. Just go and get a new one.” That adds a little bit of overhead, but that’s not as bad as, “We’re below the buffer and next allocation is going to basically give us a buffer overflow, which means now I have to go and do more work.” It’s the case of, “If we can predict something, that’s going to be cheaper than if we have to react to something.” It’s always cheaper to predict than to react. This is a somewhat expensive failure path, it means, “Go get another TLAB and start allocating in the TLAB.”

When we get into the generational collectors – probably not as much in use anymore as people have moved on to G1, and hopefully we’ll leave all of these older collectors behind – everything tends to happen in the free list. The allocation is going to happen from the free list, and recovery is going to happen to the free list, and there’s going to be a lot of free list management here. I don’t want to go too deeply in it, but essentially the garbage collector threads, when they copy data into tenured space from the young generational space, they’re going to use this thing called PLABs which effectively work the same way as TLABs. So they’re going to have multiple threads acting in the space by basically segregating the space up between the threads.

The problem with the free list maintenance that you have to do when you’re garbage collecting is that it increases the cost of the allocation and the memory management and by a factor of approximately 10X. You don’t really want to be working in this space too frequently, which is why we have a young generational space.

Problems

Now, we get into problems here that we see when we have high memory churn rates. We’re going to get many temporary objects, and that’s going to quickly fill Eden. That’s going to increase the frequency of the young GC cycles. It actually has this really strange effect, if you look at data and you ask, “What is the age of data?”, the age of data right now is calculated by the number of GC cycles it survives. If your GC cycle frequency is increased, all of a sudden, the aging of your data has increased. Now, you have this other problem, which is almost orthogonal, basically, your application is going to create a piece of data and it’s going to hold on to that piece of data for a period of time – wall clock time.

You see where we’re going with this, I’m still probably going to hold on to the data for approximately the same amount of time, but since the GC frequency is increased, all of a sudden, that data which may have been garbage collected in young generational space is now going to end up in tenured space more frequently. You can see it has this whole funny effect, strange effect, on what your garbage collection overheads are going to look like. You’re going to get premature promotion as another problem, because the data is aging too quickly. Also, you’re creating more of it, so you’re going to fill the buffers. You have to clean them out more, which means more data moves into tenured. You get more activity up into the expensive spaces to maintain. You get all of that added cost, plus you get all the copy costs and everything like that, and also increased heap fragmentation.

You get all of these funny downstream costs that you don’t even think about. In terms of the allocation, it’s still quick. If the objects die very quickly, there’s zero cost to collect them, so that’s true. That’s what garbage collection people have been telling you all the time, “Go, don’t worry about it. Just create objects. It’s free to collect them.” It may be free to collect them, but quick times a large number does equal slow. If you have high creation rates, it’s not free to create. It may be free to collect, but it’s not free to create at the higher rate.

We actually have this curve here, and so we’ve worked this out by looking at systems, a lot of systems over time. I have these bands here, the red band is approximately one gigabyte per second. Think of it this way, I’m consuming memory at approximately 100 bytes at a time. If I can reduce that allocation rate, you can see that we have all kinds of benefits. First off, we get rid of the allocation costs altogether, and we basically reduce costs from garbage collection. This is the bands that we’d like to hit. Ideally, I’d like to be below 300 megabytes per second. If I’m anywhere below a gig, though, I’m probably happy. I’m not going to really spend much time in this area. If I’m above a gig, I know that I can get significant improvements in performance just by looking for allocation hotspots and trying to improve the memory efficiency, or reduce the memory complexity, of this particular application.

Another problem that we run into sometimes is just simply large live data set size. What’s a live data set size? That’s the amount of data that’s consistently live in your heap. What we actually see is that with the high allocation rates, we’re going to get inflated live data set sizes. There’s also another reason we can have live data set sizes, that’s loitering. If you think of things like HTTP session objects, people attaching data to them, they hang around for 30 minutes. That’s data that might not be semantically useful to the application, but it’s just cached there and it’s just sitting in memory, doing nothing. That’s the term I call loitering.

You have these things loitering around for longer, and they’re going to have also a lot of costs with that. You get an inflated scan for root times, so we need to find the things that are, by definition, live. When we start the garbage collection cycle, we need to do the scan for roots. Having all of this extra live data actually means that our scan for root times can sometimes be the dominant cost in the collection. We get reduced page locality which comes with its costs, inflated compaction times, increased copy times, and you’re likely to have less space to copy to and from to. If you’ve done a Windows defrag on a half-empty disk, it’s easy, and when the disc gets full, it gets harder and takes much longer. Same type of problem here.

If you look at the pause time versus occupancy, in the chart on the left is a lot of noise and stuff there, but just look at the red dots and imagine the line that they form. That’s the first thing we’re looking at, that’s the heap occupancy. The other dots on the other chart are pause times. You can see the clear correlation between heap occupancy and pause times. This makes sense, the more live data I have, the more work the garbage collector has to do. Since the dead stuff is free and there’s no free lunch, the cost has to come from someplace, and so it’s coming from the live stuff in this case.

Of course, just for slight completeness, not complete completeness, unstable live data set size, or what we call memory leaks, obviously lead to the bad result of having out of memory errors where the JVM will terminate, if you’re lucky. Does anyone know the definition of an out of memory error? It’s 98% of recent execution runtime spent in a garbage collection with less than 2% of heap recovered. You can end up in these weird conditions someplace where your JVM is just chugging along really slow. You don’t get an out of memory error, and you don’t get the out of memory errors because maybe you’re collecting 3% of heap, not 2%. In some cases, we just said, “It’s just better to throw the out of memory error, so we’ll just go and tweak the definition.” And we’ll just say, “If you don’t get 5%, then we’re going to throw the out of memory error.” You can do that, you can even have fun with your colleagues.

Eventually, you run out of heap space and stuff like that, and that’s a completely different talk, to talk about how to diagnose and stuff like that, so we’re really not going to cover much of that here. I’ll just mention that’s another issue. The loitering actually gets you closer to these limits. If you have data that you just can’t get rid of in heap, then it’s going to fill the heap up, obviously, and then you’re just going to have less space to work with. Getting rid of some of these reductions, reducing these cached objects and stuff like that, can have some really huge benefits.

Escape Analysis

Now I’m going to talk about this little thing about escape analysis, mostly because it gets important when we start looking for hot allocation sites. Does anyone have an idea of what escape analysis is?

Participant 6: The lifetime of variables.

Pepperdine: Lifetime, yes. We’ll just go over this very quickly. Escape analysis is a test I’m going to apply to a piece of data, and I’m going to say, “What is the visibility of this data? How is it scoped?” If it’s scoped locally, that means that only the thread that created it can actually see it. If I happen to pass that data off into a method call, then that’s going to be called a partial escape. Of course, a full escape is if I’m in a situation where the data is scoped in a way that multiple threads can actually see it. Static is an obvious example; if I declare things static, then I’m in a situation where multiple threads have visibility to this particular piece of data.

The partial escape is the interesting one, because you might think, “Too bad. I just pass it off as a method call into this other thing, and it comes back, and it’s still all local.” Unfortunately, the JIT compiler hasn’t really figured that out. If you want to go see a better escape analysis, then I would suggest you go see the Graal talk. Graal actually goes the extra step and said, “If it’s only one method down, and it looks like it’s passed the visibility test, I’m going to say there’s no escape here.” The point is, in order for me to apply an optimization to how a variable is allocated, I have to know what its scoping is, I have to know what its thread visibility is. Escape analysis is going to give me that information. As I said, this will become important when we actually look at finding the hotspots and doing the diagnostics to figure out what’s going on.

Demo

Let’s have some fun with demos. I have this application here, it’s just a funny little MasterMind, kid’s game, you might have played it. Instead of using colors, I use numbers. Indirectly I gave it three numbers, and I’m asking the program to guess what three numbers I gave it. What’s happening along the way is that I’m scoring each of the guesses, and as I score the guess, the computer will use that information to refine its guess. If you played the MasterMind game as a kid, then you know it’s like a color-guessing game, so you and your partner that you’re playing with has to score your guesses to give you more information to refine your guess. This is the same idea here.

In this case, I gave it the answer 0, 1, 50,000, and 3, which runs in approximately 12 seconds. The question is, what’s the bottleneck in this application now? What’s the thing that’s preventing us from finding the answer faster? If you look at the GC log, there’s not a lot of data here. I don’t want to go through a whole analysis here, but I’m going to look at the GC log. Look at this, application throughput 98.9%, that means GC is taking 1.1% of the time. You might look at this and say, “The garbage collector is not really an issue here.” You can look at the overall pause times and the pause times are all well below 10 milliseconds. If I was to tune all of the GC out of this problem, I would save 22.6 milliseconds. Not a lot of time.

The point is, there’s not really an indication here that we have a problem until we actually get into allocation rates. If we look at the allocation rates and you use the guideline that I gave before, the magic numbers, you can see that we have this downward trend, so as things warm up, there’s something happening here, but essentially we settle out at about 2.5 gigs per second. That’s what I call a hot allocating application. The question is, what are we going to do to find it?

We could use some sort of memory profiler, so let’s just bring that to bear. Here’s our MasterMindGUI. I’m going to go over to this funky little profiler here, I’m going to go and check the filters and all the settings. There’s a reason why you want to check the filters on your profilers. Do you know what the best use of the filter is? Filter the bottleneck out of your profile. If you want to say, “There’s no bottlenecks in my code,” make sure you set your filter sets properly. Make sure you understand how your profiler works, make sure that you understand what your filter sets are, what’s going to be profiled, what’s not, and everything like that.

Since I’m clueless, I’m going to set my filter set up in the Clueless mode. If you’re not clueless, then you can take a chance and set it up in the Not Clueless mode if you like, but I’m just going to say I’m clueless. Then it’s really easy, this is this just profiling, which is really boring. Now we’ve got all these columns and numbers, and it just looks like a bad pizza night. How are we going to make sense of this?

Hot allocation rates – rates is frequency – so I want a frequency measurement out of this bloody thing. What I’m going to do is I’m going to take allocated objects, that’s my frequency measure. I’m just going to focus on that one. I don’t care about size, I do care about frequency. I could allocate 100 bytes, I could allocate a megabyte, it’s the same amount of time. This is not a size issue, it really is frequency. Now we can go back to our application. This is a good thing about performance tuning, you don’t have to do a lot. It looks like we’ve got a runaway train here on Score. Let’s take a snapshot and dig in. There’s our Score constructor, Board.isPossibleSolution.

Let’s take a look at the Score object and see what’s going on here. You can see it’s got a couple of ints and it’s got this jackpot boolean thing. In other words, we won. It all looks pretty simple, we just go over to this Board thing again. We look at it and say, “Wait a second. Something’s going on here if I look at this code.” You probably have to do something like a score.reset here. There you go, we got rid of the allocation, but have we gotten rid of the allocation here fully? Or did we actually need to do that? I think this is a better question. Let’s answer that question by applying an escape analysis test on the original code. How many threads can actually see the Score object?

Participant 7: One.

Pepperdine: One. Is it being passed to another method? No. That seems to be the condition where it passes escape analysis. So, that allocation should actually be eliminated by the compiler because it doesn’t pass escape analysis. It’s one of these small objects, two ints and a boolean, which means that we can actually do a non-stock allocation of this really quickly. The question is, why is it showing up in the profiler then? Maybe it’s not being eliminated?

I’ll bring another tool to bear on this one, which is called JITWatch. It was written by Chris Newland, it’s an awesome program. He’s got all kinds of material on the web to describe it. We don’t have enough time to redo what he’s done, it’s really awesome. Essentially, what I did is I told the JIT compilers to tell me what you’re doing, we load it up in this tool, and one of the things it does is it picks out allocations. If I to com.kodework.mastermind, JITWatch is telling me that the Score actually has been eliminated, that allocation site has been eliminated from the code. I can view the bytecode, and we can go through it.

You can see it’s saying, “There’s our allocation site. It’s been eliminated, it doesn’t exist anymore.” Why is the profiler telling us that we have this hot allocation site that the JIT compiler is telling us that we don’t have? This is one of the ways that profilers actually lie to you, they go in and they disturb the runtime. What happens is that the compiler is going to do bytecode injection, and it’s going to wrap the allocation site in a method call so that it can track it, and it’ll probably put it into something like a phantom reference or something like that, which disturbs the runtime even more. The point is, I can adjust the code and I can run it, and you’re going to see no difference in the runtime because that allocation site is actually going to already be eliminated by the optimizers, by the JIT compiler.

Participant 8: The profiler will tell you it doesn’t know that.

Pepperdine: Actually, no, it won’t. I can run it and show you, but the point is the profiler is picking up this allocation site because it does a partial escape. When it does a partial escape, it also exposes it to another thread. Now, the effect of profiling has changed the runtime, so the optimizers have to react differently to this particular situation. Therefore, this is lying to you. If you want to know the real hot allocation site from the profile, it’s actually not here. It’s actually going to be this object which is going to be wrapped in this little int array down here.

The long and short of it is that when we actually do this work, we’ve recognized, “Ok, we got this problem,” and then we come around to actually fix it, we have to also understand how our tools are interfering with our ability to actually understand what’s happening in the runtime also. When we actually go around to make the fixes, we’re fixing the right stuff. If you look at the deployment cycles for this to get it back out into production where you’re actually going to see the real picture, you’re not going to see it in your test environment. Forget that, that’s not happening. When we spin around and see what’s happening in the real environment, then it’s like, “No difference,” and you’re left there scratching your head going, “Why?”

Questions and Answers

Participant 9: Can you explain how you get to the point when you know that it’s the integer array?

Pepperdine: If you look at allocated objects and stuff like that, probably going to go after this object array, and we’ll just do a study on it. I’m just going to go top to bottom down the list and try to figure out just basically what this thing is involved with. Is this allocation actually being eliminated or not? And just answering those types of questions. We can probably take a look at what this is involved with already in our snapshot.

This looks like an artifact of profiling to me. You can’t hide from the profiler in this case. That’s what that actually looks like. If we get into the next one, indexing, we can see that now we’re back into application code, and we can actually take a look at what this particular piece of logic does to see what we can do to change it. In this case, I think I was using BigInteger, which is absolutely horrible. If you do the math, you can see that I could just replace this with a primitive long, and of course, you’re going to get the corresponding huge performance improvements when you make that change.

Participant 10: You gave an absolute number for garbage collection.

Pepperdine: The allocation rates?

Participant 10: Yes. If you’re talking about something like Spark, and very large machines, do you have any comments on how big a JVM should run?

Pepperdine: No, I don’t. These numbers seem to hold no matter what piece of hardware we’re running on. We’ve actually used some machine learning, or AI, or whatever you want to call it. That’s not my specialty, I have a guy who has actually been doing that for us. He’s validated these numbers running a whole bunch of different experiments, and this is just numbers that we’ve pulled out of production environments in general.

Participant 10: It doesn’t matter if you have a 32 CPU, or…?

Pepperdine: No. Actually, I think the message from the keynote and what Brian [Goetz] said today is that CPU’s, it doesn’t really matter if we make them faster or not. They’re saying that they can’t make them faster, that’s not the issue. The issue is that if I need to go fetch data from memory, I have stalled the processor for up to 300 clocks to wait for that particular data. S typical application is going to go 40 clocks active, 300 clocks wait, 40 clocks active, 300 clocks wait, and so on – something like that, along that lines. If you think about where we’re going to get the biggest performance gains going forward, it’s filling in that 300 clock.

Now, we used to have speculative execution, but we had a conversation about that. The sideband attacks and everything are slowly wiping that off the table. The gap between CPU speed and memory speed has been increasing at about 8% per year. Some problems are only just getting worse over time. Fortunately, as we’ve been told, that should tail off, and hopefully we can start focusing on peripherals, the things that are much slower than CPU. We can fill in the gaps naturally by having faster memory, faster buses, and things like that. We can just feed the CPU. I think that needs to be where we focus our energies going forward. These problems become less apparent, I think. We didn’t see this problem 10, 15 years ago so much, mostly because the differences in speed between CPUs and memory was not as great. You just didn’t pick it up as much, but it’s over the years just become a bigger problem.

Participant 11: When I think about one of my applications, I try to make it where the application state, the objects are immutable. Then when we receive some sort of event that requires to mutate the state, we copy-on-write. We’re necessarily producing a lot of garbage, and we do a lot of allocations there. How do you reconcile that with the desire for immutability?

Pepperdine: Mutable shared state is evil. Immutability is evil, as you can see. Really, you need to find some place where you have mutable state that’s not shared. That’d be a sweet spot, wouldn’t it? If you make the design decision to automatically make everything immutable for me, then you’ve handcuffed me. I’ve got two choices. I can either not use your stuff, or I’m just going to take the memory penalty for using it. That’s really the only choice I have.

If you make the state mutable but not shareable, then I think you are probably in a better world. That just comes from better design, and understanding how to design things so that instead of exposing mutable state, we’re actually containing it properly so that we can actually control things like actor models. We use Vert.x in our projects, Vert.x is great for doing this. We maintain stateless parsers, imagine that as an example. We don’t have to worry about it.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Article Series – .NET Core 3

MMS Founder
MMS Chris Woodruff

Article originally posted on InfoQ. Visit InfoQ

In this series, we explore the benefits of .NET Core and how it can help not only traditional .NET developers, but all technologists who need to bring robust, performant and economical solutions to market.

By Chris Woodruff

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


MongoDB Achieves Independent Validation for PCI DSS Compliance

MMS Founder
MMS RSS

Article originally posted on MongoDB. Visit MongoDB

 

NEW YORK, Sept. 23, 2019 /PRNewswire/ — MongoDB, Inc. (NASDAQ: MDB), the leading, modern, general purpose database platform, today announced that its global cloud database, MongoDB Atlas, has been independently validated as a Payment Card Industry Data Security Standard (PCI DSS) certified service provider. Following an extensive audit process, the certification was issued by an independent Qualified Security Assessor (QSA).

 

 

MongoDB

 

 

“At MongoDB, we’re committed to providing built-in, best-in-class security features to our customers,” said Lena Smart, Chief Information Security Officer, MongoDB. “We’ve placed a premium on making MongoDB Atlas as secure as possible so customers are fully confident in running mission-critical workloads in the public cloud.”

PCI DSS is an information security standard developed by the PCI Standards Security Council which applies to all entities that store, process and/or transmit cardholder data. The PCI Standard was created to increase baseline technical, physical and operational security controls necessary for protecting payment card account data. The PCI DSS requirements apply to all system components included in or connected to the cardholder data environment. Along with certification, MongoDB is now a member of the PCI Security Standards Council with the ability to review and provide feedback on future standards.

“MongoDB Atlas is being used across the globe for business-critical applications in the most demanding industries, and providing best-in-class security capabilities and compliance certifications is a major part of that,” said Sahir Azam, SVP Cloud Products & GTM. “Securely handling credit card payments is foundational for online business worldwide, and our PCI compliance certification expands MongoDB’s mission in providing the leading global cloud database across all major cloud providers.”

Attaining PCI compliance is the latest in a series of global information security standards that MongoDB Atlas complies with, meeting the criteria for stringent workloads. These global information security standards include SOC2 and ISO27001:2013. Additionally, MongoDB Atlas assists customers with GDPR compliance and is HIPAA ready.

This spring, MongoDB received Security Technical Implementation Guide (STIG) approval from the Defense Information Systems Agency (DISA), making it the first non-relational database to do so. This approval allows U.S. Department of Defense (DoD) agencies to deploy MongoDB within certain DoD networks.

In June, MongoDB announced its new client-side field-level encryption capability in version 4.2. Most databases handle encryption on the server-side, but client-side field-level encryption changes that by providing automatic, transparent encryption, separation of duties and regulatory compliance.

Further Resources:

 

 

About MongoDB
MongoDB is the leading modern, general purpose database platform, designed to unleash the power of software and data for developers and the applications they build. Headquartered in New York, MongoDB has more than 15,000 customers in over 100 countries. The MongoDB database platform has been downloaded over 70 million times and there have been more than one million MongoDB University registrations.

Investor Relations
Brian Denyeau
ICR for MongoDB
646-277-1251
ir@mongodb.com

Media Relations
Mark Wheeler
MongoDB, North America
866-237-8815 x7186
communications@mongodb.com

 

 

Cision View original content to download multimedia:http://www.prnewswire.com/news-releases/mongodb-achieves-independent-validation-for-pci-dss-compliance-300922846.html

SOURCE MongoDB

 

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


What Tech for Good is and Why it Matters

MMS Founder
MMS Ben Linders

Article originally posted on InfoQ. Visit InfoQ

Tech for Good groups provide opportunities to connect with people who share a positive vision of the future and look for ways to use technology in order to have a positive impact. Ellen Ward spoke about Tech for Good Dublin at Women in Tech Dublin 2019; she presented what Tech for Good looks like in reality, why it matters, and how people can get involved.

Tech for Good Dublin is a group who believes in the power of technology to positively impact people, communities and the planet. They care about the future and want to help ensure that tech is used as a force for positive change. It is part of the global Tech for Good movement which seeks to share the message that technology is a tool for everyone, and that we are all stakeholders in the future of tech.

Technology should be designed by and available to everyone, so they have aimed to be as inclusive as possible from the start, Ward said. Tech for Good Dublin aims to find inspiring topics that will encourage creative thinking and collaboration, so they have looked at technologies being used with children, older people and people with autism or dyslexia, for example. By sharing good news stories, their events and communications aim to amplify unheard voices and promote the important work of changemakers in our communities.

Ward explained how Tech for Good acts like an antidote to the negative news we hear every day about the impacts of technology which may cause us harm now or in the future. People love trying out new technology and asking questions about how it works, she said. They also like to talk about how ideas can be turned into reality, and the challenges that can occur along the way. It’s a fantastic form for learning and connecting with other people.

InfoQ interviewed Ellen Ward, head of information systems (interim) at Concern Worldwide and co-founder of Tech for Good Dublin, after her talk at Women in Tech Dublin 2019.

InfoQ: How are you involved in Tech for Good Dublin?

Ellen Ward: I’m a co-founder, together with Máirín Murray. We design and run the events (unpaid and in our own time) to focus on projects with purpose and showcase how people are using technology to solve everyday problems.

By highlighting the Tech for Good project we can influence the evolution of technology and encourage people to take back some technological power. Our group is open to anyone, we welcome non-technical people especially, and we encourage open and inclusive discussions as well as member networking and spin-off volunteer projects.

Since we started in March 2017 we have gathered more than 1,700 online members, and are proud to say that we have become an enthusiastic and diverse group of people who share a common mission.

InfoQ: What has Tech for Good accomplished?

Ward: Tech for Good Dublin has run 21 events so far, and given away more than 1,080 free tickets to people who want to learn more about Tech for Good. We have zero budget, and rely on donations in kind (room space, donuts etc.) and volunteer speakers. There has been incredible goodwill for our group which shows that people believe it is important.

We are part of a global movement. We know of groups popping up all over the world, such as Auckland, Denver, LA, Orlando, Cardiff, Birmingham, Brighton, Naples and Lisbon, plus Kenya, Nepal and Madagascar. There is set model to follow; the groups interact with and respond to their own community interests and needs.

In Dublin we do this by sharing stories of real people making change happen. Examples we have featured include a geography teacher who introduced Virtual Reality into classrooms to help students get engaged with the natural world, a team who built a platform to help charity shops sell their many donated books, an app to help children with Dyslexia with their education and connecting with other people, and an intelligent bike light which is helping city planners to improve cycling infrastructure.

InfoQ: What kind of topics does Tech for Good look for?

Ward: We are always finding new topics to talk about. For example, a new app is being released this month (October) to assist people learning and using Irish Sign Language and we have the creators speaking at our next event. They will explain how the app came about, and demonstrate how it works with a hands-on exercise.

Then for our November event, I have found speakers who are using technology like sensors, crowd-sourced data and AI to manage and protect bees in hives and the wild. Once I started researching I found quite a lot of bee projects using tech, and I think that will be a very interesting session.

InfoQ: What do you do to build inclusive groups?

Ward: Some of the simple things we have done from the start to encourage inclusive groups include ensuring we have 50% women speakers, and we seek out people whose voices may not ordinarily be heard. We do not charge for events as we don’t want anyone to feel that money is a barrier to participation. There are zero prerequisites to getting involved; people don’t have to have technology skills, just an interest in what we are talking about. We provide a relaxed and friendly forum where no one is turned away and everyone can contribute if they wish. At the end of each event, we ask if anyone in the room wants to talk about their own work or ask for help, share their CV, invite people to another event etc., and this is a great way to learn more about what is happening in our city.

We also ask for feedback constantly and try to improve and evolve with the group as it grows. The Women in Tech event is a great way to talk to more people about what we could do with the group in the future, so I will be asking the attendees for their feedback too!

InfoQ: Where should people go if they want to join Tech for Good?

Ward: Well, in Dublin, come to us! The Tech for Good Dublin MeetUp group is the best place to go if you want to hear about all our future events, and you can follow us on Twitter @tech4gooddublin.

You can watch our TEDx talk about Tech for Good: Community+Technology=Positive Social Change.

If you are interested in starting a Tech for Good group somewhere else in Ireland or further afield, we can strongly recommend it as a great way to meet fascinating people and start interesting conversations. All you need is a bit of free time and a room you can use. Everything else you can find online, or by talking to one of the groups already created in cities around the world.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Google Releases Cloud Dataproc for Kubernetes in Alpha

MMS Founder
MMS Steef-Jan Wiggers

Article originally posted on InfoQ. Visit InfoQ

Google Cloud Dataproc is an open-source data and analytic processing service based on Hadoop and Spark. Google has now announced the alpha availability of Cloud Dataproc for Kubernetes to provide customers with more efficiency to process data across platforms.

The Cloud Dataproc service has been generally available for over three years and now offers alpha access to Spark jobs on Google Kubernetes Engine (GKE) – meaning developers and data scientists can now run Apache Spark jobs on GKE clusters. Typically, Spark applications run on Hadoop YARN clusters; however, with Cloud Dataproc for Kubernetes, users will have one central view that can span both YARN and Kubernetes clusters and do not need to manage them separately. Furthermore, according to the announcement blog post, the support for both clusters will give enterprises more flexibility to modernize specific hybrid workloads while continuing to monitor YARN-based workloads.

Running Apache Spark on Kubernetes differs from running them on virtual machine-based-Hadoop clusters like on the CloudProc Dataproc service or competitive offerings like Amazon Web Services (AWS) Elastic MapReduce (EMR), and Microsoft’s Azure HDInsight (HDI). Apache Spark is the first open-source processing engine Google brings to Cloud Dataproc on Kubernetes. And, the tech giant is planning to bring other open-source analytics components to Kubernetes as well, such as Apache Flink, Presto and Apache Druid. Furthermore, products like Anthos – now making GKE available virtually anywhere, allow customers to even take Cloud Dataproc to their own data centers or eventually to the Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Services (AKS).

In the same Google announcement blog post, Matt Aslett, research vice president at 451 Research, said:

Enterprises are increasingly looking for products and services that support data processing across multiple locations and platforms. The launch of Cloud Dataproc on Kubernetes is significant in that it provides customers with a single control plane for deploying and managing Apache Spark jobs on Google Kubernetes Engine in both public cloud and on-premises environments.

Customers who want to try out Cloud Dataproc for Kubernetes will have to apply for access by emailing Google. Furthermore, the alpha release is intended for testing and experimentation purposes only. More details on Cloud Dataproc for Kubernetes are available on the How to Get Started blog post.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.