Mobile Monitoring Solutions

Search
Close this search box.

Uber Open-Sources Ludwig Code-Free Deep-Learning Toolkit

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Uber Engineering is open-sourcing Ludwig, a deep-learning toolkit that allows users to experiment with a variety of neural network structures without writing code.

Ludwig is built on top of Google’s TensorFlow deep-learning library. There are other “wrappers” of TensorFlow that provide friendly interfaces, such as Keras or Gluon. However, these still require users to define their neural networks by writing code (usually Python). Ludwig pre-packages a large number of popular deep-learning patterns, which can be combined and configured using a YAML file.

A large class of deep-learning solutions for vision and speech problems follow an “encoder/decoder” pattern. In this pattern, the input is converted from raw data into a tensor representation, which is then fed into one or more layers of a neural network. The layer types depend on the input data. For example, image data is often fed into a convolution neural network (CNN), while text data is fed into a recurrent neural network (RNN). Likewise, the output if the network is converted from tensors back into output data, often passing through RNN layers (if the output is text) or some other common layer type.

Ludwig simplifies the construction of complex deep-learning solutions by prepackaging encoders and decoders for common data types, along with common pre-processing steps that must be applied. Using a YAML file, users specify the data types in their feature set and configure the encoders for the input and decoders for the output. The Ludwig website provides several examples of models that can be implemented with just a few lines of YAML configuration, including:

  • natural-language understanding
  • machine translation
  • sentiment analysis
  • image captioning
  • visual-question answering
  • time-series forecasting

Training can be monitored via TensorBoard, and training time can be reduced by using the Horovod distributed training framework. Ludwig also provides commands for visualization of training and test results. Ludwig ships with a command-line interface; training and prediction can also be invoked from code using Ludwig’s API.

The Ludwig toolkit is available on GitHub under the Apache 2.0 license model. The website provides a developer guide for extending Ludwig by adding data types, encoders, and decoders.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Podcast: Grady Booch on Today’s Artificial Intelligence Reality and What it Means for Developers

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Today on The InfoQ Podcast, Wes Reisz speaks with Grady Booch. Booch is well-known as the co-creator of UML, an original member of the design patterns movement, and now work he’s doing around Artificial Intelligence. On the podcast today, the two discuss what today’s reality is for AI. Booch answers questions like what does an AI mean to the practice of writing software, and how it seems to impact delivering software. In addition, Booch talks about AI surges (and winters) over the years, the importance of ethics in software, and host of other related questions.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


RunC Bug Enables Malicious Containers to Gain Root Access on Hosts

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Security researchers have discovered a critical bug in runC – a lightweight CLI tool for spawning containers according to the OCI specification – which allows the attackers to escape the container and gain administrative privileges on the host, rendering it vulnerable.

The bug (CVE-2019-5736), discovered by Adam Iwaniuk and Borys Popławsk of Dragon Sector, a Polish security Capture The Flag team, “allows a malicious container to (with minimal user interaction) overwrite the host runc binary and thus gain root-level code execution on the host,” explained Aleksa Sarai, a senior software engineer at SUSE Linux who is also one of the maintainers of runC. Sarai added that “It is quite likely that most container runtimes are vulnerable to this flaw, unless they took very strange mitigations beforehand.” Yet, he adds further, correct use of user namespaces “where the host root is not mapped into the container’s user namespace” blocks this flaw.

The maintainers of runC have already made a fix available to resolve the security flaw. Docker has released version v18.09.2 addressing the issue. It recommends immediately applying the update to avoid any potential security threats. Many vendors and cloud service providers, including Google, Amazon, and Kubernetes have issued security bulletins on mitigating the issue.

The researchers who discovered the flaw have explained it in detail in their blog post. The proof-of-concept exploit code for the vulnerability is now available on GitHub.

The runC team has assigned a CVSSv3 vector score of 7.2 to the bug. Red Hat Product Security gave it a severity rating of “important impact”, which “is given to flaws that can easily compromise the confidentiality, integrity, or availability of resources.”

According to Scott McCarty, senior principal product manager for containers at Red Hat, the disclosure of this flaw “illustrates a bad scenario for many IT administrators, managers, and CxOs.” He notes in his blog that “Exploiting this vulnerability means that malicious code could potentially break containment, impacting not just a single container, but the entire container host, ultimately compromising the hundreds-to-thousands of other containers running on it. A cascading set of exploits affecting a wide range of interconnected production systems qualifies as a difficult scenario for any IT organization and that’s exactly what this vulnerability represents.” But he also mentions that SELinux in targeted enforcing mode mitigates this vulnerability for most Red Hat technologies.

The bug has captured the attention of developers working in the container virtualization sphere. Several of them have taken to forums like Twitter, Reddit and HackerNews to understand the implications of the discovery of the bug as well as the underlying malpractices such as running privileged containers, running processes as root and so on. The general advice for developers is to run processes under their own user and use only verified images to spawn the containers.

The topic of container security has been in spotlight of late. Early last year, 17 malicious Docker images were pulled down from the Docker Hub image repository after reports of them being actively used in illegal cryptomining activity came out. In December 2018, a critical flaw was discovered in Kubernetes which could allow “a malicious user to gain full administrator privileges on any computer node in a Kubernetes pod”. A patch was immediately released in response.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Eclipse Releases MicroProfile 2.2 for Java Microservices

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

The Eclipse foundation recently released MicroProfile 2.2, helping developers to create microservices on top of EE 8. This release comes at the same time that Eclipse is taking over as steward of Java EE and rebranding it to Jakarta EE.

This release enhances support for the OpenTracing API, helping developers create legible log statements that track interactions between different microservices. Additionally, it improves interoperability with other frameworks such as OpenAPI and RestClient, which work in tandem to help build and consume RESTful services. Using these capabilities, developers can leverage Java’s static typing system and turn possible runtime errors of piecing JSON together, into compilation errors that can be found through automated tooling.

Keshav Vasudevan from Swagger’s OpenAPI group elaborates in his blog post, “The Benefits of OpenAPI-Driven API Development“.

The [OpenAPI Specification] is to REST what WSDL was to SOAP. It provides a common framework of designers, developers, testers, and devops to build and maintain APIs. Think of the specification as a set of rules to build and implement a REST API. The OAS is language agnostic, and is both human and machine readable, which allows both people and computers to discover and understand the capabilities of a service without requiring access to source code, additional documentation, or inspection of network traffic.

OpenAPI’s focus on human readable APIs and small MicroProfile services that work together is a combination that reduces the amount of work needed to understand what programs actually do, applying the famous Donald Knuth quote, “programs are meant to be read by humans and only incidentally for computers to execute.”

In a previous interview, Uber’s chief system architect Matt Ranny explains the role of type-safe interfaces.

Microservices have a lot of trade-offs, not all of which are obvious… A lot of early code at Uber was using JSON over HTTP which makes it hard to validate those interfaces… Moving towards type safe interfaces between services; one of the biggest lessons was the unexpected cost of using type unsafe JSON strings for exchanging data between services.

Adam Bien, freelance developer and author of “Real World Java EE Night Hacks — Dissecting The Business Tier” recently did a two-minute productivity tip, explaining how to use “Thin Wars, MicroProfile, and Docker together” to streamline application development.

The new MicroProfile framework is compatible with Payara Fish, JBoss, WildFly, and IBM’s OpenLiberty project.

Developers looking to try and test MicroProfile services can leverage the new MicroProfile Starter Beta. They are also able to deploy MicroProfile applications in standalone mode through the Thorntail framework, which is a similar set of technologies to Spring Boot that embeds the components necessary to have a standalone executable JAR file.

 

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


19 Great Articles About Natural Language Processing (NLP)

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more. To keep receiving these articles, sign up on DSC.

19 Great Articles About Natural Language Processing (NLP)

DSC Resources

Follow us: Twitter | Facebook

  Hire a Data Scientist | Search DSC | Classifieds | Find a Job | Post a Blog | Ask a Question

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: Using Technology to Protect Against Online Harassment Panel

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

Leong: Today, we’re doing a panel on stopping online harassment, or using tech to stop online harassment. Basically, what the hell do we do about online harassment? So today I have gathered a group of lovely ladies here to talk about the very subject. First of all, I would like to apologize to Leigh Honeywell whose name I mispronounced this morning, so it was very embarrassing. So Leigh is the CEO of Tall Poppy, which is a very exciting startup, which is going to be providing protection of online harassment to employees as a benefit. Is that correct? Cool. Next up we have Kat Fukui who is a product designer on GitHub’s Community and Safety team. Small plug, our team is pretty great. So she helps me design safety from the beginning to the end of a feature, making sure that all products that we make are consensual. Last up, we have Sri Ponnada who is an engineer at Microsoft. This is not a Microsoft talk, I swear. And she is an engineer. You are a little bit outnumbered, I am sorry. And Sri is an engineer at Microsoft who made an app to help communities get a little bit closer to their [inaudible 00:01:23] and to give them a little bit more of a voice rather than Google saying, “This is what your page is going to look like.”

What Is Your Definition of Online Harassment?

So with that, I’d like to start off the first question which is for anybody who would like to answer, what is your definition of online harassment?

Fukui: Hello. I guess I have the mic, so I’ll start. My personal definition of online harassment is a targeted abuse of features that are on a platform using technology that may have not been accounted for. Usually, there’s a pattern to it, whether it’s the method of abuse or the type of people that are being abused with the tools and the technology. Yes, I think I’ll let others speak too.

Honeywell: That’s a super good and very like nuanced definition. Wow. So I think of a couple of things when I think of defining online harassment. One of the recurring experiences in working with people who’ve experienced online harassment is that they don’t necessarily call it harassment. A couple of terms that have been useful are just negative online interactions or negative social experiences. One of my favorite retorts when someone is being a jerk to me on Twitter is telling them to go away and then when they don’t saying, “Wow, you have really poor interpersonal boundaries.” So I think just thinking of online interactions that are not positive; there’s a smooth gradient from, “Ooh, someone was mean to me on Twitter” all the way into actual threats of violence and other things that potentially fall outside of the realm of like protected free speech. But I think thinking of it in terms of features and misuse of features is a really good perspective.

Ponnada: Yes. I guess, first, both of what you ladies touched on, and also lately thinking about what does being online mean? Your phone, what if you were dating someone and then they just consistently texted you because now they can. It’s not like back in the day where they had to write a letter, they’re not going to write a billion letters in a day, but they whip out their phone and they’re like, “Let me send you, bombard you with texts”. So like that, or also how are companies harassing others? Harassing me with ads. I try to go on Instagram to see what my friends are up to and then I just see a ton of ads and I’m like, “Why is this here?” I never asked to see this or just taking our data, selling it and I don’t know, doing things… the people putting the technology out there themselves, abusing their power. So yes.

Honeywell: I think that connects it to- has anyone here read Sarah Jeong’s book, ”The Internet of Garbage?” A few people. It’s really good if you haven’t read it yet. It’s, I think, a dollar on the Kindle store or you can download it from The Verge, I believe. Anyway, one of the salient points in that book is making a comparison between the technological evolution of anti-spam and how today we’re thinking about anti-harassment. And I think there’s often situations that blur the boundaries between those two things, right? There was a Twitter thread going around this morning where someone was complaining about having been sent recruiter spam that was very clearly scraped from their GitHub account. You know all about this problem. It’s both kind of like harassy and misusing features, but it’s also like commercial spam.

I think there’s a lot of things that straddle that line between just straight up commercial spam and you see that a lot with the enterprise salesy, “I’m going to follow up in automated fashion with you every two weeks until you tell me to screw off,” which is the standard enterprise sales playbook. So figuring out what that line of- I like to call it the line between appropriate hustle and inappropriate thirst.

Leong: #thoughtleadership right there. So you mentioned that this is becoming normal practice now, which is somebody will non-consensually take your contact information and then do something which you don’t expect it to. And then it crosses some boundaries within you as a personal person, you as a digital citizen. Why do you think that there is such a problem with harassment online?

Boundaries

Ponnada: I want to start this one if that’s okay. Yes. So you talked about boundaries, right? I feel like with the rise of technology and that having become our primary form of communication, for the most part people don’t really talk to each other anymore. And I noticed that in this room this morning, I was chatting with a gentleman and I asked him, ”Is this seacliff room?” And he was like, “You know, I don’t know.” And then he asked one of the volunteers and he was like, ”Do you know if this is seacliff?” And the volunteer said, ”Let me look it up on my phone.” Or was like, “Oh, you can find it on the app”. And that’s fine, whatever, no harm, no foul. But why is that the first thought, to think that, “Oh, why don’t you just look it up on your phone”, rather than just talk to this person. Maybe they’re just not developing those kinds of interpersonal relationships in real life or setting those boundaries for themselves. And then that just kind of translates online.

Fukui: Oh yes, sure. Piggybacking off of boundaries, when I think of online harassment, I think it’s still harassment. Online communities are still communities and in real life we have those boundaries pretty set. If I called Danielle a jerk or something, I would never say that …

Leong: I probably deserve it.

Fukui: … then like that’s not cool. And we have ways to talk about that and we have ways to combat that, especially if it’s physical violence. But online communities, we don’t have these standardized open frameworks for how to deal with that kind of stuff. If I called someone a jerk online, we wouldn’t treat it the same way online as we do in real life. And we haven’t come up with those boundaries in a standard way to create those boundaries across the technology that we build. So that’s how I’ve been seeing it lately, that online harassment is real harassment and it’s just the same in person and on the internet.

Honeywell: I literally have a slide in our startup pitch deck about how the Internet is real life. It’s one of the bullet points. In having done longitudinal work over the past decade with people facing online harassment, one of the things that comes up a lot in these conversations is, is this some kind of new flavor of bigotry or new flavor of misogyny that we’re seeing? And I think fundamentally it’s not. I think what we’re seeing with online harassment is the same negative interpersonal interactions and biased interactions that have always existed in our culture. They’re just in public; they’re just happening in a visible way.

I think of all of the stories of men not understanding how much street harassment was a problem until they walked at a distance from a female partner and observed what was happening to her, kind of thing. It’s sort of that effect, except for the entire internet when it comes to gendered harassment and racialized harassment and all of these kinds of things where this shit has always existed. Excuse my French. We can just see it now and it’s not deniable.

Although, which is sort of funny because one of the great themes in many harassment campaigns, one of the sets of tactics that I’ve seen used is people will be experiencing threats and harassment and stuff. And when they talk about it publicly, like, “Wow, I had to cancel this talk because of bomb threats” or whatever, people will be like, “It’s a false flag, you’re just making shit up.” So even though it’s happening in public, there’s this real crazy making thing of people still try to deny it even when it’s super visible. So it’s like Sandy Hook Truthers except for harassment.

Ponnada: But what you said about communities and us building communities online in this entire word, I feel like it’s also – I don’t want to say empowered because it’s just doesn’t feel right- but yes, given these hateful groups opportunities to connect …

Honeywell: Emboldened.

Ponnada: … Yes, embolden them, so that they can connect with each other. And maybe if you lived somewhere super liberal in Seattle you’re like, “Oh, I can’t really voice these thoughts in person because people are going to shame me,” like you were talking about. Then you go online and you’re like, “Yeah, there’s 10,000 of these folks online and I can just say whatever I want.” I think there was a veteran’s group or something back on Facebook and they were posting- yes, the Marines. And just these kinds of incidents that you find out about and it’s just scary.

Marginalized Communities

Leong: So then that brings me to my next question, which is, you mentioned, I think Sarah Jeong actually talks about this where a lot of harassment tends to be gendered and it tends to be against women, against a non-binary, against LGBT folks. It’s also very racialized as well. So if you are a black trans person on Twitter, it is going to be terrible because we don’t have systems in place for this. So what are some marginalized communities that tech leaves behind and how are these communities impacted by the role of tech in their lives?

Honeywell: I think one of the current situations that comes to mind around this is there’s currently a major issue where Twitter is banning trans people for using the term TERF, which for those who aren’t familiar stands for Trans Exclusionary Radical Feminist. There’s a segment of folks who call themselves feminists who are super not cool with trans people, and harass and stalk and target them online. And there have been coordinated flagging and reporting campaigns against trans people, particularly trans people in the UK because this TERF-dom is very part of the mainstream politics in the UK, including some prominent politicians and opinion writers and stuff.

So the thing that this brings up is, these abuse reporting mechanisms that do exist are being weaponized against a marginalized population who are simply using a descriptive term to describe people who are advocating against their rights. These trans exclusionary so-called feminists are basically weaponizing the reporting systems in order to target trans people. And I think, it brings up some of the nuances of abuse reporting. Why can’t Twitter solve the Nazi problem, right? They’ve solved it for Germany and France where they’re legally obligated to. All of these systems are double-edged swords and can be weaponized against marginalized groups too.

It’s this tricky needle that everyone is trying to figure out how to thread and people are trying to sprinkle some machine learning on it. But a lot of it actually comes down to human judgment. And sometimes those human judgments are not transparent and end up with stuff like what’s happening to trans people on twitter.

Fukui: Yes, I can speak next. So actually before working at GitHub, my first job out of school was working on a platform to raise the visibility of precarious workers. Precarious laborers? So for example, people who work on farms. The Marriott strikes are what I would consider precarious laborers. And I think when we talked about groups that are being left behind, I also want to highlight the socio-economic disparity.

Leong: [inaudible 00:14:44] words. What does that mean?

Fukui: There is a huge gap in income in Silicon Valley and across all over the United States international, but I think it’s very, very glaring when you come to San Francisco. We have a huge homeless problem that is not being addressed. I also live in Oakland and that is also extremely problematic. When I think of that kind of technology, I think we have to be thinking about how do we make our technology inclusive for people who can’t exactly afford it or have the time to?

So when I was working on that platform called Stories of Solidarity, the technology that we were leveraging was actually SMS texting because most people who are still precarious laborers have some sort of phone that can text. And it is used pretty commonly as a way to spread information really quickly. Like, “Oh my gosh, there’s going to be a raid this Monday at the farm, we need to go for undocumented workers.” So whenever I’m thinking of technology that we build, I still want to think about the accessibility and what kind of tiers of technology we can build for others who may not have access to it. So beyond people who experience harassment in their real life, they’re most likely going to experience it online. And I definitely want to think about our responsibility in tech to accommodate people with the income disparity.

Ponnada: Wow, tough act to follow, tough acts. But, yes, you’re absolutely right. Talking about the divide that exists between the kinds of technology that’s available to people and then, are we focusing our work on building websites or are we just making IoS apps? Are we making things for Android? How are we bridging these cultural and knowledge gaps and allowing people to access the internet and to access the knowledge and information that we have so that they can also be part of this new digital society?

And the other thing is, even women in the tech industry face this being left behind experience because it’s in the news. HR tools have bias against women. I’m not going to name names, but we all know and it’s just amazing. There is no one that isn’t being left behind. The people creating this, this kind of environment, the people that are building these tools and not thinking about what they’re doing, in a way are being left behind too because they don’t know what they’re doing. How do we bridge that? How do we let them know that, “Hey, you’re a douchebag but also, yes, not.”

Honeywell: Just to link it back to specifically online harassment. I’m really grateful that you brought up the socioeconomic issues. In the work that I’ve done over the years with people facing online harassment, either in the internet jerks kind of environment or as a result of domestic violence or intimate partner violence, the dollar cost of this experience of being harassed online in terms of, say it’s a domestic violence situation and your ex has put spyware on your computers and you don’t have the technical capabilities to get that removed, you need to just go and buy a new computer or buy a new phone, and that’s a significant dollar cost as a result of online harassment. And you see it also with the various privacy services, or even stuff as simple as like when I work with people who are coming forward as me too whistle blowers or stuff like that, I want to get them all set up with a hardware security key. Those are not free, right? If we don’t design safety into these systems, safety ends up becoming a tax on marginalized people. As we’re thinking about the big picture and how it connects into other social issues, I think it’s the opposite of cui nono, right? It’s like, who pays the cost?

Safer Products

Leong: And that segues into my next question. If you could do anything, how would you make safer products? If you could take any tech product and make it safer for marginalized people and which then makes it safer for everybody, what would you do? Is it too broad?

Honeywell: I mean, I have like 10 answers. Sorry, I’m giving it to someone else so I can come up with which one I want to go for here, if you know.

Ponnada: Maybe. So I guess a project that I did at Buzzfeed was, well, you know, that’s one of the platforms that I first became aware of how much online harassment there exists. And people share their stories and that’s awesome and use is spreading like crazy around the world. But if you read the comments, they’re so hateful. And I’ve had a Twitter thing that went viral about my immigration story and just the reactions from people, the kind of emotional impact that has on the individual that’s trying to spread awareness and create a change by using online platforms as a avenue to spread awareness, it’s hard. It’s draining. And when you asked about marginalized people that are being left behind, it’s preventing us from sharing our stories. It’s preventing us from building communities, right?

I worked on this Hackathon project for our customer support team that essentially just loads all the comments from top trending articles and runs a machine learning algorithm on them to identify if there’s any kind of hateful or homophobic, whatever kind of speech, and then allows this person to look at the context of what has been said so that they can see, “Oh, is this just somebody like joking around? And they might’ve said the f-bomb or is this actually something that we need to take action on?”

Just thinking about how can we build these tools, I think that’s what I would like to see more of on Twitter. And that’s something I noticed too – why does that exist? And I think LinkedIn has done a fairly decent job of that. Maybe I just don’t see it. I don’t really see those kinds of comments on there. And while we’re building this technology, also remembering that we’re not building technology to replace people, so how can we empower others to be part of this revolution and help create safe spaces online?

Honeywell: I have a very concrete and specific thing that I want to exist in the world. I’m throwing it out here for everyone. I’ve been starting to socialize this with different tech companies. But it would be really cool if there was an OAuth grant, that there was a standard for that said, “These are the security properties of this account.” So I’m trying to help Danielle with securing her accounts and she grants me permission to see when did she last change her password? Does she have 2FA turned on? But not anything else. I don’t want to see your DMs, nor do you want me to see those things. But to be able to introspect the security properties of your accounts in order to help give you guidance without granting any other permissions. Let’s be real, there has not been a lot of innovation in the consumer security space in our lifetimes, basically. Remember antivirus? That was the last major innovation in consumer security, which is fairly tragic.

Leong: Twenty five years ago.

Honeywell: At least. Yes. Oh gosh. Let’s not talk about McAfee please. Speaking of online harassment.

Ponnada: So many notifications.

Fukui: Pop-ups.

Honeywell: Oh, my gosh. Anyway, so that’s my big idea that I’m putting out in the world. If anyone wants to talk to me about it afterwards, come find me.

Fukui: I don’t think I have a specific of an answer, I guess I love putting out dumpster fires, but I think I’m tired of doing that. I want to be more proactive in how we encourage and empower people to make or to become better online citizens and make the people around them better as well. So a lot of the work that Danielle and I have been working on is rehabilitation. I think we’ve seen some interesting studies that people, if they have nefarious behavior or content or something, if they feel like they are being rehabilitated in a fair way, they know what content was nefarious, why it’s not allowed here, the enforcement is fair, then they will correct their behavior and that has a platform effect. Like if we can rehabilitate on GitHub, we found that people will actually correct their behavior in other places like Twitter, Slack, Discord, which is extremely fascinating. So I would love to see tech figure out those ways of encouraging people to be better and making it easier to be better, like letting bystanders help create safer online spaces.

What Can Social Platforms Do to Build Healthy Communities?

Leong: Which then segues into my next question. Thank you. Convenient, just totally made up based on the context of this. When we’re talking about online harassment, it’s really easy to be like, “Well, we should just stop doing this. We should ban hammer all of this. This entire country is just garbage, let’s just get rid of it and not allow them on our platform.” But as Kat says, you then miss out on a lot of different conversations. You lose that ability to rehabilitate. So what are some things that social platforms can do to build healthy communities and have these kinds of nuanced discussions? Leigh is laughing at me. So I’m going to give it to you first.

Honeywell: I think one of my favorite blog posts of all time dates from approximately 2007 and is Anil Dash’s ”If your website is full of assholes, it’s your fault.” It’s just such a good title. But the fundamental thing is, it made my heart grow three sizes to hear that you guys are actively working on rehabilitating shit lords because I think there’s just so much important, even low-hanging fruit work to be done there of gently course-correcting when people start to engage in negative behaviors. I think some of the video game platforms have been trying different ways of banning or time outing people to encourage better behavior over time. I think there is a large, hopefully very fruitful, set of low-hanging fruit to be working on in that space. So I’m really, really excited to hear that you’re working on it.

I think the cautionary note is we need to be careful to not put that rehabilitative work on people who are already targets and are already marked marginalized. I think a lot of efforts around online harassment over the years have been very, what’s the term I’m looking for? Not predatory, but sort of exploitative. A little bit exploitative from the platforms where they’re like, “Let’s get this army of volunteers to do volunteery things” versus actually paying and training professionals to do moderation, to do content aggregation, stuff like that. And I think particularly when we’ve seen a lot of stuff in the content moderation space, there’s currently at least one lawsuit around PTSD that content moderators have gotten from actual PTSD, from doing content moderation.

So I think when we’re figuring out what are the boundaries of what we get paid staff to do, versus what we get volunteers and community members to do, being thoughtful about what is the long term impact? Are we engaging in proper burnout prevention and secondary trauma prevention tactics? I see this a lot within the computer security incident response. Part of the field where there’s this machismo, like “We’re hackers and we’re going to be fine and we don’t ever talk about our feelings.” But people get freaking real PTSD from doing security incident response and nobody talks about that. I’m excited, but also caveats.

Fukui: Yes, I totally agree that we should make sure that we’re not putting the onus on marginalized folks who already have to deal with their identity and their spaces. It’s usually two jobs, right? Being yourself and doing the job that you went onto a technology platform to do. So we should try and absorb that burden as much as we can. And something that definitely comes to mind is understanding what those pain points are. Plug for my talk later I will be talking about how to create user stories and understand those stressful cases that happen to your users. And those are ways that you can unite a team on a vision. We need to solve this and we need to do it quickly. Because when negative interactions happen, we found that swift action is the best way to …

Leong: Visible.

Fukui: … visible and swift action is what’s going to show communities that this behavior is not okay. It will not be tolerated. And it’s way better for people’s mental health when someone else is taking on that burden and they don’t have to defend themselves.

Ponnada: Yes. Adding to that, building allies. So I really liked that you brought up how rehabilitation can course-correct people and just continuing to do that emotional labor in real life, because who we are as a person then translates to what we put online. Yes. I don’t really have much to add, you guys covered it all.

Honeywell: I think the emotional labor piece just brought up something for me which is…

Leong: What is emotional labor?

Honeywell: Oh yes, sorry. Emotional labor is the caring — I’m not explaining this very well. Caring labor, it’s having to be thoughtful and perceive other people’s feelings about a particular situation, and it’ll maybe make more sense in the context that I’m about to give. Which is, I think it’s really important when we think about building anti-abuse into platforms, that we know the history of trust and safety and how it’s often perceived as a cost center and marginalized and underfunded. And again, that’s where you get the burnout and PTSD from content moderation and stuff because it’s seen as just being a cost, versus a necessary feature of a healthy platform.

The thing that emotional labor brought up for me is making that case; that is its own emotional labor to continually having to be justifying the existence of this function that wears a lot of people out. And when we can shift into this is how we make our platform actually good for everyone, or at least how we retain people.

Ponnada: Yes. What you said just now made me think of it’s not just the moderators that experience this kind of stress or burnout, right? Even the people that are actively building these tools, us, and talking about it, thinking about it all the time, we have that burnout, right? I have that burnout sometimes and I need to kind of disconnect because then it’s just like technology everywhere. And remember that there is a world outside of this black hole on Twitter where people in my life love me and care about me and I need to go there and reconnect so that I can continue pushing the boundaries.

Leong: Cool. So that is all of the official questions that I have now. I am now opening it up to the floor. Please remember if you have a question, it has to start with a question. Yes, I know it is revolutionary, but, a very long manifesto, I will cut you off if you do that. But if you have a question, please raise your hand. I will bring the mic to you.

Tools Used for Abuse

Participant 1: I’m just going to use this time to talk about myself. I’m kidding. So Leigh, you mentioned that a lot of the tools that are used to prevent abuse can also be used to incur abuse. And I think we’ve seen that a lot and I wonder, how do you adjust for that? Because it seems like no matter what you do, especially at scale, those tools will be used for abuse. What are some tips, tricks, techniques, how do you account for that?

Honeywell: I think there’s a couple of ways to think about that. One of them is to, as you’re designing the tools, be red teaming yourself; be constantly reevaluating what are the ways – I’m approaching this as a good person who wants to do right in the world, but how might I think otherwise? And if that’s not your mode of thinking, then engaging with someone for whom that is their mode of thinking, whether it’s your company’s red team or an external contractor, a specialist. I think the other piece of it is wherever possible building transparency into the tools, so that people can know what were the criteria? And obviously in any sort of anti-abuse system, there’s a tradeoff between transparency and gaming of the system. So there’s no silver bullet on this stuff, but balancing that transparency and the necessary secrecy of your anti-abuse rule set is one of the important things to strike the right tone on.

Participant 2: One of the things that you brought up really briefly was awareness and the lack thereof and especially how your app or technology could be used to marginalize others. So my question is, what are techniques or tools you found that have been effective to see things through a new lens?

Leong: Deep soul searching. Damn. That was deep. Kat?

Fukui: Actually a really good workshop that I’ve done with the community safety team at GitHub is just drawing together and understanding user stories. So one that we added recently was a user who’s trying to escape an abusive relationship. And what are the problems that end up inevitably happening, what part of the technology is holding them back from success? What does success look like for them? What are the stressful feelings they’re experiencing? And those are the ways that we can highlight cases that we may have not thought of. If you think of something simple as like leaving a comment, how could that be used to abuse someone that we may have not thought about? So user stories for us and we’ve just been collecting them pretty much …

User story at least in the context of our team and how we do that, is you define a user. So in this case, someone who is escaping abusive relationship. And literally draw the course of their journey on your platform, either a specific workflow or something more general. What are their problems? That can be literally just drawing out. I made the engineers draw, I gave them markers. It was super fun. So what are the problems they’re facing? How are they feeling in this moment, and what does success look like for them? And I think what’s really helpful is that a lot people on our team have experienced this, either in real life or on the internet. So it’s easier for us to empathize. But I think if you do not have that context, it’s really important to talk to people who have, and gather that research and those resources, and compensate them because that is emotional labor. So pay people for that kind of work.

Leong: I’d also like to point out, tiny plug for our team, that we are a team of largely marginalized people, and it is because of our lived experiences that means we are able to catch abuse factors a lot sooner. Because if I’m using a service and it automatically broadcasts my location, for example, I’m not going to use it. I’ve had a stalker before, I don’t want my location out there. And so that’s an easy-to-close abuse factor because I have these lived experiences. If you are trying to close a lot of these abuse factors, ask the people around you. Ask marginalized people in their communities and pay them and say, “What do you see in this app?” “Do you feel safe using this?” is an excellent question. Do you feel safe using this? I guarantee you somebody with a different lived experience from you is going to be like, “No, no. This doesn’t strip the location data from this photo, I don’t want to use this, it’s creepy.” So asking people in their communities, “Would you use this, do you feel safe using this?” is extremely important when you’re doing this.

The Balance

Participant 3: You spoke a little bit about the importance of not placing the onus of rehabilitation on marginalized people. The particular context I took to be content moderation. It seems to me that with a lot of marginalized communities, encouraging them or empowering them to be their own advocates is an important component somewhere along the line. I guess the question that I have is, how do you walk the balance between encouraging them being their own advocate, without inadvertently placing the onus of rehabilitation on those communities?

Honeywell: That’s a super good question. I think it is fundamentally like a needle that you have to thread because you want to do nothing about us without us, but you also want to not place the entire burden on the set of people who are already marginalized. One of the ways you do that is by compensating people for their work. So supporting creators and thinkers and writers and advocates and activists who are members of marginalized people, whether that’s because they have a patreon or they work with a nonprofit or otherwise like signal boosting their work. I think one of the sets of experiences that this comes from is the constantly “educate me” that people often have an entitled attitude around. “Well if you want me to support your cause, you should put all this energy into like educating me.” So figuring out how to not do that. in my own personal experience, I get a lot of, “How do I fix diversity at my company?” requests. “Can I pick your brain over a coffee about how to hire more women?”

Leong: $800 an hour.

Honeywell: I mean, no. I started reconfiguring how I interacted with these requests while I was still on an H-1B. So I unfortunately couldn’t charge people $800 an hour because I was on a visa. But what I ended up doing was I would answer patiently and comprehensively a number of requests. And I basically collected all of my answers into a page on my website that was like, “I get asked this question all the time, here’s a bunch of resources, please go read it.” And it’s actually been really interesting. When I do get one of these questions now and I’m like now I have a green card and I can do that consulting, but I just don’t have time.

What I get back when I get these requests and I say, “Hey, thanks for asking. I get asked this quite a bit so I have assembled some resources. And there’s also some consultants linked to from there that you can engage to work in this with more depth.” I’ve been pleasantly surprised. I haven’t gotten any like, “Well, can you just explain it to me in little words” lately. I think I hit exactly the right note on the page. It’s hypatia.ca/diversity if you want to see how it’s set up.

Leong: I’ll tweet that out later.

Honeywell: Yes. How do you effect the change you want to be in the world, but also do it in a way that’s scalable, right? And having those one-on-one conversations over coffee about, “Yes, you should have a replicable hiring process that is as blinded as possible and not only recruit from Stanford.” I don’t need to have that conversation ever again basically, because I’ve written it down, it’s written. I’ve had positive experiences so far with the gentle pushback towards resources. So when you can build that kind of thing into the system, I think it can help with threading that needle.

Leong: Signal boosting writers of color are also pretty important too because then it’s not just like, “Well, here’s my take on this,” but this is somebody who has this lived experience, who can speak about this topic better than me, is another way that I’ve found this is giving somebody a platform to help explain this very nuanced topic that you’re discussing. So that’s another thing I found that helps. Someone over here.

Participant 4: What would be the short rebuttal to the Napster defense of online panels? The defense that, “Oh no, we’re not providing a content provider. We’re providing a way for people who want to share content to connect to those who want to view it,” in the same way that Uber doesn’t provide a taxi company or Airbnb doesn’t provide a hotel chain. How do you counteract that argument?

Honeywell: There are a couple of different pieces of this. CDA 230 is the relevant legislation in the US that governs how content plan- and I’m not a lawyer, but this is my best understanding- it governs how … I’m not a lawyer, I just worked with them at the ECLU. It governs how platforms can or can’t preemptively filter on content and it’s a lot of why you see a lot of rather than preemptively filtering, reporting and take down practices on various websites. Fundamentally, I think the Napster example is really interesting because it surfaces the sort of subtlety to all of these hate speech and free speech and first amendment and platform moderation arguments, which is that there is no one singular definition of hate speech or free speech. There are various things enshrined in American law, there are various things in international law that varies from country to country. I’m Canadian and we have hate speech laws there.

We do very frequently set all kinds of content-based boundaries in the platforms that we operate. You can’t post child porn, you can’t post copyrighted content. What other boundaries do we set? Those are choices that platforms make and those choices have consequences. But I think just hand-waving in a way is just we’re a neutral platform. That’s never been true because you can’t be neutral on a moving train and this train is moving rather quickly.

Subtle Abuses

Participant 5: There are lots of examples we can see for ourselves on Twitter and Facebook. But I wondered if you had any examples of more subtle abuses of features at home about comments in social media that would be hard for us to think of as nice people that we’d never think anyone would do that with our systems.

Fukui: Man, I just feel like that’s a lot of GitHub. People go on GitHub to collaborate on software together. It’s just code. We’re just coding together. Everyone’s just coding. Yes. But turns out there’s still conversation and human interaction that goes around code; it’s still written by humans, meaning you’ve got to be social. And more and more we’re seeing that people view GitHub as a social network now which is really strange. Yes. So that means that our tools really need to scale for that. And unfortunately when GitHub was built, that wasn’t the intention and you could clearly see that in the actions that we took. So I think even if you think, “Oh, this piece of technology is just for professionals, there’s no way it could be abused.” Anything with user to user interaction will be abused, and you need to accept that and make sure that you’re hiring the right people to tackle that from the ground up and build that framework. And I’d love to, in general, see companies be more open about that conversation, especially ones that aren’t purely social. So hoping that we can continue that conversation.

Honeywell: Two subtle examples that come to mind from my experience and just that I’ve seen publicly: I’m not going to name him because he’s like Voldemort, but a particular internet shitlord, he harassed a woman writer online by sending her a $14.88 on Paypal. Those two numbers are significant in Neo Nazi numerology and she’s Jewish. So that’s literally just using the number of the transaction as a harassment vector. Another example along those lines of subtle misuse, although this is a little more in spam territory, but when I worked at Slack- the reason you can’t put invite text in Slack anymore is that people were using it to spam and even we would block URLs and then people would go to sketchy fish site, D-O-T com spelled out. So there was always that arms race, and any service that allows people to stuff text in a field and send it to each other will at some point …

Leong: Or images.

Honeywell: Or images … Oh goodness. Phone numbers will end up getting abused. Yes. You have one as well.

Ponnada: Yes. I’ve heard this from a lot of women that they’ve had people hit on them on LinkedIn. And it’s like, “Dude, this is my LinkedIn, I’m here to get a job not a husband, come on, or a wife or whatever.” That is just so weird. Even on monster.com, where you post your resume on there. So this happened to me when I shared my story of immigration online and that went viral, and then I checked my email one day and I got an email from this guy that found my email address, I’m guessing from monster, somewhere, and sent me a wedding proposal with photos of himself and his kid and was like, “Hey, I saw your story. You seem like a really great girl. I want you to stay in the United States. I’m 40 something, I got divorced and I’m a really nice guy. Let me know if you’re interested, I can get you a green card.” And I’m like, “Wow.” Yes, right? So just the many ways in which women, in particular, have been stalked or harassed in real life is also translating into these professional networks as well.

Leong: All right, we have time for one more question.

Machine Learning-Based Solutions to Content Moderation

Participant 6: I remember, one of you talked earlier about machine learning-based solutions to scaling, content moderation and stuff like that. As companies rely more and more on machine learning to scale these types of solutions, the models themselves may internalize systemic biases and then produce biased outcomes. So what are your thoughts on what companies can do to ensure that as we kind of offload more of our moderation to these algorithms, to maintain fair outcomes?

Honeywell: I think that comes back to that same question around striking the balance between transparency and keeping the secret sauce secret, where there has to be a certain amount of transparency so that people can feel like the system is fair. And there also has to be an ability to request a review basically. And that’s where the review requesting workflow that’s part of what’s really difficult to scale- we can do a pretty broad pass filter on internet garbage as Sarah Jeong would say, with machine learning. But where it’s tricky is that sort of crossover error rate of the false positives and the false negatives.

Ponnada: So one thing that I feel very passionately about is hiring people from diverse backgrounds. Thinking about, are we hiring people with a psychology degree? My major in college was gender studies and I feel like coming into this industry, I’ve brought up points that people haven’t really thought about. So just thinking about those things as well.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Presentation: The Whys and Hows of Database Streaming

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Transcript

I’m Joy [Gao]. I’m a senior software engineer at WePay. If you haven’t heard about WePay, we provide payment solutions for platform businesses through our API. For this talk, I’m going to be talking about database streaming. We live in a world where we expect everything to be streamed, like our music is streamed. Our TV shows are streamed. I want to argue that the data in your data warehouses should not be considered as second class citizen. We should allow everything to be streamed in real-time so we can access these data as soon as they arrive into the database.

This talk is about our journey at WePay going from an ETL data pipeline into a streaming based real-time pipeline. The talk is going to be broken down into three sections. We’re first going to go over what our current ETL or our previous ETL process looked like and what are some of the pain points that we’re going through. We’re also going to introduce change data capture, which is the mechanism that we use to stream data from our database. Next, we’re going to take a look at a real-world example which is how we’re actually streaming data from MySQL into our data warehouse. And finally, we are going to go a little experimental and take a look at some of the ongoing work we’re doing with streaming Cassandra into BigQuery which is our data warehouse as I mentioned.

The Beauty of Change Data Capture

Let’s get started. At WePay, we use BigQuery. For those of you who are in the AWS lens, this is the equivalent of RedShift. It’s basically the Google’s cloud data warehouse. It uses ANSI-compliant SQL as its core language which makes it really easy for developers and engineers to pick up. It supports nested and repeated data structures for things like lists or structs and even geospatial data types, which is actually something very useful for CDC as you will see later on. And it has a virtual view feature which you can create views on top of the base tables. And because these views are now materialized, when you’re querying the view, you’re essentially querying the underlying table. And this will allow you to access real-time data even through views. And that’s another feature that we’re leveraging very heavily at WePay for our streaming pipeline which we’ll also going into later on.

So at WePay, we use a microservice architecture. Most of our microservices are stateful and the states are typically stored into a MySQL database. And we use Airflow as a tool to orchestrate our data pipelines. For anyone who hasn’t heard about Airflow, you can think of it as cron on steroids, that’s the line for data pipelines and complex workflow. And the way we’re using Airflow is basically by periodically pulling the MySQL database for a change. The way we detect these changes is by looking at the modified time column in each table, and if the modified time has been changed in the most recent interval, we upload that information into BigQuery. It’s pretty standard.

With this approach though, we’re starting to hit a lot of limitations and operational overhead. So the first problem, which ties back to the talk from the introduction, is that it has very high latency. The data won’t actually arrive into BigQuery until much later. Some of our jobs, we try to push the limit to once every 15 minutes, so the job runs in 15-minute intervals. But then we’re getting to this inconsistency where an analyst that may be trying to do a join in BigQuery, and one of the tables is being uploaded on an hourly or daily basis, and another table is being uploaded every 15 minutes, and then the data becomes inconsistent, so it’s like, why is it not in this other table but it’s here?

The second problem is that because the way we use Airflow, is that we’re creating one job for every single table. In Airflow, a job is called a DAG, or a directed acyclic graph. So we have basically hundreds of DAGs. Each of them is responsible for a table. And this is a whole lot of operational or configurations as well as overhead when it comes to monitoring. So it’s not quite ideal. Another problem is hard delete. We can allow hard deletes in our database because when you’re pulling the database, you’re running these select queries. It’s not going to generate which data has been deleted. It’s only going to show you what’s in the database.

We basically have to tell our microservice owners, “Hey, just don’t delete anything in these tables,” which is pretty error-prone. And that’s at least to the first point, that it is very error-prone. We are relying on our microservice owners to be doing the right thing, not only must they not delete rows from this table, where they must have to guarantee that they’re always updating the modified time spend every time because otherwise, we’ll still get into a data inconsistency issue because we won’t be able to detect those changes.

Finally, the schema management is actually manual because if a DBA decides to go into the database and they want to say, “Add a column to a table,” Airflow doesn’t know about it. So now we have to go into Airflow and we have to manage every single one of those tables, or whichever table that needs to be modified, and we have to update the schema so that it propagates the BigQuery and so on. On top of all of these problems, our data ecosystem is constantly evolving. We’re adding new tools that are optimized for different jobs. We may introduce a Redis that optimizes for key-value cash. We may introduce Elasticsearch to do full text search. We may want to add a graph database for fraud detection. Or we may want to add some live dashboards and alert the monitoring system that help us understand how our business is doing right now.

And Airflow being a badge oriented tool, it’s not meant for streaming. So we needed a better tool for this job. And as many of you probably already guessed it or if you already read the summary in this talk, we use Kafka. With Kafka now, every single downstream derived application can now just listen to the Kafka log and apply the changes at their own pace which is really nice, and because Kafka is designed for streaming, this solved the streaming problem.

The next question is, how are we getting the data from these databases into Kafka? There are a couple of options. First one, we can just double-write about system, right? Every time we’re updating the database, we make sure we’re also sending a message to Kafka. Then the question is, should we do this synchronously or asynchronously? If we update asynchronously, we’ll, again, get into data inconsistent issues because we don’t know whether the data has been successfully running in Kafka when we’re doing the updating to the database. If we do this synchronously, which means that every time we successfully send to Kafka, we commit the change. Every time we fail to send to Kafka, we abort the change. But we’re talking about distributed systems here and errors are friends. The problem is timeouts.

Timeouts are something that we don’t quite know. It could be a network glitch that essentially caused the timeout and the data could have been successfully running to Kafka or could have not. So we wouldn’t know what to do. And to solve that properly, that required distributed transaction which means something like two-phase commit. And two-phase commit is not trivial to implement and get right. It requires a set of interfaces and tools to actually implement it and the Vanilla Kafka doesn’t support it, and not to mention that with two-phase commit, it means that it requires multiple rank shifts to do a consensus in order to have each write committed, and that’s going to take a lot of time and a lot of production database cannot allow that kind of latency.

There’s the second option; this is the cool kid on the club. It’s event sourcing, which means we’re using Kafka as a source of truth, we only write the data into Kafka and we’re going to treat the database just like any other derived system. The database is just going to be pulling changes from this Kafka log and it’s going to apply them into the database one by one. This looks much cleaner and it would solve a lot of headaches. However, there is one problem with this for some use cases and it’s Read-Your-Write consistency.

Read-Your-Write consistency is the idea that when you’re updating some data and you’re trying to read from what you’ve just updated, you’re expected to get what you just wrote. But with this setup, we actually may potentially be reading stale data, because say if we have a traffic spike and we have a bunch of data that are being sent into Kafka and then the database is slow at catching up, so at that point if we’re trying to do a read, we’re going to be reading stale data. That’s really bad when you’re building an application, like an account balance where you need to guarantee that your users are not withdrawing money to go into negative balance and this is problematic for that.

Then there is the third option which is change data capture using the write-ahead log. Change data capture is a design pattern in databases that basically captures every single database changes into a stream of change events. And then anyone that is interested in these change events can listen to the change and react accordingly. And we mentioned we’re going to do this with the write-ahead log. Write-ahead log is pretty much implemented in every single database out there; it’s kind of an implementation detail of each database rather than an API. And the idea of the write-ahead log is that before we update the data into the storage file, we’re first going to update them into the write-ahead log just like the name sounds.

There are some benefits to this approach. The first is crash recovery. Now, if the database crashes halfway while writing the data into the storage file, the database upon restart can look at the commit log, replay the change, restore the corrupted data, so that’s great. The second benefit is improving write performance in a certain scenario. This is the case where you have a single transaction but you’re updating a lot of tables and this table probably resides on different storage files. So instead of trying to update on each of those tables individually, it’s first going to sequentially write all of those changes into this log, that’s only a single fsync versus fsyncing on each of those individual storage log, so it’s much faster.

The third benefit is streaming replication. A lot of databases already apply this, like MySQL, where all the replicas are just looking at the commit log and tailing the commit log, applying the change, and then updating these replicas asynchronously. One other detail that’s worth mentioning about the write-ahead log, specifically for MySQL, is that it gives you two options. You can either log statement. You can either do statement-based logging or you can do raw-based logging. Statement-based logging means you’re logging the queries, and raw-based logging means you’re actually logging the data after the change has been applied. And in terms of change data capture, raw-based logging is very useful, since now you have the data for the entire row, not just the column you’ve updated.

So by using change data capture with the write-ahead log, we get the best of all the worlds. We don’t have to worry about implementing distributed transaction, but we get all of the transactional guarantees. And because we’re asynchronously tailing this MySQL Binlog or some kind of a write-ahead log, we don’t have to worry about impacting the performance when we’re writing the data into the database because it’s asynchronous.

Real-World Example: Streaming MySQL

Now, let’s take a look at how exactly we’re using CDC at WePay to stream database from MySQL into BigQuery. Under the hood, we use Kafka Connect framework or we leverage Kafka Connect framework for this job. The source connector is responsible for getting data from external sources and publishing them into Kafka. The sink connector is responsible for reading from Kafka and storing them into external sinks. Applied at WePay, our source data is MySQL, our data sink is BigQuery, our source connector is Debezium which is an open-source project. And our data sink is KCBQ which stands for Kafka Connect BigQuery; it’s something we named ourselves because we wrote it.

We’re going to break this up into two sections and talk about each part of them separately. So first, MySQL into Kafka, and I have to definitely talk about Debezium before going to any details. Debezium is an open-source project. It’s basically meant for CDC and it’s built on top of a Kafka Connect framework. And the way it does this is by basically just like CDC, reading the write-ahead log and converting them into individual changes and record them on the row-level basis. Debezium guarantees at least one semantics, which is the same guarantee as Kafka. And this means we don’t have to worry about we’d ever lost data but we may potentially get duplicates. And finally, Debezium currently already supports MySQL, MongoDB, PostgreSQL, Oracle, and SQL server.

So how does Debezium look like in action? Before we started Debezium connector, we probably already had some database running in production. It’s probably already replicating to some replica. So when we first start the connector, it’s going to ask the database to give it the filename and the position of the most recent write and it’s going to record that information.

Next, it’s going to run a select star from table, every single table from the database, and it’s going to convert the result set into individual create event and publish these events into Kafka. And because some tables are huge, this could potentially take a couple of hours. And during this time, the database may be having additional writes and that may be replicating to the replica, and Debezium is just going to temporarily ignore that. Once the snapshotting is complete, Debezium is going to start to catch up and it knows where to catch up because they recorded the filename and the position of the most recent write. And then once it’s finally caught up, it will start streaming the data in real time just like any other replica except instead of storing the information, it’s sending that information to Kafka.

Just take a look at what a Debezium event looks like. The before section is what the data looks like before the change. After section is what the data looks like after the change. The source section provides a bunch of metadata about the data source, so like the server ID and the filename and positions, as well as the database and the table that it’s coming from. And if you’re familiar with MySQL since 5.6, it introduced GTID so this is actually able to support GTID as well, instead of using the filename and position. The op section represents the type of operation used for updates, C is for create and D is for delete. And the timestamp is the timestamp of when this event was created in Debezium. If it’s a create event, before will be null, if it’s a delete event, after would be null.

This original pipeline I’ve showed you at the very start is pretty different from what we’re actually running in production; it’s a little bit more complicated than this. Let’s take a look at why. So, we’re not going to basically be directly reading from the master of the MySQL instance because we can potentially have snapshots that could take hours. We don’t want to impact the performance. So we set up a MySQL replica. And this replica is dedicated for Debezium and we’re just going to be tailing from the replica. But having just one replica is not enough, because what if it goes down? So we set up a secondary replica and this is responsible in case the primary is down. In order to handle failover, we add a proxy in front of it so that if the primary is down, we read from the secondary instead.

But, of course, we don’t just have a single microservice, we have many microservices and each one of them will be replicating to the same primary and secondary MySQL replica. And the reason that we’re using just a single cluster of primary and secondary replica for Debezium is for operational cost. We know that as we add more microservices, this could potentially become problematic and we may potentially add additional cluster as well, but for now, this is sufficient for us because we’re a startup. But even though we only have a single Debezium dedicated MySQL cluster, we do have an individual Debezium connector that corresponds to every single one of those microservices, and this is important because it allows us to configure each microservice Debezium connector based on what works for that particular connector and it also allows us to bring up and down the specific connector in case we’re doing any kind of troubleshooting, without affecting the rest of the entire streaming pipeline basically. We run these connectors in distributed mode for fault tolerance. So this is what it actually looks like in production, just a little bit more complicated.

Now, that we got our data into Kafka, the next question is, how are we getting the data from Kafka into BigQuery? The reason we built KCBQ is because at that time, there was no existing Kafka to BigQuery connector. We have open-sourced it, so if you’re interested, it’s there on the WePay GitHub. There are a couple nice features about this connector. First of all, it has a configurable retry logic which means that BigQuery will sometimes give you these retryable transient errors and the connector is intelligent enough to know about it and is going to retry in order to not drop any messages. But because sometimes this error could last for a while, we’ve implemented the retry logic with exponential back off so that it won’t have to hide the API too frequently in case it’s down for a long time.

Secondly, this KCBQ is capable of lazily updating the schema for our tables. What the lazily means here is that Debezium itself is actually going to cache the schema for every single table. And when the new message arrives, it’s going to leverage the data in that cache and it’s going to try to send a message to BigQuery with the version in the cache. In the case where it gets the schema error back, it knows that the schema is outdated. It will then go fetch the latest schema from the schema registry and it will retry again with that latest schema. So that helps us deal with automatic schema evolution.

And finally, KCBQ supports both batch and streaming-based uploading or it basically uses BigQuery’s batch API and BigQuery’s streaming insertion API. The benefit of the batch API is when you’re doing snapshotting, it’s the faster option. And when the snapshotting is complete, you can then basically flip the switch to use the streaming-based API which allows you to access data in real-time.

There is one additional information that we had to add to the KCBQ event and that’s the Kafka offset. I’ll explain why in a second. But the Kafka offset, if you’re not familiar with it, is essentially the position of this record in Kafka. So here is what an example table looks like when we’re querying for all the field in this table, and I’ve also included Kafka offset there as well. Notice that this is actually not very useful. We’re getting every single record, every single change event. What we really want is just the final change. So we leverage Kafka offsets to do deduplication and compression and determine what we actually need to show to the user. The reason we can trust the Kafka offset is because the data are partitioned by primary key and in Kafka, anything in a partition is guaranteed to be ordered. So now we know that any data with a larger offset arrived at a later time. So with the Kafka offset, we can now dedup our data by primary key and we now have a version that mirrors what’s in BigQuery.

An additional benefit of using BigQuery view is that we can actually mask any column that we don’t want to see. Because, for example, email SPI, sensitive data, and we don’t want most of the users to see that information. We create another view on top of the view I showed you guys earlier and this view does not have the email information. And because BigQuery has access control configuration so we can give different users different permissions to different tables.

There’s one final piece in this pipeline that I briefly mentioned but didn’t really get into, and that’s the schema registry. At WePay, we use the Confluent Schema Registry and this is basically a registry that stores a version history of all of the data schemas. What’s really cool about the Confluent Schema Registry is that it dog-foods on Kafka. What that means is that it uses Kafka as its underlying storage for all of the schemas, so you don’t have to spin up a new storage, engineer a database of some sort to handle schema. And schema registry supports Apache Avro as its serialization format which guarantees both forward and backward compellability which is always a good thing.

And finally, we don’t want our schema registry to become our single point of failure because that defeats the whole purpose of a resilient pipeline. And the schema registry is designed to be both distributed and single-master. It leverages Zookeeper to do any kind of failover but essentially, it is resilient to failure.

To put it all together, here is what schema evolution looks like. Before that, one thing worth mentioning is that MySQL Binlog doesn’t just store the data change. It also stores every single schema change. This is really useful because now Debezium, upon receiving a schema change, is going to cache this schema change and it’s going to update this information to schema registry. Any following data change event it receives, it can now use this new cached version of the schema instead.

So by the time the data gets into KCBQ, KCBQ doesn’t know about the schema change yet so it’s just going to send the data with its older cached version. But BigQuery is going to give us an error saying the schema is wrong and KCBQ can now fetch the latest schema from the schema registry and then send the data to BigQuery using this new schema. So that completes this automatic schema evolution, which is really useful.

Future Challenge: Streaming Cassandra

As I mentioned, this final part is going to be a little bit experimental as it’s something we’re currently working on, but it’s interesting enough and relevant enough to CDC and I’m really excited to share it with you guys. At WePay, as our company grew, we began to see a need for a NoSQL database that’s optimized for high rise throughput for horizontal scalability and for high availability. And Kafka became the obvious top contender. By introducing Kafka to our stack though, we also need to figure out how we want to do CDC for Kafka.

At first, we thought we figured this out for MySQL. How hard could it be? It turns out that it’s a little bit more complicated. And because this talk is not a Cassandra-focused talk, I’m going to be skipping over a lot of details on Cassandra. I’m only going to talk about the Cassandra steps that are directly related to CDC. The thing that makes Cassandra really difficult for change data capture is its replication model. Unlike MySQL which uses primary replica replication model, Cassandra uses a peer-to-peer replication model. This means that all the nodes are equal. It also means that every single node is able to handle both reads and writes. And this also means that if we look at the data in a single node, it only contains a subset of the entire cluster of nodes, and which makes sense, because that’s how you do horizontal scalability. You don’t want a node to contain all the data.

So the next question is, how exactly then does Cassandra determine which node each data goes into? The way Cassandra handles this is that it divides the data into a cluster of nodes, it’s typically visualized as a ring. And each of the nodes in this ring is responsible for a subset of all the data and is called the token range. So in this naïve example, we have a total possible token value from 0 to 19 and each node is responsible for a quarter of them.

When a request comes in, it’s going to have a primary key or partition key value. The reason there is always going to be a partition key is because Cassandra schemas require you to specify a partitioning key for every single table. So, in this case, the partition key is foo and one of the nodes is going to be picked at the coordinator node. The job of the coordinator node is to hash this partition key, convert it into a token value, and depending on what this token value is, the coordinator is going to forward this information, this request, to the node that is responsible for writing this change.

But what if this node C dies? Then this is no longer fault tolerant. So the way Cassandra solves this is by increasing the replication factor. This example here has a replication factor of one. In reality, it typically has a replication factor of three. With a replication factor of three, the way Cassandra distributed this token range is by walking along this token ring and then basically replicating this range to its neighbors until the replication factor is reached. There are more sophisticated ways of distribution but this is just a naïve example. With this approach now, when the coordinator is forwarding the data, three of these four nodes are actually going to all store this data. So now we don’t have to worry about not being able to write when one of the nodes is down.

How does this relate to CDC? Well, there is actually also a write-ahead log in every single one of this nodes in the cluster. And this is called a commit log in Cassandra. This commit log will only record the writes that are specific to that node and then the way we can actually handle CDC is that we can put one agent, a CDC agent, in each of this nodes and this agent is going to be responsible for reading the data from this commit log and sending them off into Kafka. In fact, since Cassandra’s 3.0, they actually introduced a feature, a CDC feature, and this feature provides us with a file reader and the file-read handler, and the handler has already deserialized this information from the commit log, so we thought all we have to do then is to take this mutation which is what they call a change of event, extract the data that we care about converting to Avro, package it, and then send it off to Kafka.

But as you’re probably already thinking, there are a couple of problems with this approach. First of all, we get a duplicated change of event. Because we have a replication factor of three, when we are reading from all these logs, we’re going to get three copies of all the data. So somewhere down in our pipeline, we need to figure out how to do the deduplication.

The second problem is out-of-order events. This one is a little bit more subtle, because we’re dealing with distributed system here, it is possible that when two different clients are writing to the same row at the same time, to two different values, in this case, maybe one client is changing the first name to Anne and the other client is changing the first name to Alice, and node one and node two receive Alice first and then Anne, while node three receives Anne first and then Alice. Now, these three different nodes actually have a different understanding of what’s the most recent data.

The way Cassandra cleverly handles this is using the concept of last-write-win. So when a client is sending a request, it actually generates a client-side timestamp. And this timestamp gets propagated into every single column of this row of the data. This way when the client is reading the data from these nodes, if it sees the discrepancy between two or more nodes, it’s going to always pick the row with the latest timestamp. But because our CDC pipeline is outside of the read path of Cassandra, we have to basically figure out how to do these ourselves.

The third problem is incomplete change of event. Cassandra is optimized for write, so unlike MySQL where it’s going to do a read before every single write, Cassandra is just going to blindly write the data into the database. And because of this, we’re only going to know the columns that have changed. We’re not going to know all the rest of the columns of that row. And because of this, our change of events is incomplete and we need to somehow figure out how to piece together this information in our pipeline.

The fourth problem is unlogged schema change. You can modify your schema in Cassandra, however, it uses a completely different read-write path from a data change event. It uses gossip protocol to handle or propagate a schema change. This means that this data is never going to be recorded into the commit log. So if we’re only listening to the commit log, we’re not going to know about any schema change.

Our current solution that we’re working on, we call it the bridging-the-gap solution, is that we’re going to ignore other problems at least until the data get into BigQuery. So basically, the agent is just going to parse all of this data, send it off to Kafka. Kafka is going to send all of these data into BigQuery. Everything in BigQuery is unordered, it’s duplicated, and it’s incomplete. But then we’re going to heavily leverage BigQuery view to handle all of this. In order for BigQuery to know how to do this, it needs a little bit more information. It’s not only going to store the value of every single column, it’s also going to record the timestamp which is when the data is updated. It’s going to record a deletion timestamp in the case the data is deleted, and it’s also going to create a boolean field or record a boolean field which is this primary boolean that just represents whether this column is a primary field or not.

Let’s take a look at the data now that we have stored into BigQuery. This query specifically looks at the first name column and if we want to also query for the last name column, notice that the second row is null and that’s because the second event is just an update so we only updated first name. This is not quite useful, because what we actually want is the second event for our first name but the first event for last name. So the way we can handle this is basically by looking at the timestamp field, compare them and find the one with the latest timestamp. And then we can return the user or create a view that returns the user with the data that’s being deduplicated. It’s being ordered and it’s complete. And in order to do this, we have to heavily leverage BigQuery including its UDFs as well as a lot of like Group By and so on.

There are some advantages to this approach. The first advantage is quick iteration because we basically didn’t change anything in our pipeline and we’re doing all the heavy lifting in BigQuery and BigQuery view is very cheap to create, to modify, to delete, so then as we experiment with Cassandra, we can basically modify the view as necessary. The second benefit is that there’s very little operational overhead. Notice aside from the Cassandra CDC agent, we didn’t introduce anything new. So this way, as we’re solving this problem, we don’t want to be thinking about the uptime of other services or other applications that help for this pipeline. And finally, because we’re leveraging off the base table in BigQuery, we’re now going to impact Cassandra production because we don’t have to basically on every write, read back into Cassandra to get the full row because all of our data are already in BigQuery. But, of course, it comes at a cost.

The first cost is it’s very expensive because this means every time a user is querying for this view, we have to do all of these piecing the data together, and it’s going to get very expensive. On top of this, we’re recording a replication factor of three which means that it’s going to amplify and the table is going to get really big really fast, so we’d have to do some maintenance work in order to minimize this view. And the way to do compaction is by just materializing this view periodically, but it is going to be another overhead. And finally, notice we’ve only solved this problem for BigQuery which means that if any other downstream derived system is trying to read from this data, they are out of luck. They basically have to re-implement all of these themselves and that’s not quite ideal.

For sake of completion, I’ve included a potential future solution that we’re considering. It’s a little bit more complicated because now it introduces stream processing engine, introduces a cache, a database, and a second Kafka. Let’s go through how this would work. The messages are still going to arrive into Kafka, duplicated and out-of-order. The first thing the stream processing engine is going to do is check against the cache to see whether this data has been processed or not. If it hasn’t been processed, then we will process this, otherwise, we can drop the message.

Next, the stream processing engine is going to check against the database. It’s going to compare the timestamp of what’s in the database against the timestamp of this event, and in the case where we have on older timestamp in our event, we can drop it as well. And finally, because we’ve done a read on this database, we now get both the before and the after of every single event. So we can send this complete information into our second Kafka. Now, Kafka can then send this information into KCBQ which then can propagate into BigQuery, and the benefit here is that if we have any other derived system that is reading from Kafka, they now have a much nicer outcome.

In summary, there’s three things I’m trying to get through for this talk. The first thing is that database as a stream of change event is a really natural and useful concept. It would make a lot of sense for every single database out there to be able to provide a CDC interface for the data to be sent into other derived systems because otherwise, we’re talking about a very close system where the database expects that this is going to be the final destination and the data is not going to go anywhere else, and that’s kind of selfish.

The second point is that log-centric architecture is at the heart of streaming data pipeline. It helps us solve a lot of problems when it comes to distributed transactions, and it’s very simple to implement and understand. And finally, CDC for peer-to-peer databases is not trivial as you probably already noticed. However, we’re hoping as the tools get better, and as our understanding of these databases gets better, it will become easier over time.

Some additional information, if your interests are MySQL to BigQuery pipeline, there is a blog post on our website that basically explains it in a little bit more details. I’ve also included the KCBQ GitHub link in case you’re interested in using that. And finally, the last piece is actually a blog post that my colleague has wrote this morning. It talks about schema evolution in the case where it breaks backwards compatibility. We can use Avro’s forward and backward compatibility to deal with schemas that are compatible. But what happens if you made a change that is not compatible in the database? So the last point kind of goes a little bit more into that and I think it’s super interesting and relevant for CDC. And that’s the end of my talk. Thank you, guys.

Questions & Answers

Moderator: So for the future work, you have a database sitting on the side. I have two questions. One is, you could hydrate by just reading directly from one of the source databases, do you?

Gao: Yes. The reason we don’t want to read directly from the source is because we don’t want to impact the production of the source database. It is possible to create a second cluster, a Cassandra cluster that is made specifically for the CDC purpose which is kind of what this could potentially be as well, so.

Moderator: But you have to keep it in sync with the sources? That’s the challenge, right?

Gao: It’s okay if it’s asynchronous because for every single table, it’s essentially serialized so then we know that it’s going to be in order.

Participant 1: What do you think about writing the event from the application itself? Because we’ve implemented the future solution that you showed here, using events in the application write to Cassandra, write to Kafka, and then we have two events before and after into Kafka. And then we use the stream processing engine within [inaudible 00:42:53] to handle [inaudible 00:42:55] and then distribute the updates to multiple databases.

Gao: Yes. One of the problems that we’re seeing is the distributed transaction problem. If you guys have potentially solved that, then that’s great. But the other problem is that we do want to be able to get the before and after. At least into MySQL case, if we were to use this event sourcing approach, then we only get to the column that have changed. So that was something we’re trying to avoid. But in this case with Cassandra, it’s simply not possible because the database itself doesn’t do read before write. That’s also why this is kind of an event sourcing approach. Okay, so now I kind of understand what you’re saying, so you’re talking about why not just update Kafka first and then basically read from.

Participant 1: That approach works with Cassandra, you don’t have to worry about multiple copies and stuff.

Participant 2: So can you go into the detail, you read from Cassandra, send it to Kafka, you write to Cassandra and then write to Kafka, so you get a bit of an update.

Participant 3: Before you write to Cassandra, you write an event to Kafka and then once it’s successful, you write another event to Kafka.

Participant 2: It’s kind of but it kind of gets grounded.

Gao: And another thing is if you are using Cassandra for any kind of transactional things where you care about read-your-write consistency, that could potentially become a problem I think. Where you need to guarantee that every time you’re reading from the database, it’s the latest thing, it has what you’ve written already.

Participant 4: Yes, I know. The read or the write in Cassandra helps with this, but when you’re replicating that information [inaudible 00:44:36].

Gao: It’s okay. Yes. We can take it offline after. Thank you.

Participant 5: What’s the motivation for going to Cassandra and given that it sounds like this is quite an effort to go to it? Is BigQuery not sufficient? Or what’s the limitation there?

Gao: We want to use Cassandra more in the sense of a production database, in the sense of OLTP database whereas BigQuery is more so for analytical database, OLAP. So we want to optimize for write and Cassandra is our best contender for that.

Participant 6: So I had a quick question about caching. That essentially materializes the view of the row within the cache, right? So how do you know how long to keep that cache before you throw it out?

Gao: The cache is going to be like an optimization, but it’s not going to be a source truth because it is possible – say you’ve set your TTL to 30 minutes but for whatever reasons, one of the nodes is down for a longer amount of time, then you can get the data later. But the database can then cache those problems by the time the data get there, so.

Participant 7: Hi. Just a quick question. I was just listening through the future solution for Cassandra where you will get three copies of the data and now to quarter as well at times. Could you not just use one of the replicas as master and you could use Zookeeper for keeping that state and just have that one replica push it out?

Gao: That’s actually something we considered, where we would actually coordinate the different agents so that only one of them is sending the message. The problem is that Cassandra is meant to be a peer-to-peer database and it’s meant where all the nodes should be equal. If we’re starting to introduce Zookeeper into the picture, it’s a little bit against Cassandra’s philosophy which is that any node can be taken up or down because now we’re basically setting one of the nodes to be the master, and it’s definitely possible but the reason we’re not considering it is because we kind of want to follow what Cassandra is known for which is the tow [SP] like peer-to-peer, everyone is equal.

Participant 8: Do you write three? Write equals three on all writes?

Gao: Right now, we do, yes.

Participant 9: Thanks, it’s a great talk. Just curious, if you think there’ll be any people to build downstream applications off some of these streams, or if it all goes to analytics? I’ve seen some interesting use cases of using this to actually generate other applications.

Gao: I think it’s definitely possible which is why we want to have this future solution that allows other systems to be able to read from Kafka. If the only thing we cared about was analytics, then our existing pipeline could kind of work for a while, so yes.

Participant 10: I’m curious if you ever end up missing events due to a network failure talking to Kafka or something like that, and if so, how do you deal with that?

Gao: That’s a good question. A lot of things that we’re dealing with right now are all in POC, so we haven’t had to spend a lot of time and effort in terms of guaranteeing that our messages are not lost and whatnot. I think it’s potentially possible, as I’ve heard about similar scenarios in other pipelines before with Kafka, but we’ll probably get a clear answer as we experiment more with this pipeline.

Participant 11: My question is, you’re paying an awful lot of the data duplication with a view on BigQuery. How much of a performance penalty are you incurring by lazily evaluating the data this way?

Gao: That’s something that’s just going to get worse over time. We are currently in the process of building this so I don’t have a lot of numbers for you but it is definitely pretty expensive. BigQuery is great because it’s able to do parallel execution, but even with that, it is a concern that we could potentially take way too long for a single query to do, which is why we’re hoping compaction could help us down the road, but, yes. Sorry, I don’t have a good number for that.

Participant 11: Is that a business tradeoff, you think, whether or not to throw away the out data …?

Gao: The business tradeoff is expense. BigQuery cost is based off of execution, so this is going to get expensive, yes.

Participant 12: One question is if you do, and this [inaudible 00:49:51] but if you are cached and you’re just windowing in the stream processor, you could probably dedup there at least for a time window data.

Gao: Cassandra itself actually had the TTL feature. If we actually want to use Cassandra as this intermediary database, it’s possible as well.

Participant 13: This is more in line with someone who had asked the question before. If data consistency is such a big requirement here, what is the rationale for using an eventual thing like Casandra?

Gao: That’s a really good question. Because we’re building a data pipeline, we want to optimize for different use cases. So maybe for one user, they’re goal is to use Cassandra for write-only, and another might be using Cassandra for something that’s a little bit more consistent. We’re trying to make a more generic solution essentially, that covers all these cases but it’s true though.

Participant 13: In this picture, you are envisioning heterogeneous sources like Cassandra and other MySQL and things like that?

Gao: Oh, no. So this pipeline is specifically for Cassandra only.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Should CEOs Learn To Code? Yes!

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

What makes a great CEO?

First of all, the answer to that question depends on who raises it: An investor will likely come up with a different answer than an employee. What if you asked a CEO? He might give you his definition. Ask another CEO, and he will likely give you a slightly or even completely different view on this subject.

Ask five people, and you will get five answers

It seems nearly impossible to find an answer that would be satisfactory to all the audiences above. Why? Because each person you ask has his motivations: An investor might expect a CEO to cut costs and shrink the organization to fit. An employee is scared of precisely that to happen. Last but not least, the CEO could have a vision of how he wants to shape the company. Investors, employees or both might support that vision, or they might not — conflicts of interest everywhere.

What hard skills should a CEO have?

Surprisingly, it’s a greatly overlooked question. Forbes published a thoughtful article on this subject back in 2012: Great CEOs Must be Either Technical or Financial. Venkatesh Rao, the author, made an interesting observation:

“I am struck by a major disconnect between the stories of real-life CEOs and theories of leadership. The theories of leadership never seem to talk about necessary expertise in running companies. Most of the thinking seems to revolve around people skills. Two skillsets in particular, financial and technical, are barely ever talked about.”

Keep up with the Gates, Musks, and Zuckerbergs

The founders and CEOs of Microsoft, Tesla, SpaceX, and Facebook set the bar higher than ever for all other CEOs out there. It’s not only Musk & Co’s enormous accomplishments that set them apart. It’s the skills that got them there, one of which is a profound understanding of technology.

What does this mean to the brick-and-mortar CEO?

Most CEOs are not involved in ventures that electrify the public, such as sending spaceships to Mars or reinventing the automotive industry. Their companies might be involved in the creation and delivery of products and services most of us take for granted. Just think of electricity, food, gasoline or insurance. All of these industries are presently being digitally disrupted. How well is the majority of CEOs prepared for what’s coming?

Does your company’s CEO grasp technology? Presumably not

I don’t possess numbers to support my case. However, based on dozens of one-on-one interviews with CEOs across multiple industries, with their employees, and consultancies, I came to a shocking conclusion: Technology illiteracy is a wide-spread phenomenon amongst CEOs.

Can technology illiterate CEOs digitize their companies and make them data-centric?

A senior partner at a multinational consultancy, who wishes to remain anonymous, gave me a blunt answer to that question: “No way.” Due to his experience, large digital transformation projects fail, primarily because of technology illiterate CEOs in the drivers’ seats. Surprisingly, on both sides of the table.

Source: code.org

What’s the solution to the problem? You guessed it. Learn to code

I learned to code in my 40s. My language of choice was Python. Although I do not code for a living, nor do I have any intent to do so, this learning experience taught me some priceless lessons. All of them are of great value for my business and my customers.

Every CEO I shared these lessons with got excited about them; some are even following in my footsteps. Let me elaborate on the value of learning to code. What do you actually learn? It’s more than “just” writing some lines of code.

Breaking big problems into small, solvable ones

The most important lesson that coding taught me is breaking problems into smaller chunks one can chew on. If a CEO decides to learn to code, it will not make all his more significant problems miraculously go away. However, code is the alphabet of a software-centric world. The ability to read and write code empowers people. Those CEOs who decided to learn coding start seeing the world around them in entirely new ways: “Now I start getting an idea of what is really going on out there.”

Understanding the complex world of open source

Mark Andreessen, entrepreneur and investor, coined the term: “Software is eating the world.” Soon after, an alternate version started circulating on the internet: “Software is eating the world, and open source is eating software.”

For a CEO in a highly competitive industry such as insurance, for example, joining forces with their worst competitors to create software that could be beneficial for the entire industry is by and large unthinkable.

“All the major innovations happening right now are with open source platforms, and yet there are still a lot of people … reinventing wheels.” so said John Mark Walker in an InfoWorld article titled 20 years on, open source hasn’t changed the world as promised.

Due to lack of even the most basic understanding of open source, technology illiterate CEOs keep everyone busy reinventing wheels. By learning an open source language such as Python, a CEO gets some first-hand experience with a key-driver in today’s world. Open source.

Knowledge beats hear-say

Those CEOs who decide to learn programming gain a lot of credibility amongst tech-savvy peers, staff, and advisors. It is well perceived to have at least some limited understanding of technology, as opposed to solely relying on second- and third-hand knowledge. As one CEO I spoke to put it bluntly: “I was once told right in my face that now, that I have some basic programming skills under my belt, I started talking sense when it comes to technology.”

Learn to learn from Millenials

We are living in a world in which knowledge has become ubiquitous. Nobody understands this better than the generation of millennials. “It makes this world a dangerous place to live in for dinosaurs like us” as a befriended CEO told me. He’s in his late sixties.

Ask the average, mid-aged CEO what he thinks about YouTube, and he will likely respond: “Entertainment. Cat videos. Those things.” Ask a millennial the same question, and you might get two different answers: a) entertainment and b) education.

Those CEOs who start learning to code quickly realize that most of the educational content is available to anyone at any time at zero cost. This realization is quite shocking: “Today, a bunch of guys can just learn about and break into virtually any industry and turn everything upside down.”

As digitalization progresses, the entry barriers into most industries are vanishing. When asked how he acquired the expertise to break into NASA’s territory, Elon Musk replied: “I read books and talk to people.” A millennial-flavored response would likely be: “I watch videos on YouTube and snapchat with people.”

For me personally, the story of a 12-year-old girl who learned dubstep solely via YouTube was one of the most inspirational ones in the last years.


You will find a wealth of testimonies by people who learned Mandarin Chinese, programming, how to build irrigation systems in their villages in Africa via YouTube. We are living in an open world.

Dear CEOs, please, learn to code. Don’t stand behind.

I run a data literacy consultancy, and we cater to large companies in Europe and the US. You have questions I didn’t answer in my write-up, or you want to share your experiences? Please leave a comment or reach out to me via email rafael@knuthconcepts.com or LinkedIn. Thank you!

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Data Anonymisation Software – Differences Between Static and Interactive Anonymisation

MMS Founder
MMS RSS

Article originally posted on Data Science Central. Visit Data Science Central

This article first appeared on the Aircloak Blog – feel free to pay us a visit! Aircloak is a company that develops software solutions, enabling the immediate, safe, and legal sharing or monetization of sensitive datasets

Which company would not like to get significantly improved insights into the market with detailed data and thus optimize its products and services? In most cases, the corresponding data is already available, but its use is often restricted for data protection reasons.

Anonymisation is used when you want to use a data set with sensitive information for analysis without compromising the privacy of the user. Moreover, the principles of data protection do not apply to anonymised data. This means that this information can be analysed without risk. Therefore, anonymisation is an enormously helpful tool for drawing conclusions, for example, from bank transaction data or patient medical records.

Retaining Data Quality and the Misconception of Pseudonymisation

In practice, however, things are less simple: Data records in companies are often anonymised manually. Although special anonymisation tools or anonymisation software are used, many parameters in the corresponding tools must first be determined by experts and then entered manually. This process is laborious, time-consuming, error-prone and can usually only be handled by experts. At the same time, one must choose the lowest possible level of anonymisation, which nevertheless reliably protects the data. Whenever information is removed, the quality of the data record deteriorates accordingly.

Due to the data quality problem, data records in companies are often only supposedly anonymised when they are actually pseudonymised. Pseudonymisation means that personal identification features are replaced by other unique features. This often enables attackers to re-identify the pseudonymised data set with the help of external information (Your Apps Know Where You Were Last Night, and They’re Not Keeping It Secret, NYTimes). For example, pseudonymised data records can be combined with additional information from third-party sources to make re-identification possible (linkability). Likewise, the enrichment of data records classified “unproblematic” can lead to the exposure of sensitive data when you cross-reference it with additional information. [1] Netflix is one company that has had to struggle with such a case: In a public competition, they sought a new algorithm for improved film recommendations. A database with Netflix customers was anonymised and published. However, researchers were able to link it with data from IMDB rankings and thus de-anonymise the users.

Static Anonymisation – The One-Way- or “Release-and-Forget”-Approach

Almost all anonymization tools can generally be classified into two categories: either they use a static anonymisation or a dynamic anonymisation. With static anonymisation, the publisher anonymises the database and then publishes it. Third parties can then access the data; there is no need for further action on the part of the publisher himself. As already described above, the following things must be carefully considered:

  • The list of attributes that must be anonymised (data protection/privacy) and the utility of the data must both be precisely defined.
  • In order to ensure the latter, it is crucial to understand the use case for the anonymised data set in detail at this point, because this may better preserve the information quality of the decisive dimensions by accepting a restriction for other, less decisive, data.
  • Data is often only pseudonymised. Attackers can thus draw conclusions from the data records by linking them with other information.

Examples of current static (and free) anonymisation tools are:

The website of the ARX Anonymization Tool gives a list of further anonymisation software that works according to the static principle.

Interactive Anonymisation – Where the Anonymisation is Tailored to the Query

With interactive anonymisation, anonymisation is applied dynamically to the results of a query and not to the entire data set (that’s why we call this procedure “dynamic anonymisation” at Aircloak). The analyst can access a database via an interface and make queries. The anonymisation happens automatically during the query and he receives only anonymised results. This dynamic approach is used for Differential Privacy as well as for Aircloak. Since the results of the queries are changed by noise addition, the number of queries must be limited, at least for differential privacy. Otherwise, it would be possible for attackers to calculate the noise by using simple statistical methods and thus de-anonymise the data set. Aircloak avoids the need of a privacy budget by adding noise tailored to the queries with the framework Diffix.

The great advantage of interactive anonymisation is that the anonymisation is performed automatically and the analyst does not need to have any knowledge of data protection or data anonymisation. This allows the analyst to fully concentrate on evaluating the data.

Of course, it is also possible to create a static dataset from an interactive process: the editor formulates and executes queries for this purpose and the anonymised results are then published. Examples of current anonymisation tools based on the dynamic approach are:

Differential Privacy

Google’s RAPPOR
GUPT
PINQ (and wPINQ)
Harvard Privacy Tools Project: Private Data Sharing Interface
FLEX
Diffpriv R toolbox [2]

Diffix

Aircloak Insights

Which method is better?

Some experts have the opinion that interactive anonymisation tools protect privacy better than non-interactive tools. [3] “Another lesson is that an interactive, query-based approach is generally superior from the privacy perspective to the “release-and-forget” approach”. [4] The probability of poorly anonymised data sets is much higher with static anonymisation due to the very complex selection of data usability/data protection parameters.

Nevertheless, static anonymisation is often sufficient in one-off projects with clearly defined frameworks, preferably in conjunction with further organisational measures. Dynamic anonymisation is particularly suitable in larger projects and with regular use of anonymised data records, where uniform processes and data protection compliance are the highest priority. Ultimately, it always depends on the use case and on how important the privacy of the data is.

Addendum:

“Interactive anonymisation” is not clearly defined. For example, the research paper Interactive Anonymization for Privacy aware Machine Learning describes how an interactive machine learning algorithm interacts with the help of an external “oracle” (e.g., a data protection expert) that evaluates the result of the anonymisation performed by an algorithm and thus makes continuous improvement possible.

By contrast, the Cornell Anonymization Toolkit claims that it is “designed for interactively anonymising published datasets to limit identification disclosure of records under various attacker models”. However, the interactivity here is limited to the fact that the anonymisation parameters can be adjusted in the tool. The result is nevertheless a statically anonymous data set.

1 Leitfaden: Anonymisierungstechniken, Smart Data Forum
2 Differential privacy: an introduction for statistical agencies, Dr. Hector Page, Charlie Cabot, Prof. Kobbi Nissim, December 2018
3 Leitfaden: Anonymisierungstechniken, Smart Data Forum
4 Privacy and Security Myths and Fallacies of “Personally Identifiable Information”, Arvind Narayanan and Vitaly Shmatikov, 2010

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.


Katherine Kirk on Dealing with Teamwork Hell

MMS Founder
MMS RSS

Article originally posted on InfoQ. Visit InfoQ

Dysfunction in teams can truly feel like “being in hell”, according to Katherine Kirk, cofounder of the Inclusive Collaboration movement. In her talk How to Navigate Out of Hell at the upcoming Aginext.io conference, she describes work-hell as feeling confined within an endless loop of unhappiness, feeling out of control and not being able to influence our own destiny for sustained periods of time.

Drawing from her studies in ancient eastern philosophical practices, she advises teams and individuals to: deliberately look for and see the bigger picture, actively manage your own response to stressful situations, maintain your own integrity and ethical standards and diligently take small steps rather than trying to address every aspect of the situation at one time.

The Aginext conference runs in London on 21-22 March and is focused on looking at the future of agile, lean, CI/CD and DevOps transformations.

InfoQ spoke to Katherine about her talk and the practical ideas she will present for dealing with dysfunction in teams.

InfoQ: Why is it that some office situations feel like being in hell?

Work-hell is when you feel confined within an endless loop of unhappiness

  • Difficulty isn’t necessarily hell. Overall human beings don’t mind situations that are difficult, as long as we believe we are empowered to do something about it to make it better, or can see that eventually we can get it the way we would like things to be
  • The reason some situations feel like hell and others don’t is primarily all about whether we feel trapped indefinitely or not.

There is actually a pattern of how work-hell arises – which is often helpful to understand if you are trying to work your way out of it. Here is an excerpt of the book I am currently writing that explains it a little:

  • Consider that the nature of business is that all that is within it (e.g. projects, products and programs) will eventually degrade, dysfunction and expire. Just like human beings age, get ill and die. It’s the nature of our universe (check out the scientific concepts of entropy and the arrow of time).
  • Fundamentally humans don’t like being out of control of that. We sometimes want things to ‘last forever’ or ‘stay as they are’ longer than they do.
  • In light of this human beings are experts at working very hard to prevent degradation, dysfunction and expiry so that they get what they want. And have even learned to turn it to advantage. If a product expires, we create a new one, if a project degrades we transform it, if a situation dysfunctions we innovate. That’s the basis of success of humanity.
  • So – with projects, products and programs – we can resist entropy and are often successful. We become pleased with ourselves, our teams, our divisions and our company – for a while
  • But eventually degradation, dysfunction and expiry begin bothering us again. Because that’s the nature of business. And entropy appears to be ‘winning’. Consider: what project, program or product won’t eventually degrade, dysfunction and expire? If we don’t want it to happen – we get upset.
  • Examples of entropy upsetting us are things like: wanting profit to still increase as it always used to but the business context has changed, or wanting a defunct product to last for just one more year, or a wanting a once-successful-project to continue to be productive even when key individuals and demands from the customer have changed.
  • If there is nothing we can do to prevent what we don’t want to happen or it is far more difficult than we imagine to make our ‘wants come true’ then our human reaction of disappointment, anxiety and frustration increases.
  • During that process we begin to take it out on each other with behaviours like blame, shame, hierarchy, rules and manipulation. All because we are trying to stop the degradation, dysfunction and expiry that we dislike.
  • That still won’t necessarily create work-hell if we eventually feel we can change the situation into something we prefer but does become work-hell when it seems like it can’t be changed – no matter what we do… and that cycle continues far beyond our ability to cope.
  • That’s when we get intensely stressed, fatigued and arrogant – and, over time that turns office situations that might have just been difficult into full blown horrible work hell.

InfoQ: What are some examples of situations that cause the behaviours the put people into that hellish space?

Work-hell situations are created by locking people into the same difficulty (no matter what you do) over sustained periods of time, such as

  • Constantly pushing beyond capacity and capability– creating stress and exhaustion
  • Continually disengaging individuals from being empowered to influence and participate – creating the feeling of being trapped
  • Persistent political quagmire – creating the feeling of being manipulated
  • Always forcing same-ness – creating exclusive, rigid factions, boundaries and monocultures

InfoQ: Aside from just quitting and finding another job, what are some ways that people can cope with, and change, the situation they find themselves in?

I’ve found 5 ways that never fail me (or, from what I’ve seen, teams, divisions, companies and leaders) – especially when you do them all together at the same time

  • Continually develop your contextual understanding – the bigger picture helps create much more effective action
  • Utilize your reaction to determine outcome – ‘judo’ difficulty – you might not be able to change what is being done to you, but you can certainly change your reaction
  • Create meaningfulness out of difficulty through learning loops – use difficult situations to build a deeper level of wisdom about yourself, others and the alternatives around you – this will lead to you making much wiser decisions
  • Gentle diligent persistence forward – small steps for a longer period will preserve your precious energy and outlast drama
  • Keeping to ethics and principles – don’t make your situation any worse than it has to be, don’t let others cause you to behave in ways you regret

InfoQ: Where have you drawn you inspiration for these ideas from?

  • 10+ years adapting and applying ancient eastern philosophical patterns into ways of reducing difficulty and increasing effectiveness in business and tech
  • Specialising in turning hell scenarios around nearly all of my career
  • Being a student of difficulty rather than a victim of it

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.