Mobile Monitoring Solutions

Close this search box.

How GitHub Partitioned its Relational Database to Improve Reliability at Scale

MMS Founder
MMS Sergio De Simone

Article originally posted on InfoQ. Visit InfoQ

GitHub has been working for the last couple of years to partition their relational database and move the data to multiple. independent clusters. This effort led to a 50% load reduction and a significant reduction of database-related incidents, explains GitHub engineer Thomas Maurer.

GitHub architecture originally relied on a single MySQL database cluster, known as mysql1.

With GitHub’s growth, this inevitably led to challenges. We struggled to keep our database systems adequately sized, always moving to newer and bigger machines to scale up. Any sort of incident negatively affecting mysql1 would affect all features that stored their data on this cluster.

To improve this situation, GitHub engineers defined a strategy to partition their relational database without impairing the level of service. Key to this was the definition of virtual partitions of database schemes and the use of SQL linters.

Virtual partitions are group of tables used together in queries and transactions. Identifying table groups that could be separated in the application layer was the first step to a smoothless transition. To codify the relation among tables, GitHub engineers introduced the notion of schema domains. Below, you can see an example of schema domain, defined in a simple YAML file:

  - gist_comments
  - gists
  - starred_gists
  - issues
  - pull_requests
  - repositories
  - avatars
  - gpg_keys
  - public_keys
  - users

This notation was the basis for the application of SQL linters.

[Linters] identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster.

Linters were applied to both queries and transactions. Query linters ensured that only tables belonging to the same schema domain were referenced in the same query. Transaction linters were used to identify transactions that needed to be reviewed or required a data model update.

Once virtual partitions were identified through schema analysis, related table could be physically moved to different database clusters. For this final step, it was paramount for GitHub to avoid any downtime.

After an initial investigation of Vitess, a scaling layer running on top of MySQL, GitHub engineers opted to implement their own approach, dubbed write-cutover. This consisted in adding a new replica to a cluster then running a script to stop replication and make that replica independent from the original cluster. Using this approach, GitHub moved 130 of their most critical tables at once.

Thanks to this effort, explains Maurer, GitHub database was spread across several clusters. This made it possible to increase handled queries by over 30% while the average load on each host halved. Additionally, GitHub engineers observed a significant reduction of the number of database-related incidents.

Maurer’s article contains much more detail that can be provided here, especially concerning their write-cutover procedure. Do not miss his write up to get the full picture.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Presentation: Optimizing Your Web Performance: Separating the Signals from the Noise

MMS Founder
MMS Carl Anderson

Article originally posted on InfoQ. Visit InfoQ


Anderson: Welcome to this talk about optimizing your web frontend performance. How to Measure web performance is a broad, acknowledged topic. Over the years, while we’ve been cracking at it at Trainline, we’ve scored some wins, but also made a number of mistakes, many of which are common, even considered best practices in the industry, and which I’ll share with you towards the end.

My name is Carl Anderson. I’m the Director of Engineering for frontends at Trainline. That means that I’m responsible for our applications, iOS and Android, and our website’s APIs, pretty much everything that’s externally facing at Trainline.

Trainline’s Purpose and Vision

Trainline’s mission is to empower people to make greener travel choices, that is, hopping in trains rather than taking the car or the plane. To do that, we’re building the world’s number one rail platform, which basically takes form in a website and applications to which we’ve connected to over 270 rail and coach carriers. That’s in the UK, but it’s all across Europe, serving 45 countries. We got roughly 30,000 stations that we’re serving. To give you an idea of scale, pre-COVID, we had roughly 90 million visitors per month, and our apps were downloaded over 30 million times. That’s what we do. That’s for context.

Why Speed Matters

The first one is why speed matters. I think it obviously matters because it’s good karma. As a developer, I think none of us wake up in the morning, thinking, “I’m going to do a slow website.” As developers, we’re all striving for performance. We’re all striving to create some sleek, nice UIs and UX for users. Beyond that, from a personal standpoint, we all know that the internet consumes a lot of energy. Actually, today, powering the internet consumes more than the energy consumed by the UK. This electricity consumption actually has an enormous carbon footprint because all that electricity is not entirely green. The less energy we consume, the faster our product is, the less time the users use it, and the faster our server actually serves back the results to it, the lesser our footprint. That’s good karma. If we take a step back, and this ties back to our mission at Trainline, transportation is the fourth emitter of greenhouse gases in the world. If you look at that category, half of it is cars, 10% is airplanes, and trains is actually the greenest means of transportation in the world right now. The count is at less than 1%. The more we can get people to forget about air travel, or their car and actually commute by train, the better it is for the planet. We can do that at Trainline by just providing a really smooth, really sleek user experience that just makes using trains as easy as hopping in your car. That’s what we’re aiming for.

That’s not just good karma to create a fast and sleek website, what’s great is that it’s also aligned with your business goals. I think one of the first thing that we did when we started working actively on performance a couple years ago, is we started looking at different metrics. One of them was the time to first byte. That’s what you’re seeing in the graph right there. Do you see the green line and the trend, it’s indexed on the time that it takes, and the spike of the best step conversion on that page is roughly around 200 milliseconds. As your time to first byte gets longer to get your data across the wire, actually, your step conversion drastically reduces, and actually, it gets divided by two. Because here, at the beginning of the step conversion is at 40%, but if your time to first byte is around 2 seconds, you’re now down to 20%. People just don’t go, just don’t convert, just go back because your website is too slow.

You get the same idea here with another performance metric named the cumulative layout shift, CLS, which is basically the amount of change in your page. It’s a score. The amount of change in your page where, as things load, the layout changes and things move, which can be quite annoying as a user. We’ve all been faced with those moments where you’re trying to click on something, and that thing changes and/or moves, and then you end up clicking on something else, which is quite annoying. That’s what the cumulative layout shift score measures. Same with that one. It starts at roughly 6% step conversion, but as the cumulative layout score increases, which means that more things are moving on the page, you can see the step conversion drastically reducing as well. Creating a fast and great page experience in your product is just good for your business as well.

Very recently, because actually, this is happening right now, there’s the web vitals introduced by Google, which is a Google initiative. I think it was introduced probably last year. I think the difference is that this year, what they call the core web vitals is now becoming a ranking factor in the Google search, so that if two websites that provide the same relevance of information, one has a better page experience than the other one, is going to have a higher ranking in the Google search algorithm. This is actually an opportunity for those that have fast websites to actually gain in page ranking. It’s even more important more than other. This is actually rolling out right now, in May 2021.

How to Get Started

I think we all bought into the fact that we want to create fast websites. I think the first thing to get started with is that you need to measure. There’s a famous saying, I think it’s Peter Drucker, the management consultant, said, “If you can’t measure it, you can’t improve it.” Actually, you can even go further that whatever you measure, you improve, or, we optimize what we measure, is really the idea behind it.

Build a Baseline: Google Core Web Vitals

The first thing is to build a baseline, but building a baseline is super hard with web performance, because there’s not a single speed metric. It’s a mix of lots of different signals, and lots of different metrics actually got to go. I think if you’re starting on this journey, the best way to get started is obviously the Google Core Web Vitals, because they’re so important in your ranking algorithm at the minute, or it’s becoming increasingly important. It already was. Basically, the way Google’s structured it, again, is measuring the page experience. It’s a balancing act in between three pillars, loading, so how fast you’re loading your content, interactivity, and then visual stability. Because if you only take one of these, if you optimize for the load time, you can go super-fast to actually show content, but then it might not be interactive, because this is where the first input delay comes in is they’re going to measure when you tap or click on an item. If your CPU is still running and still processing other things, it’s not idle yet, it cannot take your interaction into account. It’s very annoying as a customer, you tap on the thing and it doesn’t take that into account.

Then the opposite, if you go for interactivity where you want to build everything and make sure that everything is ready before you actually paint anything on the screen, you end up having your customers just staring at a blank page for seconds, which is not a great experience. You need to find the right balance. The same with visual stability. It’s in between the two. You got the interactivity ready. You load all the content. If you keep moving things as the content pops in the page, because it’s always asynchronous, each and every individual component is actually interactive, but they pop and they move stuff, it’s hard for the user to actually start interacting with your page as it builds, which then becomes really annoying. Building a baseline with the Google Core Web Vitals is a great start to drive your page experience.

Find Your Own Metrics

The next step, I think, is you shouldn’t just stop there. They’re great to start, but there are lots of other metrics that you can use to drive great performance within your web application. An example that we’d already talked about was the time to first byte, a bit earlier, is the first byte that you send over the wire, you want that to be as fast as possible. If you look at that curve, you can see that the sweet spot seems to be around 200 milliseconds. If you look at the histogram, you want to shift that curve to that sweet spot of 200 milliseconds here. Same with the first contentful paint, so that’s the first thing that a user sees, so going from a blank screen to something. You want that something to be meaningful, first of all. Then, same, you want to get it where your step conversion is on that example, so you want to shift it all to the left. Measuring is the first key thing.

Then, I think what’s also very interesting with what we just talked about, and with those graphs is that, it doesn’t just measure performance and pure tech performance, it actually links it to business metrics, so you can see the impact. I think really the key thing, what we started in our journey was really about measuring all the things that we can measure, let’s link them to different business metrics, such as the conversion rates, such as the bounce rate. Then just really see, is there any correlation, and let’s find the right tech metrics that have a direct business impact, because those are the ones you want to go after. I think another quick tip that we’ve learned from going onto that journey and starting doing that is, there’s lots of different tools to actually measure your performance. They don’t all have the same definition of the same metric, or they measure it slightly differently. Really, I think compare apples with apples is really just try to use the same tool with everything so you can have that overlay over your business metric and your tech metrics. Then, it’s all comparable in between the different graphs that you have, because it all uses the same way of measuring a specific metric.

I think also what’s very important to record as much as possible are metrics, waterfalls, and do that continuously over time, so that you can see the trend. I think a key thing that we had at Trainline, when we started this, is we’re starting as well on those graphs, we started to set markers each time we were doing a release. You would see the graph, you would see the release, and you could see the curve then either go up or down so you can see the impact of what you just released. That was a very insightful thing, because then we would build our hypothesis. We’d go, “Let’s see what impact it will have.” That was also very interesting.

Create Your Own

We started with our own metrics, and the Google Core Web Vitals came in. That was quite interesting. Still, as we started going into optimizing for it, we actually realized that there’s still some gaps, because those metrics are generic and standard. Google needs a standard way to measure the performance of websites from various companies, but no one else but us knows better than know how our product works. I think this is what we started doing is that if you can see on the left, there’s quite a gap. That’s the waterfall of a web page loading. There’s quite a gap in between the early metrics and then the later metrics that you get when a web page gets constructed. What we started doing is we started increasing granularity with our own metrics as to the time to boot our React.JS application, the time for the single page application to be loaded, and time to a duration, and all those metrics. That was quite interesting, because there are all those times where as an engineer, you can actually input in to try various things to have an impact on those.

It’s Not About Speed, It’s About Users Experience

Then I’ll talk a little bit about home ready metrics. That was also a key finding for us was, again, nobody knows your product as well as you do. To measure your page experience the best that you can, what we soon started to do is identify with our product teams, what’s the key action of that page. Then, the tech teams, what we started doing is measuring when that action actually was ready. By ready, I mean when it’s ready to interact with, so it’s painted. The JS events are there, just ready for the user to consume it. That was a critical one, because then it gave us the actual moment when the page is just ready to go, and is what the user expects in there. This created the graph that you’re seeing here, is that then we have that booking flow, where we go from loading the application, like a hard navigation going when we load the single page application, all the way to when the home is ready, which basically means that the search widget is interactable. It can still mean that we’re still loading asynchronously in the background, extra things, but at least we can now search from London to Cambridge, for instance.

Then you type in your stuff. You hit search. You arrive on the search results page. Then that next marker is when all the results are actually displayed on the screen and ready to interact. Then that’s the next marker. Then we’re going to measure from that moment you click search to that moment that those are ready. That’s an extra wait time. We’re going to do that for every step of your flow. At the end, it allows us to build that graph that you’re seeing at the bottom of the screen, which is the total wait time that a user sees, or experiences, after going through a flow. Obviously, our goal is to make that as small as possible. It also actually highlights the fact that performance is a very cumulative sport. It’s not just frontend and size, not just backend, you want to have that entire stack to be as fast as possible.

Synthetics or RUM

At that stage, you’ve got a set of metrics that you’re monitoring to measure performance. We talked about Google Web Vitals, and other technical ones that are interesting to you, such as time to first byte, or first contentful paint. We’ve also talked about your own custom metrics to measure, really, the moment where a user can actually interact with the web page. Now, there’s, how do you monitor those? There’s always that debate of, shall we go synthetics, shall we go real user monitoring, RUM? Do we need both? The answer is definitely both, because they both serve a very specific purpose. Synthetics is lab data. It’s the one that you’ve got the waterfalls, you’ve got all your profiler information. It’s a control test environment. It’s very consistent, so you can run the same test over and over. You’re really testing your actual code in a controlled environment. You can see a trend, you can see progression through the updates that you’re doing in your code.

Real user monitoring is the key one, because that’s still data. That’s what users really experience, so that’s the end goal. The end goal is to improve and to create the best experience possible for these users. Those are great to use as KPIs. Then what you’re doing is that you’re using synthetics to test your hypothesis, to build a backlog, then you ship those. Then you’re monitoring in production. See, did it have the impact that I thought it was going to be? Really, you need a combination of both.

How to Measure

Now you have all these metrics, you’re measuring synthetics, you’re measuring in production. That’s a lot of data. How do you look at that data? I think, to that, the first thing, the first trap, don’t use averages ever, because it just paints a crooked picture of the world. If you compute an average, it will paint a picture of the world where all your requests run at the same speed. That’s obviously not true. I think one mistake that we’ve made at the very beginning when we started that journey of optimizing our web performance, we started at the 95th percentile to scrape off oddities or the outliers at the very end of the scope where they have network issues or connectivity issues. We said, let’s just drop the 5 last percent, but then let’s just focus on that value, which 95% of our customers are actually experiencing or below.

That was a mistake, retrospectively, because if you look at the median where most of our users are, and you have a significant degradation, that’s the example on the right, often, you won’t actually see anything in the 90th percentile. Because the value is so high compared to where the bulk of your users are, where actually you found out that it was better to have two separate buckets, and always look at your median. That’s what most of your customers are experiencing, and really monitor that space. Then have a separate one for the 90th percentile, because usually at the 90th percentile it’s so high, so slow compared to the median, that if you lose 1 second out of a 10-second wait, you sure got to look into it, but that’s not the top priority. Whereas if you’re losing, as in the example on the slide, 100 milliseconds, for half of your customers for the bulk of your requests, then you definitely should hop on to that one and identify what’s going on in there. Having a mix of the 50th and the 90th percentiles definitely was a big win for us to help us improve performance.

Key Learnings – Do’s and Don’ts

What we could do is talk about key learnings that we got on top of stuff, because right now you’re in a world where you’re monitoring all those metrics, you get the right signals, and not the noise, but then, how do you optimize? I think one of the first mistakes, what we’ve done on Trainline is, before we have the ready metrics, there’s a metric named time to interactive, which just sounds like the ready metric. That moment where the website is interactive for your users. That sounds great. Actually, in practice, that metric doesn’t really work. It does work and doesn’t work for optimization, because it doesn’t paint what the user actually experiences. That’s due to the fact of how it’s calculated. You’ve got the definition on the right. Actually, I like the original name of that metric, which was the time to consistently interactive, which is basically that moment where your website is interactive, and nothing else happens. It literally is idle. The trick with the TTI is that if ever there is an asynchronous operation happening in the background, where it’s actually still running, even though you don’t really care because your search widget is already there, loaded, and the customer is already able to interact with. It doesn’t really matter, actually. From that, this is how we went to dropping that metric, and actually going and focusing more into the page ready metrics.

Another key learning that we got from that journey was to not aim for 100% Lighthouse score. As a developer, that one is very tempting. You got that great lab tool that tells you, “You got to do this, and that. Then you’re going to be perfect from a performance perspective.” It’s very interesting. It’s old lab data in Lighthouse, so that might not be what your user experienced. Actually, you might have in production, some other pitfalls, or some other things that you can optimize that are actually of a higher value than going for the one in Lighthouse. In terms of prioritization, that’s not the best one as well. Then, also, it’s lab data. You’re really focusing on the first experience, because in the world, if you have a lot of repeated customers, for instance, that just come back to your website, they have a lot of things already cached in their browser, so it might not be as bad as Lighthouse paints.

Another mistake we’ve learnt from was focusing on the size of our uncompressed JavaScript. Actually, when we looked at it at first it was like, actually, we have this big load time. Obviously, in the web, users have to download the code before actually being able to use the web application. It sounded like a great idea to actually look at the uncompressed JavaScript size, and then try to swap all those dependencies for lighter weight ones. Then just reduce that uncompressed size. We’ve gone through that work. We’ve done that. Then, when we release it to production, we were so disappointed, because in practice with compressions plus the use of CDNs, it’s just so great that it didn’t really matter. There was a small improvement. I don’t really actually remember the number of milliseconds we gained out of it, but it was super small. That wasn’t the best one. What we did get out of this, though, is that measuring the size does matter. Actually, that went great as a guardrail. Having in your pipeline some automatic guardrails that looks at the size of your bundles to ensure that no one commits a merge as a pull request, it raises drastically the size of your JS. It’s actually a good idea. When we set this up, we actually caught some mistakes where a developer in the team would have significantly grown the size of a web application without doing it knowingly. That was actually a good takeaway and a good learning.

Do the Obvious Stuff

Obviously, we focused on the good practice and how you set up your monitoring so that you pick up the signal and focus on the experience. Out of this, you’re going to pick up a lot of stuff. All these are actually fairly no-brainers. It’s just in between all these feature development. What I want to say and what I was trying to convey, is rather than giving one-size-fits-all solutions, just set the right monitoring, and then you’ll find those out yourself. Then there’s just the obvious stuff as always, that performance is cumulative. It’s pointless to have a great, fast, sleek website if you have a very slow backend. That’s what you want to tackle. If you can skip entire steps of a flow to optimize for specific users that have some details or travel options with their profile, you should definitely do so. This is where you’re going to shave off more wait time for your users, rather than just go full technical.

Size does matter. Look at all your assets. Really, don’t load what you don’t need to, I’m thinking all those third-party dependencies. Lazy loading is always a good thing. Then, when you load something, you really only load what you use. If you can split your bundles so you don’t have to load the entire application on your first page, because most of the users are actually not going to use the rest of the application, it’s always a win. I’ve just put a few best practices in there like connection pooling, caching. When I talk about caching, it’s everything pulled at the platform level, or within dev level, you optimize the payloads. All those basic, obvious stuff, definitely do so. Having the right monitoring is going to make sure that you keep the attention to all those little details that actually do matter in the end.

Then another one that was actually quite interesting is by setting this monitoring, sometimes in between releases, we get the spikes or this performance degradation, and we didn’t understand where it was coming from, and it was an advertising campaign, or it was the marketing team that introduced yet another tracking pixel with a synchronous external dependency on another server. Watch out for those. Definitely look at it. Have a process so that you’re working jointly with the marketing team to make sure that those don’t happen, because again, if you shave off a second out of there, and then they introduce a new tracking pixel that just wipes it off, it’s just such a bummer.

Don’t Focus on a Metric, Focus on User Experience

If there’s one key thing to take away, is just don’t focus on a single metric. A single speed metric does not exist. What we should be focusing on is really the user experience. To do that you’re going to need multiple metrics to find the right balance in between the load time, interactivity, layout shifts, but also making sure that the caches of each and every of your page becomes interactive as fast as possible.


What we’ve seen and talked about in here was about creating fast websites. It’s actually good karma for devs, users, and the planet. You should definitely spend some time having the right monitoring and focused to make it happen. Page experience is becoming a ranking factor in Google search, so it’s important to you, and it actually creates new opportunities for us. Start measuring your website performance because we optimize what we measure. There isn’t a single speed metric, find yours. It’s about finding balance. Then, I think the key one really, you have to collect data continuously, and link it to releases to be able to compare and iterate over your performance backlog, globally. The gist of it is focusing on the user experience. The way we did it and what worked best for us really was with the ready metrics. Then, in terms of how you monitor them, you need both synthetics and real user monitoring with a mix of 50th and 90th percentile is what worked best for Trainline.

Questions and Answers

Humble: Can you give us a high level overview of what Trainline’s architecture looks like?

Anderson: It’s a microservice architecture. We have about 350, 400 services in the background. Those can be of multiple languages, either .NET, or Ruby, mostly. Some Node.js as well. Then on top of that, we got various frontends consuming either .NET APIs that serve as an orchestration layer, or Node.js APIs doing the same work, than for React.JS single page applications or mobile applications.

Humble: Would you expect the method you were describing to optimize web performance, would you expect that to work with any architecture?

Anderson: Yes, absolutely. Here, what we talked about really was, I think, a recipe to measure performance in a frontend application, which is very different from a backend application where you can just look at the transaction time, the request time, the time for a request that comes in and out of the box, whereas here, in the frontend, there are numerous signals. We talked about, don’t just optimize for the load time. Having a fast and nice web experience is much more than that.

Humble: You mentioned you were using Lighthouse, but obviously not to optimize for 100% Lighthouse score. Is there other tooling that you’re using specifically around optimizing for the core web vitals ranking trend that’s coming in?

Anderson: Yes, absolutely. I think the key tool that we’re using is speed curve. The first graph that you’ve seen with the black background, and orange, that’s the speed curve that we’re using. What’s great about a speed curve is that it provides synthetics checks as well as LUX. It has a very nice waterfall. You can compare between two deployments, two versions of your websites with the waterfall, and you can get videos of how fast they load, and compare side by side. That’s actually quite nice. I really like that one. I think the other graphs that you’ve seen at the top, actually just came out from New Relic. It’s us having our own metrics and raising that through New Relic events, so that we can create those wait time graphs that we had. It’s a mix of standard tooling and in-house.

Humble: Could adding that much monitoring lead to performance issues in itself?

Anderson: The first one is, in the monitoring, you’re going to add probably a JavaScript script that is going to provide you with this monitoring tool. We talked about speed curve. In real user monitoring, if you want to do that, you’re going to have to download that JavaScript. The speed curve one is actually very lightweight, very small, so it’s not going to impact you so much. Yes, be careful about that. If you create your own or if you’re using a tool from a third party, just check how big that JavaScript is, because that adds wait to the page and wait to your application, so it could add to your load time. Then there’s several ways to do it. If you think it’s not primordial, you can actually load it asynchronously in the background a bit later, if that’s ok with you. It depends on what you have in your web page, but it’s also a possibility. Yes, it could impact your performance if it’s really big, but honestly, it’s a good investment because without that you’re blind. Pick a tool that’s lightweight, and then definitely add it.

Humble: When you talk about caching as a good practice, what kinds of caching are you referring to?

Anderson: Here I was talking about caching generally. The challenge with caching is always the same thing, is, when do you invalidate your cache? You got to be always super careful with that one. It’s at all layers, basically. You got platform caching, where you cache your train on your CDN, so it doesn’t go straight to your server, so you improve your time to first byte. It’s also caching of data within your application. Whenever you can spare yourself a round trip to the backend, it’s always a good thing. Again, be careful with the invalidation. Because definitely, our users expect to have a real time and the correct price on the tickets, so, today, if they click on something and then turn out that it’s 20 quid more expensive later on, it’s not going to be a good experience. You got to be careful with those, because you cannot cache all the data. The more you do, and the more round trips you spare your users, the better it is for sure. It’s the same for assets and pretty much everything.

Humble: Do you have a sense of how much impact you think core web vitals will have in terms of a ranking factor? I remember the SSL one which I have sites that are still not running on SSL, and to be honest, I can’t measure the difference. I’m curious to know what you think of this one. Do you think it will actually have a significant impact?

Anderson: It’s hard to say. I don’t work at Google. I think the answer is, I don’t know yet. What’s pretty clear is that they are going to be rolling it out. I think it got postponed a little bit in June now. The rollout is going to be gradual. We’re going to see. My thinking is that it’s so impactful that they’re going to take it slowly. They’re probably going to start ramping up on English languages first, and then Europe usually comes a few weeks after. It’s usually two to three weeks behind. I think over the summer, we’re going to start seeing how impactful it is. I think it’s fair to say that I think the example that you took with HTTPS is probably a right one in the beginning. Then we’ll see, because it’s just one indicator out of many others. Relevancy is still going to be the top main thing that you expect as a ranking factor.

See more presentations with transcripts

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Database Management Software Market to Witness Huge Growth by 2028 – Bulk Solids Handling

MMS Founder

Posted on mongodb google news. Visit mongodb google news

JCMR recently Announced Database Management Software study with 200+ market data Tables and Figures spread through Pages and easy to understand detailed TOC on “Database Management Software. Database Management Software industry Report allows you to get different methods for maximizing your profit. The research study provides estimates for Database Management Software Forecast till 2029*. Some of the Leading key Company’s Covered for this Research are IBM, SolarWinds, Oracle, SAP, Microsoft, Teradata, ADABAS, MySQL, FileMaker, Informix, SQLite, PostgreSQL, Amazon RDS, MongoDB, Redis, DbVisualizer

Our report will be revised to address Pre/Post COVID-19 effects on the Database Management Software industry.

Click to get Database Management Software Research Sample PDF Copy Here @:

Database Management Software industry for a Leading company is an intelligent process of gathering and analyzing the numerical data related to services and products. This Database Management Software Research Give idea to aims at your targeted customer’s understanding, needs and wants. Also, reveals how effectively a company can meet their requirements. The Database Management Software market research collects data about the customers, Database Management Software marketing strategy, Database Management Software competitors. The Database Management Software Manufacturing industry is becoming increasingly dynamic and innovative, with a greater number of private players entering the Database Management Software industry.

Important Features that are under offering & key highlights of the Database Management Software report:

1) Who are the Leading Key Company in Global Database Management Software  Data Surway Report?

–   Following are list of players that are currently profiled in the report IBM, SolarWinds, Oracle, SAP, Microsoft, Teradata, ADABAS, MySQL, FileMaker, Informix, SQLite, PostgreSQL, Amazon RDS, MongoDB, Redis, DbVisualizer

** List of companies mentioned may vary in the final Database Management Software report subject to Name Change / Merger etc.

2) What will the Database Management Software industry market size be in 2029 and what will the growth rate be?

In 2021, the Global Database Management Software Market size was xx million USD and it is expected to reach USD xx million by the end of 2029, with a CAGR of xx% during 2019-2029.

3) What are the Market Applications & Types:

The Database Management Software study is segmented by following Product Types & Major applications/end-users industry are as followed:

 Segment by Type
– Cloud Based
– On-premises

Segment by Application
– Large Enterprises
– SMEs

**The Database Management Software market is valued based on weighted average selling price (WASP) and includes any applicable taxes on manufacturers. All currency conversions used in the creation of this report have been calculated using constant annual average 2021 currency rates.

To comprehend Global Database Management Software Market dynamics in the world mainly, the worldwide Database Management Software Market is analyzed across major regions. JCMR also provides customized specific regional and country-level reports for the following areas.

• Database Management Software indusrty North America: United States, Canada, and Mexico.

• Database Management Software indusrty South & Central America: Argentina, Chile, and Brazil.

• Database Management Software indusrty Middle East & Africa: Saudi Arabia, UAE, Turkey, Egypt and South Africa.

• Database Management Software indusrty Europe: UK, France, Italy, Germany, Spain, and Russia.

• Database Management Software indusrty Asia-Pacific: India, China, Japan, South Korea, Indonesia, Singapore, and Australia.

Enquire for Database Management Software industry Segment Purchase@

Find more research reports on Database Management Software Industry. By JC Market Research.

Competitive Analysis:

The Database Management Software key players are highly focusing innovation in production technologies to improve efficiency and shelf life. The best long-term growth opportunities for this sector can be captured by ensuring ongoing process improvements and financial flexibility to invest in the optimal Database Management Software indusrty strategies. Company profile section of players such as IBM, SolarWinds, Oracle, SAP, Microsoft, Teradata, ADABAS, MySQL, FileMaker, Informix, SQLite, PostgreSQL, Amazon RDS, MongoDB, Redis, DbVisualizer includes its basic information like legal name, website, headquarters, its market position, historical background and top 10 closest competitors by Database Management Software Market capitalization / Database Management Software revenue along with contact information. Database Management Software Each player/ manufacturer revenue figures, Database Management Software growth rate and gross profit margin is provided in easy to understand tabular format for past 5 years and a separate section on recent development like mergers, Database Management Software acquisition or any new product/service launch including SWOT analysis of each Database Management Software key players etc.

Database Management Software industry Research Parameter/ Research Methodology

Database Management Software industry Primary Research:

The primary sources involve the industry experts from the Database Management Software industry including the management organizations, Database Management Software related processing organizations, Database Management Software analytics service providers of the industry’s value chain. All primary sources were interviewed to gather and authenticate qualitative & quantitative information and determine the Database Management Software future prospects.

In the extensive Database Management Software primary research process undertaken for this study, the primary sources – Database Management Software industry experts such as CEOs, Database Management Software vice presidents, Database Management Software marketing director, technology & Database Management Software related innovation directors, Database Management Software related founders and related key executives from various key companies and organizations in the Global Database Management Software in the industry have been interviewed to obtain and verify both qualitative and quantitative aspects of this Database Management Software research study.

Database Management Software industry Secondary Research:

In the Secondary research crucial information about the Database Management Software industries value chain, Database Management Software total pool of key players, and Database Management Software industry application areas. It also assisted in Database Management Software market segmentation according to industry trends to the bottom-most level, Database Management Software geographical markets and key developments from both Database Management Software market and technology-oriented perspectives.

Buy Full Copy with Exclusive Discount on Global Database Management Software Market Surway @

In this Database Management Software study, the years considered to estimate the market size of Database Management Software are as follows:

Database Management Software industry History Year: 2013-2019

Database Management Software industry Base Year: 2020

Database Management Software industry Estimated Year: 2021

Database Management Software industry Forecast Year 2021 to 2029

Key Stakeholders in Global Database Management Software Market:

Database Management Software Manufacturers

Database Management Software Distributors/Traders/Wholesalers

Database Management Software Subcomponent Manufacturers

Database Management Software Industry Association

Database Management Software Downstream Vendors

**Actual Numbers & In-Depth Analysis, Business opportunities, Database Management Software Market Size Estimation Available in Full Report.

Purchase Most Recent Database Management Software Research Report Directly Instantly @

Thanks for reading this article; you can also get individual chapter wise section or region wise Database Management Software report version like North America, Europe or Asia.

About Author:

JCMR global research and market intelligence consulting organization is uniquely positioned to not only identify growth opportunities but to also empower and inspire you to create visionary growth strategies for futures, enabled by our extraordinary depth and breadth of thought leadership, research, tools, events and experience that assist you for making goals into a reality. Our understanding of the interplay between industry convergence, Mega Trends, technologies and market trends provides our clients with new business models and expansion opportunities. We are focused on identifying the “Accurate Forecast” in every industry we cover so our clients can reap the benefits of being early market entrants and can accomplish their “Goals & Objectives”.

Contact Us:


Mark Baxter (Head of Business Development)

Phone: +1 (925) 478-7203


Connect with us at – LinkedIn

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

SQL Market Global Analysis 2021-2028: MongoDB, MariaDB, MySQL, Microsoft, MarkLogic …

MMS Founder

Posted on mongodb google news. Visit mongodb google news

SQL market research report provides conclusive results followed by an accurate data extraction process and compartmentalised study representation. The market study primarily targets to derive the SQL market size, volume and overall market share. Besides the statistical aspects of the SQL market, the research article also delivers factual and valuable data sourced from market participants such as the vendors, suppliers and providers. The market study displays an illustrative forecast with an agglomerated data representing the scope for business expansion along with the growth prospects. It identifies the growth fluctuations in the present scenario as well as the predictions during the forecast of the SQL market.

Vendor Profiling: SQL Market, 2020-28:

Major Companies Covered
MarkLogic Corporation
Oracle Database
Basho Technologies

We Have Recent Updates of SQL Market in Sample [email protected]

In addition, the research article defines the major causes fuelling the growth of the SQL market with a list of growth inducing variables and the inhibitors. The study identifies factors emerging from different industrial as well as non-industrial ecosystem to influence the growth of the SQL market. It assesses multiple demographic, economic, political, technological as well as factors associated with overall infrastructure to have either a positive or negative impact on the SQL industry. An array of drivers and restrains coupled with the opportunities and challenges are studied in-depth offering an accurate SQL market analysis.

Analysis by Type:

Major Types Covered

Analysis by Application:

Major Applications Covered
Online Game Development
Social Network Development
Web Applications Management

Major economies in certain geographic regions controlling the SQL market are analyzed. The geographic regions and countries covered in the study include:

• North America: Canada, U.S., and Mexico
• South America: Brazil, Ecuador, Argentina, Venezuela, Colombia, Peru, Costa Rica
• Europe: Italy, the U.K., France, Belgium, Germany, Denmark, Netherlands, Spain
• APAC: Japan, China, South Korea, Malaysia, Australia, Taiwan, India, and Hong Kong
• Middle East and Africa: Saudi Arabia, Israel, South Africa

Browse Full Report with Facts and Figures of SQL Market Report at @

An overview of the exact impact of the recent events followed with the evolution of novel COVID-19 is assessed in the research article. It delivers the differing market scenario pre-pandemic and post-pandemic evaluating the disruptions and adversities caused as a result of the unprecedented crisis. Thorough evaluation of the critical changes in intrinsic operations and other functions of the of SQL market is studied with the novel disease in mind. It exposes the vulnerabilities and pitfalls of the SQL industry also introducing new challenges for the market. The study also consists of the foreseeable opportunities in the future encouraging the SQL market growth after the temporary halt.

Do You Have Any Query or Specific Requirement? Ask Our Industry [email protected]

Features of the Report
• The SQL market report offers a comparative analysis of industry.
• The performance analysis of all the industry segments, leading market bodies and influential regions in the SQL industry is included in the report along with market statistics.
• The record based on the study of market offers in-depth study of all the news, plans, investments, policies, innovations, events, product launches, developments, etc.

Finally the report takes the readers through a complete qualitative and quantitative assessment, highlighting the competitive scope with primary focus on the chief competitors and their investment policies. It highlights the key players contributing a substantial revenue along with their efforts for the extensive development of the keyword market studying the inclusion of advanced technologies and novel strategies to enhance the traction which will ultimately accelerate generation of revenue. The research article also consists of recent activities including mergers, collaborations and acquisitions boosting the growth of the keyword market during the forecast period.

About Us:
Orbis Research ( is a single point aid for all your market research requirements. We have vast database of reports from the leading publishers and authors across the globe. We specialize in delivering customized reports as per the requirements of our clients. We have complete information about our publishers and hence are sure about the accuracy of the industries and verticals of their specialization. This helps our clients to map their needs and we produce the perfect required market research study for our clients.

Contact Us:
Hector Costello
Senior Manager Client Engagements
4144N Central Expressway,
Suite 600, Dallas,
Texas 75204, U.S.A.
Phone No.: USA: +1 (972)-362-8199 | IND: +91 895 659 5155

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Spark Troubleshooting, Part 1 – Ten Challenges

MMS Founder

Article originally posted on Data Science Central. Visit Data Science Central

“The most difficult thing is finding out why your job is failing, which parameters to change. Most of the time, it’s OOM errors…” Jagat Singh, Quora

Spark has become one of the most important tools for processing data – especially non-relational  data – and deriving value from it. And Spark serves as a platform for the creation and delivery of analytics, AI, and machine learning applications, among others. But troubleshooting Spark applications is hard – and we’re here to help. 

In this blog post, we’ll describe ten challenges that arise frequently in troubleshooting Spark applications. We’ll start with issues at the job level, encountered by most people on the data team – operations people/administrators, data engineers, and data scientists, as well as analysts. Then, we’ll look at problems that apply across a cluster. These problems are usually handled by operations people/administrators and data engineers. 

For more on Spark and its use, please see this piece in Infoworld. And for more depth about the problems that arise in creating and running Spark jobs, at both the job level and the cluster level, please see the links below. There is also a good introductory guide here

Five Reasons Why Troubleshooting Spark Applications is Hard

Some of the things that make Spark great also make it hard to troubleshoot. Here are some key Spark features, and some of the issues that arise in relation to them:

  1. Memory-resident. Spark gets much of its speed and power by using memory, rather than disk, for interim storage of source data and results. However, this can cost a lot of resources and money, which is especially visible in the cloud. It can also make it easy for jobs to crash due to lack of sufficient available memory. And it makes problems hard to diagnose – only traces written to disk survive after crashes. 
  2. Parallel processing. Spark takes your job and applies it, in parallel, to all the data partitions assigned to your job. (You specify the data partitions, another tough and important decision.) But when a processing workstream runs into trouble, it can be hard to find and understand the problem among the multiple workstreams running at once. 
  3. Variants. Spark is open source, so it can be tweaked and revised in innumerable ways. There are major differences among the Spark 1 series, Spark 2.x, and the newer Spark 3. And Spark works somewhat differently across platforms – on-premises; on cloud-specific platforms such as AWS EMR, Azure HDInsight, and Google Dataproc; and on Databricks, which is available across the major public clouds. Each variant offers some of its own challenges, and a somewhat different set of tools for solving them. 
  4. Configuration options. Spark has hundreds of configuration options. And Spark interacts with the hardware and software environment it’s running in, each component of which has its own configuration options. Getting one or two critical settings right is hard; when several related settings have to be correct, guesswork becomes the norm, and over-allocation of resources, especially memory and CPUs (see below) becomes the safe strategy.  
  5. Trial and error approach. With so many configuration options, how to optimize? Well, if a job currently takes six hours, you can change one, or a few, options, and run it again. That takes six hours, plus or minus. Repeat this three or four times, and it’s the end of the week. You may have improved the configuration, but you probably won’t have exhausted the possibilities as to what the best settings are. 

Sparkitecture diagram – the Spark application is the Driver Process, and the job is split up across executors. (Source: Apache Spark for the Impatient on DZone.)

Three Issues with Spark Jobs, On-Premises and in the Cloud

Spark jobs can require troubleshooting against three main kinds of issues: 

  • Failure. Spark jobs can simply fail. Sometimes a job will fail on one try, then work again after a restart. Just finding out that the job failed can be hard; finding out why can be harder. (Since the job is memory-resident, failure makes the evidence disappear.) 
  • Poor performance. A Spark job can run slower than you would like it to; slower than an external service level agreement (SLA); or slower than it would do if it were optimized. It’s very hard to know how long a job “should” take, or where to start in optimizing a job or a cluster. 
  • Excessive cost or resource use. The resource use or, especially in the cloud, the hard dollar cost of a job may raise concern. As with performance, it’s hard to know how much the resource use and cost “should” be, until you put work into optimizing and see where you’ve gotten to. 

All of the issues and challenges described here apply to Spark across all platforms, whether it’s running on-premises, in Amazon EMR, or on Databricks (across AWS, Azure, or GCP). However, there are a few subtle differences:

  • Move to cloud. There is a big movement of big data workloads from on-premises (largely running Spark on Hadoop) to the cloud (largely running Spark on Amazon EMR or Databricks). Moving to cloud provides greater flexibility and faster time to market, as well as access to built-in services found on each platform. 
  • Move to on-premises. There is a small movement of workloads from the cloud back to on-premises environments. When a cloud workload “settles down,” such that flexibility is less important, then it may become significantly cheaper to run it on-premises instead. 
  • On-premises concerns. Resources (and costs) on-premises tend to be relatively fixed; there can be a leadtime of months to years to significantly expand on-premises resources. So the main concern on-premises is maximizing the existing estate: making more jobs run in existing resources, and getting jobs to complete reliably and on-time, to maximize the pay-off from the existing estate. 
  • Cloud concerns. Resources in the cloud are flexible and “pay as you go” – but as you go, you pay. So the main concern in the cloud is managing costs. (As AWS puts it, “When running big data pipelines on the cloud, operational cost optimization is the name of the game.”) This concern increases because reliability concerns in the cloud can often be addressed by “throwing hardware at the problem” – increasing reliability, but at greater cost. 
  • On-premises Spark vs Amazon EMR. When moving to Amazon EMR, it’s easy to do a “lift and shift” from on-premises Spark to EMR. This saves time and money on the cloud migration effort, but any inefficiencies in the on-premises environment are reproduced in the cloud, increasing costs. It’s also fully possible to refactor before moving to EMR, just as with Databricks. 
  • On-premises Spark vs Databricks. When moving to Databricks, most companies take advantage of Databricks’ capabilities, such as ease of starting/shutting down clusters, and do at least some refactoring as part of the cloud migration effort. This costs time and money in the cloud migration effort, but results in lower costs and, potentially, greater reliability for the refactored job in the cloud. 

All of these concerns are accompanied by a distinct lack of needed information. Companies often make crucial decisions – on-premises vs. cloud, EMR vs. Databricks, “lift and shift” vs. refactoring – with only guesses available as to what different options will cost in time, resources, and money. 

Ten Spark Challenges

Many Spark challenges relate to configuration, including the number of executors to assign, memory usage (at the driver level, and per executor), and what kind of hardware/machine instances to use. You make configuration choices per job, and also for the overall cluster in which jobs run, and these are interdependent – so things get complicated, fast. 

Some challenges occur at the job level; these challenges are shared right across the data team. They include: 

  1. How many executors should each job use?
  2. How much memory should I allocate for each job?
  3. How do I find and eliminate data skew? 
  4. How do I make my pipelines work better?
  5. How do I know if a specific job is optimized? 

Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include:

  1. How do I size my nodes, and match them to the right servers/instance types?
  2. How do I see what’s going on across the Spark stack and apps? 
  3. Is my data partitioned correctly for my SQL queries? 
  4. When do I take advantage of auto-scaling? 
  5. How do I get insights into jobs that have problems? 

Section 1: Five Job-Level Challenges

These challenges occur at the level of individual jobs. Fixing them can be the responsibility of the developer or data scientist who created the job, or of operations people or data engineers who work on both individual jobs and at the cluster level. 

However, job-level challenges, taken together, have massive implications for clusters, and for the entire data estate. One of our Unravel Data customers has undertaken a right-sizing program for resource-intensive jobs that has clawed back nearly half the space in their clusters, even though data processing volume and jobs in production have been increasing. 

For these challenges, we’ll assume that the cluster your job is running in is relatively well-designed (see next section); that other jobs in the cluster are not resource hogs that will knock your job out of the running; and that you have the tools you need to troubleshoot individual jobs. 

1. How many executors and cores should a job use?

One of the key advantages of Spark is parallelization – you run your job’s code against different data partitions in parallel workstreams, as in the diagram below. The number of workstreams that run at once is the number of executors, times the number of cores per executor. So how many executors should your job use, and how many cores per executor – that is, how many workstreams do you want running at once?

A Spark job using three cores to parallelize output. Up to three tasks run simultaneously, and seven tasks are completed in a fixed period of time. (Source: Lisa Hua, Spark Overview, Slideshare.) 

You want high usage of cores, high usage of memory per core, and data partitioning appropriate to the job. (Usually, partitioning on the field or fields you’re querying on.) This beginner’s guide for Hadoop suggests two-three cores per executor, but not more than five; this expert’s guide to Spark tuning on AWS suggests that you use three executors per node, with five cores per executor, as your starting point for all jobs. (!) 

You are likely to have your own sensible starting point for your on-premises or cloud platform, the servers or instances available, and experience your team has had with similar workloads. Once your job runs successfully a few times, you can either leave it alone, or optimize it. We recommend that you optimize it, because optimization:

  • Helps you save resources and money (not over-allocating)
  • Helps prevent crashes, because you right-size the resources (not under-allocating)
  • Helps you fix crashes fast, because allocations are roughly correct, and because you understand the job better

2. How much memory should I allocate for each job?

Memory allocation is per executor, and the most you can allocate is the total available in the node. If you’re in the cloud, this is governed by your instance type; on-premises, by your physical server or virtual machine. Some memory is needed for your cluster manager and system resources (16GB may be a typical amount), and the rest is available for jobs. 

If you have three executors in a 128GB cluster, and 16GB is taken up by the cluster, that leaves 37GB per executor. However, a few GB will be required for executor overhead; the remainder is your per-executor memory. You will want to partition your data so it can be processed efficiently in the available memory. 

This is just a starting point, however. You may need to be using a different instance type, or a different number of executors, to make the most efficient use of your node’s resources against the job you’re running. As with the number of executors (see previous section), optimizing your job will help you know whether you are over- or under-allocating memory, reduce the likelihood of crashes, and get you ready for troubleshooting when the need arises. 

For more on memory management, see this widely read article, Spark Memory Management, by our own Rishitesh Mishra.

3. How do I handle data skew and small files?

Data skew and small files are complementary problems. Data skew tends to describe large files – where one key value, or a few, have a large share of the total data associated with them. This can force Spark, as it’s processing the data, to move data around in the cluster, which can slow down your task, cause low utilization of CPU capacity, and cause out-of-memory errors which abort your job. Several techniques for handling very large files which appear as a result of data skew are given in the popular article, Data Skew and Garbage Collection, by Rishitesh Mishra of Unravel. 

Small files are partly the other end of data skew – a share of partitions will tend to be small. And Spark, since it is a parallel processing system, may generate many small files from parallel processes. Also, some processes you use, such as file compression, may cause a large number of small files to appear, causing inefficiencies. You may need to reduce parallelism (undercutting one of the advantages of Spark), repartition (an expensive operation you should minimize), or start adjusting your parameters, your data, or both (see details here). 

Both data skew and small files incur a meta-problem that’s common across Spark – when a job slows down or crashes, how do you know what the problem was? We will mention this again, but it can be particularly difficult to know this for data-related problems, as an otherwise well-constructed job can have seemingly random slowdowns or halts, caused by hard-to-predict and hard-to-detect inconsistencies across different data sets. 

4. How do I optimize at the pipeline level? 

Spark pipelines are made up of dataframes, connected by transformers (which calculate new data from existing data), and Estimators. Pipelines are widely used for all sorts of processing, including extract, transform, and load (ETL) jobs and machine learning. Spark makes it easy to combine jobs into pipelines, but it does not make it easy to monitor and manage jobs at the pipeline level. So it’s easy for monitoring, managing, and optimizing pipelines to appear as an exponentially more difficult version of optimizing individual Spark jobs. 

Existing Transformers create new Dataframes, with an Estimator producing the final model. (Source: Spark Pipelines: Elegant Yet Powerful, InsightDataScience.) 

Many pipeline components are “tried and trusted” individually, and are thereby less likely to cause problems than new components you create yourself. However, interactions between pipeline steps can cause novel problems. 

Just as job issues roll up to the cluster level, they also roll up to the pipeline level. Pipelines are increasingly the unit of work for DataOps, but it takes truly deep knowledge of your jobs and your cluster(s) for you to work effectively at the pipeline level. This article, which tackles the issues involved in some depth, describes pipeline debugging as an “art.”

5. How do I know if a specific job is optimized? 

Neither Spark nor, for that matter, SQL are designed for ease of optimization. Spark comes with a monitoring and management interface, Spark UI, which can help. But Spark UI can be challenging to use, especially for the types of comparisons – over time, across jobs, and across a large, busy cluster – that you need to really optimize a job. And there is no “SQL UI” that specifically tells you how to optimize your SQL queries.

There are some general rules. For instance, a “bad” – inefficient – join can take hours. But it’s very hard to find where your app is spending its time, let alone whether a specific SQL command is taking a long time, and whether it can indeed be optimized.  

Spark’s Catalyst optimizer, described here, does its best to optimize your queries for you. But when data sizes grow large enough, and processing gets complex enough, you have to help it along if you want your resource usage, costs, and runtimes to stay on the acceptable side. 

Section 2: Cluster-Level Challenges

Cluster-level challenges are those that arise for a cluster that runs many (perhaps hundreds or thousands) of jobs, in cluster design (how to get the most out of a specific cluster), cluster distribution (how to create a set of clusters that best meets your needs), and allocation across on-premises resources and one or more public, private, or hybrid cloud resources. 

The first step toward meeting cluster-level challenges is to meet job-level challenges effectively, as described above. A cluster that’s running unoptimized, poorly understood, slowdown-prone and crash-prone jobs is impossible to optimize. But if your jobs are right-sized, cluster-level challenges become much easier to meet. (Note that Unravel Data, as mentioned in the previous section, helps you find your resource-heavy Spark jobs, and optimize those first. It also does much of the work of troubleshooting and optimization for you.) 

Meeting cluster-level challenges for Spark may be a topic better suited for a graduate-level computer science seminar than for a blog post, but here are some of the issues that come up, and a few comments on each:

6. Are Nodes Matched Up to Servers or Cloud Instances?

A Spark node – a physical server or a cloud instance – will have an allocation of CPUs and physical memory. (The whole point of Spark is to run things in actual memory, so this is crucial.) You have to fit your executors and memory allocations into nodes that are carefully matched to existing resources, on-premises or in the cloud. (You can allocate more or fewer Spark cores than there are available CPUs, but matching them makes things more predictable, uses resources better, and may make troubleshooting easier.) 

On-premises, poor matching between nodes, physical servers, executors and memory results in inefficiencies, but these may not be very visible; as long as the total physical resource is sufficient for the jobs running, there’s no obvious problem. However, issues like this can cause datacenters to be very poorly utilized, meaning there’s big overspending going on – it’s just not noticed. (Ironically, the impending prospect of cloud migration may cause an organization to freeze on-prem spending, shining a spotlight on costs and efficiency.) 

In the cloud, “pay as you go” pricing shines a different type of spotlight on efficient use of resources – inefficiency shows up in each month’s bill. You need to match nodes, cloud instances, and job CPU and memory allocations very closely indeed, or incur what might amount to massive overspending. This article gives you some guidelines for running Apache Spark cost-effectively on AWS EC2 instances, and is worth a read even if you’re running on-premises, or on a different cloud provider. 

You still have big problems here. In the cloud, with costs both visible and variable, cost allocation is a big issue. It’s hard to know who’s spending what, let alone what the business results that go with each unit of spending are. But tuning workloads against server resources and/or instances is the first step in gaining control of your spending, across all your data estates. 

7. How Do I See What’s Going on in My Cluster?

“Spark is notoriously difficult to tune and maintain,” according to an article in The New Stack. Clusters need to be “expertly managed” to perform well, or all the good characteristics of Spark can come crashing down in a heap of frustration and high costs. (In people’s time and in business losses, as well as direct, hard dollar costs.) 

Key Spark advantages include accessibility to a wide range of users and the ability to run in memory. But the most popular tool for Spark monitoring and management, Spark UI, doesn’t really help much at the cluster level. You can’t, for instance, easily tell which jobs consume the most resources over time. So it’s hard to know where to focus your optimization efforts.  And Spark UI doesn’t support more advanced functionality – such as comparing the current job run to previous runs, issuing warnings, or making recommendations, for example. 

Logs on cloud clusters are lost when a cluster is terminated, so problems that occur in short-running clusters can be that much harder to debug. More generally, managing log files is itself a big data management and data accessibility issue, making debugging and governance harder. This occurs in both on-premises and cloud environments. And, when workloads are moved to the cloud, you no longer have a fixed-cost data estate, nor the “tribal knowledge” accrued from years of running a gradually changing set of workloads on-premises. Instead, you have new technologies and pay-as-you-go billing. So cluster-level management, hard as it is, becomes critical. 

8. Is my data partitioned correctly for my SQL queries? (and other inefficiencies) 

Operators can get quite upset, and rightly so, over “bad” or “rogue” queries that can cost way more, in resources or cost, than they need to. One colleague describes a team he worked on that went through more than $100,000 of cloud costs in a weekend of crash-testing a new application – a discovery made after the fact. (But before the job was put into production, where it would have really run up some bills.) 

SQL is not designed to tell you how much a query is likely to cost, and more elegant-looking SQL queries (ie, fewer statements) may well be more expensive. The same is true of all kinds of code you have running. 

So you have to do some or all of three things:

  • Learn something about SQL, and about coding languages you use, especially how they work at runtime
  • Understand how to optimize your code and partition your data for good price/performance
  • Experiment with your app to understand where the resource use/cost “hot spots” are, and reduce them where possible

All this fits in the “optimize” recommendations from 1. and 2. above. We’ll talk more about how to carry out optimization in Part 2 of this blog post series. 

9. When do I take advantage of auto-scaling?

The ability to auto-scale – to assign resources to a job just while it’s running, or to increase resources smoothly to meet processing peaks – is one of the most enticing features of the cloud. It’s also one of the most dangerous; there is no practical limit to how much you can spend. You need some form of guardrails, and some form of alerting, to remove the risk of truly gigantic bills. 

The need for auto-scaling might, for instance, determine whether you move a given workload to the cloud, or leave it running, unchanged, in your on-premises data center. But to help an application benefit from auto-scaling, you have to profile it, then cause resources to be allocated and de-allocated to match the peaks and valleys. And you have some calculations to make, because cloud providers charge you more for spot resources – those you grab and let go of, as needed – than for persistent resources that you keep running for a long time. Spot resources may cost two or three times as much as dedicated ones. 

The first step, as you might have guessed, is to optimize your application, as in the previous sections. Auto-scaling is a price/performance optimization, and a potentially resource-intensive one. You should do other optimizations first. 

Then profile your optimized application. You need to calculate ongoing and peak memory and processor usage, figure out how long you need each, and the resource needs and cost for each state. And then decide whether it’s worth auto-scaling the job, whenever it runs, and how to do that. You may also need to find quiet times on a cluster to run some jobs, so the job’s peaks don’t overwhelm the cluster’s resources. 

To help, Databricks has two types of clusters, and the second type works well with auto-scaling. Most jobs start out in an interactive cluster, which is like an on-premises cluster; multiple people use a set a shared resources. It is, by definition, very difficult to avoid seriously underusing the capacity of an interactive cluster. 

So you are meant to move each of your repeated, resource-intensive, and well-understood jobs off to its own, dedicated, job-specific cluster. A job-specific cluster spins up, runs its job, and spins down. This is a form of auto-scaling already, and you can also scale the cluster’s resources to match job peaks, if appropriate. But note that you want your application profiled and optimized before moving it to a job-specific cluster. 

10. How Do I Find and Fix Problems?

Just as it’s hard to fix an individual Spark job, there’s no easy way to know where to look for problems across a Spark cluster. And once you do find a problem, there’s very little guidance on how to fix it. Is the problem with the job itself, or the environment it’s running in? For instance, over-allocating memory or CPUs for some Spark jobs can starve others. In the cloud, the noisy neighbors problem can slow down a Spark job run to the extent that it causes business problems on one outing – but leaves the same job to finish in good time on the next run. 

The better you handle the other challenges listed in this blog post, the fewer problems you’ll have, but it’s still very hard to know how to most productively spend Spark operations time. For instance, a slow Spark job on one run may be worth fixing in its own right, and may be warning you of crashes on future runs. But it’s very hard just to see what the trend is for a Spark job in performance, let alone to get some idea of what the job is accomplishing vs. its resource use and average time to complete. So Spark troubleshooting ends up being reactive, with all too many furry, blind little heads popping up for operators to play Whack-a-Mole with. 

Impacts of these Challenges

If you meet the above challenges effectively, you’ll use your resources efficiently and cost-effectively. However, our observation here at Unravel Data is that most Spark clusters are not run efficiently. 

What we tend to see most are the following problems – at a job level, within a cluster, or across all clusters:

  • Under-allocation. It can be tricky to allocate your resources efficiently on your cluster, partition your datasets effectively, and determine the right level of resources for each job. If you under-allocate (either for a job’s driver or the executors), a job is likely to run too slowly, or to crash. As a result, many developers and operators resort to…
  • Over-allocation. If you assign too many resources to your job, you’re wasting resources (on-premises) or money (cloud). We hear about jobs that need, for example, 2GB of memory, but are allocated much more – in one case, 85GB. 

Applications can run slowly, because they’re under-allocated – or because some apps are over-allocated, causing others to run slowly. Data teams then spend much of their time fire-fighting issues that may come and go, depending on the particular combination of jobs running that day. With every level of resource in shortage, new, business-critical apps are held up, so the cash needed to invest against these problems doesn’t show up. IT becomes an organizational headache, rather than a source of business capability. 


To jump ahead to the end of this series a bit, our customers here at Unravel are easily able to spot and fix over-allocation and inefficiencies. They can then monitor their jobs in production, finding and fixing issues as they arise. Developers even get on board, checking their jobs before moving them to production, then teaming up with Operations to keep them tuned and humming. 

One Unravel customer, Mastercard, has been able to reduce usage of their clusters by roughly half, even as data sizes and application density has moved steadily upward during the global pandemic. And everyone gets along better, and has more fun at work, while achieving these previously unimagined results. 

So, whether you choose to use Unravel or not, develop a culture of right-sizing and efficiency in your work with Spark. It will seem to be a hassle at first, but your team will become much stronger, and you’ll enjoy your work life more, as a result. 

You need a sort of X-ray of your Spark jobs, better cluster-level monitoring, environment information, and to correlate all of these sources into recommendations. In Troubleshooting Spark Applications, Part 2: Solutions, we will describe the most widely used tools for Spark troubleshooting – including the Spark Web UI and our own offering, Unravel Data – and how to assemble and correlate the information you need. If you would like to know more about Unravel Data now, you can download a free trial or contact Unravel

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Stanford Research Center Studies Impacts of Popular Pretrained Models

MMS Founder
MMS Sabri Bolkar

Article originally posted on InfoQ. Visit InfoQ

Stanford University recently announced a new research center, the Center for Research on Foundation Models (CRFM), devoted to studying the effects of large pretrained deep networks (e.g. BERT, GPT-3, CLIP) in use by a surge of machine-learning research institutions and startups.

As a multi-disciplinary research center, it includes 32 faculty members from computer science, law, psychology, and political science departments. The main goal of CRFM is to initiate studies of such foundation models and to develop new strategies for the future of responsible machine learning.

Along wIth the announcement, the CRFM team also published an in-depth report describing the pros and cons of using foundation models as backbone deep networks for large-scale applications such as image and natural language understanding. These downstream applications are created by fine-tuning the base network’s weights. Foundation models are trained by self-supervision at a massive scale, mostly using open data from different sources and deployed as few-shot learners.

The paper states that this situation creates homogeneity as applications employ the same base models. Although the use of homogeneous high-capacity networks simplifies fine-tuning, the homogeneity carries potential dangers such as ethical and social inequalities to all downstream tasks. The paper emphasizes that fairness studies of such models deserve a special multi-disciplinary effort.

Another issue the report covers is the loss of accessibility. In the last decade, the deep-learning research community has favored open source as it leads to improved reproducibility and fast-paced development while propagating novel ideas. Open-source deep-network development frameworks such as Caffe, Tensorflow, Pytorch, and MXNet have had a major impact in popularizing and democratizing deep learning. However, as deep-network size goes well beyond a billion parameters, industry-led research code repositories and datasets are kept private (e.g. GPT-2) or commercialized by API endpoints (e.g. GPT-3). CRFM researchers underline the dangers of this barrier and point to the importance of government funding for possible resolution.

As applications of deep networks increase, deep learning understanding and theory research is gaining attention. Direct usage of deep networks without proper analysis has previously triggered discussions in machine learning conferences. Deep neural networks consist of cascaded nonlinear functions that limit their interpretability. The main problem is the mathematical difficulties when analyzing such cascaded functions, hence most of the research works have focused on the analysis of simpler fully connected models. CRFM aims to go beyond simplified models and propose practical ideas for the commonly used pretrained networks.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Article: How to Decide in Self-Managed Projects – a Lean Approach to Governance

MMS Founder
MMS Ted Rau

Article originally posted on InfoQ. Visit InfoQ

Key Takeaways

  • Starting a new project requires having clarity on the basic decision-making processes and governance of the project. 
  • Instead of hierarchical or unstructured processes, a lean, flexible and robust set of governance decisions is encouraged; self-managed projects can set themselves up to organize their work. Self-governed projects go one step further and have authority to set their own purpose. 
  • To decide who decides, the first steps have to be clear as the project starts: the aim, the default decision-making method (e.g. consent), and a defined membership for the project. 
  • Members of the project can define the basic infrastructure for the project to support accountability: a lead role, meeting facilitator role and agree on basic meeting agreements to make sure meetings are balanced and intentional; note-taking, a parking log for future agenda items to connect meetings with each other. 
  • Once this basic system is established, the project members can mold the structure as needed by adding sub-teams and roles. 

Introduction: what’s the problem? 

When starting a new project, it’s often all honeymoon and best intentions. Everyone throws themselves into it. Starting new things is engaging and fun, and building a simple product together can be easy with just work and very little structure. 

But if it’s too little structure, then, all of a sudden, lack of alignment begins to creep up. “The others on the team didn’t want to slow down. They wanted to do things, to not have another project get eaten up by endless meetings,” the people who wanted to work more on alignment claimed. 

Yet, that lack of coordination wears people out as well. Mike builds something incompatible with what Celia built. Parvin wants to set project values first. And the leadership outside the team is worried about mission creep and cost. 

If the people in the project can make decisions themselves, we can call it self-managed. By “self-managed” (or self-organized), I mean that the project members can make decisions about the content of the work, and also who does what and by when. Self-managed groups have the advantage that those who do the work are closer to the decisions: decisions are better grounded in operations, and there is more buy-in and deeper insight from those who are going to carry out the tasks into how the tasks fit into the bigger picture. 

One step further would be to have a self-governed project. In a self-governed project, even the highest decisions and the overall direction (purpose) and goals can be modified or set together by the people in the project, giving the project members maximum buy-in, contribution and say, as no other institution or individual rules the project. 

Two ways to organize – and a third option

Whether self-managed or self-governed as a project, the power still needs to be distributed internally. If the project is open to decide how things are done, how do we decide? There are typically two ways to go about new projects.

  • Top-down approach: priorities and key decisions are made by the project lead.
    Advantages: clear and efficient.
    Disadvantages: as with many top-down structures, it’s easy to lack input from the team members, and engagement in the project often diminishes. 
  • Horizontal approach: everyone somehow decides together. 
    Advantages: inclusive (at first sight), engaging (for many).
    Disadvantages: often unstructured and inefficient.

Even worse, some styles oscillate between the two styles by sending mixed messages such as “Do it your way… but your way should be what I want to see”. Or the project is left alone for a while, only to get then overruled or “rescued.” 

The questions that come up time and again for every project are always the same:

  • Who decides in our project? For example, who sets the timeline, or is it set together?  Who has the last word on the budget? Who decides what tools we’ll use? 

If we don’t know who decides – or how to decide who decides – these follow-up questions become almost impossible to answer: 

  • What are we building together? 
  • Who is part of the project? 
  • How should we divide up the work? 

These questions are going to be answered no matter whether we decide them explicitly or not. If no process is followed, these decisions will simply fall into place. Without intentionality, we often fall into unhealthy patterns – for example, some might assume that their ideas are worth hearing but not others’. Or some might dominate meetings, while others stay quiet – and relevant information never gets shared. And once set, it becomes harder and harder to change the rules of the game in our project because unacknowledged practices are more challenging to address and change. 

The third option: an intentional process

The way out of this lies in governance. While some have an aversion to governance and structure – associating governance with bureaucracy and constraints – lean governance is very natural. Every group decision requires governance, whether we see it or not. A group of friends making plans to see a movie uses decision-making. How do we decide what movie to see? Whose voices are heard? Who makes the final call? If you’ve ever felt left out in such a situation – or sent way too many text messages to figure out which show to go to – you know that these processes aren’t always easy. What you don’t notice – because it feels smooth – or what you do notice – because it’s tedious – is governance. 

The trick is to use lean governance, intentionally and in our favor. The goal of governance in a new project is to provide just enough structure to operate well. Just enough team structure to have a clear division of labor. Just enough meeting structure to use our time well. Not more but also not less. That level of “just enough,” of course, depends on the phase of the project. For example, some in the project might want to establish a lot of roles in great detail that we might not even need, or design very detailed spreadsheets to support workflows we haven’t tried out a single time yet – that’s how structure can get overbuilt and become stifling for the project. Better to work incrementally, by starting with little structure and refine over time. I admit, it’s not easy at all to gauge it, but it’s a clarifying question to ask ourselves: what are the first things one absolutely needs to decide in the beginning? And what can wait? The litmus test here are operations: do we have enough clarity to act for a while? If not, create clarity (I see that a lot in self-managed teams where roles are underspecified and people don’t carry out tasks because they are not sure what exactly they are asked to do). 

The process described in this article builds on the self-management system sociocracy, a combination of decentralized, nested decision-making and consent as a basis for decisions. The advantage of sociocracy is that it’s highly flexible and robust at the same time. That means we can use precisely the tools we need at the moment. We’re not stuck with a heavy, overbuilt system, and we’re also not lost in a laissez-faire approach. Instead, the process outlined in this article introduces the tools from sociocracy in a linear, logical order that builds over time just as a new project needs them.

Having seen dozens and dozens of groups in this stage, helping to set up their governance systems with sociocracy, here is what you need to know. 

Must-haves from the get-go

These are the decisions you need to decide before you do anything else. 

  • Aim: what’s the overall goal of the project? What are we signing up to do together? Typically, the project purpose or goal might be set from the outside; in this case, just having clarity on what is being asked is enough. Yet, in a self-governed project, the people in the project have a say in the project goal. Either way, it has to be clear and understood. 
    To be clear, it has to be specific. For example, “changing the world” is not specific. We all want to change the world but we all put our time into different things. The more specific that original aim is, the more likely it will be that people can do things together, which is what every project should be about. Bring a proposed aim to the first meeting, so you don’t have to waste time word-smithing together. 
  • Decision-making method: how do we make decisions? 
    Our choice of the decision-making method is key. Do we vote? Do we talk until we all agree (consensus)? Do we talk until half of us have had enough and leave? Do we talk until there’s no objection? 
    The tricky thing about decision-making methods is that it’s fundamental and axiomatic. Imagine you don’t have an agreed-upon decision-making method and want to adopt one. To agree on your decision-making method, what decision-making method will you use? Will you vote or aim for consensus? This is a chicken-egg problem many get trapped in. It remains an unsolvable issue of legitimacy – who decides how we decide? – but it’s much easier to address if we make that very decision first. 
  • Who is part of the project – for now? The point here is not to exclude people but to know who the founding members are. That way, we have a basis for decision-making to invite others into the group as desired. 

On each of these choices, it helps tremendously to propose something as a starting point. Then do a round of reactions and then make a decision. To use an example for the last point, groups often struggle to make decisions on project membership. Groups can easily spend 20 minutes in a meandering discussion. To catalyse the discussion, work with a proposal instead: let’s say you have a sense that the members you have right now are the right people to get started on the project with. Then just propose to keep the membership as is right now. Do one round of reactions where everyone expresses their thoughts, then ask if there’s any way the project will be harmed if the membership stays as is for now.  That kind of decision-making is called consent, and it’s the default decision-making method in sociocracy. Consent is easy to do because you only need to make sure together that your decision doesn’t create harm. That way, not everyone needs to love it, but a warning voice from a trusted team member won’t get lost in the process. Once consent is in place, you can use it for all group decisions. To speed things up, you can define operational roles, find someone to fill the role and therefore empower role holders by collective consent to make a defined range of decisions on their own.

Once the group has established consent, every decision is easy. Consent simply means: a decision moves forward if there’s no reason to stop it. Make a suggestion, allow for questions, do one round of reactions and potential amendments and then ask for consent. Don’t go down rabbit holes or engage in endless discussions about preferences. Move forward with something good enough and then improve and fine-tune over time. 

Back to our example of deciding about the membership, people might have differing opinions on the question, which they can express in the reactions. But then, as we move to consent, only objections can hold back the decision to approve the proposal to keep membership as is. That way, we don’t discuss it endlessly but the path becomes clear: if there’s no harm (= consent), we move on. If someone does object, they explain their objection. For example, they might say that currently no one in the project has enough experience on the financial side of running the project. To integrate the information the objection brings, we crowdsource ideas. Shall we get outside help on that topic as needed? Should we ask an extra person with those skills to join our team? Once everyone has added their ideas, make a new proposal (e.g. “We will keep membership as is and ask XX for support on the financial planning”) and check for consent again. Consent is pragmatic because we focus on the project goals, not our personal preferences. But the process is considerate because everyone is heard. 

Must-haves early on

From a governance point of view, it’s pretty clear what a project needs to run effectively, no matter what the project is even about. We need someone who “herds the cats” and pays attention to the project overall. In sociocracy, that’s the leader. The leader doesn’t dominate or boss people around. It’s about servant leadership in support of the project and its members, not autocracy. Having a facilitator is useful to make better use of our time in meetings, keep them short, focused, and make sure the voices in the room are heard. Then, in order to have transparency within our project or organization, good meeting notes are key, which means we need a secretary. 

Even the decision of who fills what role can be made by consent using the sociocratic selection process. Just recently, I was selected as the lead for a restructuring project out of the four people in the project, based on my experiences and the nominations by the other project members. In a different context, I was nominated as the leader of a circle at a moment that was a surprise to me. I had joined the meeting knowing there would be a selection, but not expecting to become leader. I was nominated by others (but I myself nominated other people). When the proposal was to give me the role, I was torn. Was I able to fill the role with attention and integrity? It was the transparent feedback and reasons that people had mentioned when they nominated me which convinced me. I could see that, given the moment in time for the team as well as my strengths and skills, it made perfect sense to have me be the leader. Since I was worried about the time commitment, I objected and asked to shorten the term to six months to try out whether I could handle it. Everyone consented and the decision was made!

For a successful project, there are a few more boxes to check:

  • How do we communicate with each other? Let’s not have loose emails and long reply-all threads flying around. Make a plan. Whether Google groups, Slack, or any other tool are your answers, the point is to be on the same page and avoid chaos. Keep it simple and lean, and add complexity only if it’s needed for operations.
  • How do you track topics you need to address? Make a “parking log” (backlog in sociocracy jargon) and ensure you don’t lose items. If a project member cares about a topic, it induces trust if you track it well and undermines trust if you don’t. That one is a really easy tool to adopt. We often simply have a living list of future topics in our notes that we add to or subtract. Of course, Trello or similar tools work as well. Find the tool that works best for you. 
  • Do you need/want additional members in the group? Whatever your answer is, decide it together and avoid surprises or friction. Again, keep it simple – you can keep the team small and ask outsiders for advice on certain topics. 

Tools for connection and review

Collaboration is for people, and people are more than their work. The better we know each other, the better we can create working relationships and working conditions that make it engaging and pleasant to work together. Sometimes friction in teams discharges over allegedly content questions (e.g. “Should we use software tool A or B?”), but really the true underlying conflict is whether we feel respected. Knowing each other reduces those frictions and increases psychological safety which will positively impact our collaboration and quality of our work.  Instead of working with strangers and doing group-building exercises once every six months, spend a few minutes here and there to get to know each other or build connections and review moments into your processes.

  • Rounds – the practice of speaking one by one – in parts of your meetings help us listen to each contribution better and to get more of a sense of the people on our team. Also, it feels really good to be heard without interruptions. 
  • Do a round of check-ins at the beginning, and do a round of meeting evaluations at the end. How are you doing coming in here? How are you doing as you leave? Did this work for everyone? 
  • Review your decisions from time to time in a quick go-around: how is this working for you? What could be improved to make it work better? 
  • In new groups, invest 15 minutes and let everyone share their story and hear what interested them in the project. This only takes a few minutes, but it’s a good investment. The better you understand each others’ motivations and skills, the better you can work with their energy in the project. 


The processes described here work best for a one-group project. But group projects can scale by “budding out” into sub-teams and can grow beyond one core group. You can define those sub-teams in your main team and define one or two people who serve on the main group and the sub-team to keep them well-aligned. 

When is it time to form sub-groups? That’s easy to answer. As soon as you talk about a topic in the main group, only some of the group members are involved in the topic. Why talk about a topic in a group of eight if it only affects the work of three of us? It’s much more efficient to “bud out”! With that approach, it’s much more likely that you will not overbuild structure. 

With nested, linked sociocratic circles and roles, we can build a system where every discussion is directly related to the work of each person in the room. A sub-group can grow when there’s energy and “fold” where a sub-project is completed – giving us a structure that fits like a glove every step of the way. Who makes decisions together directly impacts the quality and ease of decision-making. If the people who are making a decision together are the people who actually know about the project first-hand and will carry those decisions out together, it will be much easier to find common ground. If people are not actually involved and just have opinions, decisions can be slowed down. So the better the match between those who are directly involved and the decisions they need to make together, the smoother decision-making will be. 

Learning more

Starting a project isn’t hard – most of the steps are always the same once you see the governance patterns. A solid but flexible set of tools and practices like sociocracy is a great starting point to have clear but lean processes that can grow as we grow.  Then you can lean back, leave the drama and endless discussions behind, and focus on the project’s work – enjoy the ride!

The key governance ingredients of a successful project as described in this article are taken from the book Who Decides Who Decides. How to start a group so everyone can have a voice – a book that lays out each process in more detail. Additionally, it linearizes the decisions to make into the first three meetings of any group. The book also includes an outlook on how to grow and “bud out.” You can see the meeting templates for all three meetings on the Who Decides Who Decides page. Its accompanying resource page provides demo videos of the processes it  describes. 

About the Author

Ted Rau is a trainer, consultant and author. He grew up in suburban Germany and studied linguistics, literature and history in Tübingen before earning his PhD in linguistics there in 2010. He moved to the USA and fulfilled a long-held desire to live in an intentional community. He studied self-management and now works full time teaching/consulting, writing, and as executive director of the non-profit movement support organization Sociocracy For All.

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

NoSQL Database Market 2021 Latest Insights, Growth Rate, Future Trends and Forecast …

MMS Founder

Posted on nosqlgooglealerts. Visit nosqlgooglealerts

Reports Globe offers research-based global study and analysis of the Global NoSQL Database Market. This report provides an in-depth overview of the drivers and limitations available in the market. NoSQL Database market report also provide prehistoric and five-year forecasts for the industry and contain data on socio-economic data from around the world. Key stakeholders can review the statistics, tables, and figures mentioned in this strategic planning report that lead to the success of the organization. Illuminates strategic production, revenue and consumption trends for players to increase sales and growth in the global NoSQL Database market. Here it focuses on the latest developments, sales, market value, production, gross margin and other important business factors of major players operating in the global NoSQL Database market. Players can use the exact market facts, figures and statistical studies provided in the report to understand the current and future growth of the global NoSQL Database market.

This report provides an assessment of various drivers, government policies, technological innovations, emerging technologies, opportunities, market risks, constraints, market barriers, challenges, trends, competitive landscapes and segments that provide a true picture of growth in the global market through NoSQL Database market.

Get FREE Sample copy of this Report with Graphs and Charts at:

The segmentation chapters enable readers to understand aspects of the market such as its products, available technology and applications. These chapters are written to describe their development over the years and the course they are likely to take in the coming years. The research report also provides detailed information on new trends that may define the development of these segments in the coming years.

NoSQL Database Market Segmentation:

NoSQL Database Market, By Application (2016-2027)

  • E-Commerce
  • Social Networking
  • Data Analytics
  • Data Storage
  • Others

NoSQL Database Market, By Product (2016-2027)

  • Column
  • Document
  • Key-value
  • Graph

Major Players Operating in the NoSQL Database Market:

  • DynamoDB
  • ObjectLabs Corporation
  • Skyll
  • MarkLogic
  • InfiniteGraph
  • Oracle
  • MapR Technologies
  • he Apache Software Foundation
  • Basho Technologies
  • Aerospike

Company Profiles – This is a very important section of the report that contains accurate and detailed profiles for the major players in the global NoSQL Database market. It provides information on the main business, markets, gross margin, revenue, price, production and other factors that define the market development of the players studied in the NoSQL Database market report.

Global NoSQL Database Market: Regional Segments

The different section on regional segmentation gives the regional aspects of the worldwide NoSQL Database market. This chapter describes the regulatory structure that is likely to impact the complete market. It highlights the political landscape in the market and predicts its influence on the NoSQL Database market globally.

  • North America (US, Canada)
  • Europe (Germany, UK, France, Rest of Europe)
  • Asia Pacific (China, Japan, India, Rest of Asia Pacific)
  • Latin America (Brazil, Mexico)
  • Middle East and Africa

Get up to 50% discount on this report at:

The Study Objectives are:

  1. To analyze global NoSQL Database status, future forecast, growth opportunity, key market and key players.
  2. To present the NoSQL Database development in North America, Europe, Asia Pacific, Latin America & Middle East and Africa.
  3. To strategically profile the key players and comprehensively analyze their development plan and strategies.
  4. To define, describe and forecast the market by product type, market applications and key regions.

This report includes the estimation of market size for value (million USD) and volume (K Units). Both top-down and bottom-up approaches have been used to estimate and validate the market size of NoSQL Database market, to estimate the size of various other dependent submarkets in the overall market. Key players in the market have been identified through secondary research, and their market shares have been determined through primary and secondary research. All percentage shares, splits, and breakdowns have been determined using secondary sources and verified primary sources.

Some Major Points from Table of Contents:

Chapter 1. Research Methodology & Data Sources

Chapter 2. Executive Summary

Chapter 3. NoSQL Database Market: Industry Analysis

Chapter 4. NoSQL Database Market: Product Insights

Chapter 5. NoSQL Database Market: Application Insights

Chapter 6. NoSQL Database Market: Regional Insights

Chapter 7. NoSQL Database Market: Competitive Landscape

Ask your queries regarding customization at:

How Reports Globe is different than other Market Research Providers:

The inception of Reports Globe has been backed by providing clients with a holistic view of market conditions and future possibilities/opportunities to reap maximum profits out of their businesses and assist in decision making. Our team of in-house analysts and consultants works tirelessly to understand your needs and suggest the best possible solutions to fulfill your research requirements.

Our team at Reports Globe follows a rigorous process of data validation, which allows us to publish reports from publishers with minimum or no deviations. Reports Globe collects, segregates, and publishes more than 500 reports annually that cater to products and services across numerous domains.

Contact us:

Mr. Mark Willams

Account Manager

US: +1-970-672-0390

Email: [email protected]


Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Dedicated ML Track at QCon Plus Nov: Learn All About the Latest ML Innovations

MMS Founder
MMS Adelina Turcu

Article originally posted on InfoQ. Visit InfoQ

Dio Synodinos, President of C4media (Creators of InfoQ and QCon) recently spoke with Frank Greco, Senior Technology Consultant, Chairman at NYJavaSIG, and QCon Plus November 2021 Committee Member to discuss the topics and track he’s looking forward to attending this November at QCon Plus online software conference:

QCon is practitioner-based, so you leave with a long list of things you want to try. Since it’s an early adopter audience, you learn what not to try, which is just as valuable as things to try. Also, you’re targeting senior architects and senior developers so it’s nice to be part of that crowd.

There are so many cool tracks this November. I’m quite Java-centric so I’m pretty excited about that. I’ve also been focused on machine learning for the past few years. There are many interesting talks about putting machine learning into production. And as I said before, there are things that you learned that work and certain things that may not work. I’m looking forward to hearing the speakers in the machine learning track and learn some of their best practices.

Frank Greco, Senior Technology Consultant, Chairman at NYJavaSIG, and QCon Plus November 2021 Committee Member

[embedded content]

Frank’s Highlight Track for QCon Plus: ML Everywhere

Machine Learning (ML) is pervasive and it affects different verticals in the economy, from sales operations to forecasting applications, marketing campaigns, and healthcare systems. ML is really transforming the world around us, creating an avenue to innovation across all sectors of the global economy.

In this track you will learn all about the latest ML innovations and the multiple fields where and how ML is being applied and deployed; in particular, you will learn how ML operations (MLOps) can accelerate AI adoption, and how to use the latest open-source frameworks to improve real-world ML applications.

The ML Everywhere track is hosted by Francesca Lazzeri, principal data scientist manager at Microsoft, and the first confirmed speaker is Chip Huyen, founder at stealth startup & teaching ML Sys at Stanford.

MLOps is not a static set of tools that defines the way you operationalize your machine learning models: it is more about your organization’s culture and the capability of sharing the AI vision across different teams and roles! Join me to learn more at QCon Plus.

Francesca Lazzeri, principal data scientist manager at Microsoft and QCon Plus November 2021 Track Host

Attend QCon Plus this November to:

  • Validate which technologies, trends, and best practices should be on your radar. And identify which shouldn’t.
  • Explore the use cases that innovative software development teams are focusing on to help you improve your competitive edge.
  • Plan which skills you and your team should be investing in to be better prepared for the future.
  • Identify actionable insights from 64+ world-class domain experts that you can start working on right now.
  • Connect with a global software engineering community.

Book your spot at QCon Plus happening this November 1-12. 

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.

Database Market Global Analysis 2021-2028: Rackspace Hosting, Oracle, Cassandra …

MMS Founder

Posted on mongodb google news. Visit mongodb google news

Consequential study analysis of the Database market intends to display the strategic evaluation of integral aspects of the market including the business growth and development strategies, sales and marketing, supply chain, cost structure along with the revenue trajectory of the Database market. The revenue trajectory displays sales and profit records of the present Database market scenario as well as going back to the highest elevations achieved. The Database market report gratifies the clientele needs for latest industry updates, recent events, influential changes and the extent of growth prospects of the industry. It particularly derives market data focusing on the forecast events.

Vendor Profiling: Database Market, 2020-28:

Rackspace Hosting
Amazon Web Services

We Have Recent Updates of Database Market in Sample [email protected]

The forecast is composed of multiple predictions associated with the rise in demand and overall revenue growth anticipated to be achieved by the Database market. The growth predictions are based on the accurate understanding of the current Database market scenario identifying the crucial growth altering factors such as the drivers and restrains. An efficient forecast delivery is primarily driven by the evaluation of growth opportunities and challenges indicating what the Database market would reflect in the near future. The study also analyses some significant market trends influencing the growth of the Database market and widening the opportunities during the forecast.
Analysis by Type:


Analysis by Application:

Small and Medium Business
Large Enterprises

Major economies in certain geographic regions controlling the Database market are analyzed. The geographic regions and countries covered in the study include:

• North America: Canada, U.S., and Mexico
• South America: Brazil, Ecuador, Argentina, Venezuela, Colombia, Peru, Costa Rica
• Europe: Italy, the U.K., France, Belgium, Germany, Denmark, Netherlands, Spain
• APAC: Japan, China, South Korea, Malaysia, Australia, Taiwan, India, and Hong Kong
• Middle East and Africa: Saudi Arabia, Israel, South Africa

Browse Full Report with Facts and Figures of Database Market Report at @

A detailed representation of all the major challenges is crucial for the Database market report suggesting the equivalent importance of assessing major obstructive factors. Furthermore, the market study evaluates the significant changes in the Database market dynamics with the emergence of novel COVID-19. Besides assessing the changes in business models and methodologies, the Database market study report also thoroughly evaluates the impact of the pandemic on the metric aspects of the industry including sales and profit. Followed by the instability proposed by the crisis, the market study intently studies the major business initiatives resuming the growth of the Database market.

Do You Have Any Query or Specific Requirement? Ask Our Industry [email protected]

Features of the Report
• The Database market report offers a comparative analysis of industry.
• The performance analysis of all the industry segments, leading market bodies and influential regions in the Database industry is included in the report along with market statistics.
• The record based on the study of market offers in-depth study of all the news, plans, investments, policies, innovations, events, product launches, developments, etc.

Business initiative of major players of the Database market are assessed with a view of anticipating the potential Database market status in the near future. It involves an in-depth analysis of the fiercely competitive environment and the competitive players of the Database market delivering the clientele with a list of leading players of the market along with their ranking based on revenue generation. It also integrates a brief overview of the individual competitor profile studying the core strengths and development plans initiated by the most to least revenue generating player of the Database market.

About Us:
Orbis Research ( is a single point aid for all your market research requirements. We have vast database of reports from the leading publishers and authors across the globe. We specialize in delivering customized reports as per the requirements of our clients. We have complete information about our publishers and hence are sure about the accuracy of the industries and verticals of their specialization. This helps our clients to map their needs and we produce the perfect required market research study for our clients.

Contact Us:
Hector Costello
Senior Manager Client Engagements
4144N Central Expressway,
Suite 600, Dallas,
Texas 75204, U.S.A.
Phone No.: USA: +1 (972)-362-8199 | IND: +91 895 659 5155

Article originally posted on mongodb google news. Visit mongodb google news

Subscribe for MMS Newsletter

By signing up, you will receive updates about our latest information.

  • This field is for validation purposes and should be left unchanged.