For Review The Cloud Show Recording April 26
15K views
Nov 1, 2023
For Review The Cloud Show Recording April 26
View Video Transcript
0:00
Hello everyone and welcome to the Cloud Show
0:03
It is the show where we talk about cloudy things or everything that is related to Cloud projects
0:09
We're going to have some interesting guests on our show that have some subject matter expertise
0:16
about various things about Cloud. This show doesn't have to be technical
0:20
but it certainly can be technical. Today, we're going to open up
0:24
with a topic that is quite technical. we're going to talk about how to choose various databases and why to do that
0:31
To help me with that conversation, I have two guests in the audience today or in the waiting room today
0:37
I want to welcome in my two guests, Mark Brown and Charles. Charles, welcome
0:46
You guys are both with Microsoft, but you work on databases and in different teams
0:52
or near each other, I guess. Behind Mark Brown, we see a great big Cosmos DB logo
0:59
You're some principal PM manager or something on the Cosmos DB team
1:06
Charles, you are a principal group product manager? That's right. On the Postgres team, right
1:14
That's right. I run the product management for Postgres at Microsoft. Brilliant. Welcome
1:19
We're largely on the same team. Yeah. You're on the same team
1:24
Yeah. Except you, Mark, you're down in California. And when you're not on the air and not running Cosmos stuff, you're chasing your two sons around
1:34
Pretty much, right? Right. That's right. Charles, are you up in Redmond
1:41
Yep. I'm up in Sony Redmond. So what do you do when you're not working
1:47
Oh, I'm always working. Postgres waits for nobody. but I like to get out for me to exercise sometimes as well
1:55
Brilliant. So the reason I wanted to call you guys to the show today
1:59
was that there is this burning question that technologists want to choose
2:05
maybe some cool database storage technology and bring that into the project
2:10
And they're all excited about it and they're coming to talk to maybe their manager
2:14
or maybe some business person about, you know, oh, we should use a graph database
2:18
or oh, we should use a so-and-so database. Honestly, a lot of people don't even know what the difference is
2:27
between different database technologies. That's why I wanted to talk to you guys today
2:32
Does that sound good? Yeah. Absolutely. One of my favorite topics. That's right. Let's jump into it and let's think about this with databases
2:45
First of all, not too long ago, databases were essentially just SQL databases
2:51
relational databases, right? And then something happened. A lot of stuff has happened since
2:58
So could you guys talk a little bit about that, the explosion of various kinds of databases? Sure
3:07
So, I mean, we're all familiar with the relational databases, the RDBMS that's out there
3:13
everybody's used them for years and years and years. In fact, for decades
3:18
they've been around since the seventies. No SQL databases. However, I mean
3:25
while it's popular to think they've been around for not so long
3:30
they've actually had kind of flavors of these things going back into the nineties. Like Lotus notes was like a document database kind of before they really called them
3:37
document databases. You had like network databases and other things that were similar to what a graph is today
3:43
They've kind of gotten, they've fallen out of favor and then reformulated in new ways
3:47
But the latest emergence that we see today kind of happened kind of in the
3:51
early two thousands when you had companies that were generating data, just enormous scale companies like Yahoo and Google
4:01
Facebook, Amazon. And kind of what happened is, you know, many of these companies were using relational databases like MySQL
4:09
well. And they quickly realized that it wouldn't keep up, right? It wouldn't work with the scale
4:16
neither the data, nor the fact that it changed rapidly as well. Like things like Facebook
4:22
I mean, there's no rhyme or reason to the data that they store in there. A lot of these companies
4:27
were generating logs that themselves just can change at a moment's notice for no reason at all
4:32
and relational databases just aren't built for that rapid change in the schema
4:39
In terms of different kinds of databases, there really are a couple of
4:44
large groups or variants of databases. You have your relational databases and then you
4:49
have a couple of other options that you might hear. Tell us which major kinds of databases do we see
4:58
I think as Mark started to allege, the explosion really of the internet is what drove these massive volumes of data
5:08
And really what happened with NoSQL is they were geared towards virtually infinite scale
5:13
if you will, at the expense of a lot of the fundamentals that we're familiar with relational
5:18
databases because these workloads simply didn't need them. Now what we're observing is that
5:24
distributed SQL is becoming a real requirement for customers as well And so whereas NoSQL was very semi based it enabled you to use really variable schemas and different design approaches
5:39
Relational or distributed relational is really interesting to a lot of people
5:44
who know relational databases, right? And as Mark mentioned, everyone has used one
5:50
but it gives you effectively that scalability to run a cluster. So multiple machines horizontally scaled, which blows through a lot of the traditional limits of relational databases as we previously knew them running on a single machine
6:06
And then, of course, when you run them in the cloud, you get the benefits of that on-demand elasticity. So the ability to scale these workloads up and down
6:12
Now, typically, these operational workloads that run on these platforms are not hyper-elastic in the sense that you probably don't turn them off for hours on end as you may do for ytics
6:26
But there are still scenarios where operational workloads can burst. Common examples in retail is things like seasonal sales or promotions where you expect to grow your traffic for a period of days, for example
6:39
And so those scenarios where you really want the relational properties of the database, where you're familiar with the technology, you're using things like strengths and referential integrity can be really beneficial to your application, but you get the scalability to work across multiple machines
7:00
The largest machine in the cloud today is probably a couple hundred virtual cores, I think maybe a little more than that
7:07
But once you start adding machines together, obviously that's an entirely new level of compute scale for running big and variable sized apps
7:18
Yeah, definitely. So, okay. So the scale of it is the driver and also the fact that we have a big ball that we are revolving around
7:29
We are on a globe and there's lots of distance between different ends
7:34
So when I'm talking about distributed databases, what's the key kind of values that you want to get from that
7:43
And sort of, I know that it has been, only the biggest corporations were able to run these in the past
7:51
And now when we have cloud offerings from various cloud companies, I think that essentially what has happened is
7:58
that has been a little bit democratized that now everyone can run these because they are fully managed
8:03
You just say, I want to have a database and I want it to be global, thank you
8:08
And then it is. So tell us more about that because that is such a game changer
8:14
for driving any kind of a global presence. I think I'll give it to Mark in a sec
8:21
just to talk about the global scalability. I think that what's the point that you make
8:26
around democratizing the database applies broadly to everything in the cloud. There were definitely technologies
8:34
that were limited to the biggest companies, the biggest workloads, traditionally when you were running on-premise
8:43
And really that was tied to the need to have a significant upfront capital investment
8:48
The fundamental business model of the cloud makes all technology available to everybody
8:54
Things like AI are really hot topics at the moment with VMs that are carrying huge numbers of GPUs
9:01
Again, it's an expensive resource to purchase yourself, but to take it in the cloud makes it completely accessible as well
9:09
So when we think about that, we recognize that customers can now have access
9:14
all customers have access to effectively all of this technology, but it's still important to be able to deliver it at a price point
9:21
at a scale point that meets their needs. a pay-as-you-go model is really of little value to most people
9:28
if the entry point is still very, very high. And so that full spectrum of scale
9:34
being able to start very cost-effectively, regardless of who you are, and then grow into something as big as you need it to be
9:40
is a really important facet of distributed systems in particular. Mark, do you want to add to the global part
9:48
Yeah, so that is kind of one of the key features for a database like us is the ability to do kind of very easy replication or at least easy for a
9:58
user into frankly as many regions as they want and it takes kind of a distributed system to kind
10:05
of another level where it's not just distributed across the machines now it's distributed across
10:11
the planet so you can run you know workloads that are kind of not possible in really using
10:18
any other type of database where you can have low latency for users that are on different continents
10:25
and have all that data be replicated and be consistent on a global basis
10:32
Those are new types of challenges that are kind of the next stage, if you will
10:38
from growing from massive scale, internet-sized companies to, as Charles put it perfectly
10:45
now the cloud has democratized this. Anybody can run applications at this scale
10:51
and can grow their business to meet needs where they got customers located
10:56
on all corners of the planet because they can get a database close to them And that fundamentally why you would do it is the closer you get the data the lower the latency and the faster the performance Not only the database that you can get close to
11:11
if there are changes anywhere on the planet, all of those changes are just transparently moved to
11:19
the other place and you don't have to think about it. It's just present where you are with your data
11:25
That's right. That's really cool. Tell me about the difference between, I know you can get a SQL database or a relational database
11:34
distributed around the globe as well as a NoSQL database distribution in the same way
11:41
So SQL, NoSQL, compare and contrast. When do you choose one or the other
11:47
What's the difference? I think with respect to distributed SQL, I can talk a little bit about what they do well
11:56
I think I mentioned a little earlier, if you've built on a relational database, and it doesn't
12:01
matter which one, it could be any vendor, the fundamental concepts and object model are the
12:07
same. The language is fundamentally the same. There are some slightly different keywords used
12:14
amongst different SQL dialects, but if you've written a select statement for one relational
12:19
database, you've written close enough to a select statement for all. And so that familiarity of
12:24
you know, tables, views, you know, types, understanding the value of types for enforcing data quality
12:32
defining constraints on data, defining indexes of various types for performance benefits
12:38
All of those have always been applicable to relational and they're equally applicable in distributed systems
12:45
distributed SQL. When you think about sort of the scenarios that these play into
12:51
Mark kind of alluded to this with the ability to put things closer to the user
13:00
The need for doing that is because no one, we haven't, I don't think anyone else will
13:05
will solve the speed of light problem that you introduce latency when you start traversing traffic across oceans
13:12
And users notice that. If you're using a mobile app that has very fast refresh rates, you're using an app on your PC
13:20
You don't want to be waiting for every refresh to circumnavigate the globe
13:27
And so that ability to co-locate the set of data for the user is very important for the experience of running the app
13:36
But the challenge is you don't want to, you could say, okay, well, that's easy in the cloud
13:40
I'll just deploy a database into every region. That starts to incur operational and management overhead
13:45
And so the ability to have effectively a single logical database that spans the globe
13:51
maintains that you maintain that lower operational cost for running it, and you're giving the users the best latency experience
13:57
And that's really the best of both. And there's a tradeoff, like with anything in a distributed system, right
14:01
For us, we have very small, relatively small shard sizes at 20 gig
14:05
and that's basically running on a slice of a VM. That's tough to manage because you have to figure out how to separate or partition your data
14:14
But the trade-off or the upside is that you can scale out really, really well and basically get equivalent performance, whether it's a megabyte size database or even a petabyte
14:25
It just doesn't matter when you're in a scale out scenario like that. Right, right
14:32
So talking about upsides to these databases and the different qualities and values
14:40
And on the flip side of it, when would you not choose to have, for example, a SQL database versus a, when would you not choose a relational database over a graph database or a key value store
14:57
Like, what is the difference there in terms of your business? You want to take that one, Mark
15:03
I mean, you kind of the downsides with no SQL is they can be less forgiving in some respects
15:15
They are not well understood in terms of how to model and partition data
15:19
These are still really unknown to most users who are designing systems for data or just developers in general
15:30
Um, uh, it just, there's some conceptual things that are a barrier to them, uh, that you have to kind of overcome and understand
15:40
Uh, you are in some cases forced to make, um, very complicated decisions, uh, very early on, uh, in some cases when you're designing systems like this, uh, that's getting better
15:52
you know there's at some at some level no sql systems are at a disadvantage because you've got
15:58
50 plus years of computer science behind relational databases and the query engines
16:04
and indexings and and and lock mechanisms and all of the technology that's built within
16:09
relational database management systems that doesn't exist yet for the no sql world it's
16:14
still relatively new and raw compared to that so that's a good point that's a very good point yeah
16:20
It's for better or worse, the SQL database, the relational model, that's the one we know as an industry
16:30
That why it is such a challenge I believe for a lot of companies to or rather less technical people the business people the business side of the house
16:43
to understand when technologists in your company come running, metaphorically speaking, with their new choice of database
16:53
that you've never heard about. You don't know what it is. You don't know what it's capable of. And you certainly don't know what the difference is
16:59
This is the beauty of the distributed SQL model is that it took what was a major disadvantage 20 years ago and adopted kind of that capability in terms of the scale-out mechanism in there
17:12
Now, and even with stuff like JSONB, where you can take an index JSON, you can even handle scenarios where you have highly mutable schemas within your data
17:24
So they've really, the thing has come full circle almost, if you will
17:30
Now, there are still trade-offs to be made, but this is an evolution we're witnessing in terms of kind of these systems
17:38
Yeah, it really is. So we're about to round off here, but let's see if we can boil it down towards the end here
17:45
When would you choose, for what kind of a problem would you choose a graph database where the data is graphs of data points connected by nodes
17:58
When would you choose that? I mean, when you have highly relational data, frankly
18:04
In fact, many graph workloads can be satisfied by using relational databases because you're, in fact, describing a relationship within there
18:13
The difference between graph and a relational database is the relationship is materialized with the data versus it being handled at runtime, which is what you do in a relational database, right
18:25
That's functionally a key difference in that. And that allows you to have some performance that you cannot get with a relational database and its ability to do very large graph traversals that bog down a relational database when you get into the higher end of that
18:44
So stuff that is inherently networks of things, could be people. I've heard about people using them for the electrical grid. Talk about a network, right? That's exactly what that is, to model an electrical grid
18:58
or some network supply chain is another huge use case for uh you know because you can do bill of
19:04
materials uh like whether it's a carton of milk uh if you use a graph to manage your supply chain
19:10
and bill of materials i can tell you what cow that milk came from right that's the kind of
19:14
uh thing you can do all right i don't know how many i don't know if any dairy farmers are using
19:19
a graph but that's literally you could apply it in that's in that type of fashion that's good to
19:23
to know. And so can I just ask also about document database? Because again, managers know relational
19:31
databases, they don't know when a document, when they would use a document database
19:38
The typical scenario is when you have unstructured data is quite often. That is why it is there
19:45
is you have a data that either operates like a document, so in and of itself is kind of a
19:52
totally owned entity, if you will, of data that always shifts with itself. But that's kind of the
19:58
typical scenario is that there's a flexible or mutable schema involved and necessary there
20:05
Brilliant. All right. Any last words from either of you or both? I'll just add one quickly. I think with respect to the distributed SQL, one of the common myths that
20:17
that I'll take the opportunity to dispel quickly is that they are simply big versions of single box machines
20:26
That's not entirely the case. There are still data modeling considerations to ensure that a data model is optimized
20:33
for a distributed system. And commonly we see that when customers are moving large monolithic models onto distributed SQL
20:41
that does require a little bit of rework. But new apps, maybe multi-tenant SaaS apps
20:46
IoT apps, things that have data models that actually shard very cleanly across a cluster
20:54
It's really where customerships start. It gives you a lot of headway to grow into the future
20:58
But just a word of warning that those monolithic models do need a little bit of work
21:02
And that really applies to any distributed system. Once you introduce relationships or constraints
21:08
being checked across nodes in the cluster, same speed of light problem occurs
21:12
Less time than transiting the ocean, but still noticeable when you've been doing these things
21:17
in a single box machine previously. So, good best practice. Just to reinforce that
21:23
this is one of the biggest challenges I face when I talk to developers when they're coming onto the cloud
21:27
is they need to reframe how they think because you're working in a distributed systems environment
21:32
and in a distributed systems world, there's always going to be some refactoring necessary
21:35
You cannot lift and shift what sits in your data center, your colo, and you're under your desk
21:40
and shove it into the cloud and expect it to work. It just doesn't work like that
21:44
It doesn't work like that. All right. And if you get it right, it's the scalability's magic
21:48
Yep. It's worth the design for it. Yeah, you got to design for it, and it's a big world out there
21:54
So with those words, I think it's about time to wrap this topic up
21:59
I really want to thank both of you guys for being with me today on the Cloud Show
22:04
and I hope to talk to you again sometime. Have a great evening. Thank you
22:07
Thank you, Magnus
#Business Services
#Data Management
#Enterprise Technology
#Knowledge Management