0:03
I'm niai CEO and co-founder at karut and
0:06
I'm niai CEO and co-founder at karut and
0:07
I'm niai CEO and co-founder at karut and today I'm going to talk about
0:09
today I'm going to talk about
0:09
today I'm going to talk about observability and few words about me I'm
0:13
observability and few words about me I'm
0:13
observability and few words about me I'm I work for kurud we built an open source
0:16
I work for kurud we built an open source
0:16
I work for kurud we built an open source observability platform and we try to
0:19
observability platform and we try to
0:19
observability platform and we try to focus on turning Telemetry data into
0:22
focus on turning Telemetry data into
0:22
focus on turning Telemetry data into answer like uh actionable insights and
0:25
answer like uh actionable insights and
0:25
answer like uh actionable insights and so on and I noticed that a lot of other
0:29
so on and I noticed that a lot of other
0:29
so on and I noticed that a lot of other speakers when talking about
0:30
speakers when talking about
0:30
speakers when talking about observability are focused on you know
0:34
observability are focused on you know
0:34
observability are focused on you know storage uh particular
0:36
storage uh particular
0:36
storage uh particular platforms and how to gather data for uh
0:41
platforms and how to gather data for uh
0:41
platforms and how to gather data for uh troubleshooting and how to store them
0:43
troubleshooting and how to store them
0:44
troubleshooting and how to store them but almost
0:45
but almost nobody uh talks about how to use all
0:48
nobody uh talks about how to use all
0:48
nobody uh talks about how to use all this data to find answers and today I
0:53
this data to find answers and today I
0:53
this data to find answers and today I try I'll try to focus on on this
0:55
try I'll try to focus on on this
0:55
try I'll try to focus on on this particular
0:57
particular question and uh let's start from the
1:00
question and uh let's start from the
1:00
question and uh let's start from the very beginning what is
1:03
very beginning what is
1:03
very beginning what is observability uh in simple words
1:06
observability uh in simple words
1:06
observability uh in simple words observability is being able to answer
1:08
observability is being able to answer
1:08
observability is being able to answer questions about your systems uh when you
1:11
questions about your systems uh when you
1:11
questions about your systems uh when you deploy your applications into production
1:15
deploy your applications into production
1:15
deploy your applications into production environment you should be able answer
1:18
environment you should be able answer
1:18
environment you should be able answer this questions like how uh the app uh is
1:23
this questions like how uh the app uh is
1:23
this questions like how uh the app uh is performing right now uh are there any
1:27
performing right now uh are there any
1:27
performing right now uh are there any failed requests and why they fail and uh
1:32
failed requests and why they fail and uh
1:32
failed requests and why they fail and uh other question can be related to latency
1:34
other question can be related to latency
1:34
other question can be related to latency and so
1:36
and so on and why do we need uh answer these
1:39
on and why do we need uh answer these
1:39
on and why do we need uh answer these questions the first one uh we should be
1:42
questions the first one uh we should be
1:42
questions the first one uh we should be able to
1:44
able to provide uh to make our system available
1:46
provide uh to make our system available
1:46
provide uh to make our system available for users because if you run your
1:50
for users because if you run your
1:50
for users because if you run your business that based on website or mobile
1:54
business that based on website or mobile
1:54
business that based on website or mobile app your application can make money only
1:57
app your application can make money only
1:57
app your application can make money only if it's available for
1:58
if it's available for
1:58
if it's available for users uh uh like to be honest uh having
2:03
users uh uh like to be honest uh having
2:03
users uh uh like to be honest uh having all that observability stuff doesn't
2:05
all that observability stuff doesn't
2:05
all that observability stuff doesn't guarantee you 100% availability but uh
2:11
guarantee you 100% availability but uh
2:11
guarantee you 100% availability but uh every bad situations uh happens in
2:14
every bad situations uh happens in
2:14
every bad situations uh happens in production so you should be ready to
2:17
production so you should be ready to
2:17
production so you should be ready to mitigate these
2:19
mitigate these issues uh the second one uh reason uh is
2:23
issues uh the second one uh reason uh is
2:23
issues uh the second one uh reason uh is like you should keep your application
2:26
like you should keep your application
2:26
like you should keep your application performance as high as possible because
2:28
performance as high as possible because
2:28
performance as high as possible because lower application latency means better
2:31
lower application latency means better
2:31
lower application latency means better conversions and it's like good for
2:34
conversions and it's like good for
2:34
conversions and it's like good for business and the SEC the last one is
2:37
business and the SEC the last one is
2:37
business and the SEC the last one is costs uh as more performing your
2:40
costs uh as more performing your
2:40
costs uh as more performing your application you can spend less money on
2:44
application you can spend less money on
2:44
application you can spend less money on your Cloud uh Cloud costs and and
2:48
your Cloud uh Cloud costs and and
2:48
your Cloud uh Cloud costs and and optimizing
2:50
optimizing application uh can lead to like reducing
2:53
application uh can lead to like reducing
2:53
application uh can lead to like reducing your Cloud
2:56
your Cloud cost and uh like these days uh all
3:00
cost and uh like these days uh all
3:00
cost and uh like these days uh all problems start from the data we try
3:04
problems start from the data we try
3:04
problems start from the data we try to uh collect data and analyze it and
3:07
to uh collect data and analyze it and
3:07
to uh collect data and analyze it and make some conclusions and in case of in
3:10
make some conclusions and in case of in
3:10
make some conclusions and in case of in case of observability the same we need
3:12
case of observability the same we need
3:12
case of observability the same we need to before everything we need to talk
3:15
to before everything we need to talk
3:16
to before everything we need to talk about
3:17
about data and last uh let's start from really
3:21
data and last uh let's start from really
3:21
data and last uh let's start from really simple example we have a like uh trival
3:26
simple example we have a like uh trival
3:26
simple example we have a like uh trival uh web application hello world and uh we
3:30
uh web application hello world and uh we
3:30
uh web application hello world and uh we deployed it in production and uh we know
3:32
deployed it in production and uh we know
3:32
deployed it in production and uh we know nothing about its performance because it
3:35
nothing about its performance because it
3:35
nothing about its performance because it is not instrumented with any logging
3:39
is not instrumented with any logging
3:39
is not instrumented with any logging tracing matrics and let's think about
3:43
tracing matrics and let's think about
3:43
tracing matrics and let's think about how we should look at this application
3:47
how we should look at this application
3:47
how we should look at this application in production we have some questions and
3:51
in production we have some questions and
3:51
in production we have some questions and let's start thinking how to improve our
3:54
let's start thinking how to improve our
3:54
let's start thinking how to improve our improve our application to to make it
3:57
improve our application to to make it
3:57
improve our application to to make it observable the most obious way to add
4:01
observable the most obious way to add
4:01
observable the most obious way to add some observability at logging because we
4:04
some observability at logging because we
4:05
some observability at logging because we can from our code we can add few lines
4:08
can from our code we can add few lines
4:08
can from our code we can add few lines and uh write anything you want to a
4:11
and uh write anything you want to a
4:12
and uh write anything you want to a journal and to be able to see
4:15
journal and to be able to see
4:15
journal and to be able to see application performance the number of
4:17
application performance the number of
4:17
application performance the number of requests their latency and so on so
4:20
requests their latency and so on so
4:20
requests their latency and so on so logging from my perspective is the
4:23
logging from my perspective is the
4:23
logging from my perspective is the simplest way to make your application
4:26
simplest way to make your application
4:26
simplest way to make your application observable
4:30
but uh we have few uh like we we can log
4:34
we have few uh like we we can log
4:34
we have few uh like we we can log everything we want but it's
4:37
everything we want but it's
4:37
everything we want but it's not uh without its costs it's pretty
4:41
not uh without its costs it's pretty
4:41
not uh without its costs it's pretty expensive to to store and analyze logs
4:44
expensive to to store and analyze logs
4:44
expensive to to store and analyze logs on scale and the second limitation is
4:47
on scale and the second limitation is
4:47
on scale and the second limitation is that you cannot analyze the the really
4:53
that you cannot analyze the the really
4:53
that you cannot analyze the the really intensive L stream in real time we need
4:55
intensive L stream in real time we need
4:55
intensive L stream in real time we need some processing because our um brain is
5:00
some processing because our um brain is
5:00
some processing because our um brain is not able to read 100 messages per second
5:06
not able to read 100 messages per second
5:06
not able to read 100 messages per second and that's why to understand what's
5:08
and that's why to understand what's
5:08
and that's why to understand what's happening uh using logs you need some
5:11
happening uh using logs you need some
5:11
happening uh using logs you need some processing to make it more uh
5:16
processing to make it more uh
5:16
processing to make it more uh understandable uh like we can transform
5:20
understandable uh like we can transform
5:20
understandable uh like we can transform our logs into matrics uh Matrix from my
5:23
our logs into matrics uh Matrix from my
5:23
our logs into matrics uh Matrix from my perspective are some aggregations on Raw
5:27
perspective are some aggregations on Raw
5:27
perspective are some aggregations on Raw events when you have um uh three events
5:31
events when you have um uh three events
5:31
events when you have um uh three events in your log you can uh calculate the the
5:35
in your log you can uh calculate the the
5:35
in your log you can uh calculate the the the count of them and you will have
5:38
the count of them and you will have
5:38
the count of them and you will have magic uh events per second for example
5:41
magic uh events per second for example
5:41
magic uh events per second for example and you can
5:44
and you can uh make a chart and it's Rey to eat but
5:48
uh make a chart and it's Rey to eat but
5:49
uh make a chart and it's Rey to eat but on the other hand at this point when we
5:52
on the other hand at this point when we
5:52
on the other hand at this point when we uh aggregated our events we lost the
5:57
uh aggregated our events we lost the
5:57
uh aggregated our events we lost the ability to see the particular event this
6:00
ability to see the particular event this
6:00
ability to see the particular event this is the primary magic uh
6:04
is the primary magic uh
6:04
is the primary magic uh limitation and uh working with
6:08
limitation and uh working with
6:08
limitation and uh working with Matrix is much more cheap uh than logs
6:11
Matrix is much more cheap uh than logs
6:11
Matrix is much more cheap uh than logs and traces and we understand how to work
6:14
and traces and we understand how to work
6:14
and traces and we understand how to work with them we can uh build dashboards we
6:18
with them we can uh build dashboards we
6:18
with them we can uh build dashboards we can set up some alerting rules and it's
6:21
can set up some alerting rules and it's
6:21
can set up some alerting rules and it's super
6:22
super easy and the other one observability
6:25
easy and the other one observability
6:26
easy and the other one observability signal is uh traces so traces is some
6:31
signal is uh traces so traces is some
6:31
signal is uh traces so traces is some events that grouped by uh some tracing
6:35
events that grouped by uh some tracing
6:35
events that grouped by uh some tracing Trace ID it means uh you can gather all
6:39
Trace ID it means uh you can gather all
6:39
Trace ID it means uh you can gather all the data
6:41
the data um that are relevant to the particular
6:44
um that are relevant to the particular
6:44
um that are relevant to the particular request and see the whole picture of how
6:48
request and see the whole picture of how
6:48
request and see the whole picture of how uh for example a particular user request
6:52
uh for example a particular user request
6:52
uh for example a particular user request has been Pro processed through the
6:55
has been Pro processed through the
6:55
has been Pro processed through the system here is like sample uh request
6:59
system here is like sample uh request
6:59
system here is like sample uh request Trace we see the duration of each uh
7:04
Trace we see the duration of each uh
7:04
Trace we see the duration of each uh stage we see Services involved into
7:07
stage we see Services involved into
7:07
stage we see Services involved into processing this particular request we
7:09
processing this particular request we
7:09
processing this particular request we see errors and uh it's
7:13
see errors and uh it's
7:13
see errors and uh it's like super cool it it it makes your uh
7:17
like super cool it it it makes your uh
7:17
like super cool it it it makes your uh workflow application workflow more uh
7:21
workflow application workflow more uh
7:21
workflow application workflow more uh visible but uh traces are as logs uh
7:27
visible but uh traces are as logs uh
7:27
visible but uh traces are as logs uh more expensive uh comparing to Matrix
7:31
more expensive uh comparing to Matrix
7:31
more expensive uh comparing to Matrix for example and you should find the
7:33
for example and you should find the
7:33
for example and you should find the balance usually it is done by adding
7:36
balance usually it is done by adding
7:36
balance usually it is done by adding sample sampling when you store only the
7:39
sample sampling when you store only the
7:39
sample sampling when you store only the fraction of all your data and the S uh
7:43
fraction of all your data and the S uh
7:43
fraction of all your data and the S uh fourth uh tet signal is profiles
7:47
fourth uh tet signal is profiles
7:47
fourth uh tet signal is profiles profiles can help you understand uh
7:52
profiles can help you understand uh
7:52
profiles can help you understand uh precise to particular line of
7:55
precise to particular line of
7:55
precise to particular line of code uh why at this point of time you
7:59
code uh why at this point of time you
7:59
code uh why at this point of time you application consumed more CPU time or
8:02
application consumed more CPU time or
8:02
application consumed more CPU time or more memory as usual usually uh you you
8:06
more memory as usual usually uh you you
8:06
more memory as usual usually uh you you could see uh usually profiles are
8:09
could see uh usually profiles are
8:09
could see uh usually profiles are represented as flame graphs where uh
8:13
represented as flame graphs where uh
8:13
represented as flame graphs where uh wider wider area
8:16
wider wider area means uh more resource consumption in
8:18
means uh more resource consumption in
8:18
means uh more resource consumption in this case we can select some area on on
8:21
this case we can select some area on on
8:21
this case we can select some area on on the chart and
8:24
the chart and understand uh like uh functions and
8:27
understand uh like uh functions and
8:27
understand uh like uh functions and calls in our code
8:29
calls in our code uh responsible for for this uh cons
8:32
uh responsible for for this uh cons
8:32
uh responsible for for this uh cons consumption or you can compare uh the
8:35
consumption or you can compare uh the
8:35
consumption or you can compare uh the like some uh CPU usage or memory usage
8:39
like some uh CPU usage or memory usage
8:39
like some uh CPU usage or memory usage Spike with a baseline previous period
8:42
Spike with a baseline previous period
8:42
Spike with a baseline previous period and understand what was
8:46
and understand what was
8:46
and understand what was changed uh my co-founder Peter zv uh
8:49
changed uh my co-founder Peter zv uh
8:50
changed uh my co-founder Peter zv uh asked his network uh on what is the most
8:54
asked his network uh on what is the most
8:54
asked his network uh on what is the most important pillar of observability and
8:56
important pillar of observability and
8:56
important pillar of observability and the majority of people ask answer like
8:59
the majority of people ask answer like
8:59
the majority of people ask answer like magx because everyone knows how to work
9:03
magx because everyone knows how to work
9:03
magx because everyone knows how to work with magx from my perspective the
9:06
with magx from my perspective the
9:06
with magx from my perspective the correct answers is you need all of them
9:09
correct answers is you need all of them
9:09
correct answers is you need all of them and next we will see why but before
9:14
and next we will see why but before
9:14
and next we will see why but before let's talk um about how uh we gather all
9:18
let's talk um about how uh we gather all
9:18
let's talk um about how uh we gather all that tetri
9:20
that tetri data there are like really quick uh
9:23
data there are like really quick uh
9:23
data there are like really quick uh really quickly uh we
9:25
really quickly uh we have a standard defa way to instrument
9:28
have a standard defa way to instrument
9:29
have a standard defa way to instrument our applications with open Telemetry is
9:31
our applications with open Telemetry is
9:31
our applications with open Telemetry is the car open Telemetry is a vendal
9:34
the car open Telemetry is a vendal
9:34
the car open Telemetry is a vendal neutral uh product uh that uh provides
9:38
neutral uh product uh that uh provides
9:38
neutral uh product uh that uh provides you with uh sdks for many programming
9:41
you with uh sdks for many programming
9:41
you with uh sdks for many programming languages Java net everything almost
9:44
languages Java net everything almost
9:44
languages Java net everything almost everything it supports uh Gathering
9:46
everything it supports uh Gathering
9:46
everything it supports uh Gathering Magics loog traces and now they uh they
9:51
Magics loog traces and now they uh they
9:51
Magics loog traces and now they uh they started uh to add profiles into their uh
9:55
started uh to add profiles into their uh
9:55
started uh to add profiles into their uh supported uh
9:57
supported uh signals but it requires uh some changes
10:01
signals but it requires uh some changes
10:01
signals but it requires uh some changes in your code some in some uh platforms
10:04
in your code some in some uh platforms
10:04
in your code some in some uh platforms like Java you can just add the agent and
10:08
like Java you can just add the agent and
10:08
like Java you can just add the agent and all your code will be instrumented
10:10
all your code will be instrumented
10:10
all your code will be instrumented automatically but like in languages like
10:13
automatically but like in languages like
10:13
automatically but like in languages like go you you need to import uh the
10:16
go you you need to import uh the
10:16
go you you need to import uh the relevant SDK and uh redeploy your
10:20
relevant SDK and uh redeploy your
10:20
relevant SDK and uh redeploy your applications into
10:21
applications into production an alternative method to
10:24
production an alternative method to
10:24
production an alternative method to gather
10:25
gather telemetric Telemetry data is ebpf ebpf
10:29
telemetric Telemetry data is ebpf ebpf
10:30
telemetric Telemetry data is ebpf ebpf is a feature of the Linux kernel that
10:32
is a feature of the Linux kernel that
10:32
is a feature of the Linux kernel that allows you to run small programs in
10:35
allows you to run small programs in
10:35
allows you to run small programs in kernel space and instrument various uh
10:40
kernel space and instrument various uh
10:40
kernel space and instrument various uh kernel space or user space function
10:43
kernel space or user space function
10:43
kernel space or user space function calls especially coru um you actively
10:47
calls especially coru um you actively
10:47
calls especially coru um you actively use uses ebpf to provide users with zero
10:51
use uses ebpf to provide users with zero
10:51
use uses ebpf to provide users with zero instrumentation experience when you set
10:53
instrumentation experience when you set
10:53
instrumentation experience when you set up the agents and that's it you it it
10:57
up the agents and that's it you it it
10:57
up the agents and that's it you it it provides you with insights in in
11:00
provides you with insights in in
11:00
provides you with insights in in minutes and in case of ebpf uh Linux Kel
11:04
minutes and in case of ebpf uh Linux Kel
11:04
minutes and in case of ebpf uh Linux Kel has provides some guarantees that there
11:08
has provides some guarantees that there
11:08
has provides some guarantees that there will be not uh high like it it will not
11:13
will be not uh high like it it will not
11:13
will be not uh high like it it will not affect your application latency uh
11:16
affect your application latency uh
11:16
affect your application latency uh because it has uh the complexity limit
11:19
because it has uh the complexity limit
11:20
because it has uh the complexity limit it ensures that every BPF
11:22
it ensures that every BPF
11:22
it ensures that every BPF program like uh should be
11:26
program like uh should be
11:26
program like uh should be super uh optimized for per
11:30
super uh optimized for per
11:30
super uh optimized for per performance and the third source of
11:34
performance and the third source of
11:34
performance and the third source of telemetric data is infrastructure Matrix
11:36
telemetric data is infrastructure Matrix
11:36
telemetric data is infrastructure Matrix and database matrics because we run our
11:38
and database matrics because we run our
11:38
and database matrics because we run our applications on nodes we use uh networks
11:43
applications on nodes we use uh networks
11:43
applications on nodes we use uh networks for uh app to app Communications we also
11:47
for uh app to app Communications we also
11:47
for uh app to app Communications we also use databases to store our data and uh
11:50
use databases to store our data and uh
11:50
use databases to store our data and uh we should also gather all the relevant
11:53
we should also gather all the relevant
11:53
we should also gather all the relevant Matrix logs and some sometimes traces to
11:58
Matrix logs and some sometimes traces to
11:58
Matrix logs and some sometimes traces to understand the performance of network
12:00
understand the performance of network
12:00
understand the performance of network databases and uh our
12:03
databases and uh our
12:03
databases and uh our infrastructure and the most interesting
12:06
infrastructure and the most interesting
12:06
infrastructure and the most interesting part of my talk is what we can do with
12:09
part of my talk is what we can do with
12:09
part of my talk is what we can do with all that data because it's like pretty
12:13
all that data because it's like pretty
12:13
all that data because it's like pretty easy to collect a lot of data but we
12:15
easy to collect a lot of data but we
12:15
easy to collect a lot of data but we need to figure out uh how to extract
12:21
need to figure out uh how to extract
12:21
need to figure out uh how to extract answers and this is the biggest from my
12:24
answers and this is the biggest from my
12:24
answers and this is the biggest from my perspective the biggest challenge in the
12:26
perspective the biggest challenge in the
12:26
perspective the biggest challenge in the observability space our approach uh cot
12:30
observability space our approach uh cot
12:30
observability space our approach uh cot is we built a model of each system
12:33
is we built a model of each system
12:33
is we built a model of each system distributed system we understand the
12:36
distributed system we understand the
12:36
distributed system we understand the applications their instances no they run
12:39
applications their instances no they run
12:39
applications their instances no they run on and uh we understand how they
12:42
on and uh we understand how they
12:42
on and uh we understand how they communicate with each other and we as
12:45
communicate with each other and we as
12:45
communicate with each other and we as Engineers we start from an overview of
12:48
Engineers we start from an overview of
12:48
Engineers we start from an overview of the system and how it performs and then
12:53
the system and how it performs and then
12:53
the system and how it performs and then we like uh we Traverse dependency graph
12:57
we like uh we Traverse dependency graph
12:57
we like uh we Traverse dependency graph and analyze the performance of each
12:59
and analyze the performance of each
12:59
and analyze the performance of each component and we understand
13:02
component and we understand
13:02
component and we understand where uh the problem comes from
13:06
where uh the problem comes from
13:07
where uh the problem comes from uh then uh if you have to deal with
13:11
uh then uh if you have to deal with
13:11
uh then uh if you have to deal with thousands hundreds of or even thousands
13:14
thousands hundreds of or even thousands
13:14
thousands hundreds of or even thousands of services you should add some
13:16
of services you should add some
13:16
of services you should add some automation because it's no longer
13:19
automation because it's no longer
13:19
automation because it's no longer possible to to handle it manually so
13:23
possible to to handle it manually so
13:23
possible to to handle it manually so let's start from an overview we I
13:26
let's start from an overview we I
13:26
let's start from an overview we I believe that every troubleshooting
13:29
believe that every troubleshooting
13:29
believe that every troubleshooting process should start from uh metrics
13:32
process should start from uh metrics
13:32
process should start from uh metrics related to user experience because it's
13:35
related to user experience because it's
13:35
related to user experience because it's like the most important metrics in your
13:37
like the most important metrics in your
13:37
like the most important metrics in your system we
13:41
should uh we should check them before we we
13:44
uh we should check them before we we
13:44
uh we should check them before we we think about not CPU usage or or network
13:47
think about not CPU usage or or network
13:47
think about not CPU usage or or network issues we should uh understand uh is
13:51
issues we should uh understand uh is
13:51
issues we should uh understand uh is there any issues affecting our users or
13:54
there any issues affecting our users or
13:54
there any issues affecting our users or not and for an no better to use traces
13:58
not and for an no better to use traces
13:58
not and for an no better to use traces and metrics because like logs and
14:01
and metrics because like logs and
14:01
and metrics because like logs and profiles are uh almost useless in this
14:05
profiles are uh almost useless in this
14:05
profiles are uh almost useless in this case and let's uh look at some examples
14:10
case and let's uh look at some examples
14:10
case and let's uh look at some examples here we have uh uh tracing uh aggregated
14:15
here we have uh uh tracing uh aggregated
14:15
here we have uh uh tracing uh aggregated tracing view uh each point on this chart
14:19
tracing view uh each point on this chart
14:19
tracing view uh each point on this chart is uh some number of requests uh they
14:23
is uh some number of requests uh they
14:23
is uh some number of requests uh they are distributed over time and uh they
14:28
are distributed over time and uh they
14:28
are distributed over time and uh they are group by their latency and also we
14:31
are group by their latency and also we
14:31
are group by their latency and also we have red points uh that sign that uh
14:36
have red points uh that sign that uh
14:36
have red points uh that sign that uh there are some
14:38
there are some errors we can select this area and
14:41
errors we can select this area and
14:41
errors we can select this area and understand that some uh we have some
14:44
understand that some uh we have some
14:44
understand that some uh we have some errors in um in our system and we can uh
14:48
errors in um in our system and we can uh
14:48
errors in um in our system and we can uh ask this system uh analyze uh tra only
14:53
ask this system uh analyze uh tra only
14:53
ask this system uh analyze uh tra only error traces and group them by the
14:57
error traces and group them by the
14:57
error traces and group them by the origin error and and in this case we can
15:00
origin error and and in this case we can
15:00
origin error and and in this case we can see that product catalog service um
15:04
see that product catalog service um
15:04
see that product catalog service um affected all that
15:06
affected all that requests then we can see some latency
15:09
requests then we can see some latency
15:09
requests then we can see some latency anomaly we can see that there are a
15:12
anomaly we can see that there are a
15:12
anomaly we can see that there are a number of requests that uh that were
15:15
number of requests that uh that were
15:15
number of requests that uh that were handled longer than usual in some
15:17
handled longer than usual in some
15:17
handled longer than usual in some anomaly we can select the slow requests
15:20
anomaly we can select the slow requests
15:20
anomaly we can select the slow requests and understand uh which
15:23
and understand uh which
15:23
and understand uh which which service call uh caused this uh
15:27
which service call uh caused this uh
15:27
which service call uh caused this uh situation in this case
15:29
situation in this case
15:29
situation in this case uh it's because of uh we slow down
15:33
uh it's because of uh we slow down
15:33
uh it's because of uh we slow down Network between some servers and its
15:35
Network between some servers and its
15:35
Network between some servers and its radius database pretty easy but we we
15:39
radius database pretty easy but we we
15:39
radius database pretty easy but we we can understand the state of our
15:41
can understand the state of our
15:41
can understand the state of our application using only one chart it's
15:44
application using only one chart it's
15:44
application using only one chart it's like we just
15:46
like we just uh uh picked the right representation of
15:50
uh uh picked the right representation of
15:50
uh uh picked the right representation of all our data but uh we cannot use only
15:55
all our data but uh we cannot use only
15:55
all our data but uh we cannot use only traces to understand what's happening
15:59
traces to understand what's happening
15:59
traces to understand what's happening with our application right now because
16:01
with our application right now because
16:01
with our application right now because in in case of traces you you will see
16:04
in in case of traces you you will see
16:04
in in case of traces you you will see only like errors or uh spans that uh
16:10
only like errors or uh spans that uh
16:10
only like errors or uh spans that uh took longer than usual but uh tracing
16:13
took longer than usual but uh tracing
16:13
took longer than usual but uh tracing data cannot answerer questions like why
16:17
data cannot answerer questions like why
16:17
data cannot answerer questions like why the application performs slower than
16:19
the application performs slower than
16:19
the application performs slower than before for example uh tracing cannot
16:22
before for example uh tracing cannot
16:22
before for example uh tracing cannot cover issues related to Java garbage
16:25
cover issues related to Java garbage
16:25
cover issues related to Java garbage collector activity or some internal post
16:29
collector activity or some internal post
16:29
collector activity or some internal post gql uh log contention or something
16:34
gql uh log contention or something
16:34
gql uh log contention or something else so and the the second limitation is
16:38
else so and the the second limitation is
16:38
else so and the the second limitation is that uh you cannot instrument all your
16:43
that uh you cannot instrument all your
16:43
that uh you cannot instrument all your application stack because of third party
16:46
application stack because of third party
16:46
application stack because of third party Services Legacy services and so on so
16:49
Services Legacy services and so on so
16:49
Services Legacy services and so on so and this why we need uh not only tracing
16:53
and this why we need uh not only tracing
16:53
and this why we need uh not only tracing data uh but other we should look at uh
16:57
data uh but other we should look at uh
16:57
data uh but other we should look at uh matrics logs and so on
17:00
matrics logs and so on
17:00
matrics logs and so on uh okay let's assume that we understood
17:03
uh okay let's assume that we understood
17:03
uh okay let's assume that we understood that now we have an open issue uh that
17:06
that now we have an open issue uh that
17:06
that now we have an open issue uh that affect that's affecting our users we
17:09
affect that's affecting our users we
17:10
affect that's affecting our users we need to somehow understand how a
17:12
need to somehow understand how a
17:12
need to somehow understand how a component communicate with each other
17:14
component communicate with each other
17:14
component communicate with each other and then uh we can drill down into any
17:17
and then uh we can drill down into any
17:17
and then uh we can drill down into any particular service and troubleshooted
17:20
particular service and troubleshooted
17:20
particular service and troubleshooted here is a service map uh we at kurut
17:26
here is a service map uh we at kurut
17:26
here is a service map uh we at kurut use uh use BPF to understand
17:30
use uh use BPF to understand
17:30
use uh use BPF to understand communication between each uh between
17:33
communication between each uh between
17:33
communication between each uh between Services understand latency between them
17:35
Services understand latency between them
17:35
Services understand latency between them understand Network performance between
17:37
understand Network performance between
17:37
understand Network performance between them and we can easily understand how
17:40
them and we can easily understand how
17:40
them and we can easily understand how our components uh communicate with each
17:42
our components uh communicate with each
17:42
our components uh communicate with each other and thanks to BPF it can be done
17:46
other and thanks to BPF it can be done
17:46
other and thanks to BPF it can be done with without any blind spots because uh
17:49
with without any blind spots because uh
17:49
with without any blind spots because uh you don't need to instrument your
17:51
you don't need to instrument your
17:51
you don't need to instrument your applications to have this uh this
17:54
applications to have this uh this
17:54
applications to have this uh this service map and using this service map
17:57
service map and using this service map
17:57
service map and using this service map you can D down any
17:59
you can D down any into any particular application and see
18:03
into any particular application and see
18:03
into any particular application and see only matrics related to this one
18:05
only matrics related to this one
18:05
only matrics related to this one application uh it's clients its uh
18:08
application uh it's clients its uh
18:08
application uh it's clients its uh dependency services like the database uh
18:12
dependency services like the database uh
18:12
dependency services like the database uh and we can analyze each subsystem
18:14
and we can analyze each subsystem
18:14
and we can analyze each subsystem separately to understand uh other issues
18:18
separately to understand uh other issues
18:18
separately to understand uh other issues CPU related issues memory related issues
18:21
CPU related issues memory related issues
18:21
CPU related issues memory related issues and and so on and we use this approach
18:24
and and so on and we use this approach
18:24
and and so on and we use this approach to to make troubleshooting easier
18:26
to to make troubleshooting easier
18:26
to to make troubleshooting easier because we don't have tov through the
18:30
because we don't have tov through the
18:30
because we don't have tov through the countless dashboards and so on and what
18:33
countless dashboards and so on and what
18:33
countless dashboards and so on and what we should know about each uh each
18:38
we should know about each uh each
18:38
we should know about each uh each component uh I believe that uh over 80%
18:43
component uh I believe that uh over 80%
18:43
component uh I believe that uh over 80% of
18:45
of applications uh application issues are
18:47
applications uh application issues are
18:47
applications uh application issues are the same so we can cover the most common
18:51
the same so we can cover the most common
18:51
the same so we can cover the most common pit holes and uh highlight them
18:54
pit holes and uh highlight them
18:54
pit holes and uh highlight them automatically in our case from about
18:57
automatically in our case from about
18:57
automatically in our case from about every application we know it's uh
19:00
every application we know it's uh
19:00
every application we know it's uh service level indicators such as uh
19:03
service level indicators such as uh
19:03
service level indicators such as uh errors and application latency uh we
19:07
errors and application latency uh we
19:07
errors and application latency uh we understand CPU consumption memory
19:08
understand CPU consumption memory
19:09
understand CPU consumption memory consumption all resource consumptions in
19:10
consumption all resource consumptions in
19:10
consumption all resource consumptions in some cases we can drill down into
19:13
some cases we can drill down into
19:13
some cases we can drill down into runtime environments like gvm
19:15
runtime environments like gvm
19:15
runtime environments like gvm environments uh net on time we see all
19:19
environments uh net on time we see all
19:19
environments uh net on time we see all the relevant notes uh that
19:22
the relevant notes uh that
19:22
the relevant notes uh that are uh that host uh our application
19:26
are uh that host uh our application
19:26
are uh that host uh our application instances we understand Connections in
19:28
instances we understand Connections in
19:28
instances we understand Connections in all dependency services so we can
19:30
all dependency services so we can
19:30
all dependency services so we can Traverse the graph and find the that
19:34
Traverse the graph and find the that
19:34
Traverse the graph and find the that like this application is performing
19:36
like this application is performing
19:36
like this application is performing slowly because it's database performing
19:39
slowly because it's database performing
19:39
slowly because it's database performing slowly we also can track uh Network
19:42
slowly we also can track uh Network
19:42
slowly we also can track uh Network latency issues network connectivity
19:44
latency issues network connectivity
19:44
latency issues network connectivity issues and so on and finally we have uh
19:48
issues and so on and finally we have uh
19:48
issues and so on and finally we have uh logs of error applications and or we can
19:52
logs of error applications and or we can
19:52
logs of error applications and or we can Jill down into CPU profiles for example
19:56
Jill down into CPU profiles for example
19:56
Jill down into CPU profiles for example and here we have some Prett for
19:59
and here we have some Prett for
19:59
and here we have some Prett for inspections uh about each subsystem of
20:01
inspections uh about each subsystem of
20:01
inspections uh about each subsystem of each application so we can easily
20:04
each application so we can easily
20:04
each application so we can easily highlight the issues so you don't need
20:07
highlight the issues so you don't need
20:07
highlight the issues so you don't need to analyze each chart uh manually so uh
20:10
to analyze each chart uh manually so uh
20:10
to analyze each chart uh manually so uh here is basic slos inspection we expect
20:15
here is basic slos inspection we expect
20:15
here is basic slos inspection we expect that our application will serve uh over
20:18
that our application will serve uh over
20:18
that our application will serve uh over 99% of request without errors and we
20:21
99% of request without errors and we
20:21
99% of request without errors and we expect some latency it's like
20:24
expect some latency it's like
20:24
expect some latency it's like adjustable parameters but uh this is the
20:28
adjustable parameters but uh this is the
20:28
adjustable parameters but uh this is the mo the most valuable Matrix of each
20:31
mo the most valuable Matrix of each
20:31
mo the most valuable Matrix of each application so and we should alert only
20:34
application so and we should alert only
20:35
application so and we should alert only on SLO violations so uh until
20:39
on SLO violations so uh until
20:40
on SLO violations so uh until application uh is not violating its slos
20:44
application uh is not violating its slos
20:44
application uh is not violating its slos we should not care about CPU usage on
20:46
we should not care about CPU usage on
20:46
we should not care about CPU usage on the relevant nodes or
20:48
the relevant nodes or
20:48
the relevant nodes or something uh then we should check that
20:51
something uh then we should check that
20:51
something uh then we should check that we have enough application instances uh
20:54
we have enough application instances uh
20:54
we have enough application instances uh up and running so we ask corot tracks
20:58
up and running so we ask corot tracks
20:58
up and running so we ask corot tracks that uh we have uh over 75 available
21:02
that uh we have uh over 75 available
21:02
that uh we have uh over 75 available application instances and also we Track
21:05
application instances and also we Track
21:05
application instances and also we Track application restarts because uh each
21:08
application restarts because uh each
21:08
application restarts because uh each restart each restart can affect uh
21:12
restart each restart can affect uh
21:12
restart each restart can affect uh inflight request so we can have
21:16
inflight request so we can have
21:16
inflight request so we can have application errors because of restarts
21:19
application errors because of restarts
21:19
application errors because of restarts then we analyze CPU usage uh like not
21:22
then we analyze CPU usage uh like not
21:22
then we analyze CPU usage uh like not only not CPU usage but also container
21:25
only not CPU usage but also container
21:26
only not CPU usage but also container CPU usage uh comparing to its limit and
21:29
CPU usage uh comparing to its limit and
21:29
CPU usage uh comparing to its limit and highlight situations where
21:31
highlight situations where
21:31
highlight situations where application uh has no uh experienced a
21:34
application uh has no uh experienced a
21:34
application uh has no uh experienced a lack of CPU
21:36
lack of CPU time then uh we track uh memory leaks
21:41
time then uh we track uh memory leaks
21:41
time then uh we track uh memory leaks and the ACT activity of the H killer if
21:45
and the ACT activity of the H killer if
21:45
and the ACT activity of the H killer if our application were uh terminated by
21:50
our application were uh terminated by
21:50
our application were uh terminated by the H killer because of you know
21:52
the H killer because of you know
21:52
the H killer because of you know reaching CPU limit or reaching U node
21:55
reaching CPU limit or reaching U node
21:55
reaching CPU limit or reaching U node memory capacity we can easily understand
21:58
memory capacity we can easily understand
21:58
memory capacity we can easily understand that it was restart uh by the
22:02
that it was restart uh by the
22:02
that it was restart uh by the application by the own
22:04
application by the own
22:04
application by the own killer and uh also we when we look at uh
22:09
killer and uh also we when we look at uh
22:09
killer and uh also we when we look at uh the particular application we we see
22:11
the particular application we we see
22:11
the particular application we we see only the relevant nodes so we don't need
22:14
only the relevant nodes so we don't need
22:14
only the relevant nodes so we don't need to check one note by node and we not to
22:18
to check one note by node and we not to
22:18
to check one note by node and we not to uh we don't have
22:21
uh we don't have to remember the mopping applications not
22:24
to remember the mopping applications not
22:24
to remember the mopping applications not and it works uh also for a dynamic
22:28
and it works uh also for a dynamic
22:28
and it works uh also for a dynamic environment such as kubernetes then we
22:30
environment such as kubernetes then we
22:30
environment such as kubernetes then we can check all Communications of each
22:33
can check all Communications of each
22:33
can check all Communications of each application with dependency Services we
22:36
application with dependency Services we
22:36
application with dependency Services we can measure uh Network latency between
22:38
can measure uh Network latency between
22:38
can measure uh Network latency between them we can track fail TCP connections
22:42
them we can track fail TCP connections
22:42
them we can track fail TCP connections we can uh see any um Network
22:46
we can uh see any um Network
22:46
we can uh see any um Network partitioning uh so uh we can easily
22:49
partitioning uh so uh we can easily
22:49
partitioning uh so uh we can easily understand that uh these two Services
22:52
understand that uh these two Services
22:52
understand that uh these two Services could not communicate with each other
22:54
could not communicate with each other
22:54
could not communicate with each other because of network issues then we can
22:58
because of network issues then we can
22:58
because of network issues then we can see all Ds request by each application
23:01
see all Ds request by each application
23:01
see all Ds request by each application so if something goes wrong and
23:03
so if something goes wrong and
23:03
so if something goes wrong and application cannot resolve uh each
23:06
application cannot resolve uh each
23:06
application cannot resolve uh each database oral uh we can easily identify
23:10
database oral uh we can easily identify
23:10
database oral uh we can easily identify uh such
23:11
uh such scenarios and the most valuable thing is
23:15
scenarios and the most valuable thing is
23:15
scenarios and the most valuable thing is that uh as you remember I mentioned that
23:19
that uh as you remember I mentioned that
23:19
that uh as you remember I mentioned that uh we need some postprocessing for logs
23:22
uh we need some postprocessing for logs
23:22
uh we need some postprocessing for logs to be able to quickly understand what's
23:24
to be able to quickly understand what's
23:24
to be able to quickly understand what's happening and kurut has uh super
23:27
happening and kurut has uh super
23:27
happening and kurut has uh super interesting features
23:29
interesting features
23:29
interesting features uh it can extract repeated patterns from
23:31
uh it can extract repeated patterns from
23:31
uh it can extract repeated patterns from log so we have a lot of events here but
23:35
log so we have a lot of events here but
23:35
log so we have a lot of events here but kurut understood that there are only
23:38
kurut understood that there are only
23:38
kurut understood that there are only five types of Errors they group the
23:40
five types of Errors they group the
23:40
five types of Errors they group the similar messages and allows you in
23:43
similar messages and allows you in
23:43
similar messages and allows you in seconds understand what's happening like
23:45
seconds understand what's happening like
23:45
seconds understand what's happening like uh if we have new type of Errors you
23:48
uh if we have new type of Errors you
23:48
uh if we have new type of Errors you will see here Spike of errors and
23:51
will see here Spike of errors and
23:51
will see here Spike of errors and quickly understand its type and uh then
23:55
quickly understand its type and uh then
23:55
quickly understand its type and uh then you can jump into uh samples and see the
23:58
you can jump into uh samples and see the
23:58
you can jump into uh samples and see the the particular line uh log lines then as
24:02
the particular line uh log lines then as
24:02
the particular line uh log lines then as I mentioned we can drill down into
24:05
I mentioned we can drill down into
24:05
I mentioned we can drill down into application run time uh in case of java
24:08
application run time uh in case of java
24:09
application run time uh in case of java can check uh stop the world poses caused
24:12
can check uh stop the world poses caused
24:12
can check uh stop the world poses caused by garbage collector or other reasons
24:15
by garbage collector or other reasons
24:15
by garbage collector or other reasons but it's super valuable when
24:19
but it's super valuable when
24:19
but it's super valuable when you uh like think of it when somebody
24:24
you uh like think of it when somebody
24:24
you uh like think of it when somebody already analyzed all application
24:26
already analyzed all application
24:26
already analyzed all application subsystems and highlighted
24:29
subsystems and highlighted
24:29
subsystems and highlighted uh which application subsystems needs
24:32
uh which application subsystems needs
24:32
uh which application subsystems needs your attention so in this case if
24:35
your attention so in this case if
24:35
your attention so in this case if everything is okay with your GBM Matrix
24:39
everything is okay with your GBM Matrix
24:39
everything is okay with your GBM Matrix you will
24:40
you will not you don't have to look at this uh
24:44
not you don't have to look at this uh
24:44
not you don't have to look at this uh and it's getting drastically safe uh
24:47
and it's getting drastically safe uh
24:47
and it's getting drastically safe uh time uh needed
24:49
time uh needed to understand what's happening and uh
24:53
to understand what's happening and uh
24:53
to understand what's happening and uh the last one inspection is K can cheack
24:59
the last one inspection is K can cheack
24:59
the last one inspection is K can cheack uh application roll outs and compare uh
25:03
uh application roll outs and compare uh
25:03
uh application roll outs and compare uh application new application release
25:05
application new application release
25:05
application new application release performance with the previous one it's
25:07
performance with the previous one it's
25:07
performance with the previous one it's super it's super helpful when you deploy
25:12
super it's super helpful when you deploy
25:12
super it's super helpful when you deploy many times a day so and you don't have
25:15
many times a day so and you don't have
25:15
many times a day so and you don't have to check Matrix every time when you roll
25:18
to check Matrix every time when you roll
25:18
to check Matrix every time when you roll uh new version out so kurot can in the
25:22
uh new version out so kurot can in the
25:22
uh new version out so kurot can in the background check every release and
25:24
background check every release and
25:24
background check every release and notify if uh new version has
25:29
notify if uh new version has
25:29
notify if uh new version has some issues or it's like uh there are
25:33
some issues or it's like uh there are
25:33
some issues or it's like uh there are new type of Errors uh in the
25:36
new type of Errors uh in the
25:36
new type of Errors uh in the logs
25:38
logs and having such automation allows us to
25:43
and having such automation allows us to
25:43
and having such automation allows us to uh sort our services uh by the their
25:48
uh sort our services uh by the their
25:48
uh sort our services uh by the their severity of like issues for example if
25:51
severity of like issues for example if
25:51
severity of like issues for example if some Services violate their slos we can
25:54
some Services violate their slos we can
25:54
some Services violate their slos we can highlight them
25:56
highlight them if if some Services
25:59
if if some Services has uh errors in their logs we can
26:03
has uh errors in their logs we can
26:03
has uh errors in their logs we can highlight it we can we can uh not show
26:07
highlight it we can we can uh not show
26:07
highlight it we can we can uh not show you dashboards with like beautiful
26:10
you dashboards with like beautiful
26:10
you dashboards with like beautiful charts but we provide you with answers
26:14
charts but we provide you with answers
26:14
charts but we provide you with answers about
26:16
about uh where where is the problem for
26:19
uh where where is the problem for
26:19
uh where where is the problem for example in this case we can see that
26:21
example in this case we can see that
26:21
example in this case we can see that there is no uh available instances of
26:24
there is no uh available instances of
26:24
there is no uh available instances of this application you don't need to
26:26
this application you don't need to
26:26
this application you don't need to navigate to the particular your
26:28
navigate to the particular your
26:28
navigate to the particular your dashboard to understand this and few
26:32
dashboard to understand this and few
26:32
dashboard to understand this and few words about alerts alerts is uh you need
26:36
words about alerts alerts is uh you need
26:36
words about alerts alerts is uh you need alerts to be able to understand that
26:39
alerts to be able to understand that
26:39
alerts to be able to understand that something goes wrong uh and without like
26:43
something goes wrong uh and without like
26:43
something goes wrong uh and without like when you not uh look at uh your
26:47
when you not uh look at uh your
26:47
when you not uh look at uh your dashboards like our approach is that we
26:50
dashboards like our approach is that we
26:50
dashboards like our approach is that we should alert only on SLO violations and
26:53
should alert only on SLO violations and
26:53
should alert only on SLO violations and we should not alert you on
26:56
we should not alert you on
26:56
we should not alert you on infrastructure problems or some errors
26:58
infrastructure problems or some errors
26:58
infrastructure problems or some errors in the logs because if application uh is
27:02
in the logs because if application uh is
27:02
in the logs because if application uh is performing as expected we don't need to
27:05
performing as expected we don't need to
27:05
performing as expected we don't need to dig into any underlying problems
27:10
dig into any underlying problems
27:10
dig into any underlying problems because it's not necessary it doesn't
27:13
because it's not necessary it doesn't
27:13
because it's not necessary it doesn't like in any distributed system uh like
27:18
like in any distributed system uh like
27:18
like in any distributed system uh like uh not an availability is is a expected
27:22
uh not an availability is is a expected
27:22
uh not an availability is is a expected Behavior so and uh we believe that uh
27:25
Behavior so and uh we believe that uh
27:25
Behavior so and uh we believe that uh it's not important to to track that
27:27
it's not important to to track that
27:27
it's not important to to track that stuff
27:31
until your applications uh are meeting their slos
27:35
applications uh are meeting their slos
27:35
applications uh are meeting their slos and we try
27:36
and we try to provide you with Rich alerts which
27:40
to provide you with Rich alerts which
27:40
to provide you with Rich alerts which means uh we know that this application
27:44
means uh we know that this application
27:44
means uh we know that this application is violating its SL slos but we know uh
27:47
is violating its SL slos but we know uh
27:47
is violating its SL slos but we know uh that there are some relevant errors in
27:50
that there are some relevant errors in
27:50
that there are some relevant errors in the logs or it's the same time the
27:53
the logs or it's the same time the
27:53
the logs or it's the same time the application has network issues and we
27:56
application has network issues and we
27:56
application has network issues and we can send you single alert with all this
27:58
can send you single alert with all this
27:59
can send you single alert with all this data so you don't need to analyze it
28:01
data so you don't need to analyze it
28:01
data so you don't need to analyze it manually or correlate manual alerts in
28:05
manually or correlate manual alerts in
28:05
manually or correlate manual alerts in to understand what's
28:08
to understand what's
28:08
to understand what's happening uh and like I almost done but
28:13
happening uh and like I almost done but
28:14
happening uh and like I almost done but uh it's like uh some summary of my talk
28:20
uh it's like uh some summary of my talk
28:20
uh it's like uh some summary of my talk I want you
28:22
I want you to think about observability because uh
28:26
to think about observability because uh
28:26
to think about observability because uh in devops philosophy we
28:29
in devops philosophy we
28:29
in devops philosophy we should understand that developers should
28:33
should understand that developers should
28:33
should understand that developers should understand how their applications run in
28:36
understand how their applications run in
28:36
understand how their applications run in production
28:38
production uh I want to highlight that I believe
28:42
uh I want to highlight that I believe
28:42
uh I want to highlight that I believe that uh observability is not about
28:46
that uh observability is not about
28:46
that uh observability is not about storing your telemeter data or uh
28:49
storing your telemeter data or uh
28:49
storing your telemeter data or uh compressing your telemeter data it's
28:51
compressing your telemeter data it's
28:51
compressing your telemeter data it's more about uh understanding your data
28:55
more about uh understanding your data
28:55
more about uh understanding your data and being able to extract
28:58
and being able to extract
28:58
and being able to extract and actionable insights and uh I want
29:03
and actionable insights and uh I want
29:03
and actionable insights and uh I want you to uh to give a try to karud because
29:08
you to uh to give a try to karud because
29:08
you to uh to give a try to karud because it's like open source free of charge and
29:11
it's like open source free of charge and
29:11
it's like open source free of charge and you can use it and you will have all
29:14
you can use it and you will have all
29:14
you can use it and you will have all that all these practices out of the box
29:17
that all these practices out of the box
29:17
that all these practices out of the box and I believe kurut can identify over 8%
29:22
and I believe kurut can identify over 8%
29:22
and I believe kurut can identify over 8% of issues
29:23
of issues automatically and like uh reuse
29:27
automatically and like uh reuse
29:27
automatically and like uh reuse observability
29:28
observability as you reuse your codee so that's it uh
29:34
as you reuse your codee so that's it uh
29:34
as you reuse your codee so that's it uh if you have any questions
29:36
if you have any questions
29:36
if you have any questions [Music]