Help! I cannot find out where is the bug by Jose Manuel Jurado Diaz || Lightup Conference
11K views
Nov 16, 2023
In this session, we are covering the troubleshooting steps that, as support engineers in Microsoft, we're doing to detect a performance issue in a .NET application. We are going to cover the two main steps. • First using WinDbg and other debugging techniques we are going to debug the application to identify the module that is causing the performance issue. • Second as our application is connecting to Azure SQL Database, we are going to identify what is the query that is causing this performance issue. Conference Website: https://www.2020twenty.net/lightup #lightup #2020twenty
View Video Transcript
0:00
OK, so thank you so much
0:04
Thank you so much for attending this session. A decision about help
0:11
I cannot find out where is the bug. Remember to donate in support of UNICEF
0:21
Thank you so much also for all the sponsors that we have that make real this type of events
0:30
Remember that about the it's very, very important about the feedback about this event and speaker
0:37
It's I think that providing you feedback and at the end, we'll try to grow up in as a speaker events, etc
0:47
So out me, my name is Jose Manuel Jurado. I am escalation engineer in SQL Azure working in Microsoft
0:56
Also, such a master expert in this technology in SQL database, Manage Instance, and the new databases that we have in
1:05
as open source and Azure Pass as MySQL, Progress and MariaDB. I am a speaker in different events
1:13
let's say SQL Saturday, TechReady, SQL Nexus, SQL Pass. And also, you have all the contacts about Twitter
1:21
LinkedIn, and my email. With me, it will be my colleague Vitor Pombayton
1:28
Unfortunately, he is not available right now, so I'm going to deliver a session alone
1:33
But for sure that this session also, the owner of this session is my colleague Vitor
1:40
As you can see, it's also another escalation engineer based on Portugal
1:44
I'm based in Spain that you can notice by my accent, but you have also the taste of context
1:50
So what is the reason about this session? Well, it's about three things
1:59
First of all is try to find a performance problem. It's very typical in support
2:06
We receive calls from the customer that we want to know what happened
2:11
how to find the performance problem. So the idea of this session is try to help you in
2:16
order to find a performance problem with your application. Second, the idea is try to find what is
2:23
the query that is taking too much time because also, it's based on real example
2:27
Because it's a real example working with a customer, I'm going to show you how it's possible to find
2:33
a query that is taking too much time. Finally, I'm going to show you how to find
2:39
after we found the performance problem, the query, what is the bug that we have with this code
2:46
So let me introduce about the background, about what is the reason that we receive a call
2:56
that mentioned that my application is running slow, but I don't know why
3:01
So let me explain about the background. Right now, we receive a call that the customer mentioned
3:07
that we don't have any updates in the code for a long time
3:11
We don't have any new deployment, but it's running very slow. It's an expected issue because the customer didn't know what happened without any modification in the production environment
3:22
Also, the business was seriously impacted and around 1,000 users impacted. The customer adds a lot of data from the last month without any issues
3:37
Finally, if they try to run this process in on-premise, this process took around 20 percent less in this point
3:50
So what is the reason that we have? I'm going to stop here
3:55
One second. I'm going to show you what is happening
4:07
This is the application that our customer has. They mentioned that, okay, I received a call
4:16
They show me that what the application, as you could see, the application is running this way
4:22
One second. They expect that the application is running
4:33
faster more than right now is running. is running very slow but they don't know what happened in this point
4:40
So in this situation, we need to find what happened in this application
4:46
So the first point that we ask to the customer is try to verify what happened
4:54
So as you can see, it's running very slow, the customer didn't know what happened, etc
4:59
So let's go to the presentation. So what is the process that we are going to follow up in order to
5:10
identify the part of the code that is the problem? I would like to suggest, for example
5:16
one thing that we are going to show you about stopwatch. It's a very useful tool that we have in order to detect
5:23
what is the time that is spend my functions. We are going to show you how we are going to
5:29
obtain the connectivity of execution statistics. Also, we are going to use one thing that is called Visual Studio Profiler
5:37
using one part that is called Interimestation, that is possible to know what is the part of
5:42
the code that is taking too much time. So I'm going back to the application
5:51
One second here. The first point is I'm going to suggest to the customer
5:59
run the Visual Studio Profiler. I'm going to click here in the back
6:04
choose the performance profiler. I'm going to choose a performance wizard, I'm going to click Start
6:13
I'm going to choose one thing that is called inter-ministration. I'm going to go here and click Next
6:20
and click Next, and finish. Let's try it right now
6:32
Right now, as you could see here, the application is running. We are going to wait a couple of executions
6:38
let's say until 20 executions. As you could see, it's taking too much time
6:48
9, 10, 11. So as you can see more or less is taking
6:54
around one to seconds per executions. I'm going to stop the profiling
7:07
Right now, we have a special report that is using the instrumentation profiling report
7:15
As you can see here, the part of the code that is taking too much time is two parts
7:20
The first is the execution of this common executed scalar is taking the 51 percent of the CPU let say the time let say Another one it a free to sleep I going to explain what is the reasons
7:36
So right now we know that something happened, executing some SQL statement
7:44
So I'm going to, customer, we know that the part of the code that is facing this problem
7:50
let's try to verify what happened in your code. I'm going to the code
8:00
As you can see, this is the process. This is the customer is using right now
8:08
the X screen Scalar, and we have the thread. I'm going to show you one thing that is very
8:15
important about the Stodwatch. As you can see, this is a Stodwatch that is a common
8:23
provide us the way that how much time from how much time passed from
8:28
running this command until we obtain the lapse. So at the end, we could see how much data
8:35
or how much time we pause in order to execute this query
8:38
So other things that is very important that I'm going to suggest to the customer is to use
8:44
one thing that is called statistics. You know that the SQL connection that we have in.net
8:51
it's possible to obtain some statistics about, for example, the execution time or the network server time
8:59
So when I enable these statistics, and then I'm going to see how much time we took for
9:06
every execution that we have. Also, as you can see, this customer is running one query
9:12
very simple query that select a count, it changes the value of this parameter
9:18
every time that they run the process. So I'm going to enable the application
9:24
I'm going to enable the data. One second, let me change this value
9:28
and I'm going to run again. So right now, you can see that we have more useful information
9:43
about what happened for every executions that we have. Let me stop here
9:50
Go back to application and explain about it. So every time that our customer run
9:58
a select count from the performance, this table from test to search this value
10:05
As you can see, the execution time is around 1400 milliseconds every time
10:12
Also, the network server time is 1,400 milliseconds. Also, as you could see
10:22
the execution from this part to this part is taking two seconds
10:29
So what is the reason this is taking two seconds? Just only for demonstration
10:33
I add this slip in this process because it's very important to try to
10:39
this thing what is the execution of the query, or what is the execution of the complete process that you have in the functions
10:47
Let me stop here and remove, for example, the thread. As you can see right now
11:01
the execution of the process is more or less, Second. It's more or less the same time that we have
11:13
We don't have any big difference between the execution of the query and the execution of the
11:18
of the store procedure that we, sorry, the function that we have here in the shop
11:24
So that's mean that the time that I expend in this process is
11:31
just solely execution of the query. Right now, we identify that something happened in the database
11:37
Because as you can see, the time this time is very similar that we have in this
11:44
So let me go to my presentations. So right now, we identify what happened about execution
12:00
So we identified three things. The start watch, we use the SQL connectivity execution statistics
12:07
using the Studio Profiler. So in the end, we found what happened
12:12
So what is the next step that we need to do? Is the phase two
12:16
So the idea is to identify the query. What is the tool that we are using in this process
12:24
We are going to use the Azure portal in order to identify what is the query that is taking too much time
12:30
because it seems based on the telemetry that we have, or the option that we have
12:36
the statistics that we have is taking around one or two seconds
12:41
We are going to review also using the query that I store because it's very important
12:47
It's a very useful tool, a very important tool because with this we are able to
12:52
see everything that is happening in my database. Also, we are going to review
12:58
one thing that is called the execution tab. the execution plan. So let's go for the demo
13:09
I'm going to use the portal. I'm going to run the process mean well
13:22
This process is running. Yeah, the process is running. Go to the portal again
13:30
Okay. But the first thing that we need to review is review what happened in my database
13:38
As you could see here, when you choose in a portal, you choose your database
13:43
you could see one thing that is called this graphic that is computerization
13:47
But if we click here, we have a lot of metrics that we could
13:54
find what happened in my database. Right now, I don't know what happened is CPU
13:58
I don't know what is the time spending this point. So for the reason I'm going to add a new metric in the portal
14:05
For example, I'm going to choose CPU. I'm going to choose, for example
14:13
that IO and I'm going to add a new metric that is called LogIO
14:23
Let me remove this because it's not yet. So something happened in terms of CPU
14:30
As you could see here, we don't have high data.io, we don't have high log.io
14:37
but it's clear that something happened here about the execution of my CPU
14:45
So even one thing that's very important when you are using this type of tools is you could add a new chart
14:54
Also for example we could review how many connections that we have because you know that sometimes we could have a pilot connections or we could have a lot of new connections
15:07
So I'm going to choose this one. Sexual connections and as you can see here
15:15
just only we have one connection per time. So if I review what happened
15:21
just only one connection or around two connections is causing right now 100 percent all the time
15:28
So I would like to understand that, okay, it's not too much connections
15:34
We have a problem in terms of CPU. Also, if we choose, for example
15:40
the workers, that is another point that we have. This is very important about the worker percentage
15:46
As you can see, it's not too much. So I don't know what happened
15:50
So this is the first point that we found right now, that the application is running some SQL statements
15:59
This SQL statement is using a lot of CPU. So I'm going to back to the database
16:06
and I'm going to review if we have something more about intelligent performance
16:12
That in SQL Azure, we have the chance to have some options about intelligent performance
16:19
For example, performance recommendation. SQL Azure by default is yzing all the requests
16:28
all the process or the workload that you have. We could provide some information about
16:33
okay, you need to create an index, you need to create or delete this index
16:38
or do in this way, try to do this way or try to verify some details
16:46
So if you don't have enabled, this thing is very simple. You need to go to the automatic tuning
16:54
Enable right now, it's by default, but you could enable, for example
17:00
force plan and you could enable the option to create index. For example, you could specify in that
17:08
if you find something, please create index. And it's something, it's the improvement
17:13
it's not that we expect, try to remove the index. So this is the option that we have for Azure
17:20
that they will create for you. And also a drop in this
17:26
You know that sometimes there is no need to create too much indexes
17:31
Sometimes to have a lot of indexes, they are causing a lot of problems in terms of a IO
17:37
So for the reason we have this option in automatic tuning. So I'm going to review the performance recommendation
17:43
that I don't have any performance recommendation. So I don't know what happened right now
17:48
So I'm going to go to the queries performance inside and I'm going to review what happened
17:56
As you can see, we have on this graphic that contains a lot of information
18:01
but it seems that we have something about, as you can see, we have this CPU
18:08
you could choose a Data IOD, you could choose a Log IOD in terms of the top five queries
18:14
that is using too much CPU. As you can see here, we have right now a distribution
18:21
that CPU for the query 348 is not too much, but it seems that we have a query that with a number 632
18:32
that is using 70 percent of the CPU, and also the CPU is 90 percent
18:36
So right now, I'm going to click here in the query ID
18:41
As you can see right now, we identify the query. So we identified that this query is causing this problem
18:51
but I don't know what happened because it's a simple query. It's a rerun the search from the table from this query
19:01
I don't know. So in this case, I'm going to review what is happening
19:11
So for example, let's try to use another tool. For example, as you can see here in SQL Management Studio
19:24
we have other way in order to detect the data. For example, if I run a select start from SIS resources that I could see the same metric
19:34
Right now, every 15 seconds, I have the authorization in terms of CPU, IO, etc
19:42
So other things that's very important, it's about query store as I mentioned before
19:48
As I mentioned before, here we have several reports that provide us a lot of information
19:55
For example, I'm going to open the query with the statistics. As you can see, the total time in this query is CPU
20:07
I'm going to click here. As you can see, we have this time
20:15
This is the query that is taking too much time in terms of CPU. You could click here in Configure
20:20
you could change the data if you want to review what happened this last day, this last two days, etc
20:27
But right now, this is happening. Also, you can see one thing that is called execution plan
20:33
So it's very important that right now, as you can see every time that I run this query
20:39
we're running an index scan. So that's mean that SQL is reading all the data
20:45
all the rows that we have in this table, that it seems that we miss much probably an index
20:52
Also, you have the information about what is happening. So for example, other information that we have
21:02
you can see the query or other thing that you can see in the properties
21:08
It's a very good thing, it's about the parameter list. So this is the parameter that we compile the query
21:18
So you can see here the set options about the environment, the statement, etc
21:25
And also we have a warning that is very important about, This is the real problem that we have
21:30
So it's very important about other things that you have these cycles
21:34
that's mean that the execution of the query took this time. Also, this square means that the execution was canceled
21:42
So using query the test store, we could see when our query was executed
21:49
it's executed correctly or was canceled. So other things that we have
21:53
I'm not going to go more than this, but for example, you know that in query performance inside
22:00
we have the chance in order to review the query. In this way, we have also the same operation
22:06
If I type here the number of the query, you could see for example, how much data
22:16
for example, here these executions take this time, take this time, was canceled here, was canceled here
22:22
So you could see what happened during this time the execution and also you could see the plan ID You know that sometime we have a regression that you could have different plans for the same executions
22:35
Right now, the execution plan means that we have an index scan
22:40
So that means that it seems that this table has not right now an index
22:46
When I search for this test to search, I'm going to click here and go into this table
22:55
Let me stop because it's taking 99 percent of the CPU, much probably not able to do a lot of things
23:03
One second. Stop here, the application. Okay. So I'm going to the table is this
23:21
This is the query and this is the index. So as you can see here, we have an index
23:28
If I type, for example, the definition of the index. It's taking time
23:54
Okay. Anyway, it's taking time
24:04
We have the index, we are using the field that we are searching the data
24:12
So as you can see, there's a search we are using this time
24:16
And so we have an index. So I don't know what happened
24:19
So we have an index, we have the data. I don't know what happened
24:25
So it seems that we have a problem with database or the code is doing something that is strained
24:32
Let's try to put more focus about execution plan. As you can see, we have a warning
24:38
So let me go to the presentation again, in order to explain with more details
24:47
about how we found the backup. So what happened in this situation
25:03
We have an issue, but we don't know what happened. So the background is
25:09
this query took one second per executions. It seems that they are using
25:18
index scan against indexseq. and also how much data you add, this query will take more time
25:28
So I'm going back to the execution, and I'm going to review what happened
25:34
So as you can see, here we have in this scan, that's mean that for every executions
25:43
the number, SQL is reading 400,000 rows for every executions. They run the aggregate
25:54
so they try to calculate how many rows that we have. Also, we have the computer scalar
26:07
but we have here one thing that is very important. It's about we have and convert implicit
26:14
So as you can see here, SQL is converting, it seems that is converting mVarCar to VARCar
26:24
Because VARCar, as you can see here, is VARCar and the parameter
26:32
of the application is using is mVarCar. As you can see here, if I go back to the properties again
26:44
So we have this thing, one second
26:55
Okay, one second is here. The parameter list that we are using
27:04
we are using one thing, it's this. This letter means that SQL is using
27:11
or SQL application is using nvarchar. So let's try to review what happened
27:15
I'm going to obtain a query, and I'm going to run the query because I will like what happened
27:24
You know that in order to obtain the execution plan, it's very simple
27:28
You could click here to obtain the current execution plan, and I'm going to run the query
27:38
I'm going to, for example, is specified in this case. One second, I'm going to copy a page
27:45
That's probably here, parameter, this value. So if I run the query
28:04
it's not taking too much time, is taking one second, more or less is the time that
28:11
the query running the application is taking for the customer side. We have the execution here
28:20
As you can see, we have in this scan. So you know that if the definition of the table
28:27
right now it's VARCAR. I'm going to modify it right now, this parameter in order to specify
28:37
VARCAR and state to use MVARCAR. I'm going to do again the same
28:45
but removing right now the letter. I'm going to again to use the execution plan
28:54
I'm going to execute the query again. So as you can see here
29:04
the Dynas pen is zero. Let's say we don't have any warning
29:10
Also, if you see the difference between these two executions, is we are using index scan
29:18
reading every execution around 400,000 rows, Here, we are using index seek that the number of
29:30
the execution will be just only or the number of queries will be just only one
29:36
So in this situation, it's clear that something happened in the code that it seems that
29:44
the parameter that this application is using is not the expected parameter value or
29:50
type that we have for this table. So in this case, the idea is I'm going to go to Applications
29:59
I'm going to review what happened in the code. As you can see, it's very important
30:10
For example, in SQL Azure, it's very important to use Retrialogic. Please remember, before we go deeper about it
30:19
it's very important about to use Retrialogic. You can see that we are using a StodWatch
30:25
As you can see here, the application is using the data type and VARCAR
30:31
So I'm going to modify the value, I'm going to modify the parameter to use
30:38
the right parameter that we have from these tables. You can see I'm going to use
30:47
the same VARCAR with this parameter. So I'm going to modify this
30:56
and I'm going to run. So as you can see here
31:12
right now it's taking a few milliseconds in order to complete the process
31:17
As you can see, the difference is very huge. Right now all the executions
31:22
it's taking a few milliseconds. For example, remember that previously we have
31:28
one second per executions right now is no more than 35, 36 milliseconds
31:36
Also remember that I'm working in Spain, or I run in this application from Spain
31:42
So it's taking some time in order to reach the North Europe region in Azure
31:49
So it's going very fast. It's amazing about the time that just only
31:58
modifying the parameters, the data types. So as you can see here
32:02
the byte received is 84. The bytes that I sent is to the SQL server, it's 81
32:12
I receive one row, the execution time is 36, The network server time, that's mean how much time I waiting for
32:23
another reply from the server more or less is the same. And also as you can see
32:28
the text is just only without using Unicode or VARCARD. And also as you can see here
32:38
the execution is taking a few milliseconds in order to complete the process
32:43
So also, I'm going to run again, the instrumentation about the performance profiler of Visual Studio
33:08
Let's wait for a couple of seconds. There we have the applications
33:18
Let's stop. Also right now, as you can see right now
33:33
all the time it running this process The 75 percent of the time is running the computer scala that is taking a few milliseconds So this is that we found this back in this code using this option
33:51
So what is the lesson learned that we found in this
34:04
The most important thing is how to use the portal. It's very important to use the portal query store
34:12
because they will help you when you have any performance problem. So remember that we use the portal
34:22
Thanks to the portal, we found what is the query that is taking too much time
34:27
because as you could see here, it's taking the CPU. Also, you could see here the CPU time for
34:37
every hour that took the percentage. For example, 70 percent of the CPU during this time was
34:45
executed around 1,700 operations per time. You could see the recommendations
34:54
Remember that is very important that SQL, we yze the workload and provide indexes
35:00
As you can see, we don't have any problems with indexes because we have this index created
35:07
Also, in the performance inside, we saw the query that is taking this process
35:23
the execution, etc. So it's very important to use the portal. It also is very important about to use the Query Store
35:31
Because using the Query Store, as you can see, we have the way to yze all the data in a very best way
35:39
in order to find multiple things about, not just only this time, about it
35:44
For example, all the things that I will like to show with you
35:48
Before go to another point, it's also you could see in the top resource consuming queries
35:53
you could see for example, how much time, the duration of the query
35:58
or how much time in terms of CPU, wrong count, logical writes
36:03
radical reads and also you could see in terms of average, in terms of max, mean, values, etc
36:10
One thing that it's very important if you don't need to use the graphics
36:16
also you could see the data. Just only with the query, the time that we have
36:26
So in this grid, you have the chance to review all the queries
36:30
what happened with these queries. Every time that I click in the query, you are going to receive the query
36:38
you are going to see the plan ID, etc. So it's very important to use this one
36:45
Other things that I will like in terms of executions, I will like to show you one thing that also is very
36:54
important about the alerts and the agnostic settings. You know that other things that you could beside
37:04
the query store and the portal is to use, for example, diagnostic settings
37:10
So what does mean diagnostic settings? Another way to monitor all the queries you could send all the information that we have about query store runtime statistics weather statistics errors that you have in your database
37:26
database weather statistics, sign out, blocks, dead blocks, etc. You could send this information to
37:34
one thing that is called Log ytics Workspace. So if I open this Log ytics Workspace
37:43
Example here. We have another way in order to detect the data
37:54
to detect what happened with my database. For example, here, you could see that other report that we have is
38:01
okay, this database is taking the 90 percent of the CPU. You have several reports
38:10
about the query duration in seconds, the number of errors that we have
38:15
We don't have any that blocks. We don't have any timeouts. We don't have any blockings
38:19
So at the end, that's mean that it's just only a pure execution
38:23
So if I click here, as you can see, you have more information about
38:30
okay, this query is taking this time. This is the way to start that you have
38:38
This is the query. the query ID that you have and also
38:42
report all the information that we have here. For example, if you want to have more data or do you want to
38:53
create your personal reports, I'm going to show you right now. But also as you can see here
39:00
if you have a lot of our reports in order to monitor your database
39:03
without using any third-party tool, just only enable the log diagnostics, the ysis settings or this option
39:12
you're going to have multiple things. If you want to review the data by yourself
39:19
one thing that we could use is, for example, I'll show you
39:26
Second, I don't know what happened. It's about to retrieve the data directly from
39:37
the tables that we are using in order to obtain the data. So right now, when you enable the diagnostic settings
39:46
all the information that we have, it's going to this table. This is, let's say
39:54
the syntax or the query is using Kusto. You could find Kusto the documentation about how to query
40:03
the data using, for example, PluralSite. you have this information about it
40:08
But right now, as you can see here, for example, these queries to obtain
40:17
how much time in terms of execution took my query. So as you can see, you can query Azure Diagnostics
40:25
you could retrieve about the category, query store room time statistics about my resources ID
40:33
Was this my subscription server, etc. You can see all the information that you need
40:41
Also, if I want, for example, instead you have a metric every
40:45
for example, every 30 minutes, we can modify it, for example, for every five minutes
40:52
So if I run the query I going to have more information in in intervals about every five minutes so it a very important to to know this because using this option you could have
41:06
a lot of information about it so let me go back to the presentation again
41:13
and also about you other things the most important thing we found back we found that
41:20
that please review your code. Because as you could see, using a parameter that you
41:28
use in nbarcar and the table is barcar. Yes, the time is not too much
41:33
It's just only one second. But if you run this operation multiple times
41:36
at the end the process will take too much time. I want to interrupt and let you know that we are at
41:44
40 minutes past and there are no post questions, but you have five minutes
41:48
Just only I'm going to finish because this is the last slide
41:52
As I mentioned, testing in advance, you query, try to, if your table will grow, that's it
42:01
Also, one thing that is very important, it's not always the bigger database tier helps because in this case
42:08
we could increase the database tier with more powerful and you are going to have the same problem
42:15
From my recommendation, if you want to save your money, invest your time in diagnostic and coding
42:21
I think that is very important. If you code, if your source code is good
42:26
you review the code and you tune in the code, much probably you are going to save you money and you are going to have
42:35
a very good performance. Finally, all the information that you could see
42:42
you could click here in GsgHub in order to obtain how to monitor the code
42:47
as I mentioned using Powercell, and again, provide you code. It's very important because we'll start thinking that
42:56
the problem is the database and the end was the application. So right now it's time for questions
43:03
As this is the detail of contacts, so I'm waiting for if you have any questions
43:17
You know that we want to post our questions in the chat window so that we can get them
43:32
to Jose. There are no posted right now. It seems that we don't have too many questions, right
43:57
It was very clear. There was a lot of information and delivered it very well
44:02
Thank you so much. People will probably reach out to you when they have time to digest
44:07
all of this information and try to use this. Then they'll be calling you and say
44:11
wait a minute, tell me again. Okay. This is terrific. Thank you so much for your time
44:17
and effort in putting this together. Thank you so much for giving me this opportunity
44:22
I'm very happy, excited to stay here. Thank you so much. So terrific event. Thank you
44:28
Yeah, thank you so much