0:00
This is called the four big software challenges for financial services
0:06
In reality, it's just macro trends that we see in the industry overall today
0:11
You know, when we think about in terms of financial services, it's mostly because the impact on these financial services and these applications, as they're moving things faster
0:19
and faster, the impact to the global economy as well as to the business themselves is really tremendous
0:25
So that's really sort of the, you know, what our focus is. So a little bit about me, just real quick, as to why you should listen to me if my slides will work
0:36
I've worked in the enterprise software business for a very long time at this point
0:40
Dave, I was looking at your bio, and I think I've got you beat by a handful of years. I started my life as a software engineer working with Windows apps
0:48
I actually used to write video biases and video drivers for a PC manufacturer
0:52
and I worked with Windows in the Windows 1.0 days where we were doing both OS2 as well as Windows
0:56
So I've been sort of around the entire incarnation of the Microsoft stack since really the late 80s, early 90s
1:04
You know, I've held leadership positions in a lot of large and small companies from, you know, sort of, you know, the incept of, you know, five to 10 people all the way up to, you know, thousands of people and, you know, watching some of these companies get built
1:16
All of these have something in common, which is really about building software and applications for the enterprise, for enterprise software companies or for enterprise businesses
1:28
And most of our customers at OverOps right now really tend to be in the financial services space or in the large enterprise, you know, in these big, complex distributed applications
1:40
So these big distributed apps really, you know, they do everything they possibly can on code quality
1:46
And that's sort of the focus across all of these guys is, you know, how do I sort of get high quality product out
1:53
But all we're straining these systems terribly. And a lot of what you were talking about with application insights, a lot of this tooling that we're seeing coming up these days
2:03
really, this tooling is really about how do we do a better job at monitoring things in production
2:09
How do we make sure that when we get code changes and we propagate them across the pipeline
2:14
you'll hear things like shift left and shift right, which is really about continuous delivery
2:19
continuous integration, and where we find things in the stack. Because I agree with sort of the
2:26
commentary you made earlier, which is your profiling and those things, people aren't doing
2:31
enough of that in production in particular. And the reality is, is that as much as we do with unit
2:37
tests, as much as we do with integration tests, really until you're under load, until you're in
2:42
those production environments, there's scenarios that you just can't anticipate. Software tends to only survive until it hits the first user
2:50
and the user starts playing with it and next thing you know, God only knows what's gonna happen
2:56
So over ops, our whole business is really driven around, similar to when Praveen was talking about profiling
3:03
it's about running under your applications when they run in production. So in .NET, we run directly within the CLR
3:11
We yze the byte code that happens in the CLR, and we capture any exceptions as they occur
3:18
So logging only gets you to a certain point. Logging is really about telling, you know
3:23
the developer put log statements in that tells you state of the system. But the challenge with that is that if the log statements aren't where you thought they were
3:31
then you really still don't have enough information for the developers to actually reproduce those cases and make them possible
3:38
OverOp's goal is really about how do we get you to that root cause when that exception occurred or when that error happens in your code
3:47
How do we get it back into the developer's hand as quickly as possible? And that really is part of what we see when we talk about the main challenges for the financial services space
3:58
So if you think about in the financial services, right, there's two terms that are really probably not well known in the development community
4:07
But they're super well known in the site reliability engineers, the guys that run these large scale websites and the large scale applications for their customers
4:16
And those two terms are MTTI and MTTR. And in net, what they really mean is mean time to identify and mean time to resolve issues
4:26
So one of the challenges in these financial applications and really in anybody that's hosting these more complex apps is the time to identify an issue
4:35
even if one's occurring and recognizing it happened is extremely valuable to these users
4:43
The sooner they can identify an issue, the sooner that they find it before, the more
4:47
important it is, particularly in these financial services areas. Something as trivial as a slowdown that occurs that's causing a transaction to time out
4:56
When you're in a financial services business and you're dealing with a credit card transaction
5:00
that could be the difference between whether or not that customer actually buys or they
5:05
walk away because they got frustrated because they were timing out. So they can't afford that time, that timeline for that for
5:12
identifying issues in production environments to be very long. And a lot of times these things may happen silently There may be silent errors Your throughput may be gradually degrading One of our customers we saw was having a problem with cash and they were getting a tremendous number of cash misses which was causing transactions to slow down
5:32
In a financial services aspect, that timeline is extremely important where you've got 500 millisecond response time requirements in some of these areas
5:41
Similar to mean time to resolve is once you identify how fast can you recover
5:46
In some cases, it might be as simple as throwing sort of new hardware at the issue
5:50
In others, it's I need to turn this around. I've got to get a fix out of my development group
5:54
It may be I have to roll back an application if I just did an upgrade or it may be that I've got to turn around fixes
6:00
the more that you can give to the developers, the less that trial and error is
6:05
I think we've all, you know, anybody that's been in development for more than a few years has experienced that, you know, well, I can't reproduce it in the lab. Okay, let me add some
6:13
more log statements. You run it, you get a new set of logs. That timeline can be extremely lengthy
6:20
for trying to get things promoted back and forth between production. So the more that we can do
6:24
to sort of help these companies solve this MTTI and MTTR, which is really what they care about
6:31
And the reason they care about them are the bullets on the right. It's not just the fact
6:36
that they can lose money. It's reputation, right? We all hear about it. When you see a big website
6:41
go down, when you hear about sort of a major outage or a major transaction that a bank couldn't
6:49
process financial payments for a period of time, that damages their reputation. It could damage
6:53
their stock price. So the ripple effects across a company for any kind of a significant application
7:00
outage is very huge. And in reality, if you look at some of these places, particularly like some of
7:06
the big financial services, they have more resources allocated to their development and
7:11
to their engineering operations than they do towards some of the basic financial services
7:17
cases. In cases, some of these customers have 10 or 20,000 developers out there
7:23
they've got to make sure that those developers have the tools that they need to solve these issues
7:27
The ripple effects of these Sev1 bugs, they also tend to propagate through and they can hurt your overall timeline
7:34
If I'm pulling developers off of new development to fix bugs, then the reality is that that ripple effect throughout my company winds up being extremely high
7:45
And the third thing that we see quite a bit is really about who figures out what the problem is
7:51
You know, the challenge is that, you know, right now in these microservices, just identifying the right team and the right person who can actually help resolve and fix that problem, that alone can tend to be a very time consuming effort
8:07
And so what winds up happening is that they pull these war rooms together of 30 or 40 people
8:11
They're all sitting around trying to yze a project. There's probably one person who knows what that fix is
8:17
So their big challenge right now is how do they reduce that MTTI and MTTR
8:21
The second thing is really it's about avoiding downtown overall. Now, let's be honest with ourselves
8:30
Nobody writes perfect software. Nothing survives the first interaction in the wild
8:36
If you think about a system that is now composed of a set of microservices and each one of them has a 99 percent reliability
8:44
they say, yeah, so my system should be up a lot. Well, if I've got 25 microservices, my total product has a lifecycle of about 93% or 92%
8:53
And that means that 8% of the time, your system's going down, right
8:57
So when you hear about people talking about, hey, I need to get to five nines
9:02
is because it's the multiplying effect of where that downtime comes into play
9:07
Now, unplanned downtime versus downtime are two totally different things, right? There is planned downtime for system maintenances
9:14
You need to minimize that. that needs to be sort of as mitigated as possible
9:18
I can't take an entire credit card processing system offline. But and that's where you see things like people doing A, B and canary type tests where people are running applications
9:29
you know, sort of in a, you know, in a test in a production environment
9:33
They're gradually swapping users over to it versus, you know, doing a wholesale system upgrade
9:39
Unfortunately, a lot of customers really can't do that. They just don't have the scope or the breadth to do it
9:43
So we really need to do everything in an industry to make sure that that downtime, even in planned downtime, is as minimal as possible
9:50
System failures and unplanned downtime really is the stuff that's an ultimate killer, though
9:54
It's things that they didn't have an ability to prep for at a customer. Errors and failures, they're just going to happen, people
10:02
I've been working in this business for a long time. I've never seen bug-proof software, and I don't think we ever will
10:08
You know, perfection is not, you know, progress and perfection are sort of sometimes at odds with each other
10:15
And you really don't want to struggle for perfection in your software at the midst of progress
10:19
You have to accept that certain bugs are going to get there out there in the field. So it's really about understanding what is your failure strategy when these things inevitably go wrong
10:28
You know, are you doing things using feature flags? Are you dealing with things relative to, you know, rollback capabilities
10:35
Can I enable or quickly turn on and off specific features when things fail You know how do I do I have built in redundancy So if you know a certain node goes down I can fail over to other ones The cloud does a lot of that stuff for us but it doesn come for free
10:50
There's costs associated in, you know, sort of active, active deploy. So it's really a lot of things that those customers care deeply about
10:57
particularly in the financial services space. And then really most of these customers and a lot of these enterprises
11:04
are evolving from these monolithic architectures into a set of microservices and lambda functions
11:10
understanding what those communications are and how they work when a service goes away
11:15
or if a service comes up with a different version and it suddenly fails on you
11:19
How do you handle that? Can you roll back the microservice quickly? We had one customer, they were getting an error deep down in a component and a module that they
11:28
had been using literally for five years. And it suddenly started throwing no pointer exceptions on
11:34
them. They were swearing, you know, but they were looking at trying to figure out what happened. They
11:37
were swearing that nothing changed there. The reality is, is that upstream on a website
11:42
somebody wasn't scrubbing an input parameter the same way they used to be, but it was propagating
11:46
all the way down in the system. You know, and it was just as they were swapping into and building
11:51
these things more as services and using them in ways that they hadn't used them in the past
11:55
They were exposing errors that they had never seen before. All of these things, any downtime, it costs real money and it costs real reputation
12:04
So anything they can do to avoid downtime is stuff that we in the software industry really need to be particularly sensitive towards for these people
12:14
Ultimately, it's all about the same thing that we that we all face with every day, which is this speed versus stability
12:20
Right. There's a there's a trade off that you need to make between
12:25
I want to get releases out as frequently as I can. Years ago, if you go back into some of these monolithic applications and sort of even the early versions of Windows and stuff, you had two year to three year release cycles
12:37
At this point, you know, people are looking at daily updates in a lot of these sites
12:42
Right. You're you don't expect your website or your iPhone apps. They just update. Right. They just get pushed out to those updates
12:49
So you really want to drive that balance between speed and stability
12:55
And how do you balance that? That creates a lot of challenges across the spectrum in here
13:01
One is that we tend to generate a lot of interdepartment and across departmental communication throughout the software lifecycle
13:10
We've got much more distributed teams these days. We've got different lines of communication
13:16
We really have to sort of focus on, are these teams coordinating and communicating well together
13:21
So if I change an interface, you know, is the consumer expecting that interface to actually travel with it
13:28
And, you know, we've got bad logging that pops out of these systems, right
13:32
You know, Dave, I know you were talking earlier about logging and your logging framework
13:37
And the answer is logging is great. The challenge is that some of these sites get overwhelmed with too much logging
13:42
And one of the things that drives me crazy is that I've been out to a number of these larger customers in production where their ops people have literally turned off logging in its entirety or they've set it only to error because the amount of noise in the logs that the developers was pushing out there was actually filling up their hard drives
14:03
And they just literally couldn't deal with their logging anymore. The other is that the logging gets stale
14:09
Right. So you wind up with I put a really robust set of log messages around the some set of capability
14:17
And then I added three extra parameters into the function. But I didn't update the log statement to include that variable state within there
14:23
So you wind up with the logging is either incomplete or is creating just a tremendous amount of noise
14:30
What that means is that it's got to go back and forth in these iterations between the developers and the site reliability engineers and the production people
14:38
And that drives, you know, just the velocity is a killer. That totally will ruin any ability for these customers to actually get any significant velocity in their release cycles
14:51
And the fourth big challenge for them is this distributed workforce. Right. You know, the we're global at this point. Right
15:01
That's great. Right. The you know, and there is we've gained such tremendous benefit in the industry by by
15:08
going to this global distributed workforce. We've got pockets of expertise across the globe
15:15
We're able to take advantage of intellectual capital and of resources in any place where
15:22
those skills are available. And so that's all a great thing. The challenge is that that creates
15:28
time zone challenges. It creates communication challenges where we've got language barriers
15:33
where we have sort of just people use a lot of colloquialisms
15:39
I know I do. I use a lot of sort of U.S. slang that doesn't necessarily translate well
15:46
One of our core development sites is over in Tel Aviv. Their workday is Sunday to Thursday
15:51
Our workday here in the States is Monday to Friday. So we got several days that we just don even have a staff overlap for us to communicate through The other is that we got tribal knowledge that exists in the Tel Aviv site that they all talk to each other but we don always replicate that information to our group here in Orlando and vice versa So you got to get really really good in these larger enterprises
16:16
about how do I distribute that tribal knowledge, whether or not it's through Microsoft Teams
16:22
whether or not it's doing better jobs on commentary, whether or not you're sharing
16:26
things in Git or Slack and the various channels out there, we really need to make sure that those
16:31
pieces are through there. We also need to look at the more that we want to do agile development
16:37
you know, how are you coordinating, you know, your agile delivery with your with these global teams
16:43
right? They're on different timelines, they're on different schedules. There's things like SAFE
16:48
the scaled agile framework and the like that are geared to deal with those. But a lot of companies
16:52
don't really handle those. And then the third part is that the final part in the distributed
16:57
workforce challenge really tends to be in the cases where when I have these site outages
17:03
where I have these challenges on the site, and I'm pulling developers on a regular basis off of
17:08
their product, it kills the employee morale. The morale of the development team, when they're
17:14
constantly firefighting and they're not doing new features, that tends to drag down and really
17:20
sort of can slow down your entire engineering activity. So, you know, like I said, I mean
17:25
the challenge really is, is that on these financial services sides, you know, the issues for those
17:31
companies are really the same as they are for most enterprises across the space. The difference is
17:35
more what the impact is, right? It may not be a big deal if you've got a minor site outage on
17:41
something that's lightly used. But when you're a large distributed global provider of, you know
17:47
credit card services and credit card processing, you can't afford any outage, but you still have
17:52
to get your software rolled out as fast as you possibly can. And you have to have the tools in
17:56
place. One of the things that we've done a great job on in the engineering and in the technology
18:01
industry over the years is we've done a great job of providing tools for the developers all the way
18:07
up to the line of production. But we've never really focused well at the engineering side about
18:13
how do I bridge between these developers and the complex production environments that are out there
18:19
to make lives simpler for the developers? And I think that's really part of what
18:23
you know, things like application insights, some of the new observability platforms
18:27
Wavefront, you know, some of those places are really focused on. It's also things that, you know
18:32
tools like over ops and tools like, you know, some of the profilers, that's really what we focus on
18:37
It's about how do we enable, you know, customers to roll things into their production environment to get their products out the door as fast as they possibly can and still do it with a high degree of quality and a way to fix as quickly as they possibly can so that when the inevitable goes wrong, you actually can recover from it quickly
18:59
And I think that's the majority of what I really wanted to sort of talk about today
19:03
I did see a couple of quick comments on there about big data and hybrid cloud
19:10
I will tell you, for all of the financial services, they're heavily invested in both
19:14
You can't be in the financial services space without dealing with big data in some fashion
19:21
In my prior life, we were all big data oriented. But if you think about the sheer number of transactions, if I'm a credit card processing company
19:30
the number of transactions that they have streaming in on a regular basis that they need to respond to quickly is enormous
19:40
Some of these guys I was dealing at one point with with FINRA, which is the financial regulatory industry in the United States
19:47
And they can get anywhere upwards from three to 10 trillion records a day that are popping in across all the global financial trades
19:56
And they need to process all of those. They need to be able to store them. They need to be able to monitor them. They need to be able to yze those for regulatory purposes and really on that watchdog side
20:07
And and so really, they can't do that without the cloud. But but again, I think Sarva had a question on hybrid cloud
20:14
And the reality is, is that almost all of them are hybrid cloud, particularly in the financial services space
20:20
A lot of them are not going to a full public cloud. They're not comfortable with putting all that information, even with high security that you have around AWS or around Azure or Google Cloud
20:32
In reality, what they're doing is they're building private clouds that they then will branch out processing nodes into public cloud capabilities
20:40
and they'll tend to blend their on-prem data rooms with a private cloud as well as a public cloud strategy
20:52
for offloading heavy processing tasks and for heavy load tasks. And again, this is where tools like distributed tracing need to come into play
21:02
This is where profiling needs to come into play. It's where all these, you know, understanding how these moving parts
21:08
all impact each other because one piece goes out. It really truly is a weakest link problem, unfortunately