0:03
all right well thank you very much for
0:04
all right well thank you very much for
0:04
all right well thank you very much for having me this is exciting this is a
0:06
having me this is exciting this is a
0:06
having me this is exciting this is a topic that is both old and new uh data
0:09
topic that is both old and new uh data
0:09
topic that is both old and new uh data quality is nothing new we've been
0:10
quality is nothing new we've been
0:10
quality is nothing new we've been dealing with it for years and years and
0:11
dealing with it for years and years and
0:11
dealing with it for years and years and years even before databases existed
0:13
years even before databases existed
0:14
years even before databases existed people had filing cabinets full of paper
0:16
people had filing cabinets full of paper
0:16
people had filing cabinets full of paper and there were sure to be mistakes
0:17
and there were sure to be mistakes
0:17
and there were sure to be mistakes inside of them and so data quality is an
0:19
inside of them and so data quality is an
0:19
inside of them and so data quality is an eternal issue we've been dealing with
0:20
eternal issue we've been dealing with
0:20
eternal issue we've been dealing with forever it doesn't stop being an issue
0:22
forever it doesn't stop being an issue
0:22
forever it doesn't stop being an issue but the more Downstream things we do the
0:25
but the more Downstream things we do the
0:25
but the more Downstream things we do the more complex our usage that data gets
0:27
more complex our usage that data gets
0:27
more complex our usage that data gets the more ways we have to muck it up and
0:29
the more ways we have to muck it up and
0:29
the more ways we have to muck it up and make our lives very difficult so our
0:30
make our lives very difficult so our
0:30
make our lives very difficult so our goal here is to talk about data quality
0:33
goal here is to talk about data quality
0:33
goal here is to talk about data quality through the lens of AI that is we're
0:35
through the lens of AI that is we're
0:35
through the lens of AI that is we're going to throw complicated algorithms at
0:37
going to throw complicated algorithms at
0:37
going to throw complicated algorithms at our data what does that mean for data
0:39
our data what does that mean for data
0:39
our data what does that mean for data quality I'm G to move kind of quickly my
0:41
quality I'm G to move kind of quickly my
0:41
quality I'm G to move kind of quickly my goal is to save some time at the end for
0:43
goal is to save some time at the end for
0:43
goal is to save some time at the end for questions those questions are great um
0:45
questions those questions are great um
0:45
questions those questions are great um but if there aren't any we can also just
0:46
but if there aren't any we can also just
0:46
but if there aren't any we can also just chat and have fun as well so I'm just
0:48
chat and have fun as well so I'm just
0:48
chat and have fun as well so I'm just going to Dive Right on in very quickly
0:50
going to Dive Right on in very quickly
0:50
going to Dive Right on in very quickly I've already gotten the intro um this is
0:52
I've already gotten the intro um this is
0:52
I've already gotten the intro um this is just more stuff I will skip it because
0:54
just more stuff I will skip it because
0:54
just more stuff I will skip it because we already got a cool intro the agenda
0:56
we already got a cool intro the agenda
0:57
we already got a cool intro the agenda is straightforward uh really we're going
0:59
is straightforward uh really we're going
0:59
is straightforward uh really we're going to talk about data quality
1:00
to talk about data quality
1:00
to talk about data quality the classic problem how to solve it why
1:03
the classic problem how to solve it why
1:03
the classic problem how to solve it why it matters and then what are all the
1:05
it matters and then what are all the
1:05
it matters and then what are all the things we do in AI that make our lives
1:07
things we do in AI that make our lives
1:07
things we do in AI that make our lives more difficult because of this we'll
1:10
more difficult because of this we'll
1:10
more difficult because of this we'll bring it all together at the end and
1:12
bring it all together at the end and
1:12
bring it all together at the end and save a little bit of time for
1:15
save a little bit of time for
1:15
save a little bit of time for questions so this is an example of a
1:18
questions so this is an example of a
1:18
questions so this is an example of a software development life cycle there
1:20
software development life cycle there
1:20
software development life cycle there are many out there different
1:21
are many out there different
1:21
are many out there different organizations different sizes have
1:23
organizations different sizes have
1:23
organizations different sizes have different variations of this if you're a
1:25
different variations of this if you're a
1:25
different variations of this if you're a small organization you have a little
1:26
small organization you have a little
1:26
small organization you have a little less if you're big you may have more but
1:28
less if you're big you may have more but
1:28
less if you're big you may have more but the basic idea is there's design there's
1:30
the basic idea is there's design there's
1:30
the basic idea is there's design there's architecture you build some software you
1:33
architecture you build some software you
1:33
architecture you build some software you test a whole bunch iterate and
1:34
test a whole bunch iterate and
1:34
test a whole bunch iterate and eventually deploy maintain so on and so
1:37
eventually deploy maintain so on and so
1:37
eventually deploy maintain so on and so forth and there's a lot of arrows a lot
1:39
forth and there's a lot of arrows a lot
1:39
forth and there's a lot of arrows a lot of boxes and these are all here
1:41
of boxes and these are all here
1:42
of boxes and these are all here primarily so that we make good decisions
1:44
primarily so that we make good decisions
1:44
primarily so that we make good decisions we add quality new features make good
1:47
we add quality new features make good
1:47
we add quality new features make good decisions make few mistakes and then
1:49
decisions make few mistakes and then
1:49
decisions make few mistakes and then release something that users love right
1:51
release something that users love right
1:51
release something that users love right that's the whole point of software
1:53
that's the whole point of software
1:53
that's the whole point of software development but where that is kind of
1:56
development but where that is kind of
1:56
development but where that is kind of where a mature product would behave in
1:58
where a mature product would behave in
1:58
where a mature product would behave in terms of how we go from an aidea to code
2:01
terms of how we go from an aidea to code
2:01
terms of how we go from an aidea to code in a production environment what very
2:03
in a production environment what very
2:03
in a production environment what very often happens in AI is a little bit like
2:06
often happens in AI is a little bit like
2:06
often happens in AI is a little bit like this not exactly I'm exaggerating but
2:09
this not exactly I'm exaggerating but
2:09
this not exactly I'm exaggerating but it's funny right somebody comes and says
2:11
it's funny right somebody comes and says
2:11
it's funny right somebody comes and says we want the bright shiny new thing now
2:13
we want the bright shiny new thing now
2:13
we want the bright shiny new thing now and you say okay when do you want to buy
2:15
and you say okay when do you want to buy
2:15
and you say okay when do you want to buy and they say next week and you work
2:17
and they say next week and you work
2:17
and they say next week and you work really fast and a new feature goes out
2:18
really fast and a new feature goes out
2:18
really fast and a new feature goes out and it's cool because the new shiny
2:20
and it's cool because the new shiny
2:20
and it's cool because the new shiny thing's worth a lot right and so very
2:22
thing's worth a lot right and so very
2:22
thing's worth a lot right and so very often we skip steps we hurry we skip we
2:25
often we skip steps we hurry we skip we
2:26
often we skip steps we hurry we skip we do everything we can to get things out
2:28
do everything we can to get things out
2:28
do everything we can to get things out and that's dangerous right we're being a
2:30
and that's dangerous right we're being a
2:30
and that's dangerous right we're being a software design life cycle so part of
2:32
software design life cycle so part of
2:32
software design life cycle so part of this entire conversation is that we
2:34
this entire conversation is that we
2:34
this entire conversation is that we should treat anything we develop in AI
2:37
should treat anything we develop in AI
2:37
should treat anything we develop in AI to be very much like any other software
2:40
to be very much like any other software
2:40
to be very much like any other software development right it's just features in
2:42
development right it's just features in
2:42
development right it's just features in the same way that database development
2:44
the same way that database development
2:44
the same way that database development is just development software development
2:46
is just development software development
2:46
is just development software development is development AI development machine
2:48
is development AI development machine
2:48
is development AI development machine learning algorithms analytics
2:49
learning algorithms analytics
2:49
learning algorithms analytics dashboarding all those things that we do
2:51
dashboarding all those things that we do
2:51
dashboarding all those things that we do with our data later it's just
2:53
with our data later it's just
2:53
with our data later it's just development so treat it like development
2:55
development so treat it like development
2:55
development so treat it like development uh if you do what I show on the screen
2:56
uh if you do what I show on the screen
2:56
uh if you do what I show on the screen right now things may not work the way we
2:59
right now things may not work the way we
2:59
right now things may not work the way we want them to you'll get some surprises
3:01
want them to you'll get some surprises
3:01
want them to you'll get some surprises that you know surprise cake for your
3:02
that you know surprise cake for your
3:03
that you know surprise cake for your birthday good surprise bugs not so good
3:06
birthday good surprise bugs not so good
3:06
birthday good surprise bugs not so good right there's more challenges though
3:09
right there's more challenges though
3:09
right there's more challenges though what are the challenges we face with
3:10
what are the challenges we face with
3:10
what are the challenges we face with data and Ai and these are common
3:13
data and Ai and these are common
3:13
data and Ai and these are common challenges some of them for example data
3:15
challenges some of them for example data
3:15
challenges some of them for example data Grows Right data gets bigger it always
3:17
Grows Right data gets bigger it always
3:17
Grows Right data gets bigger it always gets bigger we have more databases more
3:19
gets bigger we have more databases more
3:19
gets bigger we have more databases more data May more data sources and it can be
3:22
data May more data sources and it can be
3:22
data May more data sources and it can be from anywhere right data begins
3:23
from anywhere right data begins
3:23
from anywhere right data begins somewhere it begins in a transactional
3:25
somewhere it begins in a transactional
3:25
somewhere it begins in a transactional database in Edge devices in some data
3:30
database in Edge devices in some data
3:30
database in Edge devices in some data Source element place somewhere where
3:32
Source element place somewhere where
3:32
Source element place somewhere where it's being created and then it exists
3:34
it's being created and then it exists
3:34
it's being created and then it exists it's used right it's used by
3:36
it's used right it's used by
3:36
it's used right it's used by applications used in uis used by users
3:39
applications used in uis used by users
3:39
applications used in uis used by users used by people but then more things
3:41
used by people but then more things
3:41
used by people but then more things happen to that data we begin getting
3:43
happen to that data we begin getting
3:43
happen to that data we begin getting bigger and asking more questions about
3:45
bigger and asking more questions about
3:45
bigger and asking more questions about our data and so we begin copying it we
3:47
our data and so we begin copying it we
3:47
our data and so we begin copying it we make replicas of it maybe we'll do some
3:49
make replicas of it maybe we'll do some
3:49
make replicas of it maybe we'll do some transformations of that data to make it
3:52
transformations of that data to make it
3:52
transformations of that data to make it easier to report on we want to provide
3:54
easier to report on we want to provide
3:54
easier to report on we want to provide dashboards and reports to users or to
3:56
dashboards and reports to users or to
3:56
dashboards and reports to users or to internal to an organization or to
3:59
internal to an organization or to
3:59
internal to an organization or to executive sort who knows who else
4:00
executive sort who knows who else
4:00
executive sort who knows who else shareholders there's a million places
4:02
shareholders there's a million places
4:02
shareholders there's a million places that data can be used and so after
4:04
that data can be used and so after
4:04
that data can be used and so after transforming it and playing with it a
4:06
transforming it and playing with it a
4:06
transforming it and playing with it a bit we may eventually put into other
4:08
bit we may eventually put into other
4:08
bit we may eventually put into other places uh into red shift or snowflake a
4:12
places uh into red shift or snowflake a
4:12
places uh into red shift or snowflake a data Lake a warehouse a lake house
4:13
data Lake a warehouse a lake house
4:13
data Lake a warehouse a lake house whatever data goes somewhere else and we
4:16
whatever data goes somewhere else and we
4:16
whatever data goes somewhere else and we work with it
4:17
work with it Downstream uh and so because data is
4:19
Downstream uh and so because data is
4:19
Downstream uh and so because data is growing and evolving over time and
4:21
growing and evolving over time and
4:21
growing and evolving over time and moving a lot we have a way of making
4:24
moving a lot we have a way of making
4:24
moving a lot we have a way of making mistakes along the way and those
4:26
mistakes along the way and those
4:26
mistakes along the way and those mistakes get copied well as soon as a
4:27
mistakes get copied well as soon as a
4:27
mistakes get copied well as soon as a mistake is made with data you have a
4:29
mistake is made with data you have a
4:29
mistake is made with data you have a data quality issue it just passes
4:31
data quality issue it just passes
4:31
data quality issue it just passes through this process whatever it looks
4:33
through this process whatever it looks
4:33
through this process whatever it looks like and continues its way to the end
4:35
like and continues its way to the end
4:35
like and continues its way to the end which isn't very good and you know
4:38
which isn't very good and you know
4:38
which isn't very good and you know downside to AI processes which is an
4:41
downside to AI processes which is an
4:41
downside to AI processes which is an additional is they like returning
4:43
additional is they like returning
4:43
additional is they like returning answers they don't give results they
4:45
answers they don't give results they
4:45
answers they don't give results they give answers we want our AI to tell
4:47
give answers we want our AI to tell
4:48
give answers we want our AI to tell people what they want to know like I
4:50
people what they want to know like I
4:50
people what they want to know like I want to know what a plane ticket costs
4:51
want to know what a plane ticket costs
4:51
want to know what a plane ticket costs give me the price I want to know what
4:54
give me the price I want to know what
4:54
give me the price I want to know what car you'd recommend for my lifestyle go
4:56
car you'd recommend for my lifestyle go
4:56
car you'd recommend for my lifestyle go and give me the right answer so they're
4:57
and give me the right answer so they're
4:57
and give me the right answer so they're very authoritative in nature so if you
4:59
very authoritative in nature so if you
5:00
very authoritative in nature so if you have bad data then you're going to
5:01
have bad data then you're going to
5:02
have bad data then you're going to authoritatively give bad answers which
5:04
authoritatively give bad answers which
5:04
authoritatively give bad answers which is a challenge so you want to be able
5:06
is a challenge so you want to be able
5:06
is a challenge so you want to be able authoritatively give good answers and
5:08
authoritatively give good answers and
5:08
authoritatively give good answers and data quality is the Lynch pin to all of
5:11
data quality is the Lynch pin to all of
5:11
data quality is the Lynch pin to all of this so many of this is going to be a
5:14
this so many of this is going to be a
5:14
this so many of this is going to be a classic answer how do we prevent bad
5:16
classic answer how do we prevent bad
5:16
classic answer how do we prevent bad data this is going to begin in the land
5:18
data this is going to begin in the land
5:18
data this is going to begin in the land of classic data problems and end in the
5:20
of classic data problems and end in the
5:20
of classic data problems and end in the land of AI we're going to start with
5:22
land of AI we're going to start with
5:22
land of AI we're going to start with things you may have heard before or
5:24
things you may have heard before or
5:24
things you may have heard before or maybe things you do already maybe things
5:26
maybe things you do already maybe things
5:26
maybe things you do already maybe things you don't do
5:27
you don't do already the simplest way to prevent bad
5:30
already the simplest way to prevent bad
5:30
already the simplest way to prevent bad data is to prevent it at the source so
5:33
data is to prevent it at the source so
5:33
data is to prevent it at the source so you're at the beginning of the data life
5:34
you're at the beginning of the data life
5:34
you're at the beginning of the data life cycle you're in an application you're
5:36
cycle you're in an application you're
5:37
cycle you're in an application you're pulling in data from Ed devices people
5:39
pulling in data from Ed devices people
5:39
pulling in data from Ed devices people are processing orders or buying things
5:42
are processing orders or buying things
5:42
are processing orders or buying things or Bank transactions or whatever you
5:45
or Bank transactions or whatever you
5:45
or Bank transactions or whatever you could be anywhere data is being created
5:47
could be anywhere data is being created
5:47
could be anywhere data is being created for the first time it's being stored
5:49
for the first time it's being stored
5:49
for the first time it's being stored somewhere and we want to make sure it's
5:51
somewhere and we want to make sure it's
5:51
somewhere and we want to make sure it's correct this is where data begins so how
5:53
correct this is where data begins so how
5:53
correct this is where data begins so how do we do that and there's different ways
5:55
do we do that and there's different ways
5:55
do we do that and there's different ways to do it you can be strict about it and
5:58
to do it you can be strict about it and
5:58
to do it you can be strict about it and have things like check constraints data
6:00
have things like check constraints data
6:00
have things like check constraints data constraints foreign Keys uh database
6:02
constraints foreign Keys uh database
6:02
constraints foreign Keys uh database constructs that ensure that if you try
6:05
constructs that ensure that if you try
6:05
constructs that ensure that if you try to put bad data in there it will fail
6:07
to put bad data in there it will fail
6:07
to put bad data in there it will fail and those are good things to have if
6:09
and those are good things to have if
6:09
and those are good things to have if they're available to you because they
6:10
they're available to you because they
6:10
they're available to you because they ensure you can't have bad data if your
6:12
ensure you can't have bad data if your
6:12
ensure you can't have bad data if your database says you can't do this then you
6:15
database says you can't do this then you
6:15
database says you can't do this then you can't do this problem solved right we
6:18
can't do this problem solved right we
6:18
can't do this problem solved right we can't do that in every system but when
6:19
can't do that in every system but when
6:20
can't do that in every system but when we can it's nice uh some systems put
6:22
we can it's nice uh some systems put
6:22
we can it's nice uh some systems put data in out of order and do other things
6:24
data in out of order and do other things
6:24
data in out of order and do other things where you just can't do it or you're in
6:25
where you just can't do it or you're in
6:25
where you just can't do it or you're in a platform where those sorts of
6:27
a platform where those sorts of
6:27
a platform where those sorts of constraints don't make sense you can
6:29
constraints don't make sense you can
6:29
constraints don't make sense you can still validate data though the
6:31
still validate data though the
6:31
still validate data though the application can validate data the
6:33
application can validate data the
6:33
application can validate data the database could have a process to
6:34
database could have a process to
6:34
database could have a process to validate data or you can simply run
6:36
validate data or you can simply run
6:36
validate data or you can simply run reports after the fact or validation
6:38
reports after the fact or validation
6:38
reports after the fact or validation processes after the fact that check data
6:40
processes after the fact that check data
6:40
processes after the fact that check data and say is it good is it bad and then
6:43
and say is it good is it bad and then
6:43
and say is it good is it bad and then deal with it in whatever the appropriate
6:45
deal with it in whatever the appropriate
6:45
deal with it in whatever the appropriate way is it's important to note that any
6:48
way is it's important to note that any
6:48
way is it's important to note that any bad data created here lasts forever
6:51
bad data created here lasts forever
6:51
bad data created here lasts forever anything else you do with it from this
6:52
anything else you do with it from this
6:52
anything else you do with it from this point on will have mistakes in it and
6:55
point on will have mistakes in it and
6:55
point on will have mistakes in it and it's going to stay there so if we could
6:57
it's going to stay there so if we could
6:57
it's going to stay there so if we could fix it at the source that's a very good
7:00
fix it at the source that's a very good
7:00
fix it at the source that's a very good thing and note that I mentioned
7:02
thing and note that I mentioned
7:02
thing and note that I mentioned validating data at multiple levels
7:04
validating data at multiple levels
7:04
validating data at multiple levels there's value in that sometimes we like
7:05
there's value in that sometimes we like
7:05
there's value in that sometimes we like to think to ourselves a if the app is
7:07
to think to ourselves a if the app is
7:07
to think to ourselves a if the app is checking our data for us we're good
7:08
checking our data for us we're good
7:09
checking our data for us we're good right we don't have to worry about
7:09
right we don't have to worry about
7:09
right we don't have to worry about anything else the database can be kept
7:11
anything else the database can be kept
7:11
anything else the database can be kept simple straightforward small and compact
7:13
simple straightforward small and compact
7:13
simple straightforward small and compact but we're assuming the app is perfect
7:16
but we're assuming the app is perfect
7:16
but we're assuming the app is perfect and if you've built an app where there's
7:17
and if you've built an app where there's
7:17
and if you've built an app where there's no bugs and that's the way it's been for
7:19
no bugs and that's the way it's been for
7:19
no bugs and that's the way it's been for its life cycle that's awesome uh but
7:22
its life cycle that's awesome uh but
7:22
its life cycle that's awesome uh but most apps don't behave that way things
7:24
most apps don't behave that way things
7:24
most apps don't behave that way things go wrong we have bugs we fix bugs bad
7:26
go wrong we have bugs we fix bugs bad
7:26
go wrong we have bugs we fix bugs bad data is created so consider when you're
7:28
data is created so consider when you're
7:28
data is created so consider when you're validating data and in the its creation
7:30
validating data and in the its creation
7:30
validating data and in the its creation State having it checked in multiple
7:32
State having it checked in multiple
7:32
State having it checked in multiple places it's a great way in say we you
7:35
places it's a great way in say we you
7:35
places it's a great way in say we you layer security layer data validation and
7:37
layer security layer data validation and
7:37
layer security layer data validation and and and data Integrity checking because
7:40
and and data Integrity checking because
7:40
and and data Integrity checking because this is the beginning of data it's bad
7:41
this is the beginning of data it's bad
7:41
this is the beginning of data it's bad here it's bad
7:43
here it's bad forever from here where do data go we
7:46
forever from here where do data go we
7:46
forever from here where do data go we start with transactional data we start
7:47
start with transactional data we start
7:47
start with transactional data we start with application data and it begins
7:49
with application data and it begins
7:49
with application data and it begins moving along it may it copied may end up
7:52
moving along it may it copied may end up
7:52
moving along it may it copied may end up in analytic environment because we want
7:53
in analytic environment because we want
7:53
in analytic environment because we want to run reports whether they're inward
7:55
to run reports whether they're inward
7:55
to run reports whether they're inward facing outward facing if there's any
7:57
facing outward facing if there's any
7:57
facing outward facing if there's any sort of analytics going on at all
8:00
sort of analytics going on at all
8:00
sort of analytics going on at all decisions are being made based on it and
8:01
decisions are being made based on it and
8:01
decisions are being made based on it and so we want those decisions to be made
8:03
so we want those decisions to be made
8:03
so we want those decisions to be made correctly so how do we do that
8:06
correctly so how do we do that
8:06
correctly so how do we do that validating data is not hard honestly you
8:08
validating data is not hard honestly you
8:08
validating data is not hard honestly you don't have to check every single value
8:09
don't have to check every single value
8:09
don't have to check every single value you have to make sure everything is
8:10
you have to make sure everything is
8:10
you have to make sure everything is correct very often simply checking the
8:13
correct very often simply checking the
8:13
correct very often simply checking the general data size and data shape after
8:15
general data size and data shape after
8:15
general data size and data shape after data gets moved or copied somewhere or
8:17
data gets moved or copied somewhere or
8:17
data gets moved or copied somewhere or transformed is good enough if the amount
8:20
transformed is good enough if the amount
8:20
transformed is good enough if the amount of data that I have changes by more than
8:22
of data that I have changes by more than
8:22
of data that I have changes by more than x percent day over day or like over like
8:25
x percent day over day or like over like
8:25
x percent day over day or like over like we want to know if the data return is
8:27
we want to know if the data return is
8:27
we want to know if the data return is zero for a day if the number of data
8:28
zero for a day if the number of data
8:28
zero for a day if the number of data sources goes up or down by a lot if the
8:30
sources goes up or down by a lot if the
8:30
sources goes up or down by a lot if the bite count goes up or down by a lot
8:32
bite count goes up or down by a lot
8:32
bite count goes up or down by a lot things like that very very easy to check
8:34
things like that very very easy to check
8:34
things like that very very easy to check and very easy to tell and something's
8:36
and very easy to tell and something's
8:36
and very easy to tell and something's wrong if we get a thousand rows from
8:38
wrong if we get a thousand rows from
8:38
wrong if we get a thousand rows from somewhere every single day and suddenly
8:40
somewhere every single day and suddenly
8:40
somewhere every single day and suddenly it's a million that's unusual and
8:42
it's a million that's unusual and
8:42
it's a million that's unusual and there's a reason right similarly we can
8:44
there's a reason right similarly we can
8:44
there's a reason right similarly we can look at values as well uniqueness of
8:46
look at values as well uniqueness of
8:46
look at values as well uniqueness of values there certain values that should
8:48
values there certain values that should
8:48
values there certain values that should always be unique are nulls and blanks
8:50
always be unique are nulls and blanks
8:50
always be unique are nulls and blanks allowed or not are there values that are
8:53
allowed or not are there values that are
8:53
allowed or not are there values that are invalid you know very often we'll say
8:55
invalid you know very often we'll say
8:55
invalid you know very often we'll say take an integer store value but we know
8:57
take an integer store value but we know
8:57
take an integer store value but we know it can't be negative but storing as in
9:00
it can't be negative but storing as in
9:00
it can't be negative but storing as in anyway you can check and see are they
9:01
anyway you can check and see are they
9:01
anyway you can check and see are they negatives and you may say like that
9:03
negatives and you may say like that
9:03
negatives and you may say like that can't possibly happen but we know how
9:05
can't possibly happen but we know how
9:05
can't possibly happen but we know how these things work if you allow a
9:06
these things work if you allow a
9:06
these things work if you allow a negative somebody will put a negative in
9:08
negative somebody will put a negative in
9:08
negative somebody will put a negative in if you allow it to happen it's just what
9:10
if you allow it to happen it's just what
9:10
if you allow it to happen it's just what QA does right it's how we do things
9:12
QA does right it's how we do things
9:12
QA does right it's how we do things negative a million just as valid as
9:14
negative a million just as valid as
9:14
negative a million just as valid as positive a million if you allow it right
9:16
positive a million if you allow it right
9:16
positive a million if you allow it right missing data is important as is
9:17
missing data is important as is
9:17
missing data is important as is duplicate data these are key validations
9:20
duplicate data these are key validations
9:20
duplicate data these are key validations that are not hard to check for if you
9:21
that are not hard to check for if you
9:22
that are not hard to check for if you have some key value that should be in
9:23
have some key value that should be in
9:23
have some key value that should be in unique make sure it only appears once
9:25
unique make sure it only appears once
9:25
unique make sure it only appears once per value again in the analytic world we
9:27
per value again in the analytic world we
9:27
per value again in the analytic world we very often we have less constraints have
9:29
very often we have less constraints have
9:29
very often we have less constraints have less validation that's built into data
9:31
less validation that's built into data
9:31
less validation that's built into data structures but we can check it right you
9:33
structures but we can check it right you
9:33
structures but we can check it right you can load data check it you can copy data
9:35
can load data check it you can copy data
9:35
can load data check it you can copy data check it you can transform it check it
9:37
check it you can transform it check it
9:37
check it you can transform it check it it's very fast it's very easy and once
9:39
it's very fast it's very easy and once
9:39
it's very fast it's very easy and once you've checked any data you've loaded
9:40
you've checked any data you've loaded
9:40
you've checked any data you've loaded you can put that then onto your giant
9:41
you can put that then onto your giant
9:42
you can put that then onto your giant pile mountain of of existing data and
9:45
pile mountain of of existing data and
9:45
pile mountain of of existing data and not worry about it again also consider
9:47
not worry about it again also consider
9:47
not worry about it again also consider edge cases uh edge cases happen a lot in
9:50
edge cases uh edge cases happen a lot in
9:50
edge cases uh edge cases happen a lot in data are they valid or not valid edge
9:53
data are they valid or not valid edge
9:53
data are they valid or not valid edge cases do they mean something and we
9:55
cases do they mean something and we
9:55
cases do they mean something and we should have them or are they crazy if I
9:57
should have them or are they crazy if I
9:57
should have them or are they crazy if I see in my order processing system that
9:59
see in my order processing system that
9:59
see in my order processing system that somebody placed an order for 10 trillion
10:01
somebody placed an order for 10 trillion
10:01
somebody placed an order for 10 trillion doll I would assume that education is
10:03
doll I would assume that education is
10:03
doll I would assume that education is probably invalid nobody is making an
10:06
probably invalid nobody is making an
10:06
probably invalid nobody is making an order that large right on the other hand
10:09
order that large right on the other hand
10:09
order that large right on the other hand if I saw an order for a million dollars
10:11
if I saw an order for a million dollars
10:11
if I saw an order for a million dollars I would say all right that number is
10:12
I would say all right that number is
10:12
I would say all right that number is really big probably somebody should look
10:14
really big probably somebody should look
10:14
really big probably somebody should look at it um but it could be valid right
10:16
at it um but it could be valid right
10:16
at it um but it could be valid right somebody could spend a million dollars
10:18
somebody could spend a million dollars
10:18
somebody could spend a million dollars on something as improbable as it may be
10:20
on something as improbable as it may be
10:20
on something as improbable as it may be that's at least feasible uh negative
10:22
that's at least feasible uh negative
10:22
that's at least feasible uh negative million dollars not possible negative
10:24
million dollars not possible negative
10:24
million dollars not possible negative numbers you can't spend a negative
10:25
numbers you can't spend a negative
10:25
numbers you can't spend a negative number of dollars right so we can check
10:27
number of dollars right so we can check
10:27
number of dollars right so we can check for things like that and understand end
10:29
for things like that and understand end
10:29
for things like that and understand end is it correct or not and importantly
10:32
is it correct or not and importantly
10:32
is it correct or not and importantly here this validation should all occur
10:34
here this validation should all occur
10:34
here this validation should all occur before the data is moved again before
10:35
before the data is moved again before
10:35
before the data is moved again before it's used for anything else this is your
10:37
it's used for anything else this is your
10:37
it's used for anything else this is your second chance so you check it at its
10:39
second chance so you check it at its
10:39
second chance so you check it at its source you check it after it's been
10:41
source you check it after it's been
10:41
source you check it after it's been moved and copied and each time it gets
10:43
moved and copied and each time it gets
10:43
moved and copied and each time it gets moved and copied there needs to be
10:45
moved and copied there needs to be
10:45
moved and copied there needs to be validation and it amazes me how many
10:46
validation and it amazes me how many
10:47
validation and it amazes me how many systems I've worked with over my
10:48
systems I've worked with over my
10:48
systems I've worked with over my lifetime and seen and heard about where
10:50
lifetime and seen and heard about where
10:50
lifetime and seen and heard about where there's no validation data simply moves
10:51
there's no validation data simply moves
10:51
there's no validation data simply moves all over the place and people just use
10:53
all over the place and people just use
10:53
all over the place and people just use it and that's it uh keep in mind that if
10:57
it and that's it uh keep in mind that if
10:57
it and that's it uh keep in mind that if you do that you're riding on Hope and
10:59
you do that you're riding on Hope and
10:59
you do that you're riding on Hope and dreams but hopes and dreams aren't very
11:01
dreams but hopes and dreams aren't very
11:01
dreams but hopes and dreams aren't very scientific uh nor do they really help
11:03
scientific uh nor do they really help
11:04
scientific uh nor do they really help you get good data I've hoped and dreamed
11:06
you get good data I've hoped and dreamed
11:06
you get good data I've hoped and dreamed a lot and my data quality is never gone
11:07
a lot and my data quality is never gone
11:07
a lot and my data quality is never gone up as a result so I highly recommend
11:09
up as a result so I highly recommend
11:09
up as a result so I highly recommend validation as opposed to hopes uh they
11:12
validation as opposed to hopes uh they
11:12
validation as opposed to hopes uh they don't really help as
11:13
don't really help as
11:13
don't really help as much and of course one big piece that
11:16
much and of course one big piece that
11:16
much and of course one big piece that doesn't always get uh considered is what
11:19
doesn't always get uh considered is what
11:19
doesn't always get uh considered is what happens when things change uh software
11:21
happens when things change uh software
11:21
happens when things change uh software releases happen all the time in
11:22
releases happen all the time in
11:23
releases happen all the time in applications in the analytic world in
11:24
applications in the analytic world in
11:25
applications in the analytic world in the AI World things change right you
11:26
the AI World things change right you
11:26
the AI World things change right you decide to go from you know chat GP to a
11:29
decide to go from you know chat GP to a
11:29
decide to go from you know chat GP to a new version of chap GPT we're going to
11:31
new version of chap GPT we're going to
11:31
new version of chap GPT we're going to upgrade and try a new version we're
11:32
upgrade and try a new version we're
11:32
upgrade and try a new version we're going to move everything to that but
11:33
going to move everything to that but
11:33
going to move everything to that but that changes things right or maybe a
11:35
that changes things right or maybe a
11:35
that changes things right or maybe a software release goes into a
11:36
software release goes into a
11:36
software release goes into a transactional application well check and
11:39
transactional application well check and
11:39
transactional application well check and make sure that any data impacted by the
11:41
make sure that any data impacted by the
11:41
make sure that any data impacted by the release is checked validated looks good
11:44
release is checked validated looks good
11:44
release is checked validated looks good also ensure that new data being created
11:46
also ensure that new data being created
11:46
also ensure that new data being created is also checked validated good uh and
11:49
is also checked validated good uh and
11:49
is also checked validated good uh and this is really what QA is for right
11:51
this is really what QA is for right
11:51
this is really what QA is for right quality assurance is there to ensure
11:52
quality assurance is there to ensure
11:52
quality assurance is there to ensure that changes happen well that's
11:55
that changes happen well that's
11:55
that changes happen well that's important uh but if we don't do that
11:57
important uh but if we don't do that
11:57
important uh but if we don't do that though there's always a chance that
11:58
though there's always a chance that
11:58
though there's always a chance that things will go wrong going forward and
12:00
things will go wrong going forward and
12:00
things will go wrong going forward and now you're in a world where some of your
12:01
now you're in a world where some of your
12:01
now you're in a world where some of your data is good and some of your data is
12:02
data is good and some of your data is
12:02
data is good and some of your data is bad we have to come back and deal with
12:04
bad we have to come back and deal with
12:04
bad we have to come back and deal with that later which I suppose is better
12:06
that later which I suppose is better
12:06
that later which I suppose is better than all bad data but just keep in mind
12:09
than all bad data but just keep in mind
12:09
than all bad data but just keep in mind that uh changes to any tier of your
12:11
that uh changes to any tier of your
12:12
that uh changes to any tier of your application whether it's a transactional
12:14
application whether it's a transactional
12:15
application whether it's a transactional app whether it's a production database
12:17
app whether it's a production database
12:17
app whether it's a production database whether it's ETL elt data movement data
12:21
whether it's ETL elt data movement data
12:21
whether it's ETL elt data movement data copying analytics AI machine learning
12:22
copying analytics AI machine learning
12:22
copying analytics AI machine learning whatever those are all changes you
12:24
whatever those are all changes you
12:24
whatever those are all changes you should QA tested data validated it's
12:27
should QA tested data validated it's
12:27
should QA tested data validated it's important
12:30
important something else too that's kind of a a
12:31
something else too that's kind of a a
12:31
something else too that's kind of a a fundamental in database design that gets
12:33
fundamental in database design that gets
12:33
fundamental in database design that gets often thrown away when we start talking
12:35
often thrown away when we start talking
12:35
often thrown away when we start talking about things so far Downstream are what
12:38
about things so far Downstream are what
12:38
about things so far Downstream are what do our names of things look like what do
12:41
do our names of things look like what do
12:41
do our names of things look like what do our data types look like these are
12:43
our data types look like these are
12:43
our data types look like these are classic problems we've dealt with that
12:45
classic problems we've dealt with that
12:45
classic problems we've dealt with that are are very often seen as beginning of
12:47
are are very often seen as beginning of
12:47
are are very often seen as beginning of the lifeline um for our data but they
12:50
the lifeline um for our data but they
12:50
the lifeline um for our data but they matter more now than ever because we're
12:51
matter more now than ever because we're
12:51
matter more now than ever because we're gonna take data and throw it into an AI
12:54
gonna take data and throw it into an AI
12:54
gonna take data and throw it into an AI algorithm and say hey look at my data
12:57
algorithm and say hey look at my data
12:57
algorithm and say hey look at my data process it understand it then answer my
12:59
process it understand it then answer my
12:59
process it understand it then answer my questions about it if you have M if you
13:01
questions about it if you have M if you
13:01
questions about it if you have M if you have metadata problems not just data
13:03
have metadata problems not just data
13:03
have metadata problems not just data problems but you have metadata problems
13:06
problems but you have metadata problems
13:06
problems but you have metadata problems then there is a chance the AI will make
13:08
then there is a chance the AI will make
13:08
then there is a chance the AI will make a mistake based on it this is really
13:09
a mistake based on it this is really
13:09
a mistake based on it this is really common we've all probably run into
13:13
common we've all probably run into
13:13
common we've all probably run into metadata problems like this and if you
13:14
metadata problems like this and if you
13:14
metadata problems like this and if you haven't you will but you're lucky so far
13:17
haven't you will but you're lucky so far
13:17
haven't you will but you're lucky so far I suppose for example I have a couple
13:18
I suppose for example I have a couple
13:18
I suppose for example I have a couple fun examples here just for you know heck
13:21
fun examples here just for you know heck
13:21
fun examples here just for you know heck to throw some ideas out here you know
13:22
to throw some ideas out here you know
13:22
to throw some ideas out here you know what if you saw a data element that was
13:25
what if you saw a data element that was
13:25
what if you saw a data element that was integer named invoice what would that
13:28
integer named invoice what would that
13:28
integer named invoice what would that what would that be if you're AI
13:29
what would that be if you're AI
13:29
what would that be if you're AI algorithm consuming this what do you
13:30
algorithm consuming this what do you
13:30
algorithm consuming this what do you think it would be does it mean there is
13:32
think it would be does it mean there is
13:32
think it would be does it mean there is an invoice yes or no is it the invoice
13:35
an invoice yes or no is it the invoice
13:35
an invoice yes or no is it the invoice amount is it an invoice number I don't
13:38
amount is it an invoice number I don't
13:38
amount is it an invoice number I don't really know what it is I can guess I can
13:40
really know what it is I can guess I can
13:40
really know what it is I can guess I can look at the data and guess uh but
13:42
look at the data and guess uh but
13:42
look at the data and guess uh but there's no guarantee I'll be right I'll
13:44
there's no guarantee I'll be right I'll
13:44
there's no guarantee I'll be right I'll be right sometimes and I might be wrong
13:46
be right sometimes and I might be wrong
13:46
be right sometimes and I might be wrong sometimes so look at any data you have
13:49
sometimes so look at any data you have
13:49
sometimes so look at any data you have and look at the names the data types the
13:51
and look at the names the data types the
13:51
and look at the names the data types the sizes and validate is this correct or
13:54
sizes and validate is this correct or
13:54
sizes and validate is this correct or not for example if I had a a date time
13:56
not for example if I had a a date time
13:56
not for example if I had a a date time named entry time is it entry time or is
14:00
named entry time is it entry time or is
14:00
named entry time is it entry time or is it really a date time sometimes we store
14:02
it really a date time sometimes we store
14:02
it really a date time sometimes we store dates or times as date time or even as a
14:05
dates or times as date time or even as a
14:05
dates or times as date time or even as a string strings are dangerous because
14:07
string strings are dangerous because
14:07
string strings are dangerous because they could have bad values right
14:08
they could have bad values right
14:08
they could have bad values right February 31st is not possible in uh a
14:12
February 31st is not possible in uh a
14:12
February 31st is not possible in uh a date time that's typed correctly uh but
14:14
date time that's typed correctly uh but
14:14
date time that's typed correctly uh but if it was a string February 31st didn't
14:17
if it was a string February 31st didn't
14:17
if it was a string February 31st didn't happen and or you just mix up European
14:19
happen and or you just mix up European
14:19
happen and or you just mix up European and American times for example you know
14:21
and American times for example you know
14:21
and American times for example you know some of us will write two 29 for
14:24
some of us will write two 29 for
14:24
some of us will write two 29 for February 29th but what about 292 or 29th
14:27
February 29th but what about 292 or 29th
14:27
February 29th but what about 292 or 29th of February or year month day Year Day
14:30
of February or year month day Year Day
14:30
of February or year month day Year Day month so on and so forth those things
14:32
month so on and so forth those things
14:32
month so on and so forth those things matter here AI is going to consume this
14:34
matter here AI is going to consume this
14:34
matter here AI is going to consume this look at it make decisions based on it it
14:36
look at it make decisions based on it it
14:36
look at it make decisions based on it it needs to be clear similarly if I had a
14:39
needs to be clear similarly if I had a
14:39
needs to be clear similarly if I had a call and this is kind of a fun one uh
14:41
call and this is kind of a fun one uh
14:41
call and this is kind of a fun one uh very often we soft delete data because
14:43
very often we soft delete data because
14:43
very often we soft delete data because we want to keep it around for posterity
14:45
we want to keep it around for posterity
14:45
we want to keep it around for posterity for reference purposes for compliance
14:47
for reference purposes for compliance
14:47
for reference purposes for compliance purposes and so we'll soft delete data
14:49
purposes and so we'll soft delete data
14:49
purposes and so we'll soft delete data and we'll have some column or element
14:51
and we'll have some column or element
14:51
and we'll have some column or element like is deleted or something and then is
14:53
like is deleted or something and then is
14:53
like is deleted or something and then is archived whatever and we'll have that
14:55
archived whatever and we'll have that
14:55
archived whatever and we'll have that out there and then at some point AI
14:58
out there and then at some point AI
14:58
out there and then at some point AI comes in consumes this and the algorithm
15:00
comes in consumes this and the algorithm
15:00
comes in consumes this and the algorithm will then maybe say well hey uh it's
15:03
will then maybe say well hey uh it's
15:03
will then maybe say well hey uh it's deleted okay does that mean I should use
15:05
deleted okay does that mean I should use
15:05
deleted okay does that mean I should use this data as I Crunch and return
15:08
this data as I Crunch and return
15:08
this data as I Crunch and return responses or should I skip the ones that
15:10
responses or should I skip the ones that
15:10
responses or should I skip the ones that are deleted because they're deleted
15:11
are deleted because they're deleted
15:11
are deleted because they're deleted realistically it doesn't really know
15:13
realistically it doesn't really know
15:13
realistically it doesn't really know what the right answer is uh you and I
15:14
what the right answer is uh you and I
15:14
what the right answer is uh you and I may um but it won't so consider that as
15:17
may um but it won't so consider that as
15:17
may um but it won't so consider that as well that if there's any kind of
15:19
well that if there's any kind of
15:19
well that if there's any kind of ambiguous metadata or metadata that may
15:21
ambiguous metadata or metadata that may
15:22
ambiguous metadata or metadata that may influence decisions you may have to
15:23
influence decisions you may have to
15:23
influence decisions you may have to train it or give it additional
15:24
train it or give it additional
15:24
train it or give it additional information to understand what it means
15:26
information to understand what it means
15:26
information to understand what it means or just remove it all together if a
15:28
or just remove it all together if a
15:28
or just remove it all together if a column is something you can simply
15:30
column is something you can simply
15:30
column is something you can simply resolve and say you know what I just
15:32
resolve and say you know what I just
15:32
resolve and say you know what I just don't want deleted data in there at all
15:34
don't want deleted data in there at all
15:34
don't want deleted data in there at all then remove all the rows is deleted one
15:36
then remove all the rows is deleted one
15:36
then remove all the rows is deleted one call it a day never train with it don't
15:38
call it a day never train with it don't
15:38
call it a day never train with it don't bring it in as rag data just leave it
15:40
bring it in as rag data just leave it
15:40
bring it in as rag data just leave it out don't even give it to the algorithm
15:42
out don't even give it to the algorithm
15:42
out don't even give it to the algorithm don't let it think about it it's just
15:43
don't let it think about it it's just
15:43
don't let it think about it it's just noise at that point and providing noise
15:45
noise at that point and providing noise
15:46
noise at that point and providing noise to an AI algorithm is only gonna create
15:48
to an AI algorithm is only gonna create
15:48
to an AI algorithm is only gonna create bad responses or or just you know things
15:50
bad responses or or just you know things
15:50
bad responses or or just you know things people don't expect the last thing you
15:52
people don't expect the last thing you
15:52
people don't expect the last thing you want to do is get information on an
15:54
want to do is get information on an
15:54
want to do is get information on an order you placed last year that you
15:55
order you placed last year that you
15:55
order you placed last year that you deleted and never actually placed right
15:57
deleted and never actually placed right
15:57
deleted and never actually placed right I filled my card up I went to place the
15:59
I filled my card up I went to place the
15:59
I filled my card up I went to place the order I said oh I don't want that stuff
16:02
order I said oh I don't want that stuff
16:02
order I said oh I don't want that stuff now just start over and then it's giv me
16:05
now just start over and then it's giv me
16:05
now just start over and then it's giv me answers back about it later that's not
16:07
answers back about it later that's not
16:07
answers back about it later that's not so
16:09
so good one note is we have different kinds
16:12
good one note is we have different kinds
16:12
good one note is we have different kinds of data uh I'll these are two examples
16:14
of data uh I'll these are two examples
16:14
of data uh I'll these are two examples you may have other data that goes into
16:16
you may have other data that goes into
16:16
you may have other data that goes into your algorithms goes in your machine
16:17
your algorithms goes in your machine
16:17
your algorithms goes in your machine learning keep in mind that each of these
16:20
learning keep in mind that each of these
16:20
learning keep in mind that each of these are different data sets so you have
16:22
are different data sets so you have
16:22
are different data sets so you have training data uh the purpose of training
16:24
training data uh the purpose of training
16:24
training data uh the purpose of training data is to ultimately tell your AI how
16:26
data is to ultimately tell your AI how
16:26
data is to ultimately tell your AI how to behave what is its purpose what is is
16:28
to behave what is its purpose what is is
16:29
to behave what is its purpose what is is it going to do how is it going to
16:31
it going to do how is it going to
16:31
it going to do how is it going to respond what is it trying to and if it's
16:34
respond what is it trying to and if it's
16:34
respond what is it trying to and if it's bad what happens well if you have bad
16:35
bad what happens well if you have bad
16:35
bad what happens well if you have bad training data your model will simply
16:38
training data your model will simply
16:38
training data your model will simply return not behave the way it should it
16:40
return not behave the way it should it
16:40
return not behave the way it should it may return the right answers potentially
16:42
may return the right answers potentially
16:42
may return the right answers potentially but it may not do it in the right way or
16:44
but it may not do it in the right way or
16:44
but it may not do it in the right way or it may give irrelevant answers that are
16:46
it may give irrelevant answers that are
16:46
it may give irrelevant answers that are still correct you ask me what the color
16:48
still correct you ask me what the color
16:48
still correct you ask me what the color of the sky is and I say 42 that's not a
16:51
of the sky is and I say 42 that's not a
16:51
of the sky is and I say 42 that's not a very helpful response maybe I was
16:52
very helpful response maybe I was
16:52
very helpful response maybe I was thinking of a wavelength instead or
16:54
thinking of a wavelength instead or
16:54
thinking of a wavelength instead or something else but that isn't relevant
16:56
something else but that isn't relevant
16:56
something else but that isn't relevant or correct so I can give the right
16:58
or correct so I can give the right
16:58
or correct so I can give the right answer to the wrong question that's
17:00
answer to the wrong question that's
17:00
answer to the wrong question that's often what happens here uh on the other
17:02
often what happens here uh on the other
17:02
often what happens here uh on the other hand you have retrieval augmented
17:04
hand you have retrieval augmented
17:04
hand you have retrieval augmented Generation Um that is rag data that's
17:06
Generation Um that is rag data that's
17:06
Generation Um that is rag data that's the data you bring so you train a model
17:09
the data you bring so you train a model
17:09
the data you bring so you train a model it's behaving the way you want now you
17:11
it's behaving the way you want now you
17:11
it's behaving the way you want now you swap in your data the actual data you
17:13
swap in your data the actual data you
17:13
swap in your data the actual data you care about you've trained it to be a you
17:17
care about you've trained it to be a you
17:17
care about you've trained it to be a you know a bot that will answer questions
17:19
know a bot that will answer questions
17:20
know a bot that will answer questions about an airline for example now you're
17:22
about an airline for example now you're
17:22
about an airline for example now you're gonna bring your airline data in so it
17:24
gonna bring your airline data in so it
17:24
gonna bring your airline data in so it can use current information to do what
17:25
can use current information to do what
17:25
can use current information to do what it has to do if your rag data is
17:28
it has to do if your rag data is
17:28
it has to do if your rag data is incorrect what do you get you have
17:30
incorrect what do you get you have
17:30
incorrect what do you get you have invalid responses bad answers so it's
17:32
invalid responses bad answers so it's
17:32
invalid responses bad answers so it's important to know you have different
17:33
important to know you have different
17:33
important to know you have different sets of data and bad data in any of
17:36
sets of data and bad data in any of
17:36
sets of data and bad data in any of these areas will result in different
17:38
these areas will result in different
17:38
these areas will result in different results at the end of the line so you
17:40
results at the end of the line so you
17:40
results at the end of the line so you can have bad training data which will
17:42
can have bad training data which will
17:42
can have bad training data which will provide one set of problems and
17:43
provide one set of problems and
17:43
provide one set of problems and headaches and bad your data frag data
17:46
headaches and bad your data frag data
17:46
headaches and bad your data frag data which will result in other headaches
17:49
which will result in other headaches
17:49
which will result in other headaches different headaches and if you have bad
17:50
different headaches and if you have bad
17:50
different headaches and if you have bad data in both you may have a hard time
17:52
data in both you may have a hard time
17:52
data in both you may have a hard time diagnosing where the problem
17:57
is all right so now we're going to talk
17:59
is all right so now we're going to talk
17:59
is all right so now we're going to talk about AI specific how do we cheat how do
18:03
about AI specific how do we cheat how do
18:03
about AI specific how do we cheat how do we do things we're not supposed to do to
18:06
we do things we're not supposed to do to
18:06
we do things we're not supposed to do to try to fix the problem because we have
18:08
try to fix the problem because we have
18:08
try to fix the problem because we have an app we're close to releasing some new
18:10
an app we're close to releasing some new
18:10
an app we're close to releasing some new stuff it's not behaving correctly we've
18:13
stuff it's not behaving correctly we've
18:13
stuff it's not behaving correctly we've got to fix it what are the things we do
18:16
got to fix it what are the things we do
18:16
got to fix it what are the things we do and then don't look back on that
18:18
and then don't look back on that
18:18
and then don't look back on that probably we shouldn't do all right this
18:22
probably we shouldn't do all right this
18:22
probably we shouldn't do all right this sounds fun right this is the most common
18:24
sounds fun right this is the most common
18:24
sounds fun right this is the most common one I see it's very often that we're
18:26
one I see it's very often that we're
18:26
one I see it's very often that we're close we have a a interface that works
18:29
close we have a a interface that works
18:29
close we have a a interface that works nicely it's almost correct all the time
18:32
nicely it's almost correct all the time
18:32
nicely it's almost correct all the time sometimes you get bad responses
18:34
sometimes you get bad responses
18:34
sometimes you get bad responses sometimes we get answers that aren't
18:35
sometimes we get answers that aren't
18:35
sometimes we get answers that aren't correct so what do we do we take our
18:37
correct so what do we do we take our
18:37
correct so what do we do we take our prompt and try to engineer it further
18:39
prompt and try to engineer it further
18:39
prompt and try to engineer it further and we say hey you know what we're
18:40
and we say hey you know what we're
18:40
and we say hey you know what we're really close we'll just adjust it a
18:43
really close we'll just adjust it a
18:43
really close we'll just adjust it a little bit so that it doesn't give the
18:44
little bit so that it doesn't give the
18:44
little bit so that it doesn't give the wrong responses anymore and this goes
18:47
wrong responses anymore and this goes
18:47
wrong responses anymore and this goes down a rabbit hole of kind of cause and
18:50
down a rabbit hole of kind of cause and
18:50
down a rabbit hole of kind of cause and effect cause and effect the purpose of
18:51
effect cause and effect the purpose of
18:51
effect cause and effect the purpose of prompt engineering is to provide purpose
18:54
prompt engineering is to provide purpose
18:54
prompt engineering is to provide purpose to an algorithm so that it knows what is
18:57
to an algorithm so that it knows what is
18:57
to an algorithm so that it knows what is it what is its role what is it supposed
18:58
it what is its role what is it supposed
18:58
it what is its role what is it supposed to be doing how does it answer things
19:01
to be doing how does it answer things
19:01
to be doing how does it answer things like that that's what we're trying to do
19:03
like that that's what we're trying to do
19:03
like that that's what we're trying to do it keeps things relevant it makes sure
19:05
it keeps things relevant it makes sure
19:05
it keeps things relevant it makes sure that it's doing what you want it to do
19:07
that it's doing what you want it to do
19:07
that it's doing what you want it to do if you have bad data coming in from
19:08
if you have bad data coming in from
19:09
if you have bad data coming in from anywhere in your process and it consumes
19:12
anywhere in your process and it consumes
19:12
anywhere in your process and it consumes that bad data you really can't prompt
19:14
that bad data you really can't prompt
19:14
that bad data you really can't prompt your way out of it you can try you can
19:16
your way out of it you can try you can
19:16
your way out of it you can try you can include details in your prompt to try to
19:18
include details in your prompt to try to
19:18
include details in your prompt to try to get around it but the problem is you're
19:20
get around it but the problem is you're
19:20
get around it but the problem is you're changing its purpose now and so you're
19:22
changing its purpose now and so you're
19:22
changing its purpose now and so you're trying to prompt way of bad data but as
19:24
trying to prompt way of bad data but as
19:24
trying to prompt way of bad data but as you do so you're going to make it behave
19:28
you do so you're going to make it behave
19:28
you do so you're going to make it behave differently in an effort to avoid bad
19:30
differently in an effort to avoid bad
19:30
differently in an effort to avoid bad data and if you think of it more from a
19:32
data and if you think of it more from a
19:32
data and if you think of it more from a human perspective like if you were going
19:34
human perspective like if you were going
19:34
human perspective like if you were going to ask somebody to answer questions for
19:36
to ask somebody to answer questions for
19:36
to ask somebody to answer questions for you on the phone and you went to them
19:38
you on the phone and you went to them
19:38
you on the phone and you went to them and said listen this is really important
19:40
and said listen this is really important
19:40
and said listen this is really important when you're answering questions about
19:41
when you're answering questions about
19:41
when you're answering questions about the airline don't respond with that
19:44
the airline don't respond with that
19:44
the airline don't respond with that flight going to Bangkok tomorrow just
19:45
flight going to Bangkok tomorrow just
19:45
flight going to Bangkok tomorrow just don't mention it at all and if they if
19:47
don't mention it at all and if they if
19:47
don't mention it at all and if they if they ask about it don't say anything
19:49
they ask about it don't say anything
19:49
they ask about it don't say anything it's not a$ thousand dollar like you
19:50
it's not a$ thousand dollar like you
19:50
it's not a$ thousand dollar like you begin giving these weird pointers and a
19:52
begin giving these weird pointers and a
19:52
begin giving these weird pointers and a human being would say what what are you
19:55
human being would say what what are you
19:55
human being would say what what are you talking about that doesn't make any
19:56
talking about that doesn't make any
19:56
talking about that doesn't make any sense um but the algorithms have to
19:58
sense um but the algorithms have to
19:58
sense um but the algorithms have to consume what whatever you give it and so
19:59
consume what whatever you give it and so
19:59
consume what whatever you give it and so you prompt engineer you put details in
20:02
you prompt engineer you put details in
20:02
you prompt engineer you put details in and you might start getting correct
20:03
and you might start getting correct
20:03
and you might start getting correct answers to the problem you found but I
20:06
answers to the problem you found but I
20:06
answers to the problem you found but I guarantee you will introduce
20:08
guarantee you will introduce
20:08
guarantee you will introduce complexities and wrong answers elsewhere
20:10
complexities and wrong answers elsewhere
20:10
complexities and wrong answers elsewhere so you can't prompt your way out of bad
20:12
so you can't prompt your way out of bad
20:12
so you can't prompt your way out of bad data and if you try you will prompt
20:13
data and if you try you will prompt
20:13
data and if you try you will prompt yourself into other bad responses not a
20:17
yourself into other bad responses not a
20:17
yourself into other bad responses not a good way to go similarly we have the
20:20
good way to go similarly we have the
20:20
good way to go similarly we have the world of the data we bring retrieval me
20:21
world of the data we bring retrieval me
20:22
world of the data we bring retrieval me to generation rag data is data we bring
20:24
to generation rag data is data we bring
20:24
to generation rag data is data we bring so you train your model on one data set
20:26
so you train your model on one data set
20:26
so you train your model on one data set and then when you're happy with what you
20:28
and then when you're happy with what you
20:28
and then when you're happy with what you have you bring your data and continue
20:30
have you bring your data and continue
20:30
have you bring your data and continue working with it um you can't fix bad
20:32
working with it um you can't fix bad
20:32
working with it um you can't fix bad training with this though the purpose of
20:34
training with this though the purpose of
20:34
training with this though the purpose of rag data is to be your data for the
20:37
rag data is to be your data for the
20:37
rag data is to be your data for the flight you know for the airline example
20:39
flight you know for the airline example
20:39
flight you know for the airline example I will have flight numbers and prices
20:41
I will have flight numbers and prices
20:41
I will have flight numbers and prices and layovers and T durations and first
20:44
and layovers and T durations and first
20:44
and layovers and T durations and first class economy whatever I have all that
20:47
class economy whatever I have all that
20:47
class economy whatever I have all that data coming in for upcoming flights
20:49
data coming in for upcoming flights
20:49
data coming in for upcoming flights people will ask questions they'll get
20:50
people will ask questions they'll get
20:50
people will ask questions they'll get answers the nice thing about this though
20:52
answers the nice thing about this though
20:53
answers the nice thing about this though is if you do have bad data in the rag
20:54
is if you do have bad data in the rag
20:54
is if you do have bad data in the rag world that's gonna mean bad responses
20:56
world that's gonna mean bad responses
20:56
world that's gonna mean bad responses just inaccurate answers that's all it's
20:59
just inaccurate answers that's all it's
20:59
just inaccurate answers that's all it's also easy to fix though this is the data
21:01
also easy to fix though this is the data
21:01
also easy to fix though this is the data set you're bringing you simply identify
21:03
set you're bringing you simply identify
21:03
set you're bringing you simply identify where's the bad data you fix it and then
21:05
where's the bad data you fix it and then
21:05
where's the bad data you fix it and then after you fix it you can go back and
21:06
after you fix it you can go back and
21:06
after you fix it you can go back and find the source of the bad data and fix
21:08
find the source of the bad data and fix
21:08
find the source of the bad data and fix the source as well and then we're good
21:11
the source as well and then we're good
21:11
the source as well and then we're good and you move on with life don't make it
21:13
and you move on with life don't make it
21:13
and you move on with life don't make it more complicated than it has to be don't
21:15
more complicated than it has to be don't
21:15
more complicated than it has to be don't try to work around it don't try to train
21:16
try to work around it don't try to train
21:17
try to work around it don't try to train around it don't try to fine tune around
21:18
around it don't try to fine tune around
21:18
around it don't try to fine tune around it don't try to unlearn stuff anything
21:20
it don't try to unlearn stuff anything
21:20
it don't try to unlearn stuff anything else you do to try to fix your bad rag
21:22
else you do to try to fix your bad rag
21:22
else you do to try to fix your bad rag data is just going to make things more
21:24
data is just going to make things more
21:24
data is just going to make things more complicated and there really is a beauty
21:26
complicated and there really is a beauty
21:26
complicated and there really is a beauty and simplicity in this world the more
21:28
and simplicity in this world the more
21:28
and simplicity in this world the more complic cated your code gets for
21:30
complic cated your code gets for
21:30
complic cated your code gets for algorithms the harder it is to
21:31
algorithms the harder it is to
21:31
algorithms the harder it is to troubleshoot and to get back what you
21:33
troubleshoot and to get back what you
21:34
troubleshoot and to get back what you really
21:36
want semantic search is fun because
21:38
want semantic search is fun because
21:38
want semantic search is fun because there's a lot of math involved here I
21:40
there's a lot of math involved here I
21:40
there's a lot of math involved here I like personally I love math math is fun
21:41
like personally I love math math is fun
21:41
like personally I love math math is fun to me not everyone believes that but I
21:43
to me not everyone believes that but I
21:43
to me not everyone believes that but I enjoy it purpose of semantic search is
21:45
enjoy it purpose of semantic search is
21:45
enjoy it purpose of semantic search is take and create associations in data
21:48
take and create associations in data
21:48
take and create associations in data this helps us create more qualitative
21:50
this helps us create more qualitative
21:50
this helps us create more qualitative analysis and so you have qu quantitative
21:52
analysis and so you have qu quantitative
21:52
analysis and so you have qu quantitative mathematical analysis that results in
21:54
mathematical analysis that results in
21:54
mathematical analysis that results in our ability to make things associate to
21:57
our ability to make things associate to
21:57
our ability to make things associate to each other answer questions so if I ask
21:59
each other answer questions so if I ask
22:00
each other answer questions so if I ask a question of an algorithm and it
22:01
a question of an algorithm and it
22:01
a question of an algorithm and it doesn't recognize all the words or
22:03
doesn't recognize all the words or
22:03
doesn't recognize all the words or sentences it can figure out what I'm
22:05
sentences it can figure out what I'm
22:05
sentences it can figure out what I'm talking about based on meaning and and
22:07
talking about based on meaning and and
22:07
talking about based on meaning and and sort it out synonyms things like that
22:09
sort it out synonyms things like that
22:09
sort it out synonyms things like that and so under the covers it's going to
22:11
and so under the covers it's going to
22:11
and so under the covers it's going to vectorize data it's going to break it
22:13
vectorize data it's going to break it
22:13
vectorize data it's going to break it into chunks it's going to create
22:14
into chunks it's going to create
22:14
into chunks it's going to create associations assign numbers of
22:16
associations assign numbers of
22:16
associations assign numbers of similarity or dissimilarity and then go
22:18
similarity or dissimilarity and then go
22:18
similarity or dissimilarity and then go from there and this is really cool but
22:22
from there and this is really cool but
22:22
from there and this is really cool but if you have bad data in here you'll get
22:23
if you have bad data in here you'll get
22:23
if you have bad data in here you'll get bad associations the good news is this
22:26
bad associations the good news is this
22:26
bad associations the good news is this is kind of like a very basic example I
22:28
is kind of like a very basic example I
22:28
is kind of like a very basic example I like to use to show what vectorization
22:30
like to use to show what vectorization
22:30
like to use to show what vectorization looks like so you're basically taking
22:31
looks like so you're basically taking
22:31
looks like so you're basically taking this many dimensional model of
22:34
this many dimensional model of
22:34
this many dimensional model of relationships and obviously this is
22:35
relationships and obviously this is
22:35
relationships and obviously this is smaller than any data you have right you
22:37
smaller than any data you have right you
22:37
smaller than any data you have right you have more than eight things and seven
22:40
have more than eight things and seven
22:40
have more than eight things and seven things you associate them with you'll
22:42
things you associate them with you'll
22:42
things you associate them with you'll probably have hundreds thousands
22:43
probably have hundreds thousands
22:43
probably have hundreds thousands millions whatever but simplify it you
22:45
millions whatever but simplify it you
22:45
millions whatever but simplify it you look at this here and you see positive
22:47
look at this here and you see positive
22:47
look at this here and you see positive and negative numbers indicate similarity
22:49
and negative numbers indicate similarity
22:49
and negative numbers indicate similarity or dissimilarity if I were to be talking
22:52
or dissimilarity if I were to be talking
22:52
or dissimilarity if I were to be talking to an algorithm chapot whatever and
22:54
to an algorithm chapot whatever and
22:54
to an algorithm chapot whatever and getting back wrong answers because of
22:58
getting back wrong answers because of
22:58
getting back wrong answers because of this I can trace it back so tracing back
23:01
this I can trace it back so tracing back
23:01
this I can trace it back so tracing back back data from semantic search is
23:02
back data from semantic search is
23:02
back data from semantic search is actually not that bad because if I see
23:05
actually not that bad because if I see
23:05
actually not that bad because if I see that cat is associated into K9 and
23:07
that cat is associated into K9 and
23:07
that cat is associated into K9 and there's a high relevance factor for it
23:10
there's a high relevance factor for it
23:10
there's a high relevance factor for it I'm gonna say all right there's
23:12
I'm gonna say all right there's
23:12
I'm gonna say all right there's something wrong in my data there's no
23:14
something wrong in my data there's no
23:14
something wrong in my data there's no way that would happen otherwise let me
23:15
way that would happen otherwise let me
23:15
way that would happen otherwise let me go dig into it and figure out where that
23:17
go dig into it and figure out where that
23:17
go dig into it and figure out where that came from and solve it so the good news
23:19
came from and solve it so the good news
23:19
came from and solve it so the good news here is that if you see bad data here
23:21
here is that if you see bad data here
23:21
here is that if you see bad data here you can reverse engineer it solve it and
23:23
you can reverse engineer it solve it and
23:24
you can reverse engineer it solve it and be done U but you can't work around it
23:26
be done U but you can't work around it
23:26
be done U but you can't work around it though if there's bad data here it will
23:28
though if there's bad data here it will
23:28
though if there's bad data here it will continue associate badly in the future
23:30
continue associate badly in the future
23:30
continue associate badly in the future and K9 doesn't just mean dog can mean
23:32
and K9 doesn't just mean dog can mean
23:32
and K9 doesn't just mean dog can mean wolf it mean something else and so
23:33
wolf it mean something else and so
23:33
wolf it mean something else and so there's potential for more bad
23:35
there's potential for more bad
23:35
there's potential for more bad associations that I simply haven't hit
23:37
associations that I simply haven't hit
23:37
associations that I simply haven't hit on yet it's because I found one does not
23:39
on yet it's because I found one does not
23:40
on yet it's because I found one does not mean that more aren't there
23:43
mean that more aren't there
23:43
mean that more aren't there somewhere a few more fun ones here um
23:46
somewhere a few more fun ones here um
23:46
somewhere a few more fun ones here um before I leave some time here for
23:47
before I leave some time here for
23:47
before I leave some time here for questions because questions are fun fine
23:50
questions because questions are fun fine
23:50
questions because questions are fun fine tuning is something that we often misuse
23:53
tuning is something that we often misuse
23:53
tuning is something that we often misuse fine-tuning lets you take an algorithm
23:55
fine-tuning lets you take an algorithm
23:55
fine-tuning lets you take an algorithm and make it specific to a certain use
23:57
and make it specific to a certain use
23:58
and make it specific to a certain use case like for example let's say that I
24:00
case like for example let's say that I
24:00
case like for example let's say that I have my airline model and I like it but
24:02
have my airline model and I like it but
24:02
have my airline model and I like it but I want to make a couple of specific use
24:04
I want to make a couple of specific use
24:04
I want to make a couple of specific use cases out of that one for cargo and one
24:07
cases out of that one for cargo and one
24:07
cases out of that one for cargo and one for first class because those just tend
24:08
for first class because those just tend
24:08
for first class because those just tend to be very different from everything
24:10
to be very different from everything
24:10
to be very different from everything else I want them to kind of be tailored
24:12
else I want them to kind of be tailored
24:12
else I want them to kind of be tailored specifically to the crowds that will be
24:14
specifically to the crowds that will be
24:14
specifically to the crowds that will be interested in paying money for those
24:16
interested in paying money for those
24:16
interested in paying money for those services to really do this correctly
24:19
services to really do this correctly
24:19
services to really do this correctly though you have to spend a little bit of
24:20
though you have to spend a little bit of
24:20
though you have to spend a little bit of time understanding the use cases the
24:22
time understanding the use cases the
24:22
time understanding the use cases the business model and what you're trying to
24:24
business model and what you're trying to
24:24
business model and what you're trying to do an important key is that you can't
24:27
do an important key is that you can't
24:27
do an important key is that you can't find like you can't r your way out of
24:28
find like you can't r your way out of
24:28
find like you can't r your way out of bad data you can't uh unlearn it untrain
24:31
bad data you can't uh unlearn it untrain
24:31
bad data you can't uh unlearn it untrain it you can't prompt engineer your way
24:33
it you can't prompt engineer your way
24:33
it you can't prompt engineer your way you can't fine-tune your way out of bad
24:35
you can't fine-tune your way out of bad
24:35
you can't fine-tune your way out of bad data if you have a data problem fine
24:37
data if you have a data problem fine
24:38
data if you have a data problem fine tuning is not going to fix it the
24:39
tuning is not going to fix it the
24:39
tuning is not going to fix it the purpose of fine-tuning is to make a
24:41
purpose of fine-tuning is to make a
24:41
purpose of fine-tuning is to make a model more specific that's all it's
24:43
model more specific that's all it's
24:43
model more specific that's all it's supposed to allow it to handle a subset
24:46
supposed to allow it to handle a subset
24:46
supposed to allow it to handle a subset of use cases that are important to you
24:48
of use cases that are important to you
24:48
of use cases that are important to you under certain circumstances and fine
24:51
under certain circumstances and fine
24:51
under certain circumstances and fine tuning for that purpose is easy uh but
24:53
tuning for that purpose is easy uh but
24:53
tuning for that purpose is easy uh but it takes time and effort do it correctly
24:56
it takes time and effort do it correctly
24:56
it takes time and effort do it correctly it's not going to solve bad data and if
24:57
it's not going to solve bad data and if
24:57
it's not going to solve bad data and if you try to solve bad data here again
25:00
you try to solve bad data here again
25:00
you try to solve bad data here again you're saying hey model I'm giving you a
25:02
you're saying hey model I'm giving you a
25:02
you're saying hey model I'm giving you a purpose but by the way as part of that
25:04
purpose but by the way as part of that
25:04
purpose but by the way as part of that purpose avoid this and don't do this and
25:07
purpose avoid this and don't do this and
25:07
purpose avoid this and don't do this and and this should really be this that's
25:09
and this should really be this that's
25:09
and this should really be this that's not a purpose that isn't providing a use
25:12
not a purpose that isn't providing a use
25:12
not a purpose that isn't providing a use case that's confusing uh whenever you're
25:15
case that's confusing uh whenever you're
25:15
case that's confusing uh whenever you're not really sure if something you're
25:16
not really sure if something you're
25:16
not really sure if something you're doing is correct think about talking to
25:17
doing is correct think about talking to
25:17
doing is correct think about talking to a human being to ask them to do these
25:19
a human being to ask them to do these
25:19
a human being to ask them to do these things would it make sense or not and if
25:21
things would it make sense or not and if
25:22
things would it make sense or not and if it's total nonsense then it's probably
25:24
it's total nonsense then it's probably
25:24
it's total nonsense then it's probably nonsense to the algorithm as well so
25:26
nonsense to the algorithm as well so
25:26
nonsense to the algorithm as well so fine-tuning is to create specific use
25:29
fine-tuning is to create specific use
25:29
fine-tuning is to create specific use cases is not meant to solve that
25:33
cases is not meant to solve that
25:33
cases is not meant to solve that data here's a new one that's been coming
25:35
data here's a new one that's been coming
25:35
data here's a new one that's been coming up more and more and this has been
25:37
up more and more and this has been
25:37
up more and more and this has been proven through recent studies that are
25:39
proven through recent studies that are
25:39
proven through recent studies that are really interesting um many models have
25:42
really interesting um many models have
25:42
really interesting um many models have way many apps have ways to unlearn data
25:45
way many apps have ways to unlearn data
25:45
way many apps have ways to unlearn data basically say all right you know what I
25:47
basically say all right you know what I
25:47
basically say all right you know what I have an algorithm is doing great but I
25:48
have an algorithm is doing great but I
25:48
have an algorithm is doing great but I want to forget certain things there's
25:51
want to forget certain things there's
25:51
want to forget certain things there's bad data with pii there's copyrighted
25:53
bad data with pii there's copyrighted
25:53
bad data with pii there's copyrighted material there's there's something I
25:54
material there's there's something I
25:54
material there's there's something I just don't want it to ever talk about so
25:57
just don't want it to ever talk about so
25:57
just don't want it to ever talk about so remove it and the reality of this right
26:00
remove it and the reality of this right
26:00
remove it and the reality of this right now is that all the unlearning methods
26:01
now is that all the unlearning methods
26:01
now is that all the unlearning methods that are out there right now are pretty
26:02
that are out there right now are pretty
26:02
that are out there right now are pretty rudimentary um they might kind of work
26:05
rudimentary um they might kind of work
26:05
rudimentary um they might kind of work but there's more of a chance they're
26:06
but there's more of a chance they're
26:06
but there's more of a chance they're gonna harm your model than help it it's
26:08
gonna harm your model than help it it's
26:08
gonna harm your model than help it it's very much like me saying one day like
26:10
very much like me saying one day like
26:10
very much like me saying one day like you know what last week was horrible I
26:12
you know what last week was horrible I
26:12
you know what last week was horrible I had a terrible week everything went
26:13
had a terrible week everything went
26:13
had a terrible week everything went wrong I want to go in my brain and pull
26:15
wrong I want to go in my brain and pull
26:15
wrong I want to go in my brain and pull out all the neurons from last week and
26:17
out all the neurons from last week and
26:17
out all the neurons from last week and make all those memories go away um can
26:19
make all those memories go away um can
26:19
make all those memories go away um can you try to do that I'm sure there's some
26:21
you try to do that I'm sure there's some
26:21
you try to do that I'm sure there's some way to try to do that in science right
26:24
way to try to do that in science right
26:24
way to try to do that in science right now I'm sure it'll have negative
26:26
now I'm sure it'll have negative
26:26
now I'm sure it'll have negative repercussions that I really really
26:27
repercussions that I really really
26:27
repercussions that I really really wouldn't like
26:28
wouldn't like so it's this is just technology that's
26:31
so it's this is just technology that's
26:31
so it's this is just technology that's getting there it's not there yet if you
26:34
getting there it's not there yet if you
26:34
getting there it's not there yet if you try to do unlearning be very cautious
26:37
try to do unlearning be very cautious
26:37
try to do unlearning be very cautious check save back up everything you're
26:39
check save back up everything you're
26:39
check save back up everything you're doing before you make anything permanent
26:41
doing before you make anything permanent
26:41
doing before you make anything permanent change to it and be cautious unlearning
26:44
change to it and be cautious unlearning
26:44
change to it and be cautious unlearning especially you unlearn more data will
26:46
especially you unlearn more data will
26:47
especially you unlearn more data will very often result in bad things
26:49
very often result in bad things
26:49
very often result in bad things happening if you need to get rid of data
26:51
happening if you need to get rid of data
26:51
happening if you need to get rid of data remove it from the data source if your
26:53
remove it from the data source if your
26:53
remove it from the data source if your rag data has stuff in it you don't want
26:55
rag data has stuff in it you don't want
26:55
rag data has stuff in it you don't want to bring anymore um remove the data
26:58
to bring anymore um remove the data
26:58
to bring anymore um remove the data present a new set of data and go from
27:01
present a new set of data and go from
27:01
present a new set of data and go from there that's far better than trying to
27:02
there that's far better than trying to
27:02
there that's far better than trying to tell it after the fact exclude data
27:04
tell it after the fact exclude data
27:04
tell it after the fact exclude data unlearn data make it go away for
27:06
unlearn data make it go away for
27:06
unlearn data make it go away for training data this is even more
27:08
training data this is even more
27:08
training data this is even more important uh you've built this model you
27:11
important uh you've built this model you
27:11
important uh you've built this model you built the way it behaves you've worked
27:13
built the way it behaves you've worked
27:13
built the way it behaves you've worked with it you like it you you've adjusted
27:15
with it you like it you you've adjusted
27:15
with it you like it you you've adjusted it now to go and begin ripping stuff out
27:17
it now to go and begin ripping stuff out
27:17
it now to go and begin ripping stuff out after the fact it may have unintended
27:20
after the fact it may have unintended
27:20
after the fact it may have unintended consequences so I highly
27:22
consequences so I highly
27:22
consequences so I highly recommend caution about unlearning data
27:25
recommend caution about unlearning data
27:25
recommend caution about unlearning data because you don't really know what the
27:28
because you don't really know what the
27:28
because you don't really know what the results are going to be like and in
27:30
results are going to be like and in
27:30
results are going to be like and in current methods current models a lot of
27:32
current methods current models a lot of
27:32
current methods current models a lot of harm can be done and so I really just
27:34
harm can be done and so I really just
27:34
harm can be done and so I really just caution caution here um before doing it
27:38
caution caution here um before doing it
27:38
caution caution here um before doing it these will get better with time I'm sure
27:40
these will get better with time I'm sure
27:40
these will get better with time I'm sure they will but there's always a danger in
27:42
they will but there's always a danger in
27:42
they will but there's always a danger in telling a model to forget stuff because
27:44
telling a model to forget stuff because
27:44
telling a model to forget stuff because it may forget more than you want it to
27:47
it may forget more than you want it to
27:47
it may forget more than you want it to or it may forget less or it may forget
27:50
or it may forget less or it may forget
27:50
or it may forget less or it may forget who knows what you may try to get rid of
27:52
who knows what you may try to get rid of
27:52
who knows what you may try to get rid of all the pii and accidentally get rid of
27:53
all the pii and accidentally get rid of
27:53
all the pii and accidentally get rid of addal information you need uh or you may
27:56
addal information you need uh or you may
27:56
addal information you need uh or you may not get rid of all of it and then you'll
27:57
not get rid of all of it and then you'll
27:57
not get rid of all of it and then you'll still give to people inadvertently but
28:00
still give to people inadvertently but
28:00
still give to people inadvertently but confidently after the
28:02
confidently after the
28:02
confidently after the fact so I want to wrap up I have a few
28:04
fact so I want to wrap up I have a few
28:04
fact so I want to wrap up I have a few minutes or questions at the end um this
28:07
minutes or questions at the end um this
28:07
minutes or questions at the end um this this presentation really had a lot of
28:09
this presentation really had a lot of
28:09
this presentation really had a lot of information in it that was pre-existing
28:11
information in it that was pre-existing
28:11
information in it that was pre-existing that existed long before AI was talked
28:13
that existed long before AI was talked
28:13
that existed long before AI was talked about the public it's been around for
28:15
about the public it's been around for
28:15
about the public it's been around for years and a lot of newer things as well
28:17
years and a lot of newer things as well
28:17
years and a lot of newer things as well and it's all the same it all relates
28:19
and it's all the same it all relates
28:19
and it's all the same it all relates we've been using data for analysis for a
28:21
we've been using data for analysis for a
28:21
we've been using data for analysis for a long time it's not going to change uh
28:24
long time it's not going to change uh
28:24
long time it's not going to change uh all the new algorithms out there are
28:25
all the new algorithms out there are
28:25
all the new algorithms out there are really just an extension of things we've
28:27
really just an extension of things we've
28:27
really just an extension of things we've already had
28:28
already had and this will keep happening and keep
28:30
and this will keep happening and keep
28:30
and this will keep happening and keep going on and keep going on and so just
28:32
going on and keep going on and so just
28:32
going on and keep going on and so just keep in mind the best place to solve bad
28:35
keep in mind the best place to solve bad
28:35
keep in mind the best place to solve bad data is to do it at the source solve it
28:38
data is to do it at the source solve it
28:39
data is to do it at the source solve it early cut it off before it can go
28:41
early cut it off before it can go
28:41
early cut it off before it can go Downstream uh AI you know messing with
28:43
Downstream uh AI you know messing with
28:43
Downstream uh AI you know messing with your AI models can do great things for
28:45
your AI models can do great things for
28:45
your AI models can do great things for them but it won't be a substitute for
28:47
them but it won't be a substitute for
28:47
them but it won't be a substitute for good data it can't solve good bad data
28:49
good data it can't solve good bad data
28:50
good data it can't solve good bad data it can't make your data better you can
28:51
it can't make your data better you can
28:51
it can't make your data better you can try but you're far better better off
28:54
try but you're far better better off
28:54
try but you're far better better off solving it earlier in your processes
28:56
solving it earlier in your processes
28:56
solving it earlier in your processes once you're into a model once you're
28:58
once you're into a model once you're
28:58
once you're into a model once you're testing it once you're working with it
28:59
testing it once you're working with it
28:59
testing it once you're working with it once you're Qing it once you're letting
29:00
once you're Qing it once you're letting
29:00
once you're Qing it once you're letting people use it for real keep testing
29:03
people use it for real keep testing
29:03
people use it for real keep testing carefully and check for responses
29:04
carefully and check for responses
29:05
carefully and check for responses whenever bad responses happen find the
29:07
whenever bad responses happen find the
29:07
whenever bad responses happen find the source of it was it bad training was it
29:09
source of it was it bad training was it
29:09
source of it was it bad training was it bad rag data was it some changes you
29:11
bad rag data was it some changes you
29:11
bad rag data was it some changes you made that backfired figure out what the
29:13
made that backfired figure out what the
29:13
made that backfired figure out what the source is and resolve the source this is
29:15
source is and resolve the source this is
29:15
source is and resolve the source this is the best way of solving problems in your
29:17
the best way of solving problems in your
29:17
the best way of solving problems in your models that really really will make
29:20
models that really really will make
29:20
models that really really will make things better so I want to stop here I'm
29:23
things better so I want to stop here I'm
29:23
things better so I want to stop here I'm going to provide a little bit of
29:24
going to provide a little bit of
29:24
going to provide a little bit of information about me this will be in the
29:26
information about me this will be in the
29:26
information about me this will be in the slides later that get shared I'll share
29:28
slides later that get shared I'll share
29:28
slides later that get shared I'll share them afterwards so you have them are
29:30
them afterwards so you have them are
29:30
them afterwards so you have them are there any questions