0:00
Hi, good morning, good afternoon and good evening everyone, depending on where you are
0:29
you join us from today. Maybe you could share with us in the chat where you are based right now
0:37
And with that, I want to get to say hi to our beautiful speaker, Sina. Hi, and welcome back
0:46
It is really nice to see you again. And I also wanted to say hi to my awesome team member or
0:54
my best friend, maybe, Håkon. And I think it's a good time to talk about how do you feel after the last session, Sine
1:05
Yeah, yeah, I'm feeling great. I mean, the problem is there is so much I want to tell about with R, but we have to not
1:15
overwhelm too much. But I'm okay. still no corona but I'm ready for
1:23
my jab here just go with it we can also say for the viewers
1:31
that if you missed out on the last session you can see the link in
1:35
the description here so you can have a look at it at a later time yeah yes and
1:44
sorry yeah because this one is building a bit upon the last one, I must say
1:49
Yeah. Yes. And it would be also new how many of you guys
1:57
I'm asking now the audience, how many of you have been with us
2:02
at the last session when Sina gave an introduction to R. Maybe you can put in the chat
2:09
whether you managed to see us two weeks ago. And I think with that
2:15
we will have a chat about what did we talk about last time as well, right Sina
2:20
Maybe during the session or yes. And I think with that we can get started
2:28
So maybe we give a bit of an introduction of what is AI42 and then we give the stage to Sina
2:36
See you soon back. Hi, welcome back everyone. Hi Håkan. Hello. So I hope you are doing great and you're ready to share
2:58
some insights about the motivation of AI42. Yes. So the reason why we started AI42 is because we
3:09
would like to give everyone a chance to get into this interesting field. So what we're doing is
3:14
we're inviting industry experts and recognized speakers. So they will have sessions that we will be streaming two times a month
3:21
on Wednesday at five o'clock Central European time. And what we've done is we've started out with mathematics
3:28
and statistics and probability theory. And then after that, now we're moving into languages
3:34
So we've had sessions on Python, and now we have about R
3:39
and then we'll end about SQL. And then furthermore, we will also talk about tools like Databricks and Power BI
3:46
We will also look into how can you actually set up your own machine learning pipeline
3:51
And finally, we will go through some more advanced topics like reinforcement learning and explainable AI
3:57
And in addition to these more theoretical subjects, we will also have practical workshops
4:03
so that you can put this theory into practice. Yes, and the idea is really that we would like to connect you with the best in class
4:14
experts from all around the world. So you will have the chance to extend your network towards the network of AI and data
4:23
science communities. And we would also like to share with you that you can follow us on Instagram, Twitter and Facebook
4:33
If you want to get started from the beginning or see our previous lectures, you can go to our YouTube page by scanning this QR code
4:42
And if you want to see our upcoming events, you can just go and scan the QR code for the meetup page of AI42
4:53
Yes. And we are also associated with the global AI community. so you can see our sessions on that YouTube channel as well
5:05
and also on C Sharp Corner and on our own YouTube channel. I would like to thank our sponsors, which is Microsoft and Miles
5:15
And we are also supported by C Sharp Corner, so you can actually see our videos at their live TV show as well
5:24
And hereby, we want to say a big, big, big thank you to Marina Marie
5:28
for composing and performing our music that we use in our streams
5:33
And for the graphic designer, Levente Pongor, who built all the graphics that you can see throughout our streams
5:41
And with that, we want to say thank you, everyone, who supported us in any other ways
5:48
Yes, so should we get back to Sine? Yes, I think so
5:58
Hi, welcome back Sine. Thank you. So do you think you're ready for the session
6:12
Yeah, definitely. And could you give us a recap before we get started on where are we going to continue from
6:19
Yeah, yeah, sure. I think it's my first slide actually. So, yeah, I have been teaching a lot the last month here
6:32
So I started as a postdoc at the Copenhagen Business School. So I'm kind of used to the routine
6:40
And of course, I have a recap in the beginning of my session. So, yeah
6:46
All right. Okay. And I think we just give the stage over to you, Sina
6:51
Okay. Thanks a lot. Yeah, so thanks a lot
7:07
Well, last time I told a lot about, I started out to introduce myself
7:12
I'll not do that very much here. So I work at Copenhagen Business School and I have written the book
7:18
or a woman's code book or whatever. And I have a PhD within the field of AI and bioinformatics right So just to recap from last time
7:35
we should have a short chat on that. And that's really fine because I also built upon
7:40
what I did last time and my code will also be a bit development on last time's code
7:49
So those of you who were here last time will get something out of it
7:54
So we talked about R is great for, for example, visualizations and for statistics and so on
8:00
It's really developed for that. And it can do a lot of machine learning, too, if you want
8:08
And it's quite often it's a bit slower than, for example, Python and other tools
8:17
But people are developing on R to make it up to speed
8:22
so last time we talked about of course a bit about that
8:26
but also a bit on some of the basics how to calculate things in R
8:31
and what is a data frame and how to read in files a little bit on functions as well
8:37
so the plan then for today is that you will look a bit more
8:46
into data frames and wrangling I'll also introduce modeling visualization in R
8:56
where we are trying to build some of the histogram and bar plot and so on
9:02
which is you could say some of the things that R is really good for
9:06
and I'll also briefly introduce the pipe that's particularly because I know the one
9:12
who is going to talk about machine learning later on in AI42
9:22
He'll love the pipe. Or he loves the pipe. So I'll have to tell a bit about that
9:30
And you might wonder what is the pipe. But yeah, I'll tell you. So first of all, we are going to look a bit on
9:38
oh, sorry, data frames and wrangling. and I'm not going to do the presentation view here
9:48
So you have to look at this screen because then it's much easier to me
9:52
to shift back and forth from this to the RStudio that I'm using
9:58
And you can already now maybe start up your RStudio because then I think there's time
10:04
we should try to do the code together. and then you can fill out some of the code
10:12
So here we have different, last time we talked about different ways
10:15
to inspect our data frame with view, structure and summary. We have a paraphrasing of data frames
10:24
because there are more ways you can kind of access them. You can access them through this square
10:32
fully square bracket format or you can, here you more take a vector
10:39
which is a column from the data frame, and then you can kind of call the rows
10:44
from the square bracket, for example. There might be other notations that I'm not aware of
10:50
but that's the way it is. There are many ways to do it. And it's also an open source language
10:57
so there are also many packages that has been written to it
11:01
and where people have kind of maybe made functionalities in a bit different way
11:10
All right. So we'll look at this again in a minute, but I think I should just start with some of the basic wranglings
11:24
introduce some basic wranglings before. So first for wrangling and visualizations, we need some more packages
11:36
We need tidyverse and we need ggplot2. So I think we should go to our studio
11:43
And I have, this is a lot of, this is the same code as last time
11:50
We are reading in data frames. We are looking at them. And then we have also talked about reading in libraries, right
12:05
So the new library you need is Tidyverse
12:16
So you can wrangle data without Tidyverse, but Tidyverse is just a really nice library
12:23
So you can run this. If you don't have it already, you can use this
12:33
You can click here on packages and then install and then write tidy births
12:40
Then you'll get it from there. And it's a huge package with a lot of functionalities
12:46
And there are many people these days that are working within the tidyverse universe
12:58
So that's a good package. And there would be many references to it
13:04
And one of the great books on R, data science in R, I think it's called by Hadley Wickham, is also using this tidyverse functionality, right
13:20
All right. So that was for the first part here, the tidyverse
13:27
Then there is some basic wranglings I'll just introduce here. we have left join, which is you've already
13:36
some of you have already been at SQL. You had a bit of SQL, right
13:43
And there you probably learned the left join. If you have been working with Excel
13:48
you might have used VLOOKUP, which is a bit the same. Or yeah, it can be the same at least
13:58
so there are some ways where you can for example count where you count instances in a data frame
14:06
we can mutate create a new column and group by is a great functionality
14:15
we can group summaries so you can group so you can have a lot
14:23
a bit like a pivot table if you're working with Excel You can kind of cluster your mean and your median and so on with regard to certain groups, which is also variables in your data frame
14:38
All right. So, but let's go to the code. So the first thing is last time we talked a bit about what is actually the difference between a matrix and the data frame in R
14:51
And I think here we can actually I will try to illustrate it So here is the code that I had last time And I think it maybe a bit small for you to see here So I try to zoom in a bit Right so now it probably easier for you to to see this
15:13
So here we have a red, created a matrix that we read into a data frame
15:21
And we try to look at what is the structure here, as we can see
15:30
We can also look at it. So here you can see that the column is called x1, x2, x3
15:35
and we have these values within it. But you can also see when we look at the structure that
15:41
all the values here are now characters. And that's the thing about matrix, matrices in R
15:47
that they need to have the same type, which was also something we talked about last time, right
15:56
So they are all the character type, although they are numbers, and although this row is actually only numbers
16:04
So it's a bit weird. So if you read it in as a data frame
16:08
there is a bit more work in it, you could say, because then you need to take every column here as a vector
16:17
which is a small c here. That's something you do when you're having a vector, right
16:24
Then you can read that into a data frame. Then we'll rename the columns
16:32
so that they have the same name as the matrix just before
16:37
Now we look at the structure here, and we can see if we just compare
16:41
that now we haven't said anything about what type it is, right
16:49
But we can see here that X1 is numeric, X2 is numeric
16:53
and X3 is a character type, right? So that's more what we want
16:59
And that's a property about data frames that you can have, the columns can have different data types and so on
17:06
So it's where a matrix is more a whole. So I hope that that helps
17:15
And I don't know if the person who asked about data frames versus matrices is there
17:20
but here is the point. Also, we'll just look at how we can access the same element here in a data frame
17:29
because there are more ways. You can either do it where you are using indexes in a square bracket here
17:42
where row and columns, you have the rows and columns here. You can also say, take the first column as a vector
17:51
and then take the second element in that vector, and you get the same value
17:56
or you can say you want to use a bit the same notation up here
18:02
but instead of using column, this column, you say the name of the column, which is, as you remember, X1
18:12
You can also see it here. It's X1, yes. And then you get the same value out of it
18:19
Perfect. So we need to read in some more data frames such that we can have some more fun here today
18:27
because this data frame is really boring. So to do that, we use the library here
18:37
And then now I can read in these two files. Probably when you have, if you are following this
18:46
you might not be able to read them in like this. You have to actually import the data sets
18:54
as I showed last time to find the right path. But we'll need these two data sets
19:06
We can inspect them by using the view. Then we have it open here
19:13
and you can even kind of filter and do stuff with them
19:17
So there are a lot of options here, sorry. We can also inspect them here in the console, so to say
19:29
So we can see down here how many rows are there in the file. What is the structure
19:34
Like we looked at the data frame and the matrix before. So here we can see that there is, for example, number of cases that's numeric
19:46
all these countries and disease names. They are characters. And then year is, here is integer
20:00
Right. We can also write, have a summary here where we can get more information from this
20:08
And we can, there are a lot of more functionalities and I'll go through them
20:13
And sometimes if, you know, So if you have the problem with finding out what function you should use in R
20:18
you can kind of Google how do I get the length of a vector R
20:22
and then you'll get the results. All right. But we're going to wrangle, and it's okay
20:30
Now I just read in the library again, but you can see it since it's already there
20:36
nothing happens this time when I run it. So here I have, I don't know if it's necessary, but I'll just ensure sometimes it's read in as a factor, so I'll be sure that it's a character
20:59
It shouldn't be necessary, because now I'm going to show the first function here, left join
21:07
And what we're going to join is our BOO, which is bag of oranges
21:18
So this is five bag of oranges. This is the bag number
21:24
This is the weight. This is the price of the bag. So there is a spelling error, but we shouldn't focus more on that
21:35
We have the origin of our oranges, and then we have whether it's organic or conventional
21:44
So that's what we have in this data frame. Over here we have diseases where we have country, country code, region, and so on
21:53
Right. So what we want to join now is that we would like to have regions in this table as well
22:02
So we join, like we glue region from our other data file where we have the map between country and region with regard to diseases
22:14
So that and then the thing is that we are now by this part of the sentence is that we are creating a smaller data frame of the WHO here where we have only one
22:37
We have the mapping between country and region. Right. So we are now
22:43
So this is creating a data frame with only country and region
22:48
and using the unique, we don't have redundancy in it. So we don't have the same country and region more times
23:00
like we have when we look at the full data frame. If it wasn't so slow, I don't know what's going on here
23:07
Thanks. For example, here, we have the same country a lot of times, right
23:13
And we only want this region and country mapping once. So, and then in the end here, we have, we want, since it's not called the same thing
23:29
we want to be sure that it understands that it should map origin and country
23:33
Sometimes it knows how to do it anyway, but it's nice to be able to control it
23:39
So therefore, we have used this as part of the function, of the lift join function
23:45
So this sentence is a kind of attribute to the function, right
23:54
So we try to run it, see if it works. It does. Great
24:00
And now we want to look at this again. And here we actually have region now
24:06
So we don't have a region on California, and that's because California is actually not a country
24:12
and it's also called origin and not country. So that's just because when you buy oranges, usually it doesn't say origins from United States, but it says oranges from California
24:27
So I'm not going to do anything about this now, but that's just a funny fact
24:33
But as you can see here, we have regions added on and we still have 30 rows
24:39
So you can also see some information on the different data frames out here in your global environment, right
24:45
We still have 30 rows. So we have basically just glued information on our old bag of oranges table
24:57
So boo is not boo, but it's bag of oranges, right? So that's one way to do it
25:07
So now we have added it to the new data frame. Another way, and here is the pipe that I'll just introduce it a few times
25:15
So instead of putting it into a new, if we actually just want to look at it
25:25
and not necessarily put it into a new data frame or something, we can just do like this, where we first do the left join
25:32
and then view the results here. So that's what this pipe can be used for, for example
25:49
Great. So that was our data merging example here. And there are, of course, a lot of other things you can do
26:00
If you are more into SQL, there is also write join and full join
26:04
and stuff like that. And there is also a function called merge
26:08
If you are more, I think it comes more from maybe Python. So it also depends on which language you're used to do
26:18
before working with R. That can also influence a bit what kind of functions
26:25
you like to work with. Then there is another function here called count
26:34
which I love because that's also you can see here for example
26:38
your bag of oranges you have 30 observations, 6 variables that's really nice but maybe
26:44
we want to know how many instances from each origin which we know is a country
26:50
so we can count it use the count function here. We can also
26:56
make a count of something else we can count count Let's see, maybe we should look at the boo here to see if we could count with the get
27:11
guard to food label maybe. Oh sorry
27:24
And here we see that there are more conventional bags of oranges than organic bags of oranges
27:30
So that's also nice. And sometimes if you have a transaction table with one line per something
27:44
maybe what you need is you might need to kind of cluster it with regard to something
27:52
For example, the country. And then you'll have a new table here
27:59
you could call it country where you put in your count
28:08
now you have this table with your different countries and how many
28:13
bags of oranges there are from each country so it's of course a toy example so
28:20
it's a very silly toy example but I like it so So we also have another function here called mutate
28:35
And what it does, it creates a new column. And what we do here, we actually make a new column based on an old column
28:53
So this mutate, what it does here, it takes the data, boo, then it creates this new column
29:04
which it gets the, where the name will be new price and the values will be old price plus tree
29:16
Let's see what happens. Our new data frame here. And we can see here we have our price and we have our new price and it looks very much like it did add tree to the old price
29:30
So that was nice. I'll just remove some of this so it's not so messy
29:40
Right. So that was some particularly this count and this mutate. they are short
29:50
and relatively reviewable I'll now show you a function which is very
29:57
useful but also a bit more complex. So here we are using something called group by and summarize. So what we
30:13
do here is we start to group this data frame by origin. And then from this, if we just wanted to
30:24
have the count, we could use a count function. But now we want to have more than just count
30:29
We also want mean of the old price and we want the mean of the new price
30:35
And to do so, we can use the group buy and then the summarize
30:40
And here we get the mean, which is here. It's not a median, but the arithmetic mean, average of the countries with the new and the old ones
30:54
And as you can see, it's three times higher than the new one
30:59
And you can also, so here I used the pipe because now I just want to show you
31:04
but we could also do it in another way. Let's see, we could say, for example, A
31:10
start to group the group by A, and then we say B
31:29
summarize and then we have to say the data, what we want to summarize
31:36
So I'll just call it B. So this is another way to write it
31:48
Now we have our new data frame with our two means just like before
31:56
So that's how you can do it. And of course, from the pipe, you can also save it
32:09
Then I'd like to... Do you have any questions? Maybe I should ask if you have any questions to this wrangling
32:26
The reason for doing this wrangling is that quite often when you want to do some fun modeling or other kind of stuff
32:33
you might need to recalculate things a bit. Maybe you need a mean, for example, or you need a median or you need a standard deviation and so on
32:43
And you maybe need to cluster it on something. And then it's really nice to be able to wrangle and clean data
32:52
And so you use a lot of these functions. Yes? Hi, Sine. Yes, we actually have a question because we got a question that is like, is this group by function is the same as it is in SQL
33:06
well? So I'm not really trained in SQL. So actually, I don't know. Maybe if there are
33:16
any of you who are good at both R and SQL might know if it's the same. I know a lot
33:22
of these functions from Tidyverse are inspired by SQL. So probably yes
33:28
Yeah, I would think and expect that the functionality itself is probably similar at least
33:40
But I can see from this that I can already see that the syntax is totally different
33:47
And while it's not actually totally different, but like the syntax is maybe a bit different
33:52
but I think the functionality should be the same. Yeah, yeah. Yeah, it's a bit, I think, lift join syntax is also probably, sorry, here, lift join syntax is probably also different from SQL
34:06
And you can see there are even different syntaxes within R. But, yeah, they're probably similar
34:16
Yeah, I think they are. So the functionality itself should be the same
34:21
possibly the way of using it as different, but yes. Yeah, and particularly in R, I mean, it has been developed a lot
34:30
And then there are people who come from the SQL world, from the Python world, from other worlds
34:35
they come into R and then they make packages and environments that kind of suit their way of thinking
34:43
And you kind of think differently when you are working with databases
34:47
than when you work with, for example, procedural languages. like Python. So you have to squeeze your brain every time you work with a new language
34:56
And R, for example, usually you call it a function-based language, which is that instead of writing a lot of loops and so on
35:10
you can actually write functions, and that will be much faster. So you can do a lot of operations on one column or one row, for example
35:20
that's quite fast and easy because you have your data up here in the memory
35:30
Yeah. All right. Thank you. So I'll go on to, I think
35:38
I'm not stressed yet, but it takes a while, of course. And we are going to do a bit of plotting
35:45
And for this, I'm expanding this window because as you might figure out in a while we'll look at plots and we need to
35:55
use this library ggplot which you read in if you don't have it you install it
36:01
like this right ggplot2 I'm not going to install it now but you can do that and after you have installed
36:13
it you can just write like this library ggplot2 and you have read it in so I
36:18
I actually have some slides for this. So I go back. There are some basic ggplot grammar, we used to call it
36:28
I'll just say ggplot2 because now it's like you almost forget that the packages are actually called ggplot2
36:35
but the functions is called ggplot. So that can confuse one. Right
36:43
So we have this function ggplot. that kind of begin that starts a plot
36:49
And then usually in the ggplot, you put in your data frame name here
36:54
Oops, sorry. So you have your data in here. Then you have some aesthetics
37:00
where the aesthetics, as I tell down here, it's where you're mapping the variables
37:04
So you say x is equal to war one, y is equal to war two
37:10
So if you have two variables, it will understand this is x and this is y
37:15
But you can also have more and then you maybe need to say so this is X, this is Y, this is equal fill
37:21
And I'll come back to that in a minute. So there are more aesthetics Then there are something which we called GMs which is those that are actually creating the chart types
37:37
So GM point, GM call, GM hits, and so on. So these are created chart types
37:48
and there are also other functionalities, of course. and you can add a lot of layers to one plot
37:57
So one plot code can be really long because you want to have both
38:03
maybe if you have three or four variables in your plot, maybe that's not so good practice
38:09
but sometimes you maybe want to put in more variables. Then you might need to visualize this in different ways
38:20
you maybe want to add some text in the plot or some labels and so on for the x-axis and the y-axis and so on
38:29
So there is a lot of things you can do. And I cannot show you all today
38:34
So I have cheated and made a link here for the cheat sheet of GGplot2
38:45
You can also find it just by Googling cheat sheet GGplot2. and you'll get something like that
38:51
I can, sorry, we'll just write here, cheat, cheat, upload
39:06
Right, and then we get here. Now we have to find it
39:11
No, it's direct length, great. So, and here you can see there is a lot of
39:16
what we just talked about, GMs here. You can see all the geons and here you have some kind of link to, for example, if you have two, both X and Y are continuous, you can do like this
39:32
You can use these functions. If one is discrete and the other one is continuous, for example, country and number of diseases, then you might need some of these plots, chart types
39:48
If you only have one variable, you might use a GM bar or a GM histogram, as you can see here
40:02
So that's one of the main things. And then, as you can see here, you can add a lot of layers
40:08
So you can both have a GM point, a GM smooth, which is giving a kind of confidence interval
40:16
and there is, you can also choose different colors, color scales, different themes
40:30
So there are a lot of options here. You can see there even more
40:37
So the cheat sheet is not really one page, but two pages. Yeah, with a lot of extra information here
40:46
And if you are more the reading type, you can, Hadley Wickham has made
40:53
he actually has made a book on ggplot2 that is open. You just Google Wickham and ggplot2
41:02
but also his book on data science in R has some basic ggplot functionalities
41:12
Great. So I'll just show you here. we can start. Now I'll start out as if I haven't written the whole plot. So here we have our
41:25
WHO disease frame here. We have the aesthetics, we have region and cases. So you see that it's
41:39
region that's over here and that's you could say a discrete variable and cases that's a continuous
41:50
variable. So when if you try to run this you can see I've just run it, then I just get the coordinate system
42:03
With respect to how many, you can see it has kind of made it ready for something. Here are the regions
42:11
here are the cases, and the scale of this is kind of
42:15
the size of number of cases. So you can add a new layer
42:23
And we could say that our new layer should be that we want to make some columns
42:30
and what we want to fill the colors with is disease. So let's see what happens now
42:42
And again, this aesthetic. So it's not an X or a Y, but it's a fill
42:48
There are other aesthetics as well. So here you can say, so it's actually a nice picture here
42:55
You can see different regions. for example you can see in Africa
43:00
that a lot of actually measles seems to be a huge thing there
43:03
maybe there are not so many vaccinated in Europe rubella is actually
43:10
a big thing which is not so much present in the other
43:16
parts of the world then there are some of these diseases I don't know what is pertussis
43:23
It's also big in many countries, in many regions and so on
43:32
So now we get a larger overview. So what we want to know here, we know that this is all for maybe 30 years
43:44
So maybe we want to actually look at one graph a year
43:49
Sorry, one chart a year. It takes a while. And now we have something that doesn't look really nice here. We can do a lot
44:00
of more to make it prettier and maybe turn the region names and so on. But just as a general
44:12
thing here, we can see it's interesting that, you know, for example, Rubella, as we could see
44:18
It was a big thing in Europe, but it actually took a while before it grew
44:24
And then it was actually just between 2006 and 2010 that it was a big thing
44:32
And you can also see mumps suddenly. It seems to be, I cannot see what this country is, right
44:39
But you can see that from having no mumps for a while, then suddenly it became a thing
44:45
and still is in 2016. Right so that a ggplot where we have a GMCOL layer and a wrap layer And of course we can put also a layer where we rename
45:05
if we want to rename our X and Y axis here, or maybe we want to have a title and so on
45:12
we can also add a layer with that. I'll just show you something which looks slightly similar here
45:20
but still a bit different. So this is in a sense the same thing, but here we have instead, you can see we have the aesthetic fill equals disease
45:38
but we have also a function which is attached to the GM call, which is position fill, which makes stacked bars here
45:51
So, the rest of the code is the same, but we have just decided that we want spec path
45:58
Now, the y-axis is no longer number of cases, but relative cases
46:04
So, we can more see what's a relative amount of measles and mumps and so on for each region in each year
46:13
Right. So that's, but it becomes, but suddenly in this, in this chart, we actually see we have many features
46:24
Suddenly we have disease feature. We have the amount of diseases too
46:31
We have like the cases, right? We have the region and we have the years
46:37
So suddenly we actually have quite a lot of features to abstract from
46:45
And it can really, you should be aware not to put in too many different features in a plot
46:55
Sometimes, I mean, people can succeed in making a good plot with a chart, sorry, with a lot of features
47:02
But it's not easy. All right. And I'll just... show a last type here where we have actually two GMs
47:12
So maybe I should have started here. Just start again. Pretend we didn't see, saw the other one
47:21
So here we have all the... We have it all in points here
47:32
but it's just, instead of, you know, it's difficult to watch if there are many points in each, or many dots in one point, rather
47:43
Sorry. And so we can kind of jitter it here where we have, we can spread it out
47:51
So this is not so good because here we actually have both our points
47:55
in the middle here and then our jitter. so I think what we should do here is actually we try
48:02
this is something I just tried out I hope it works see maybe this one can we see yeah that was good it worked
48:13
can't always be sure right so here we have instead of having points
48:19
we now have a jitter where we kind of can get some kind of intuition
48:25
on the density of the dots in the graph, in the chart
48:32
So it's not that many points. So it doesn't matter that much, right
48:36
But it can just be nice. Great. So I have taken a lot of time on this ggplot
48:46
but I think it's important because it's one of the really good things about R
48:53
is it's extremely flexible with respect to visualization. So it's really a thing
49:02
There's a lot of functions here. So I'll go through a bit of test and modeling
49:07
because that will, when you start to learn, do machine learning and so on, you need to
49:16
I mean, maybe the statistical tests are not that important for you, but at least you need to know
49:21
So how do we make a model in R and what is the grammar of that one
49:27
So we just go back to this. We have these functions. They are having these arguments
49:33
We looked a bit on the t-test last time. So I'll walk you through t-tests again, just really briefly
49:46
I'll also say a bit about Fisher's test because they are kind of complementary to the T-test
49:52
as one of the very simple tests. Maybe you know the chi-square test
49:57
So Fisher's exact test is a bit like the chi-square test. It's just more exact
50:03
So it can be... In old days, it took a long time to calculate the Fisher's exact test
50:11
but it doesn't anymore. So I just use that one. So then there are some models and it can be kind of regression, logistic regressions, random forest and so on
50:24
So what I look at here is regression with multiple variables
50:34
So how do we do that? Right. And last time you saw we had a t-test where we had two vectors here with numeric variable values
50:47
We can also make a t-test where we have a vector with values and a vector with categories
50:54
And this is actually kind of a model notation already. So we have here what is the price kind of given the food label
51:06
I'll show you what it shows in a minute. We have a vicious test, which is more like a T-test
51:15
but it's across two matrices where we look at, typically you look at if you have two different groups
51:22
and you treat them with, for example, different medicine
51:31
Then you want to look at if women behave differently than men on some kind of medicine, it could be, for example
51:40
And there is a lot of tests that are like that. So that's a very standard test, too
51:50
That's why I also brought it in. Then we have models, as I said
51:55
So the grammar of the model is typically here. We have the name of the model
51:59
Then we have the output variable, which is also called the regressor
52:10
And we have here input variables. So here is only an X, but it can also be a lot of different variables
52:18
Down here I have shown two variables. We can have all the rest of the variables in the data frame
52:24
we do that by a dot in there
52:35
Then we have specified the data, and then we can specify some more if we want to
52:42
if we don't want to use the default values of the function
52:47
And the example of today is a linear model. And I, for example, I have used my bag of oranges thing here to explore that
52:59
So I can see the time is running. So I'll not go too much into..
53:04
We should run our linear model. Yeah. We have our t-test from last time
53:14
We just want to look at it. Yeah, here we got it. We can see that it's not the two variables here, they are not different
53:30
You can see the p-value is 0.1. If we run it maybe 20 times something, we would get one which has a p-value of 0.05 or below
53:45
but most likely it's not different. Then we have another example of a t-test here
54:01
And that's here where we have the price. And we have the food label
54:15
So one should think that organic oranges are more expensive than conventional oranges
54:30
But we can see here that it doesn't seem to be the case if you just make a simple t-test
54:37
And I'll get back to that because maybe our linear regression model will show something else
54:44
But let's look at, we have this regular, we have created this t-tasting matrix
54:57
This is a matrix, right? Where all the values, they have the same form
55:03
So we have, so it's like, we have two groups here and one thinks it's milk
55:13
is milk and there is then one thing the tea is milk
55:17
Then there is one, then there are three here from this group says that tea is tea
55:22
and one who thinks it's milk. So it's of course really a toy example
55:26
we want to look at the Fischer's test on this. And we can see here that the p-value is above
55:42
so the true, so it's not greater, the odds ratio might be greater than one
55:52
And they are not, sorry, they might be similar, actually. That's the point from here
55:59
But that's a very small data set and so on. Great. So, but let's go back to our bag of oranges
56:14
because we saw that it seems like the food label was not, you know, just before we looked at it
56:26
we saw that organic is not significantly more expensive than conventional
56:37
Then we are thinking maybe if we make a model where we control for where it comes from
56:43
So maybe it's both important if it's organic or not, but also where it comes from
56:53
And here we have the summary of our little small regression where we have two input variables
57:03
and we can see here that so it compares here the origin with Brazil
57:13
and we can see here that there are a few countries that are significantly higher in price than Brazil
57:20
that's true but still you know the food label whether it's organic or not
57:26
you can see maybe a bit more but it's not statistically significant
57:36
You can also see here the model is not that good. It needs to be close to one
57:42
If it's natural science, then usually in social science and so on
57:47
I have seen much lower adjusted R-square values. So apparently it's not a rule bend in iron
57:54
also depending on which field you come from. So the p-value here is, on the other hand, it's significant, but not very much 0.2, right
58:06
It's below 0.05, sorry. All right. But maybe we should include, now we just try to include everything we have in this whole
58:21
in this hole. And it says that it's probably unreliable
58:35
There is something wrong here, right? We can see there are more things in it
58:40
You can see here you have a back number, and back number, we don't want to put back number
58:45
in our model because, I mean, it doesn't make sense. Then we also know there is something with the regions
58:52
First of all, they correlate with the origin. So we want to take them out
58:57
We also want to take a back number out. So just a moment, I'll just do it like this
59:03
So here we have... We're taking out... I think you can do it like this
59:15
So first we could see... So this what we did here was saying we still have the price as the thing we're measuring
59:28
but what we put into our model is just all the variables here. Now we try here that's not maybe
59:37
it's like this. Sorry, I should have..
1:00:00
Don't know if this will work. No. It was such a beautiful example, but I completely forgot how to remove one
1:00:17
But now we just remove the back number as a start. Still warns us
1:00:25
It gives a very high, R-square. And now we also have this new price
1:00:33
Ah, now I remember. Sorry, just a moment. That's because I have done a lot of things to our bag of oranges
1:00:44
Now I try again from our pure start here. We can see we get something which is more nice here
1:00:58
Yes, we have this. It makes a bit more sense the contents here
1:01:04
We can see suddenly now food label organic is actually the most, is significantly more expensive
1:01:12
than the conventional. And this is the value here. We can also see, again, that all the countries that are not Brazil are more expensive than Brazil, with California and Spain in front
1:01:32
And then we also see weight correlates with price. And maybe it's to add this weight into the model that kinds of do the trick, right
1:01:44
So maybe the weight of the organic bags were a bit lower
1:01:50
Still, we want to remove our bag number because it doesn't make sense to have a bag number within it
1:01:59
So you can see here the R square is still, it's actually a bit smaller now
1:02:07
But then also the p-value of the model is also smaller. And additionally you could say that the back number, I mean it doesn't make sense to bring it in
1:02:26
But that's something you can discuss and that's something that's also interesting in general when you model
1:02:31
that there are these things that can kind of trigger you. Model and something you should be aware of
1:02:44
which variables do you put into the model? Because some models, they are very sensitive to
1:02:51
if you both put country and region into it for example linear model that a problem because a problem because they will really they will correlate of course because they are kind of the same thing
1:03:03
that they are measuring. While in other models, it doesn't matter that much
1:03:12
that you apparently put in some variables that are kind of redundant
1:03:17
So, yeah, so, so you should also always be a bit aware
1:03:22
aware when you are modeling the features that you're putting into to your model here
1:03:29
Great. That was what I had for today. I had a bit wanted to say a bit about the pipe here, but
1:03:37
I didn't got that much into it. But it's also I mean, it's not something that I use a lot. I just
1:03:42
know the guy who is going to tell you more about machine learning in R, he loves it. So I thought
1:03:47
It would be nice just to introduce briefly here. But I have also used it a bit here
1:03:55
So because that saves, you can see here, it saves some space now and then to put your results into a summary function instead of renaming, adding, you know, making a new data frame
1:04:09
and then you have to rename the data frame or put old data into the new data frame
1:04:17
and then maybe you make an error and override something and you have to begin from the start again and so on
1:04:23
So that's some of the argument for using this pipe function. But that was all I had for today
1:04:35
and I hope there is a bit of time for questions too. Hi, welcome back everyone and Sina, what an amazing session again
1:04:56
Thank you. I'll just stop by sharing. And we can also say to the audience who are watching, if you have any questions, you can
1:05:08
And still, last questions, it can be both about, you know, the modeling part
1:05:11
but it can also be about, you know, our studio and things that we talked about
1:05:16
in the previous session. So, yes, ask any questions that you want
1:05:21
Yes, and also, if you don't have a question right now, because you could follow Sina and everything was clear
1:05:28
but you stuck somewhere later on, don't worry. Go ahead and ask us outside of the stream
1:05:34
You can catch us on Twitter, Facebook, and so go ahead and ask us questions there
1:05:45
Yeah, definitely. And also I actually also giving a course that introduction to programming in R and data science at Copenhagen Business School next semester But that mostly for you who are enrolled in Copenhagen Business School of course Yes
1:06:02
Actually, I had a question to you, Sine, which I think our viewers would find interesting
1:06:08
Could you tell us briefly how you've used R or machine learning in your PhD or in your postdoc
1:06:15
Have you been using that? Yeah, I use it now a lot and I used it in my last postdoc as well
1:06:23
So my first postdoc was in, it was at actually a mental health institution where we looked at genetics of patients with schizophrenia
1:06:35
where we looked at if there are any genetic variants that had an influence on the risk of getting schizophrenia
1:06:49
because it's really small or outrageous and a lot of stochastic randomness that leads to that
1:07:03
And so we made some models there where we used it on genetic data
1:07:08
It was huge amounts of data. And we had to, and if I look back to it, I'm thinking maybe actually Python or something else would have been better for these huge amounts of data
1:07:19
But we used it, we did so that we actually cleaned the data before putting it into R so that we have created a smaller data frame so that it was easier to work with, right
1:07:33
because it's in the memory. So if you have a, yeah, so we need to have a memory enough
1:07:40
to actually be able to work with these huge amounts of data. So now I'm also working with it more on social science methods
1:07:50
So I'm yzing business data combined with the survey data, for example, and so on, on education
1:07:58
and actually on online teaching and learning. and that started actually just before COVID-19
1:08:07
I started on my postdoc where I should look at it and then came COVID-19
1:08:12
and I thought that it would be interesting to look both because now it's possible to measure
1:08:17
what people are doing and at the same time we can also ask them
1:08:22
through a survey so how do you feel and what's your IT skills and so on
1:08:29
so we have tried that both with teachers and students and so on. Yeah
1:08:36
But and the models are sorry, that was not so much. Which models we used
1:08:41
I simply cannot. Yeah, the model I used at my first postdoc was Plink, which is
1:08:47
I think it basic It actually just statistics huge statistics statistic model But now we are experimenting a bit more with also using some random forest and some other machine learning methods
1:09:04
But actually, it's still a small amount of data. So sometimes a weighted linear model can work better than the random forest
1:09:19
So it's also about choosing the right method sometimes. Yes. Well, thank you a lot for that answer
1:09:29
Håkan, do we have other questions? No, not at the moment. and we should also mention that
1:09:36
we are monitoring the YouTube channel so if you ask any questions here on YouTube
1:09:40
we will be able to follow them up and answer them and I would love also if you have some comments
1:09:47
to my teaching and so on please write me an email or on Twitter or wherever
1:09:53
with the comments because I would love to improve my skills in teaching
1:10:02
and so on Yes, thank you a lot. I think with that, we want to say thank you for joining us today
1:10:12
Was all CNA for the beautiful lesson again about R. It was really, really, really exciting
1:10:19
And I actually learned a lot about R today. And I hope the audience felt like the same
1:10:26
And I would like you to provide feedback about our session today
1:10:31
and share this with us. I'm going to share a link about it
1:10:36
where you can go and give us feedback so we can improve for the next time
1:10:42
And yes, and I think with that, we can say a big thank you for joining us today
1:10:48
and take care everyone. And maybe one thing before we finish here
1:10:53
is that we will be back on May 19th and then we will be talking about T-SQL in practice
1:11:01
So if you look back on YouTube, you can find the first session about T-SQL
1:11:06
and Jean-Pierre or JP, JP Vogt, he will come back and have a little bit more advanced session
1:11:12
So we hope that you will be able to join us then
1:11:16
Yes, and I think it's also possible to follow along with him too
1:11:20
So it's going to be really exciting. So thank you Sine for today
1:11:26
and thank you everyone for joining us and see you soon the next time