-->

Jendela Statistika

Melihat Dunia Dengan Data Sebagai Sebuah Investasi

MULTIPLE LINEAR REGRESSION USING R

Hallo...

Good night of friends blogger....
Welcome to jendela statistik again, heehheheehe

Tonight, jendela statistik will continue discussing about linear regression with part multiple linear regression.

What is multiple linear regression??

Multiple linear regression is an extension of simple linear regression.
In simple linear relation we have one predictor and one response variable,
but in multiple regression we have more than one predictor variable and one response variable.

Or in simple terms the multiple linear regression is used to explain the relationship between
one dependent variable from two or more independent variables.
The general form of the multiple linear regression is defined as

yi = B0 +  B1X1 + B2X2 + ... BiXi +  ei

for i = 1,2, ... n.

Before we to next step, we need to understand several assumptions that data must meet in order for multiple regression to give a valid result.
So.., we need discuss these assumptions.

Okay,.. What are the assumptions in multiple regression????

In previous article already mentioned a bit about multiple regression assumptions points. However at this time we are going to discuss a more detail.


Let's take a look at these eight assumptions:

Assumption number 1:

our dependent variable should be measured on a continuous scale (it is either an interval or ratio variable).
Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score),
exam performance (measured from 0 to 100), weight (measured in kg).

Assumption number 2:

We have two or more independent variables, which can be either continuous(i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable).
Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist).

Assumption number 3:

We should have independence of observations (i.e., independence of residuals), which we can easily check using the Durbin-Watson statistic.

Assumption number 4:

we need to be a linear relationship between the dependen variabel and each of independent variabels.
we can check using scatterplots and then we visually inspecting scatterplot for linearity.
If the relationship displayed in our scatterplots are not linear, we will have to either run a non-linear regression analysis or "transform" our data.

Assumption number 5:

The data we have got to show the homoscedasitas.
In simple it is where the variances along the line of best fit remain similar as you move along the line.
To see it can use  plot the studentized residuals against the unstandardized predicted values.
by looking at the visuals, the residual data is spread evenly between zero point.

Or by statistical tests park, glejser, white.

Assumption number 6:

The data we have has to indicate no multicollinearity.
Which occurs when you have two or more independent variables that are highly correlated with each other.
This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable.

Assumption number 7:

we need to check that residuals (errors) are approximately normally distributed.
Two common methods to check this assumption include using:

first:

(a) a histogram (with a superimposed normal curve) and a Normal P-P Plot; or
(b) a Normal Q-Q Plot of the studentized residuals.

second:

statistical tests kolmogrov smirnov or shapiro-wilk test.


Okay, after we both had studied the material above,
so, let's learn how to use multiple regression analysis with applications R.

The first step to start a multiple regression analysis is to prepare data and input data to the application R

EKS HRG   KURS
3678.8 248.48 5.65
4065.3 331.48 10.23
8431.4 641.88 13.5
15718 100.8 13.84
11891 536.69 12.66
9349.7 332.25 13.98
14561 657.6 15.69
20148 928.1 16.62
26776 1085.5 18.96
43501 1912.2 22.05
49223 2435.8 22.5
65076 6936.7 20.6
54941 3173.14 43
58097 2107.7 70.67
112871 2935.7 71.2
108280 3235.8 84

by using code:

DataMLR= read.delim("clipboard") #Input data into R
DataMLR #we can call data
Next we will form a model for multiple regression with writing the code :

#Model Fit Multiple Linear Regression
multi.fit = lm(EKS~HRG+KURS, data=DataMLR)

#we can call result model multiple linear regression
summary(multi.fit)

The summary the results of the linear regression model, i.e,.
Performance Measures: Three sets of measurements are provided.

  1. Residual Standard Error: This is the standard deviation of the residuals.  Smaller is better.
  2. Multiple / Adjusted R-Square: For one variable, the distinction doesn’t really matter.  R-squared shows the amount of variance explained by the model.  Adjusted R-Square takes into account the number of variables and is most useful for multiple-regression.
  3. F-Statistic: The F-test checks if at least one variable’s is significantly different than zero.  This is a global test to help asses a model.  If the p-value is not significant (e.g. greater than 0.05) than your model is essentially not doing anything.
  4. statistical tests T: the T-test is a test for Interpreting coefficient independent variable or called a partial test. If the p-value is significant (e.g. smaller than 0.05).
Interpreting R’s Regression Output
Reliable test model or test the feasibility of a model or a more popularly called as F-test (also called it a simultaneous test model) is an early stage of identifying the regression model being estimated is worthy or not.
Based on the F-test is obtained  F calculated 66.82 with P-value 1.445e-07.
Because p-value is smaller than 0.05 regression model then can be used to predict EKS or is simply a variable KURS and HRG together effect on EKS.

The Multiple R-squared (0.9113) and Adjusted R-squared (0.8977) are both exceptionally high.
While the rest are explained by causes other than the cause model (100%-142.14% = 8.87%).

Residual standard error: 11180, the smaller the value of the standard error of the regression models would make the more precise in predicting the dependent variable.

T-test in multiple linear regression is intended to test whether the parameters (coefficient of regression and constants) allegedly for estimation of linear regression model equations/double is already a parameter that is right or not. Right intention here is able to explain the behavior of these parameters in the independent variables affect the dependent variable. The parameters being estimated in linear regression include intersep (constants) and the slope (coefficients in linear equations). In this section, test t focused on the slope parameters (coefficient of regression) only. So the t-test is a test of regression coefficient.

Of two independent variables entered into the regression model all significant (T-test) at 0.05. from here it can be concluded that EKS variable was influenced by variable KURS and HRG equations mathematically:

EKS= -4067.496 + 7.815HRG + 1001.855KURS 

Constants of -4,067,496 States that if the independent variables are considered constant, then the average export reduced by 4,067,496.

The regression coefficient posistif value meaning (HRG) at time of export Prices go up then the number of export (EKS) also experienced ascension. so also when the price is down then the number of exports also dropped. The increase in export prices of one unit will increase the number of exports of  7,815 tons, so did the opposite.

The regression coefficient of KURS is positive has the same meaning with the regression coefficient of HRG.

okay, next steps is  we doing assumptions test
first step is multicolinearity Test

We can do by way of viewing the values test VIF, i.e.,
library(car)
vif(multi.fit)
Because the value of two variable of VIF no greater than 10 or 5 (many books that the terms of no more than 10, but there's also that the terms of no more than 5) then it can be said to be not the case multikolinieritas on both the free variable.
Based on the terms of the classical linear regression assumptions with OLS regression linear model, so the good thing is that freed from the presence of multikolinieritas.

Second step is autocorrelations

We can do by way of viewing the values durbin watson test
library(lmtest) 
dwtest(multi.fit)
Value of the durbin watson will be compared to the criteria of acceptance or rejection will be created with the value of d is determined based on the number of free variables in the regression model (k) and the number of sampelnya (n). The value of the dL and dU. L and d can be seen in Table DW with the level of significance (error) 5% (α = 0.05).

The number of free variables: k = 2
Total sample: n = 16
Table of the Durbin-Watson values indicate that the dL = 0.982 and d = 1.539 so
can be determined or whether autocorrelation occurred criteria as shown in the picture below.
Count of 2.162 DW values larger than the smaller of 1.539 and 2.481 meaning are on area there is no autocorrelation. Thus it can be concluded that in a linear regression model does not happen autocorrelation.

Third step is homogeneity test
We can do by way of visualisation scatterplot 
plot(multi.fit$fitted, rstudent(multi.fit),
     main="Multi Fit Studentized Residuals",
     xlab="Predictions",ylab="Studentized Resid",
     ylim=c(-2.5,2.5))
abline(h=0, lty=2)

From the picture above to see that the distribution point does not form a specific pattern/Groove, so it can be inferred not occur heteroskedastisitas or homoskedastisitas occurs. Classical assumptions about heteroskedastisitas in this model are met, i.e. free from heteroskedastisitas.

Fourth step is a test of normality
We can do by way of visualisation plot
qqnorm(multi.fit$resid)
qqline(multi.fit$resid)

The results of the test of normality can be seen from the image of a Normal P-P Plot below. Need to be reminded that the assumption of normality is in the classical assumptions of the OLS approach is (data) residual linear regression models established normal distributed, not independent variable or dependent variable . Criteria a (data) distributed the residual normal or not Normal approach with a P-P Plot can be done by looking at the distribution of points on the image. If the distribution of these points approach or meetings on straight lines (diagonals) then it is said that (data) residual is distributed normally, but when spread those points away from the line and thus are not distributed normally.



The distribution of the points of the image to Normal P-P Plot above relative approaches a straight line, so it can be inferred that the (data) residual is distributed normally. These results are in line with the classical assumption of linear regression with the OLS approach.

Overall test assumptions are met then the model can be said to be BLUE.
Nice regression model means to be used in this case.


Happy learning and trying.
Best regard.
Jendela statistik.
Receive suggestions and corrections.

Baca juga:

0 komentar



Emoticon