Abstracts Statements Story

Correlation and regression analysis in Excel: instructions for execution. Study of the statistical dependence of changes in the properties of the reservoir and reservoir fluids as a result of the development of oil fields Parabolic and polynomial regression

The following data is available different countries on the index of retail prices for food (x) and on the index of industrial production (y).

Retail food price index (x)Industrial production index (y)
1 100 70
2 105 79
3 108 85
4 113 84
5 118 85
6 118 85
7 110 96
8 115 99
9 119 100
10 118 98
11 120 99
12 124 102
13 129 105
14 132 112

Required:

1. To characterize the dependence of y on x, calculate the parameters of the following functions:

A) linear;

B) sedate;

B) an equilateral hyperbola.

3. Assess the statistical significance of regression and correlation parameters.

4. Make a forecast of the value of the industrial production index y with the forecast value of the retail food price index x=138.

Solution:

1. To calculate linear regression parameters

Solving the system normal equations regarding a and b:

Let's build a table of calculated data, as shown in Table 1.

Table 1 Estimated data for linear regression estimation

No.Xatxyx 2y 2
1 100 70 7000 10000 4900 74,26340 0,060906
2 105 79 8295 11025 6241 79,92527 0,011712
3 108 85 9180 11664 7225 83,32238 0,019737
4 113 84 9492 12769 7056 88,98425 0,059336
5 118 85 10030 13924 7225 94,64611 0,113484
6 118 85 10030 13924 7225 94,64611 0,113484
7 110 96 10560 12100 9216 85,58713 0,108467
8 115 99 11385 13225 9801 91,24900 0,078293
9 119 100 11900 14161 10000 95,77849 0,042215
10 118 98 11564 13924 9604 94,64611 0,034223
11 120 99 11880 14400 9801 96,91086 0,021102
12 124 102 12648 15376 10404 101,4404 0,005487
13 129 105 13545 16641 11025 107,1022 0,020021
14 132 112 14784 17424 12544 110,4993 0,013399
Total: 1629 1299 152293 190557 122267 1299,001 0,701866
Average value: 116,3571 92,78571 10878,07 13611,21 8733,357 X X
8,4988 11,1431 X X X X X
72,23 124,17 X X X X X

The average value is determined by the formula:

We calculate the standard deviation using the formula:

and enter the result in Table 1.

By squaring the resulting value we get the variance:

The parameters of the equation can also be determined using the formulas:

So the regression equation is:

Therefore, with an increase in the retail food price index by 1, the industrial production index increases on average by 1.13.

Let's calculate the linear pair correlation coefficient:

The connection is direct and quite close.

Let's determine the coefficient of determination:

The variation in the result is 74.59% explained by the variation in the x factor.

Substituting the actual values ​​of x into the regression equation, we determine the theoretical (calculated) values.

therefore, the parameters of the equation are determined correctly.

Let's calculate the average approximation error - the average deviation of the calculated values ​​from the actual ones:

On average, the calculated values ​​deviate from the actual ones by 5.01%.

We will assess the quality of the regression equation using the F-test.

The F-test consists of testing the hypothesis H 0 about the statistical insignificance of the regression equation and the indicator of the closeness of the relationship. To do this, a comparison is made between the actual F fact and the critical (tabular) F table values ​​of the Fisher F-criterion.

F fact is determined by the formula:

where n is the number of population units;

m is the number of parameters for variables x.

The obtained estimates of the regression equation allow it to be used for forecasting.

If the forecast value of the retail food price index is x = 138, then the forecast value of the industrial production index will be:

2. Power regression has the form:

To determine the parameters, logarithm of the power function is performed:

To determine the parameters of the logarithmic function, a system of normal equations is constructed using the least squares method:

Let's build a table of calculated data, as shown in Table 2.

Table 2 Calculated data for estimating power regression

No.Xatlg xlg ylg x*lg y(log x) 2(log y) 2
1 100 70 2,000000 1,845098 3,690196 4,000000 3,404387
2 105 79 2,021189 1,897627 3,835464 4,085206 3,600989
3 108 85 2,033424 1,929419 3,923326 4,134812 3,722657
4 113 84 2,053078 1,924279 3,950696 4,215131 3,702851
5 118 85 2,071882 1,929419 3,997528 4,292695 3,722657
6 118 85 2,071882 1,929419 3,997528 4,292695 3,722657
7 110 96 2,041393 1,982271 4,046594 4,167284 3,929399
8 115 99 2,060698 1,995635 4,112401 4,246476 3,982560
9 119 100 2,075547 2,000000 4,151094 4,307895 4,000000
10 118 98 2,071882 1,991226 4,125585 4,292695 3,964981
11 120 99 2,079181 1,995635 4,149287 4,322995 3,982560
12 124 102 2,093422 2,008600 4,204847 4,382414 4,034475
13 129 105 2,110590 2,021189 4,265901 4,454589 4,085206
14 132 112 2,120574 2,049218 4,345518 4,496834 4,199295
Total 1629 1299 28,90474 27,49904 56,79597 59,69172 54,05467
Average value 116,3571 92,78571 2,064624 1,964217 4,056855 4,263694 3,861048
8,4988 11,1431 0,031945 0,053853 X X X
72,23 124,17 0,001021 0,0029 X X X

Continuation of Table 2 Calculated data for estimating power regression

No.Xat
1 100 70 74,16448 17,34292 0,059493 519,1886
2 105 79 79,62057 0,385112 0,007855 190,0458
3 108 85 82,95180 4,195133 0,024096 60,61728
4 113 84 88,59768 21,13866 0,054734 77,1887
5 118 85 94,35840 87,57961 0,110099 60,61728
6 118 85 94,35840 87,57961 0,110099 60,61728
7 110 96 85,19619 116,7223 0,11254 10,33166
8 115 99 90,88834 65,79901 0,081936 38,6174
9 119 100 95,52408 20,03384 0,044759 52,04598
10 118 98 94,35840 13,26127 0,037159 27,18882
11 120 99 96,69423 5,316563 0,023291 38,6174
12 124 102 101,4191 0,337467 0,005695 84,90314
13 129 105 107,4232 5,872099 0,023078 149,1889
14 132 112 111,0772 0,85163 0,00824 369,1889
Total 1629 1299 1296,632 446,4152 0,703074 1738,357
Average value 116,3571 92,78571 X X X X
8,4988 11,1431 X X X X
72,23 124,17 X X X X

By solving a system of normal equations, we determine the parameters of the logarithmic function.

We get a linear equation:

Having performed its potentiation, we get:

Substituting the actual values ​​of x into this equation, we obtain the theoretical values ​​of the result. Based on them, we will calculate the indicators: tightness of connection - correlation index and average approximation error.

The connection is quite close.

On average, the calculated values ​​deviate from the actual ones by 5.02%.

Thus, H 0 - the hypothesis about the random nature of the assessed characteristics is rejected and their statistical significance and reliability are recognized.

The obtained estimates of the regression equation allow it to be used for forecasting. If the forecast value of the retail food price index is x = 138, then the forecast value of the industrial production index will be:

To determine the parameters of this equation, a system of normal equations is used:

Let's make a change of variables

and we get the following system normal equations:

By solving a system of normal equations, we determine the parameters of the hyperbola.

Let's create a table of calculated data, as shown in Table 3.

Table 3 Calculated data for assessing the hyperbolic dependence

No.Xatzyz
1 100 70 0,010000000 0,700000 0,0001000 4900
2 105 79 0,009523810 0,752381 0,0000907 6241
3 108 85 0,009259259 0,787037 0,0000857 7225
4 113 84 0,008849558 0,743363 0,0000783 7056
5 118 85 0,008474576 0,720339 0,0000718 7225
6 118 85 0,008474576 0,720339 0,0000718 7225
7 110 96 0,009090909 0,872727 0,0000826 9216
8 115 99 0,008695652 0,860870 0,0000756 9801
9 119 100 0,008403361 0,840336 0,0000706 10000
10 118 98 0,008474576 0,830508 0,0000718 9604
11 120 99 0,008333333 0,825000 0,0000694 9801
12 124 102 0,008064516 0,822581 0,0000650 10404
13 129 105 0,007751938 0,813953 0,0000601 11025
14 132 112 0,007575758 0,848485 0,0000574 12544
Total: 1629 1299 0,120971823 11,13792 0,0010510 122267
Average value: 116,3571 92,78571 0,008640844 0,795566 0,0000751 8733,357
8,4988 11,1431 0,000640820 X X X
72,23 124,17 0,000000411 X X X

Continuation of Table 3 Calculated data for assessing the hyperbolic dependence

Another type of one-factor regression is approximation by power polynomials of the form:

It is natural to want to obtain the simplest possible dependence, limiting ourselves to power polynomials of the second degree, i.e. parabolic dependence:
(5.5.2)

Let's calculate the partial derivatives with respect to the coefficients b 0 , b 1 And b 2 :



(5.5.3)

Equating the derivatives to zero, we obtain a normal system of equations:

(5.5.4)

Solving the system of normal equations (5.5.2) for a specific case of values x i * , y i * ;
we get optimal values b 0 , b 1 And b 2 . For approximation by dependence (5.5.2) and even more so (5.5.1), simple formulas for calculating the coefficients have not been obtained and, as a rule, they are calculated using standard procedures in matrix form:

(5.5.5)

Figure 5.5.1 shows a typical example of approximation by a parabolic dependence:

9 (5;9)

(1;1)

1

1 2 3 4 5 x

Fig.5.5.1. Coordinates of experimental points and approximated

their parabolic dependence

Example 5.1. Approximate the experimental results given in Table 5.1.1 with a linear regression equation
.

Table 5.1.1

Let's construct experimental points according to the coordinates indicated in Table 5.1.1 on the graph presented in Fig. 5.1.1.

at

9

4

1 2 3 4 5 x

According to Fig. 5.1.1, on which we will draw a straight line for a preliminary assessment, we will conclude that there is a clearly expressed nonlinearity in the location of the experimental points, but it is not very significant and therefore it makes sense to approximate them with a linear dependence. Note that to obtain a correct mathematical conclusion, it is necessary to construct a straight line using the method least squares.

Before performing regression analysis, it is advisable to calculate

linear correlation coefficient between variables X And at:

The significance of the correlation relationship is determined by the critical value of the linear correlation coefficient, calculated using the formula:

Critical value of Student's test t Crete found according to statistical tables for the recommended significance level α=0.05 and for n-2 degrees of freedom. If the calculated value r xy not less than the critical value r Crete, then the correlation between the variables x And y considered essential. Let's do the calculations:










Due to the fact that
we conclude that the correlation between the variables X And at is significant and it can be linear.

Let's calculate the coefficients of the regression equation:

Thus, we obtained a linear regression equation:

Using the regression equation, we draw a straight line in Fig. 5.1.2.

y (5;9.8)

9

4

(0;-0.2) 1 2 3 4 5 x

Fig.5.1.2. Coordinates of experimental points and approximated

their linear dependence

Using the regression equation, we calculate the values ​​of the function based on the experimental points of Table 5.1.1 and the difference between the experimental and calculated values ​​of the function, which we present in Table 5.1.2.

Table 5.1.2


Let's calculate the mean square error and its ratio to the average value:

In terms of the ratio of the standard error to the mean value, an unsatisfactory result was obtained, since the recommended value of 0.05 was exceeded.

Let's evaluate the significance level of the regression equation coefficients using the Student's t-test:


From the statistical table for 3 degrees of freedom, let's write down the lines with the significance level - and the value of the Student's criterion t to table 5.1.3.

Table 5.1.3

Significance level of regression equation coefficients:


Note that according to the significance level for the coefficient a satisfactory result was obtained, and for the coefficient unsatisfactory.

Let us evaluate the quality of the resulting regression equation using indicators calculated on the basis of analysis of variance:

Examination:

The result of the check is positive, which indicates the correctness of the calculations performed.

Let's calculate the Fisher criterion:

with two degrees of freedom:

Using statistical tables, we find the critical values ​​of the Fisher criterion for two recommended gradations of the significance level:


Since the calculated value of the Fisher test exceeds the critical value for the significance level of 0.01, we will assume that the significance level according to the Fisher test is less than 0.01, which will be considered satisfactory.

Let's calculate the coefficient of multiple determination:

for two degrees of freedom

Using the statistical table for the recommended significance level of 0.05 and two degrees of freedom found, we find the critical value of the coefficient of multiple determination:

Since the calculated value of the coefficient of multiple determination exceeds the critical value for the significance level
, then the level of significance according to the coefficient of multiple determination
and the result obtained for the submitted indicator will be considered satisfactory.

Thus, the obtained calculated parameters in terms of the ratio of the standard error to the mean value and the level of significance according to the Student’s test are unsatisfactory, therefore it is advisable to select another approximating dependence for approximation.

Example 5.2. Approximation of the experimental distribution of random numbers by a mathematical dependence

The experimental distribution of random numbers given in Table 5.1.1, when approximated by a linear dependence, did not lead to a satisfactory result, incl. due to the insignificance of the coefficient of the regression equation with a free term, therefore, to improve the quality of the approximation, we will try to carry it out using a linear dependence without a free term:

Let's calculate the value of the coefficient of the regression equation:

Thus, we obtained the regression equation:

Using the resulting regression equation, we calculate the values ​​of the function and the difference between the experimental and calculated values ​​of the function, which we present in the form of table 5.2.1.

Table 5.2.1

x i

According to the regression equation
in Fig. 5.2.1 we will draw a straight line.

y (5;9.73 )

(0;0) 1 2 3 4 5 x

Fig.5.2.1. Coordinates of experimental points and approximated

their linear dependence

To assess the quality of the approximation, we will carry out calculations of quality indicators similar to the calculations given in example 5.1.

(remains old);

with 4 degrees of freedom;

For

Based on the results of the approximation, we note that in terms of the significance level of the coefficient of the regression equation, a satisfactory result was obtained; The ratio of the standard error to the mean has improved, but is still above the recommended value of 0.05, so it is recommended to repeat the approximation with a more complex mathematical relationship.

Example 5.3. To improve the quality of approximation of examples 5.1 and 5.2, we will carry out a nonlinear approximation by the dependence
. To do this, we will first make intermediate calculations and place their results in table 5.3.1.

Values

Table 5.3.1

X 2

(lnX) 2

lnX lnY

Let's additionally calculate:

Let us approximate the dependence
. Using formulas (5.3.7), (5.3.8) we calculate the coefficients b 0 And b 1 :

Using formulas (5.3.11) we calculate the coefficients A 0 And A 1 :


To calculate the standard error, intermediate calculations were carried out, presented in Table 5.3.2.

Table 5.3.2

Y i

y i

Amount: 7.5968

The standard error of approximation turned out to be much larger than in the two previous examples, so we consider the approximation results unusable.

Example 5.4. Let's try to approximate with another nonlinear dependence
. Using formulas (5.3.9), (5.3.10) according to table 5.3.1, we calculate the coefficients b 0 And b 1 :

We got an intermediate dependence:

Using formulas (5.3.13) we calculate the coefficients C 0 And C 1 :


We got the final dependency:

To calculate the standard error, we will carry out intermediate calculations and place them in table 5.4.1.

Table 5.4.1

Y i

y i

Amount: 21.83152

Let's calculate the standard error:

The standard error of approximation turned out to be much larger than in the previous example, so we consider the approximation results unusable.

Example 5.5. Approximation of the experimental distribution of random numbers by a mathematical dependence y = b · lnx

The initial data, as in the previous examples, are shown in Table 5.4.1 and Fig. 5.4.1.

Table 5.4.1

Based on the analysis of Fig. 5.4.1 and Table 5.4.1, we note that with smaller values ​​of the argument (at the beginning of the table) the function changes more than with larger values ​​(at the end of the table), therefore it seems advisable to change the scale of the argument and introduce a logarithmic function into the regression equation from it and approximate with the following mathematical dependence:

. Using formula (5.4.3) we calculate the coefficient b:

To assess the quality of the approximation, we will carry out intermediate calculations presented in Table 5.4.2, from which we will calculate the magnitude of the error and the ratio of the standard error to the average value.

Table 5.4.2


Since the ratio of the standard error to the mean value exceeds the recommended value of 0.05, the result will be considered unsatisfactory. In particular, we note that the greatest deviation is given by the value x=1, since with this value lnx=0. Therefore, we will approximate the dependence y = b 0 +b 1 lnx

We present auxiliary calculations in the form of table 5.4.3.

Table 5.4.3

Using formulas (5.4.6) and (5.4.7) we calculate the coefficients b 0 and b 1 :

9 (5;9.12)

4

1 (1;0.93)

1 2 3 4 5 x

To assess the quality of the approximation, we will carry out auxiliary calculations and determine the level of significance of the found coefficients and the ratio of the standard error to the average value.

Significance level slightly above the recommended value of 0.05 (
).


Due to the fact that according to the main indicator - the ratio of the standard error to the average value - an almost twofold excess of the recommended level of 0.05 was obtained, we will consider the results acceptable. Note that the calculated value of the Student's test t b 0 =2,922 different from critical
by a relatively small amount.

Example 5.6. Let us approximate the experimental data of Example 5.1 by the hyperbolic dependence
. In order to calculate the coefficients b 0 and b 1 Let's carry out the preliminary calculations given in Table 5.6.1.

Table 5.6.1

X i

x i =1/X i

x i 2

x i y i

Based on the results of Table 5.6.1 using formulas (5.4.8) and (5.4.9), we calculate the coefficients b 0 and b 1 :

Thus, a hyperbolic regression equation is obtained

.

The results of auxiliary calculations for assessing the quality of approximation are given in Table 5.6.2.

Table 5.6.2

X i

Based on the results of Table 5.6.2, we calculate the standard error and the ratio of the standard error to the mean value:


Due to the fact that the ratio of the standard error to the mean value exceeds the recommended value of 0.05, we conclude that the approximation results are unsuitable.

Example 5.7.

To calculate specific values ​​of income from the operation of jib cranes depending on the time of maintenance work, it is necessary to obtain a parabolic dependence.

Let us calculate the coefficients of this dependence b 0 , b 1 , b 11 in matrix form according to the formula:

Nonlinear regression equations connecting the effective indicator with the optimal values ​​for carrying out preventive maintenance of tower cranes were obtained using the multiple regression procedure of the Statistica 6.0 application package. Next, we present the results of regression analysis for the effective performance indicator according to Table 5.7.1.

Table 5.7.1

Table 5.7.2 shows the results of nonlinear regression for the effective performance indicator and Table 5.7.3 shows the results of the analysis of residuals.

Table 5.7.2

Table 5.7.3

Rice. 3.7.36. Residue analysis.

Thus, we obtained a multiple regression equation for the variable
:

Ratio of standard error to mean:

14780/1017890=0,0145 < 0,05.

Since the ratio of the standard error to the mean value does not exceed the recommended value of 0.05, the approximation results can be considered acceptable. As a drawback according to Table 5.7.2, it should be noted that all calculated coefficients exceed the recommended significance level of 0.05.

Let's consider constructing a regression equation of the form .

Compiling a system of normal equations for finding parabolic regression coefficients is carried out similarly to compiling normal linear regression equations.

After transformations we get:

.

By solving a system of normal equations, the coefficients of the regression equation are obtained.

,

Where , A .

The second degree equation describes the experimental data significantly better than the first degree equation if the decrease in variance compared to the linear regression variance is significant (non-random). The significance of the difference between and is assessed by the Fisher criterion:

where the number is taken from reference statistical tables (Appendix 1) according to the degrees of freedom and the selected significance level.

The procedure for performing calculation work:

1. Familiarize yourself with theoretical material, set out in the guidelines or in additional literature.

2. Calculate odds linear equation regression. To do this, you need to calculate the amounts. Conveniently calculate amounts immediately , which are useful for calculating the coefficients of a parabolic equation.

3. Calculate the calculated values ​​of the output parameter using the equation.

4. Calculate the total and residual variance, , as well as Fisher’s criterion.

Where – matrix, the elements of which are the coefficients of the system of normal equations;

– a vector whose elements are unknown coefficients;

– matrix of the right-hand sides of the system of equations.

7. Calculate the calculated values ​​of the output parameter using the equation .

8. Calculate the residual variance, as well as Fisher's criterion.



9. Draw conclusions.

10. Construct graphs of regression equations and initial data.

11. Complete settlement work.

Calculation example.

Using experimental data on the dependence of water vapor density on temperature, obtain regression equations of the form and . Conduct statistical analysis and draw a conclusion about the best empirical relationship.

0,0512 0,0687 0,081 0,1546 0,2516 0,3943 0,5977 0,8795

Processing of experimental data was carried out in accordance with the recommendations for the work. Calculations to determine the parameters of the linear equation are given in Table 1.

Table 1 - Finding the parameters of a linear dependence of the form
Water vapor density at the saturation line
t i,°C , ohm t i 2 calc.
0,0512 2,05 -0,0403 -0,0915 0,0084 0,0669
0,0687 3,16 0,0248 -0,0439 0,0019 0,0582
0,0811 4,22 0,0899 0,0089 0,0001 0,0523
0,1546 9,9 0,2202 0,06565 0,0043 0,0241
0,2516 19,12 0,3505 0,09894 0,0098 0,0034
0,3943 34,70 0,4808 0,08654 0,0075 0,0071
0,5977 59,77 0,6111 0,01344 0,0002 0,0829
0,8795 98,50 0,7414 -0,13807 0,0191 0,3245
sum 2,4786 231,41 0,0512 0,6194
average 72,25 0,3098 5822,5 28,93
b 0 = -0,4747 D 1 ost 2 = 0,0085
b 1 = 0,0109 Dy 2 = 0,0885
F= 10,368
F T =3.87 F>F T model is adequate

.

To determine the parameters of parabolic regression, the elements of the coefficient matrix and the matrix of the right sides of the system of normal equations were first determined. Then the coefficients were calculated in the MathCad environment:

The calculation data are given in Table 2.

Designations in table 2:

.

conclusions

The parabolic equation describes significantly better the experimental data on the dependence of vapor density on temperature, since the calculated value of the Fisher criterion significantly exceeds the table value of 4.39. Therefore, including a quadratic term in a polynomial equation makes sense.

The results obtained are presented in graphical form (Fig. 3).

Figure 3 – Graphic interpretation of calculation results.

The dotted line is the linear regression equation; solid line – parabolic regression, points on the graph – experimental values.

Table 2. – Finding the parameters of the type of dependence y(t)=a 0 +a 1 ∙x+a 2 ∙x 2 Water vapor density on the saturation line ρ= a 0 +a 1 ∙t+a 2 ∙t 2 i–ρav) 2 0,0669 0,0582 0,0523 0,0241 0,0034 0,0071 0,0829 0,03245 0,6194
(Δρ) 2 0,0001 0,0000 0,0000 0,0002 0,0000 0,0002 0,0002 0,0002 0,0010 0,0085 0,0002 0,0885 42,5
∆ρ i=ρ( t i)calc–ρ i 0,01194 –0,00446 –0,00377 –0,01524 –0,00235 0,01270 0,011489 –0,01348 D 1 2 rest = D 2 2 rest = D 1 2 y= F=
ρ( t i)calc. 0,0631 0,0642 0,0773 0,1394- 0,2493 0,4070 0,6126 0,8660 2,4788
t ii 81,84 145,33 219,21 633,24 1453,2 3053,4 5977,00 11032,45 22595,77
t i 4
t i 3
t iρ i 2,05 3,16 4,22 9,89 19,12 34,70 59,77 98,50 231,41
t i 2
ρ, ohm 0,0512 0,0687 0,0811 0,1546 0,2516 0,3943 0,5977 0,8795 2,4786 0,3098
t i,°C 0,36129 –0,0141 1.6613E-04
1 2 3 4 5 6 7 8 sum average a 0 = a 1 = a 2 =

Annex 1

Fisher distribution table for q = 0,05

f 2 -
f 1
161,40 199,50 215,70 224,60 230,20 234,00 238,90 243,90 249,00 254,30
18,51 19,00 19,16 19,25 19,30 19,33 19,37 19,41 19,45 19,50
10,13 9,55 9,28 9,12 9,01 8,94 8,84 8,74 8,64 8,53
7,71 6,94 6,59 6,39 6,76 6,16 6,04 5,91 5,77 5,63
6,61 5,79 5,41 5,19 5,05 4,95 4,82 4,68 4,53 4,36
5,99 5,14 4,76 4,53 4,39 4,28 4,15 4,00 3,84 3,67
5,59 4,74 4,35 4,12 3,97 3,87 3,73 3,57 3,41 3,23
5,32 4,46 4,07 3,84 3,69 3,58 3,44 3,28 3,12 2,93
5,12 4,26 3,86 3,63 3,48 3,37 3,24 3,07 2,90 2,71
4,96 4,10 3,71 3,48 3,33 3,22 3,07 2,91 2,74 2,54
4,84 3,98 3,59 3,36 3,20 3,09 2,95 2,79 2,61 2,40
4,75 3,88 3,49 3,26 3,11 3,00 2,85 2,69 2,50 2,30
4,67 3,80 3,41 3,18 3,02 2,92 2,77 2,60 2,42 2,21
4,60 3,74 3,34 3,11 2,96 2,85 2,70 2,53 2,35 2,13
4,54 3,68 3,29 3,06 2,90 2,79 2,64 2,48 2,29 2,07
4,49 3,63 3,24 3,01 2,82 2,74 2,59 2,42 2,24 2,01
4,45 3,59 3,20 2,96 2,81 2,70 2,55 2,38 2,19 1,96
4,41 3,55 3,16 2,93 2,77 2,66 2,51 2,34 2,15 1,92
4,38 3,52 3,13 2,90 2,74 2,63 2,48 2,31 2,11 1,88
4,35 3,49 3,10 2,87 2,71 2,60 2,45 2,28 2,08 1,84
4,32 3,47 3,07 2,84 2,68 2,57 2,42 2,25 2,05 1,81
4,30 3,44 3,05 2,82 2,66 2,55 2,40 2,23 2,03 1,78
4,28 3,42 3,03 2,80 2,64 2,53 2,38 2,20 2,00 1,76
4,26 3,40 3,01 2,78 2,62 2,51 2,36 2,18 1,98 1,73
4,24 3,38 2,99 2,76 2,60 2,49 2,34 2,16 1,96 1,71
4,22 3,37 2,98 2,74 2,59 2,47 2,32 2,15 1,95 1,69
4,21 3,35 2,96 2,73 2,57 2,46 2,30 2,13 1,93 1,67
4,20 3,34 2,95 2,71 2,56 2,44 2,29 2,12 1,91 1,65
4,18 3,33 2,93 2,70 2,54 2,43 2,28 2,10 1,90 1,64
4,17 3,32 2,92 2,69 2,53 2,42 2,27 2,09 1,89 1,62
4,08 3,23 2,84 2,61 2,45 2,34 2,18 2,00 1,79 1,52
4,00 3,15 2,76 2,52 2,37 2,25 2,10 1,92 1,70 1,39
3,92 3,07 2,68 2,45 2,29 2,17 2,02 1,88 1,61 1,25

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using specific practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

The linear regression model looks like this:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.



First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.



Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

  1. In the “Statistical” category, select the CORREL function.
  2. Argument “Array 1” - the first range of values ​​– machine operating time: A2:A14.
  3. Argument “Array 2” - second range of values ​​– repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data has become visible.

Let's consider a paired linear regression model of the relationship between two variables, for which the regression function φ(x) linear. Let us denote by y x conditional average of the characteristic Y V population at a fixed value x variable X. Then the regression equation will look like:

y x = ax + b, Where aregression coefficient(indicator of the slope of the linear regression line) . The regression coefficient shows how many units the variable changes on average Y when changing a variable X for one unit. Using the least squares method, formulas are obtained that can be used to calculate linear regression parameters:

Table 1. Formulas for calculating linear regression parameters

Free member b

Regression coefficient a

Determination coefficient

Testing the hypothesis about the significance of the regression equation

N 0 :

N 1 :

, ,, Appendix 7 (for linear regression p = 1)

The direction of the relationship between variables is determined based on the sign of the regression coefficient. If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

To analyze the overall quality of the regression equation, the coefficient of determination is used R 2 , also called the square of the multiple correlation coefficient. The coefficient of determination (a measure of certainty) is always within the interval. If the value R 2 close to unity, this means that the constructed model explains almost all the variability in the corresponding variables. Conversely, the meaning R 2 close to zero means poor quality of the constructed model.

Determination coefficient R 2 shows by what percentage the found regression function describes the relationship between the original values Y And X. In Fig. Figure 3 shows the variation explained by the regression model and the total variation. Accordingly, the value shows how many percent of the variation of the parameter Y due to factors not included in the regression model.

With a high value of the coefficient of determination of 75%), a forecast can be made for a specific value within the range of the initial data. When predicting values ​​outside the range of the initial data, the validity of the resulting model cannot be guaranteed. This is explained by the fact that the influence of new factors that the model does not take into account may appear.

The significance of the regression equation is assessed using the Fisher criterion (see Table 1). Provided that the null hypothesis is true, the criterion has a Fisher distribution with the number of degrees of freedom , (for paired linear regression p = 1). If the null hypothesis is rejected, then the regression equation is considered statistically significant. If the null hypothesis is not rejected, then the regression equation is considered statistically insignificant or unreliable.

Example 1. In the machine shop, the structure of product costs and the share of purchased components are analyzed. It was noted that the cost of components depends on the time of their delivery. Distance traveled was selected as the most important factor affecting delivery time. Conduct regression analysis of supply data:

Distance, miles

Time, min

To perform regression analysis:

    construct a graph of the initial data, approximately determine the nature of the dependence;

    choose the type of regression function and determine the numerical coefficients of the model using the least squares method and the direction of the relationship;

    evaluate the strength of regression dependence using the coefficient of determination;

    evaluate the significance of the regression equation;

    make a forecast (or a conclusion about the impossibility of forecasting) using the adopted model for a distance of 2 miles.

2. Calculate the amounts necessary to calculate the coefficients of the linear regression equation and the coefficient of determinationR 2 :

; ;;.

The required regression dependence has the form: . We determine the direction of the relationship between the variables: the sign of the regression coefficient is positive, therefore, the relationship is also positive, which confirms the graphical assumption.

3. Let's calculate the coefficient of determination: or 92%. Thus, the linear model explains 92% of the variation in delivery time, which means that the factor (distance) was chosen correctly. 8% of the time variation is not explained, which is due to other factors that influence delivery time but are not included in the linear regression model.

4. Let’s check the significance of the regression equation:

Because– the regression equation (linear model) is statistically significant.

5. Let's solve the forecasting problem. Since the coefficient of determinationR 2 has a high enough value and the 2 mile distance for which the prediction is to be made is within the range of the input data, then the prediction can be made:

Regression analysis can be conveniently carried out using the capabilities Excel. The "Regression" operating mode is used to calculate the parameters of the linear regression equation and check its adequacy for the process under study. In the dialog box, fill in the following parameters:

Example 2. Complete the task of example 1 using the "Regression" modeExcel.

CONCLUSION OF RESULTS

Regression statistics

Plural R

R-square

Normalized R-squared

Standard error

Observations

Odds

Standard error

t-statistic

P-Value

Y-intersection

Variable X 1

Let's look at the results of the regression analysis presented in the table.

MagnitudeR-square , also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). In our example, the measure of certainty is 0.91829, which indicates a very good fit of the regression line to the original data and coincides with the coefficient of determinationR 2 , calculated by the formula.

Plural R - multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y) and is equal to the square root of the coefficient of determination. In simple linear regression analysismultiple R coefficientequal to the linear correlation coefficient (r = 0,958).

Linear model coefficients:Y -intersection prints the value of the dummy termb, Avariable X1 – regression coefficient a. Then the linear regression equation is:

y = 2.6597x+ 5.9135 (which agrees well with the calculation results in example 1).

Next, let’s check the significance of the regression coefficients:aAndb. Comparing column values ​​in pairs Odds And Standard error In the table, we see that the absolute values ​​of the coefficients are greater than their standard errors. In addition, these coefficients are significant, as can be judged by the values ​​of the P-value indicator, which are less than the specified significance level α = 0.05.

Observation

Predicted Y

Leftovers

Standard balances

The table shows the output resultsleftovers. Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute valueremainderin this case - 1.89256, the smallest - 0.05399. To better interpret this data, plot the original data and the constructed regression line. As can be seen from the construction, the regression line is well “fitted” to the values ​​of the initial data, and the deviations are random.